# Revisión de datos para ETL y E/R
La finalidad de este notebook es hacer una revisión de los datos actuales para definir un esquema E/R y pipeline ETL.
Vamos a trabajar con la libreria SPARK.

In [1]:
import findspark
import pickle
import pandas as pd
import os
os.environ["SPARK_HOME"] = r"F:\DataScience\spark\spark-3.5.3-bin-hadoop3\spark-3.5.3-bin-hadoop3"

findspark.init()


from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[4]").config("spark.executor.memory", "8g").appName("PySpark").getOrCreate()
#para cerrar la sesión debemos usar spark.stop()

In [37]:
from pyspark.sql.functions import explode

In [2]:
print(spark.version)

3.5.3


# Datos Google
Aqui tenemos basicamente dos datasets.

1. metadata-sitios: Metadata de todos los sites. Incluye cantidad de comentarios, descripcion, locación, categoria, rating promedio, etc.
2. review-estados: Cada uno de los reviews organizados por estado. Información relevante: usuario, texto, rating, gmap_id (id del negocio) 

## Metadata Sitios

In [4]:
ruta_lectura = r"F:\DataScience\PF - DataNova\datasets\Google Maps\metadata-sitios\1.json"

In [None]:
df = spark.read.json(ruta_lectura)
# Considerar el uso de persist o cache

In [None]:
# Esquema de la tabla. total de 15 columnas
df.printSchema()

root
 |-- MISC: struct (nullable = true)
 |    |-- Accessibility: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- Activities: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- Amenities: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- Atmosphere: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- Crowd: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- Dining options: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- From the business: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- Getting here: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- Health & safety: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- Highlights: array (nullable = true)
 |    |

In [None]:
#preview
df.show(10)

+--------------------+--------------------+----------+--------------------+-----------+--------------------+--------------------+------------------+-------------------+--------------------+--------------+-----+--------------------+--------------------+--------------------+
|                MISC|             address|avg_rating|            category|description|             gmap_id|               hours|          latitude|          longitude|                name|num_of_reviews|price|    relative_results|               state|                 url|
+--------------------+--------------------+----------+--------------------+-----------+--------------------+--------------------+------------------+-------------------+--------------------+--------------+-----+--------------------+--------------------+--------------------+
|{[Wheelchair acce...|Porter Pharmacy, ...|       4.9|          [Pharmacy]|       NULL|0x88f16e41928ff68...|[[Friday, 8AM–6PM...|           32.3883|           -83.3571|     Porte

In [None]:
# Cantidad de filas del archivo.
df.count()

275001

In [12]:
df.describe().show()

+-------+--------------------+------------------+--------------------+--------------------+-------------------+-------------------+--------------------+------------------+-----+------------------+--------------------+
|summary|             address|        avg_rating|         description|             gmap_id|           latitude|          longitude|                name|    num_of_reviews|price|             state|                 url|
+-------+--------------------+------------------+--------------------+--------------------+-------------------+-------------------+--------------------+------------------+-----+------------------+--------------------+
|  count|              264939|            275001|               13155|              275001|             275001|             275001|              274994|            275001|13450|            195523|              275001|
|   mean|                NULL| 4.307215610124819|                NULL|                NULL|  37.49011244811361| -92.274707529695

### 1st Look
1. Posibles columnas a remover: 
    - description: No proporciona data relevante. Solo 13155 de 275001 sites tienen un descripcion, es decir, menos del 5%.
    - state: No brinda información relevante para nuestro análisis. Solo información del estado (abierto o cerrado) del momento en el cual se extrajo la data.
    - url: Dirección url del sitio en google maps.
    - price: Demasiados valores nulos.
    - hours: Horarios de atención del negocio. (Evaluar remoción, a menos que sea necesario para el análisis.)

2. No poseemos información de la ciudad o el estado, podemos usar una libreria como 'geopy' o 'geocoder' para obtener esta data a partir de latitude y longitud. Opcional: (Extraer de columna address)
3. Columnas clave: gmap_id, avg_rating, category, 
4. MISC: Tiene un esquema definido. Importante validar que data puede ser de utilidad dentro de este esquema.
5. CATEGORY: Realizar un análisis de las categorías existentes. (i,e: Nube de palabras)

### MISC

In [33]:
df.select("MISC.*").show()

+--------------------+----------+--------------------+----------+------------------+---------------+--------------------+------------+--------------------+----------------+--------------------+--------------------+--------------------+--------------------+---------+--------------------+
|       Accessibility|Activities|           Amenities|Atmosphere|             Crowd| Dining options|   From the business|Getting here|     Health & safety|      Highlights|           Offerings|            Payments|            Planning|         Popular for|Recycling|     Service options|
+--------------------+----------+--------------------+----------+------------------+---------------+--------------------+------------+--------------------+----------------+--------------------+--------------------+--------------------+--------------------+---------+--------------------+
|[Wheelchair acces...|      NULL|                NULL|      NULL|              NULL|           NULL|                NULL|        NULL|[M

In [44]:
pass

### CATEGORY
Se trata de una columna de tipo ARRAY

In [None]:
# Visualización
category_df = df.select("category")
category_df.show(truncate=False)

+-----------------------------------------------------------------------+
|category                                                               |
+-----------------------------------------------------------------------+
|[Pharmacy]                                                             |
|[Textile exporter]                                                     |
|[Korean restaurant]                                                    |
|[Fabric store]                                                         |
|[Fabric store]                                                         |
|[Fabric store]                                                         |
|[Restaurant]                                                           |
|[Nail salon, Waxing hair removal service]                              |
|[Bakery, Health food restaurant]                                       |
|[Greeting card shop, Service establishment]                            |
|[Dentist, Cosmetic dentist, Dental cl

In [42]:
# Explode
category_df_exploded = category_df.select(explode("category").alias("element"))
category_df_exploded.count()

530472

In [43]:
category_df_exploded.show(truncate=False)

+---------------------------+
|element                    |
+---------------------------+
|Pharmacy                   |
|Textile exporter           |
|Korean restaurant          |
|Fabric store               |
|Fabric store               |
|Fabric store               |
|Restaurant                 |
|Nail salon                 |
|Waxing hair removal service|
|Bakery                     |
|Health food restaurant     |
|Greeting card shop         |
|Service establishment      |
|Dentist                    |
|Cosmetic dentist           |
|Dental clinic              |
|Auto glass shop            |
|Window tinting service     |
|Beauty salon               |
|Ski rental service         |
+---------------------------+
only showing top 20 rows



In [47]:
category_df_exploded_count = category_df_exploded.groupBy("element").count()

In [None]:
# 3769 valores distintos.
category_df_exploded_count.count()

3769

In [51]:
# Mostramos los valores de manera descendente. (Top 100)
category_df_exploded_count.orderBy("count", ascending=True).show(100, truncate=False)

+-----------------------------------------+-----+
|element                                  |count|
+-----------------------------------------+-----+
|Pueblan restaurant                       |1    |
|Smart Car dealer                         |1    |
|Pottery                                  |1    |
|Festival hall                            |1    |
|Video                                    |1    |
|Rehearsal studio                         |1    |
|TB clinic                                |1    |
|Fire fighters academy                    |1    |
|Confectionery wholesaler                 |1    |
|Chemical exporter                        |1    |
|CNG fittment center                      |1    |
|Piano maker                              |1    |
|Motocross                                |1    |
|Chamber of handicrafts                   |1    |
|Orchid farm                              |1    |
|Microbiologist                           |1    |
|Port                                     |1    |


In [52]:
# Pensar en el tratamiento adicional de esta columna. 

# Finalización tarea de Spark

In [53]:
spark.stop()