El proceso de ETL comprendiendo la totalidad de los datos se realizará utilizando **pyspark** y **polars** y llamando a los datos desde su origen de drive para luego ser cargados a un data lakehouse en GCP

Instalamos las librerías pyspark y polars

In [1]:
!pip install pyspark polars gcsfs fastparquet

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fastparquet
  Downloading fastparquet-2023.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m45.9 MB/s[0m eta [36m0:00:00[0m
Collecting cramjam>=2.3 (from fastparquet)
  Downloading cramjam-2.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m54.1 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425345 sha256=4dbec9506eff7535a10bff26e4e307529fac54a16a5de5

Importamos las librerías necesarias

In [2]:
import os
import json
import pandas as pd
import polars as pl
from datetime import date, timedelta, datetime
import time
import re

import pyspark.pandas as ps
from pyspark.sql import SparkSession, SQLContext
from pyspark.context import SparkContext
from pyspark.sql.functions import *
from pyspark.sql.types import *

from google.cloud import storage
import pyarrow.parquet as pq
from google.colab import auth
from google.colab import drive
drive.mount('/content/drive')
auth.authenticate_user()



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Iniciamos una session de Spark

In [3]:
spark = SparkSession.builder \
        .appName("ETL_maps") \
        .config("spark.driver.memory", "8g") \
        .config("spark.executor.memory", "8g") \
        .config("spark.jars", "/path/to/gcs-connector-hadoop2-latest.jar") \
        .getOrCreate()

In [None]:
spark

# ETL metadata-sitios

Información del comercio, incluyendo localización, atributos y categorías.

In [4]:
sitio1 = spark.read.json('/content/drive/MyDrive/Google Maps/metadata-sitios/1.json')

Mostramos la información del DF sitio1

In [None]:
# Mostramos el DF sitio1
sitio1.show(7)

# Descripción del DF
sitio1.describe().show()

+--------------------+--------------------+----------+-------------------+-----------+--------------------+--------------------+----------+-------------------+--------------------+--------------+-----+--------------------+--------------------+--------------------+
|                MISC|             address|avg_rating|           category|description|             gmap_id|               hours|  latitude|          longitude|                name|num_of_reviews|price|    relative_results|               state|                 url|
+--------------------+--------------------+----------+-------------------+-----------+--------------------+--------------------+----------+-------------------+--------------------+--------------+-----+--------------------+--------------------+--------------------+
|{[Wheelchair acce...|Porter Pharmacy, ...|       4.9|         [Pharmacy]|       NULL|0x88f16e41928ff68...|[[Friday, 8AM–6PM...|   32.3883|           -83.3571|     Porter Pharmacy|            16| NULL|[0x8

In [5]:
sitio2 = spark.read.json('/content/drive/MyDrive/Google Maps/metadata-sitios/2.json')

In [None]:
# Mostramos el DF sitio2
sitio2.show(7)

# Descripción del DF
sitio2.describe().show()

+--------------------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+------------------+------------------+--------------------+--------------+-----+--------------------+--------------------+--------------------+
|                MISC|             address|avg_rating|            category|         description|             gmap_id|               hours|          latitude|         longitude|                name|num_of_reviews|price|    relative_results|               state|                 url|
+--------------------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+------------------+------------------+--------------------+--------------+-----+--------------------+--------------------+--------------------+
|{[Wheelchair acce...|Porter Pharmacy, ...|       4.9|          [Pharmacy]|                NULL|0x88f16e41928ff68...|[[Friday, 8AM–6PM...|           32.38

Abrimos el resto de archivos **sitio**

In [None]:
sitio3 = spark.read.json('/content/drive/MyDrive/Google Maps/metadata-sitios/3.json')
sitio4 = spark.read.json('/content/drive/MyDrive/Google Maps/metadata-sitios/4.json')
sitio5 = spark.read.json('/content/drive/MyDrive/Google Maps/metadata-sitios/5.json')
sitio6 = spark.read.json('/content/drive/MyDrive/Google Maps/metadata-sitios/6.json')
sitio7 = spark.read.json('/content/drive/MyDrive/Google Maps/metadata-sitios/7.json')
sitio8 = spark.read.json('/content/drive/MyDrive/Google Maps/metadata-sitios/8.json')
sitio9 = spark.read.json('/content/drive/MyDrive/Google Maps/metadata-sitios/9.json')
sitio10 = spark.read.json('/content/drive/MyDrive/Google Maps/metadata-sitios/10.json')
sitio11 = spark.read.json('/content/drive/MyDrive/Google Maps/metadata-sitios/11.json')

Observamos la estructura de cada esquema

In [None]:
sitios = [sitio1, sitio2, sitio3, sitio4, sitio5, sitio6, sitio7, sitio8, sitio9, sitio10, sitio11]

for i, df in enumerate(sitios):
    print(f"Schema de sitio{i + 1}:")
    df.printSchema()

Schema de sitio1:
root
 |-- MISC: struct (nullable = true)
 |    |-- Accessibility: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- Activities: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- Amenities: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- Atmosphere: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- Crowd: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- Dining options: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- From the business: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- Getting here: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- Health & safety: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- Highlights: array (nullab

Según los esquemas, la columna **MISC** los DFs de los sitios 1, 2, 3, 6 y 9 es diferente tipo que la misma columna en los DFs 4, 5, 7, 8, 10 y 11. Por lo que, unimos todos los DFs de acuerdo al tipo de dato de la columna **MISC**.

In [None]:
# sitioA contiene los DFs cuya columna MISC tiene 16 archivos
sitioA = sitio1.union(sitio2).union(sitio3).union(sitio6).union(sitio9)

In [None]:
# sitioB contiene los DFs cuya columna MISC tiene 17 archivos
sitioB = sitio4.union(sitio5).union(sitio7).union(sitio8).union(sitio10).union(sitio11)

In [None]:
'''
def desanidar_columna(df, columna):
  # Obtenemos el esquema de la columna que se quiere desanidar
  esquema = df.schema[columna].dataType

  # Iteramos a través de los campos de la columna a desanidar
  for nombre_campo in esquema.names:
    nuevo_nombre_columna = f"{nombre_campo}"
    df = df.withColumn(nuevo_nombre_columna, col(f"{columna}.{nombre_campo}"))

  # Eliminamos la columna anidada
  df = df.drop(columna)

  return df

'''

In [None]:
# sitioA = desanidar_columna(sitioA, 'MISC')
# sitioB = desanidar_columna(sitioB, 'MISC')

Mostramos

In [None]:
sitioA.show(5)

sitioB.show(5)

+--------------------+--------------------+----------+-------------------+-----------+--------------------+--------------------+----------+------------+----------------+--------------+-----+--------------------+-----------------+--------------------+
|                MISC|             address|avg_rating|           category|description|             gmap_id|               hours|  latitude|   longitude|            name|num_of_reviews|price|    relative_results|            state|                 url|
+--------------------+--------------------+----------+-------------------+-----------+--------------------+--------------------+----------+------------+----------------+--------------+-----+--------------------+-----------------+--------------------+
|{[Wheelchair acce...|Porter Pharmacy, ...|       4.9|         [Pharmacy]|       NULL|0x88f16e41928ff68...|[[Friday, 8AM–6PM...|   32.3883|    -83.3571| Porter Pharmacy|            16| NULL|[0x88f16e41929435...|Open ⋅ Closes 6PM|https://www.googl.

Eliminamos la columna MISC de ambos DFs

In [None]:
sitioA = sitioA.drop('MISC')
sitioB = sitioB.drop('MISC')

Unimos los dos dataframes

In [None]:
sitios = sitioA.union(sitioB)

sitios.show(5)
sitios.count()

+--------------------+----------+-------------------+-----------+--------------------+--------------------+----------+------------+----------------+--------------+-----+--------------------+-----------------+--------------------+
|             address|avg_rating|           category|description|             gmap_id|               hours|  latitude|   longitude|            name|num_of_reviews|price|    relative_results|            state|                 url|
+--------------------+----------+-------------------+-----------+--------------------+--------------------+----------+------------+----------------+--------------+-----+--------------------+-----------------+--------------------+
|Porter Pharmacy, ...|       4.9|         [Pharmacy]|       NULL|0x88f16e41928ff68...|[[Friday, 8AM–6PM...|   32.3883|    -83.3571| Porter Pharmacy|            16| NULL|[0x88f16e41929435...|Open ⋅ Closes 6PM|https://www.googl...|
|City Textile, 300...|       4.5| [Textile exporter]|       NULL|0x80c2c98c0e3c1

3025011

In [None]:
# Observamos la cantidad de valores nulos
sitios.select([sum(col(columna).isNull().cast("int")).alias(columna) for columna in sitios.columns]).show()

+-------+----------+--------+-----------+-------+------+--------+---------+----+--------------+-------+----------------+------+---+
|address|avg_rating|category|description|gmap_id| hours|latitude|longitude|name|num_of_reviews|  price|relative_results| state|url|
+-------+----------+--------+-----------+-------+------+--------+---------+----+--------------+-------+----------------+------+---+
|  80511|         0|   17419|    2770722|      0|787405|       0|        0|  37|             0|2749808|          295058|746455|  0|
+-------+----------+--------+-----------+-------+------+--------+---------+----+--------------+-------+----------------+------+---+



Eliminamos otras columnas innecesarias

In [None]:
sitios = sitios.drop('description', 'hours', 'num_of_reviews', 'price', 'relative_results', 'state', 'url')

In [None]:
# Fuera valores nulos
sitios = sitios.dropna()

sitios.show(5)
sitios.count()

+--------------------+----------+-------------------+--------------------+----------+------------+----------------+
|             address|avg_rating|           category|             gmap_id|  latitude|   longitude|            name|
+--------------------+----------+-------------------+--------------------+----------+------------+----------------+
|Porter Pharmacy, ...|       4.9|         [Pharmacy]|0x88f16e41928ff68...|   32.3883|    -83.3571| Porter Pharmacy|
|City Textile, 300...|       4.5| [Textile exporter]|0x80c2c98c0e3c16f...|34.0188913|-118.2152898|    City Textile|
|San Soo Dang, 761...|       4.4|[Korean restaurant]|0x80c2c778e3b73d3...|34.0580917|-118.2921295|    San Soo Dang|
|Nova Fabrics, 220...|       3.3|     [Fabric store]|0x80c2c89923b27a4...|34.0236689|-118.2329297|    Nova Fabrics|
|Nobel Textile Co,...|       4.3|     [Fabric store]|0x80c2c632f933b07...|34.0366942|-118.2494208|Nobel Textile Co|
+--------------------+----------+-------------------+-------------------

2927086

Eliminamos duplicados

In [None]:
sitios = sitios.dropDuplicates()

In [None]:
sitios.count()

2901730

Guardamos el DF `sitios`

In [None]:
sitios = sitios.toPandas()

In [None]:
sitios.to_parquet('gs://yelp-and-maps-data-processed/Maps/metadata_sitios_clean.parquet')

# ETL review-estados

Disponibiliza las reviews de los usuarios por estado

## `Pennsylvania`

In [6]:
pennsylvania1 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Pennsylvania/1.json')

In [None]:
# Motramos el DF
pennsylvania1.show(5)

# Información del DF
pennsylvania1.describe().show()

+--------------------+-----------------+----+------+----+--------------------+-------------+--------------------+
|             gmap_id|             name|pics|rating|resp|                text|         time|             user_id|
+--------------------+-----------------+----+------+----+--------------------+-------------+--------------------+
|0x89c6c63c8cd8714...|  Jaron Whitfield|NULL|     5|NULL|Joe is quite uniq...|1517731762839|10494474255907975...|
|0x89c6c63c8cd8714...|Jonathan McCarthy|NULL|     5|NULL|For such a small ...|1476276291163|11760970283298032...|
|0x89c6c63c8cd8714...|        Rocky Kev|NULL|     5|NULL|I usually give th...|1338826945578|11056324201842663...|
|0x89c6c63c8cd8714...|      Josep Valls|NULL|     5|NULL|My bike had been ...|1363286110554|11289597350540139...|
|0x89c6c63c8cd8714...|   Timaree Schmit|NULL|     5|NULL|Always an easy ex...|1548798329760|11061967488596382...|
+--------------------+-----------------+----+------+----+--------------------+----------

In [7]:
# Abrimos el resto de reviews del estado de Pennsylvania
pennsylvania2 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Pennsylvania/2.json')
pennsylvania3 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Pennsylvania/3.json')
pennsylvania4 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Pennsylvania/4.json')
pennsylvania5 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Pennsylvania/5.json')
pennsylvania6 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Pennsylvania/6.json')
pennsylvania7 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Pennsylvania/7.json')
pennsylvania8 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Pennsylvania/8.json')
pennsylvania9 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Pennsylvania/9.json')
pennsylvania10 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Pennsylvania/10.json')
pennsylvania11 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Pennsylvania/11.json')
pennsylvania12 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Pennsylvania/12.json')
pennsylvania13 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Pennsylvania/13.json')
pennsylvania14 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Pennsylvania/14.json')
pennsylvania15 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Pennsylvania/15.json')
pennsylvania16 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Pennsylvania/16.json')

In [8]:
# Unimos todos los reviews del estado de Pennsylvania
pennsylvania = pennsylvania1.union(pennsylvania2)\
                            .union(pennsylvania3)\
                            .union(pennsylvania4)\
                            .union(pennsylvania5)\
                            .union(pennsylvania6)\
                            .union(pennsylvania7)\
                            .union(pennsylvania8)\
                            .union(pennsylvania9)\
                            .union(pennsylvania10)\
                            .union(pennsylvania11)\
                            .union(pennsylvania12)\
                            .union(pennsylvania13)\
                            .union(pennsylvania14)\
                            .union(pennsylvania15)\
                            .union(pennsylvania16)

In [9]:
pennsylvania.count()

2400000

Eliminamos columnas innecesarias

In [9]:
pennsylvania = pennsylvania.drop('pics', 'resp')

Eliminamos duplicados

In [10]:
pennsylvania = pennsylvania.dropDuplicates()

In [34]:
pennsylvania.show(5)

pennsylvania.count()

+--------------------+-----------------+------+--------------------+-------------+--------------------+
|             gmap_id|             name|rating|                text|         time|             user_id|
+--------------------+-----------------+------+--------------------+-------------+--------------------+
|0x89c6c63c8cd8714...|   Jonathon Kelly|     5|This place helped...|1526927434415|10296684864808028...|
|0x89c6c715d1821fe...|tristan ellsworth|     5|The space is a be...|1565892693381|10129630006189785...|
|0x89cb978262d9556...|     Debra Shelow|     5|Very knowledgeabl...|1624748223955|10749061111743553...|
|0x89cfa78597aa43c...|   Nicholas Allen|     3|             It's ok|1610114444795|10484283200506647...|
|0x89cfa76d9e04132...|    Shawn Leriche|     5|Will be bringing ...|1552448684633|10895915575071586...|
+--------------------+-----------------+------+--------------------+-------------+--------------------+
only showing top 5 rows



2366432

Información del DF

In [35]:
pennsylvania.summary().show()

+-------+--------------------+---------------------+------------------+--------------------+--------------------+--------------------+
|summary|             gmap_id|                 name|            rating|                text|                time|             user_id|
+-------+--------------------+---------------------+------------------+--------------------+--------------------+--------------------+
|  count|             2366432|              2366432|           2366432|             1328047|             2366432|             2366432|
|   mean|                NULL|                  NaN| 4.346586760151992|                NULL|1.553536247296049E12|1.092980635383948...|
| stddev|                NULL|                  NaN|1.0905154765812306|                NULL|4.185426643685574E10|5.271360752968564...|
|    min|0x405d7bcaf6acac0...|   "eye's only" blank|                 1|! These guys know...|        662601600000|10000004067989084...|
|    25%|                NULL|                  1.0|   

In [11]:
pennsylvania.printSchema()

root
 |-- gmap_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- rating: long (nullable = true)
 |-- text: string (nullable = true)
 |-- time: long (nullable = true)
 |-- user_id: string (nullable = true)



Normalizamos la columna **time**

In [12]:
# Convertimos la columna 'time' a segundos
pennsylvania = pennsylvania.withColumn("time", (col("time")/1000).cast("timestamp"))

In [37]:
pennsylvania.show(5, truncate= False)

+-------------------------------------+-----------------+------+---------------------------------------------------------------+-----------------------+---------------------+
|gmap_id                              |name             |rating|text                                                           |time                   |user_id              |
+-------------------------------------+-----------------+------+---------------------------------------------------------------+-----------------------+---------------------+
|0x89c6c63c8cd87141:0x54d0d283872eecbb|Jonathon Kelly   |5     |This place helped me out big time. Fast easy and cheap. 5✨     |2018-05-21 18:30:34.415|102966848648080282781|
|0x89c6c715d1821fe3:0x9cfa8308c0ce2289|tristan ellsworth|5     |The space is a beautiful spot to drink my espresso before work!|2019-08-15 18:11:33.381|101296300061897859237|
|0x89cb978262d9556f:0x71621300db132dd0|Debra Shelow     |5     |Very knowledgeable and helpful staff. Worth the drive.       

Eliminamos nulos en las columnas básicas

In [13]:
pennsylvania = pennsylvania.dropna(subset= ['gmap_id', 'name', 'rating', 'time', 'user_id'])

In [15]:
pennsylvania.count()

2366432

Exportamos el dataframe `pennsylvania`

In [14]:
pennsylvania = pennsylvania.toPandas()

In [15]:
pennsylvania.to_parquet('gs://yelp-and-maps-data-processed/Maps/reviews-estados/review-Pennsylvania/pennsylvania.parquet')

## `Florida`

In [4]:
florida1 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/1.json')

In [5]:
# Motramos el DF
florida1.show(5)

# Información del DF
florida1.describe().show()

+--------------------+----------------+----+------+--------------------+--------------------+-------------+--------------------+
|             gmap_id|            name|pics|rating|                resp|                text|         time|             user_id|
+--------------------+----------------+----+------+--------------------+--------------------+-------------+--------------------+
|0x8893863ea87bd5d...| Julie A. Gerber|NULL|     1|{Thank you for th...|Update: Their “re...|1628003250740|10147185615514872...|
|0x8893863ea87bd5d...|Martin Sheffield|NULL|     5|{Thank you for re...|He's a knowledgea...|1595031217005|11547723478903832...|
|0x8893863ea87bd5d...|    Brian Truett|NULL|     5|                NULL|Best doctor I've ...|1522924253567|10180501024489283...|
|0x8893863ea87bd5d...|        Tina Sun|NULL|     1|                NULL|I was told he is ...|1467907819586|10634442288149374...|
|0x8893863ea87bd5d...|    James Haynes|NULL|     5|                NULL|Takes the time to...|1480

In [7]:
# Abrimos el resto de reviews del estado de Florida
florida2 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/2.json')
florida3 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/3.json')
florida4 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/4.json')
florida5 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/5.json')
florida6 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/6.json')
florida7 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/7.json')
florida8 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/8.json')
florida9 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/9.json')
florida10 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/10.json')
florida11 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/11.json')
florida12 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/12.json')
florida13 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/13.json')
florida14 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/14.json')
florida15 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/15.json')
florida16 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/16.json')
florida17 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/17.json')
florida18 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/18.json')
florida19 = spark.read.json('/content/drive/MyDrive/Google Maps/reviews-estados/review-Florida/19.json')

In [9]:
# Unimos todos los reviews del estado de Florida
florida = florida1.union(florida2)\
                  .union(florida3)\
                  .union(florida4)\
                  .union(florida5)\
                  .union(florida6)\
                  .union(florida7)\
                  .union(florida8)\
                  .union(florida9)\
                  .union(florida10)\
                  .union(florida11)\
                  .union(florida12)\
                  .union(florida13)\
                  .union(florida14)\
                  .union(florida15)\
                  .union(florida16)\
                  .union(florida17)\
                  .union(florida18)\
                  .union(florida19)

In [13]:
# Número de filas
florida.count()

2850000

Eliminamos columnas innecesarias

In [14]:
florida = florida.drop('pics', 'resp')

Eliminamos duplicados

In [15]:
florida = florida.dropDuplicates()

In [16]:
florida.show(5)

florida.count()

+--------------------+-----------------+------+--------------------+-------------+--------------------+
|             gmap_id|             name|rating|                text|         time|             user_id|
+--------------------+-----------------+------+--------------------+-------------+--------------------+
|0x88909517e0c1c69...|     Alex Liddell|     5|Dorris is the Bes...|1550276337325|11490682933386983...|
|0x88c2d19dba9bebd...|    lourdes lopez|     5|Best salon ever! ...|1620184323580|11329173691561468...|
|0x88e5b08ff343d57...|               RJ|     1|Horrible Service ...|1451491894296|10560153291067776...|
|0x88e62d723d9a4d5...|Harley David Lott|     5|The Postal Gal wa...|1602436267168|11438683392577048...|
|0x88d9b86f7110f73...|     isasmella456|     5|In November I wil...|1505857953482|10897592033683909...|
+--------------------+-----------------+------+--------------------+-------------+--------------------+
only showing top 5 rows



2730604

Información del DF

In [17]:
florida.summary().show()

+-------+--------------------+--------------------+------------------+----------------+--------------------+--------------------+
|summary|             gmap_id|                name|            rating|            text|                time|             user_id|
+-------+--------------------+--------------------+------------------+----------------+--------------------+--------------------+
|  count|             2730604|             2730604|           2730604|         1656239|             2730604|             2730604|
|   mean|                NULL|                 NaN|4.3153954216722745|            39.5|1.555296931944298E12|1.093050756831163E20|
| stddev|                NULL|                 NaN|1.1688216680046275|43.1335136523794|4.362055671455134...|5.275339485885162...|
|    min|0x0:0xa1c0f34736d...|"Sugarcube" Bradburn|                 1|               !|       1041379200000|10000002688865548...|
|    25%|                NULL|                75.0|                 4|             9.0|   

In [18]:
florida.printSchema()

root
 |-- gmap_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- rating: long (nullable = true)
 |-- text: string (nullable = true)
 |-- time: long (nullable = true)
 |-- user_id: string (nullable = true)



Normalizamos la columna **time**

In [19]:
# Convertimos la columna 'time' a segundos
florida = florida.withColumn("time", (col("time")/1000).cast("timestamp"))

In [20]:
florida.show(5, truncate= False)

+-------------------------------------+-----------------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+---------------------+
|gmap_id                              |name             |rating|text                                                                                                                                                                                                                                                                                                                                                                                                          |time                   |use

Eliminamos nulos en las columnas básicas

In [21]:
florida = florida.dropna(subset= ['gmap_id', 'name', 'rating', 'time', 'user_id'])

In [22]:
florida.count()

2730604

Exportamos el dataframe `florida`

In [23]:
florida = florida.toPandas()

In [24]:
florida.to_parquet('gs://yelp-and-maps-data-processed/Maps/reviews-estados/review-Florida/florida.parquet')

Finalizamos la sesión de spark

In [None]:
spark.stop()