### Descripción del proceso de unión de los datasets de metadatos y de reviews:

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Cargamos los archivos json y los concatenamos:

In [2]:


file_paths = ['Google Maps/reviews-estados/review-California/{}.json'.format(i) for i in range(1, 20)]

dfs = [pd.read_json(file_path, lines=True) for file_path in file_paths]

#Concatenamos todos los datasets:
dfc = pd.concat(dfs, axis=0, join='inner')

Estimamos conveniente realizar una combinación de los metadatos con los reviews. Para ello tomamos los grupos de
metadatos que habíamos separado previamente para facilitar la manipulación: 

In [4]:
#Cargamos el dataset del metadata1
metadata1= pd.read_parquet('Google Maps/metadata-sitios/metadata1.parquet')


Realizamos una combinación entre los dataframes:

In [5]:

FL1 = pd.merge(dfc, metadata1, on='gmap_id')

In [6]:
# Guardamos el dataframe resultante en un archivo .parquet
FL1.to_parquet('FL1.parquet', index=False)

In [7]:
#Liberamos memoria:
del FL1, metadata1

In [8]:

#Cargamos la información del metadata2
metadata2= pd.read_parquet('Google Maps/metadata-sitios/metadata2.parquet')

In [9]:
#Realizamos una combinación entre los dataframes
FL2 = pd.merge(dfc, metadata2, on='gmap_id')

In [10]:
FL2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 256627 entries, 0 to 256626
Data columns (total 22 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   user_id           256627 non-null  float64
 1   name_x            256627 non-null  object 
 2   time              256627 non-null  int64  
 3   rating            256627 non-null  int64  
 4   text              159832 non-null  object 
 5   pics              11168 non-null   object 
 6   resp              31838 non-null   object 
 7   gmap_id           256627 non-null  object 
 8   name_y            256627 non-null  object 
 9   address           255866 non-null  object 
 10  description       139631 non-null  object 
 11  latitude          256627 non-null  float64
 12  longitude         256627 non-null  float64
 13  category          256616 non-null  object 
 14  avg_rating        256627 non-null  float64
 15  num_of_reviews    256627 non-null  int64  
 16  price             11

In [12]:
# Guardamos el dataframe resultante en un archivo .parquet
FL2.to_parquet('FL2.parquet', index=False)

In [13]:
#Liberamos memoria
del FL2, metadata2

In [3]:
#leemos los datos del metadata3
metadata3= pd.read_parquet('Google Maps/metadata-sitios/metadata3.parquet')

In [4]:
#Realizamos una combinación entre los dataframes
FL3 = pd.merge(dfc, metadata3, on='gmap_id')

In [5]:
FL3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179747 entries, 0 to 179746
Data columns (total 22 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   user_id           179747 non-null  float64
 1   name_x            179747 non-null  object 
 2   time              179747 non-null  int64  
 3   rating            179747 non-null  int64  
 4   text              111242 non-null  object 
 5   pics              6765 non-null    object 
 6   resp              26300 non-null   object 
 7   gmap_id           179747 non-null  object 
 8   name_y            179747 non-null  object 
 9   address           179285 non-null  object 
 10  description       41502 non-null   object 
 11  latitude          179747 non-null  float64
 12  longitude         179747 non-null  float64
 13  category          179690 non-null  object 
 14  avg_rating        179747 non-null  float64
 15  num_of_reviews    179747 non-null  int64  
 16  price             47

In [6]:
# Guardamos el dataframe resultante en un archivo .parquet
FL3.to_parquet('FL3.parquet', index=False)

In [3]:
df_fl = pd.concat([pd.read_parquet(f'NV{i}.parquet') for i in range(1, 4)], axis=0, join='inner')

Guardamos el dataset final correspondiente al estado de California: 

In [4]:
df_fl.to_parquet('df_nv.parquet', index=False)

___________________________________________________________________________________________________________

### Análisis Exploratorio de Datos - Google reviews:Estado de Florida

En el presente notebook se realizará un análisis exploratorio de los datos de reviews de negocios relacionados al hospedaje realizadas en Google Maps en el estado de Florida:


Realizamos la carga del dataset de Florida concatenado previamente:

In [2]:
#Cargamos el dataset 
df_fl= pd.read_parquet('df_fl.parquet')

In [3]:
#Verificamos la infomación y el tamaño del dataframe
df_fl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2869646 entries, 0 to 2869645
Data columns (total 22 columns):
 #   Column            Dtype  
---  ------            -----  
 0   user_id           float64
 1   name_x            object 
 2   time              int64  
 3   rating            int64  
 4   text              object 
 5   pics              object 
 6   resp              object 
 7   gmap_id           object 
 8   name_y            object 
 9   address           object 
 10  description       object 
 11  latitude          float64
 12  longitude         float64
 13  category          object 
 14  avg_rating        float64
 15  num_of_reviews    int64  
 16  price             object 
 17  hours             object 
 18  MISC              object 
 19  state             object 
 20  relative_results  object 
 21  url               object 
dtypes: float64(4), int64(3), object(15)
memory usage: 481.7+ MB


Primero analizamos las columnas 'name_x', 'pics', 'resp', 'MISC' y 'url' y decidimos eliminarlas por los siguientes motivos:
'name_x': Nombre de usuario irrelevante para nuestro estudio.
'pics'  : Corresponden a fotografías tomadas por los usuarios. La eliminaremos ya que no las consideramos necesarias.
'resp'  : Respuesta del establecimiento a la review del usuario. El numero de datos es muy pequeño.
'MISC'  : Opciones del servicio. No son relevantes en nuestro estudio.
'gmap_id':  Códigos de google que no tienen significado relevante.
'relative_results': Códigos de google que no tienen significado relevante.
'url'   : Dirección web del establecimiento comercial. No relevante.


In [4]:
#Eliminamos las columnas que no se usarán en el análisis: 

df_fl= df_fl.drop(['name_x','pics', 'resp', 'MISC', 'url', 'relative_results', 'gmap_id', 'hours'], axis=1)

Verificamos la cantidad de valores nulos:

In [5]:
null_counts = df_fl.isnull().sum()
print(null_counts)

user_id                 0
time                    0
rating                  0
text              1085638
name_y                  0
address             19654
description       1527743
latitude                0
longitude               0
category             1381
avg_rating              0
num_of_reviews          0
price             1582625
state              206090
dtype: int64


De acuerdo a lo anterior, existe una gran cantidad de valores faltantes en la columna "text", que contiene los comentarios de las reviews realizadas por el usuario. Se decide mantener las mismas, ya que tal vez podamos implementar algún método para completar los mismos a través del rating ingresado por el usuario De igual manera para las columnas "description", "address", "price" y "state". 



Desanidamos la columna 'category':

In [6]:
df_fl = df_fl.explode('category')
df_fl = df_fl.dropna(subset=['category'])

In [7]:
# Se controla que no existan filas duplicadas:
df_fl.duplicated().sum()

502499

In [8]:
# Se eliminan valores duplicados:
df_fl= df_fl.drop_duplicates()
df_fl.shape

(9191314, 14)

Llenamos los datos faltantes de la columna "price" por 'no price'

In [9]:
df_fl['price'].fillna('No Price', inplace=True)


Cambiamos el tipo de dato de la columna time a datetime:

In [10]:
df_fl['time'] = pd.to_datetime(df_fl['time'], unit='ms')

In [11]:
df_fl.head()

Unnamed: 0,user_id,time,rating,text,name_y,address,description,latitude,longitude,category,avg_rating,num_of_reviews,price,state
0,1.014719e+20,2021-08-03 15:07:30.740,1,Update: Their “reply” to my review amounted to...,"Brian Shaheen, MD","Brian Shaheen, MD, 2421 Thomas Dr, Panama City...",,30.159982,-85.752277,Family practice physician,4.2,18,No Price,Open ⋅ Closes 5PM
0,1.014719e+20,2021-08-03 15:07:30.740,1,Update: Their “reply” to my review amounted to...,"Brian Shaheen, MD","Brian Shaheen, MD, 2421 Thomas Dr, Panama City...",,30.159982,-85.752277,General practitioner,4.2,18,No Price,Open ⋅ Closes 5PM
2,1.154772e+20,2020-07-18 00:13:37.005,5,He's a knowledgeable doctor but the way he run...,"Brian Shaheen, MD","Brian Shaheen, MD, 2421 Thomas Dr, Panama City...",,30.159982,-85.752277,Family practice physician,4.2,18,No Price,Open ⋅ Closes 5PM
2,1.154772e+20,2020-07-18 00:13:37.005,5,He's a knowledgeable doctor but the way he run...,"Brian Shaheen, MD","Brian Shaheen, MD, 2421 Thomas Dr, Panama City...",,30.159982,-85.752277,General practitioner,4.2,18,No Price,Open ⋅ Closes 5PM
4,1.01805e+20,2018-04-05 10:30:53.567,5,"Best doctor I've ever had, I never wait to be ...","Brian Shaheen, MD","Brian Shaheen, MD, 2421 Thomas Dr, Panama City...",,30.159982,-85.752277,Family practice physician,4.2,18,No Price,Open ⋅ Closes 5PM


In [12]:
df_fl.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9191314 entries, 0 to 2869643
Data columns (total 14 columns):
 #   Column          Dtype         
---  ------          -----         
 0   user_id         float64       
 1   time            datetime64[ns]
 2   rating          int64         
 3   text            object        
 4   name_y          object        
 5   address         object        
 6   description     object        
 7   latitude        float64       
 8   longitude       float64       
 9   category        object        
 10  avg_rating      float64       
 11  num_of_reviews  int64         
 12  price           object        
 13  state           object        
dtypes: datetime64[ns](1), float64(4), int64(2), object(7)
memory usage: 1.0+ GB


In [13]:
# Renombramos la columna "name_y" por "business name"
df_fl.rename(columns={'name_y': 'business name'}, inplace=True)

Realizamos un filtro de negocios relacionados con la hotelería, contenidos en la columna 'category'

In [14]:
#Creamos una lista con las palabras clave
keywords = ['Hotel', 'Hostel', 'Motel', 'Resort', 'Inn', 'Lodging', 'Lodge', 'Accommodation', 
            'Bed and Breakfast (B&B)', 'Guesthouse', 'Boutique Hotel', 'Vacation Rental', 
            'Homestay', 'Cabin', 'Suites', 'Spa Resort', 'Boutique Inn', 'Extended Stay', 
            'Boutique Accommodation', 'Retreat']


# Se realiza la busqueda de coincidencias exactas de las palabras clave
import re
pattern = r'\b(?:' + '|'.join(re.escape(keyword) for keyword in keywords) + r')\b'


# Creamos un dataframe de hoteles unicamente
fl_hotels = df_fl[df_fl['category'].str.contains(pattern, case=False, na=False)]



In [15]:
fl_hotels.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8240 entries, 8580 to 2824135
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   user_id         8240 non-null   float64       
 1   time            8240 non-null   datetime64[ns]
 2   rating          8240 non-null   int64         
 3   text            5052 non-null   object        
 4   business name   8240 non-null   object        
 5   address         8240 non-null   object        
 6   description     3526 non-null   object        
 7   latitude        8240 non-null   float64       
 8   longitude       8240 non-null   float64       
 9   category        8240 non-null   object        
 10  avg_rating      8240 non-null   float64       
 11  num_of_reviews  8240 non-null   int64         
 12  price           8240 non-null   object        
 13  state           140 non-null    object        
dtypes: datetime64[ns](1), float64(4), int64(2), object(7)
m

In [16]:
fl_hotels.describe()

Unnamed: 0,user_id,time,rating,latitude,longitude,avg_rating,num_of_reviews
count,8240.0,8240,8240.0,8240.0,8240.0,8240.0,8240.0
mean,1.093445e+20,2018-10-27 09:57:13.618773760,3.99284,28.301509,-81.7819,3.899114,209.320752
min,1.000044e+20,2010-06-01 13:19:56.098000,1.0,24.567741,-87.335831,1.7,8.0
25%,1.04909e+20,2017-11-20 22:39:54.828750080,3.0,27.471287,-82.284642,3.2,38.0
50%,1.091813e+20,2018-11-22 02:41:01.690000128,5.0,28.399987,-81.546144,4.2,75.0
75%,1.139757e+20,2019-10-01 04:09:53.748000,5.0,29.468722,-81.276165,4.5,238.0
max,1.184463e+20,2021-09-05 02:20:23.092000,5.0,30.790038,-80.058026,5.0,828.0
std,5.231066e+18,,1.369947,1.485378,1.390632,0.768984,269.216817


In [18]:
fl_hotels.head()

Unnamed: 0,user_id,time,rating,text,business name,address,description,latitude,longitude,category,avg_rating,num_of_reviews,price,state
8580,1.006212e+20,2020-11-02 20:17:10.289,5,"The beach is simply beautiful, totally recomme...",Anna Maria Beach,"Anna Maria Beach, Holmes Beach, FL 34217",,27.497579,-82.712635,Lodging,4.8,8,No Price,
8582,1.010228e+20,2021-01-12 01:37:06.400,5,If you want to feel lile you are living in an ...,Anna Maria Beach,"Anna Maria Beach, Holmes Beach, FL 34217",,27.497579,-82.712635,Lodging,4.8,8,No Price,
8584,1.034052e+20,2021-03-08 02:05:03.569,5,"Quiet beach, highly recommend",Anna Maria Beach,"Anna Maria Beach, Holmes Beach, FL 34217",,27.497579,-82.712635,Lodging,4.8,8,No Price,
8586,1.118156e+20,2021-05-15 19:45:25.511,5,(Translated by Google) The ideal place to admi...,Anna Maria Beach,"Anna Maria Beach, Holmes Beach, FL 34217",,27.497579,-82.712635,Lodging,4.8,8,No Price,
8588,1.118296e+20,2020-08-28 00:58:58.361,4,,Anna Maria Beach,"Anna Maria Beach, Holmes Beach, FL 34217",,27.497579,-82.712635,Lodging,4.8,8,No Price,


In [19]:
fl_hotels.to_parquet('fl_hotels.parquet', index=False)