### Descripción del proceso de unión de los datasets de metadatos y de reviews:

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Cargamos los archivos json y los concatenamos:

In [2]:


file_paths = ['Google Maps/reviews-estados/review-California/{}.json'.format(i) for i in range(1, 19)]

dfs = [pd.read_json(file_path, lines=True) for file_path in file_paths]

#Concatenamos todos los datasets:
dfc = pd.concat(dfs, axis=0, join='inner')

Estimamos conveniente realizar una combinación de los metadatos con los reviews. Para ello tomamos los grupos de
metadatos que habíamos separado previamente para facilitar la manipulación: 

In [4]:
#Cargamos el dataset del metadata1
metadata1= pd.read_parquet('Google Maps/metadata-sitios/metadata1.parquet')


Realizamos una combinación entre los dataframes:

In [5]:

Cal1 = pd.merge(dfc, metadata1, on='gmap_id')

In [6]:
# Guardamos el dataframe resultante en un archivo .parquet
Cal1.to_parquet('Cal1.parquet', index=False)

In [7]:
#Liberamos memoria:
del Cal1, metadata1

In [8]:

#Cargamos la información del metadata2
metadata2= pd.read_parquet('Google Maps/metadata-sitios/metadata2.parquet')

In [9]:
#Realizamos una combinación entre los dataframes
Cal2 = pd.merge(dfc, metadata2, on='gmap_id')

In [10]:
Cal2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 256627 entries, 0 to 256626
Data columns (total 22 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   user_id           256627 non-null  float64
 1   name_x            256627 non-null  object 
 2   time              256627 non-null  int64  
 3   rating            256627 non-null  int64  
 4   text              159832 non-null  object 
 5   pics              11168 non-null   object 
 6   resp              31838 non-null   object 
 7   gmap_id           256627 non-null  object 
 8   name_y            256627 non-null  object 
 9   address           255866 non-null  object 
 10  description       139631 non-null  object 
 11  latitude          256627 non-null  float64
 12  longitude         256627 non-null  float64
 13  category          256616 non-null  object 
 14  avg_rating        256627 non-null  float64
 15  num_of_reviews    256627 non-null  int64  
 16  price             11

In [12]:
# Guardamos el dataframe resultante en un archivo .parquet
Cal2.to_parquet('Cal2.parquet', index=False)

In [13]:
#Liberamos memoria
del Cal22, metadata2

In [3]:
#leemos los datos del metadata3
metadata3= pd.read_parquet('Google Maps/metadata-sitios/metadata3.parquet')

In [4]:
#Realizamos una combinación entre los dataframes
Cal3 = pd.merge(dfc, metadata3, on='gmap_id')

In [5]:
Cal3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179747 entries, 0 to 179746
Data columns (total 22 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   user_id           179747 non-null  float64
 1   name_x            179747 non-null  object 
 2   time              179747 non-null  int64  
 3   rating            179747 non-null  int64  
 4   text              111242 non-null  object 
 5   pics              6765 non-null    object 
 6   resp              26300 non-null   object 
 7   gmap_id           179747 non-null  object 
 8   name_y            179747 non-null  object 
 9   address           179285 non-null  object 
 10  description       41502 non-null   object 
 11  latitude          179747 non-null  float64
 12  longitude         179747 non-null  float64
 13  category          179690 non-null  object 
 14  avg_rating        179747 non-null  float64
 15  num_of_reviews    179747 non-null  int64  
 16  price             47

In [6]:
# Guardamos el dataframe resultante en un archivo .parquet
Cal3.to_parquet('Cal3.parquet', index=False)

In [3]:
df_ca = pd.concat([pd.read_parquet(f'NV{i}.parquet') for i in range(1, 4)], axis=0, join='inner')

Guardamos el dataset final correspondiente al estado de California: 

In [4]:
df_ca.to_parquet('df_nv.parquet', index=False)

___________________________________________________________________________________________________________

### Análisis Exploratorio de Datos - Google reviews:Estado de California

En el presente notebook se realizará un análisis exploratorio de los datos de reviews de negocios relacionados al hospedaje realizadas en Google Maps en el estado de California:


Realizamos la carga del dataset de Nevada concatenado previamente:

In [3]:
#Cargamos el dataset 
df_ca= pd.read_parquet('df_ca.parquet')

In [4]:
#Verificamos la infomación y el tamaño del dataframe
df_ca.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 482596 entries, 0 to 482595
Data columns (total 22 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   user_id           482596 non-null  float64
 1   name_x            482596 non-null  object 
 2   time              482596 non-null  int64  
 3   rating            482596 non-null  int64  
 4   text              300916 non-null  object 
 5   pics              19435 non-null   object 
 6   resp              66383 non-null   object 
 7   gmap_id           482596 non-null  object 
 8   name_y            482596 non-null  object 
 9   address           480769 non-null  object 
 10  description       186592 non-null  object 
 11  latitude          482596 non-null  float64
 12  longitude         482596 non-null  float64
 13  category          482528 non-null  object 
 14  avg_rating        482596 non-null  float64
 15  num_of_reviews    482596 non-null  int64  
 16  price             16

Primero analizamos las columnas 'name_x', 'pics', 'resp', 'MISC' y 'url' y decidimos eliminarlas por los siguientes motivos:
'name_x': Nombre de usuario irrelevante para nuestro estudio.
'pics'  : Corresponden a fotografías tomadas por los usuarios. La eliminaremos ya que no las consideramos necesarias.
'resp'  : Respuesta del establecimiento a la review del usuario. El numero de datos es muy pequeño.
'MISC'  : Opciones del servicio. No son relevantes en nuestro estudio.
'gmap_id':  Códigos de google que no tienen significado relevante.
'relative_results': Códigos de google que no tienen significado relevante.
'url'   : Dirección web del establecimiento comercial. No relevante.


In [4]:
#Eliminamos las columnas que no se usarán en el análisis: 

df_ca= df_ca.drop(['name_x','pics', 'resp', 'MISC', 'url', 'relative_results', 'gmap_id', 'hours'], axis=1)

Verificamos la cantidad de valores nulos:

In [5]:
null_counts = df_ca.isnull().sum()
print(null_counts)

user_id                 0
time                    0
rating                  0
text               507434
name_y                 26
address              7574
description       1066984
latitude                0
longitude               0
category             2214
avg_rating              0
num_of_reviews          0
price             1074162
state              137642
dtype: int64


De acuerdo a lo anterior, existe una gran cantidad de valores faltantes en la columna "text", que contiene los comentarios de las reviews realizadas por el usuario. Se decide mantener las mismas, ya que tal vez podamos implementar algún método para completar los mismos a través del rating ingresado por el usuario De igual manera para las columnas "description", "address", "price" y "state". 



Desanidamos la columna 'category':

In [6]:
df_ca = df_ca.explode('category')
df_ca = df_ca.dropna(subset=['category'])

In [7]:
# Se controla que no existan filas duplicadas:
df_ca.duplicated().sum()

1527812

In [8]:
# Se eliminan valores duplicados:
df_ca= df_ca.drop_duplicates()
df_ca.shape

(1330208, 14)

Llenamos los datos faltantes de la columna "price" por 'no price'

In [9]:
df_ca['price'].fillna('No Price', inplace=True)


Cambiamos el tipo de dato de la columna time a datetime:

In [10]:
df_ca['time'] = pd.to_datetime(df_ca['time'], unit='ms')

In [11]:
df_ca.head()

Unnamed: 0,user_id,time,rating,text,name_y,address,description,latitude,longitude,category,avg_rating,num_of_reviews,price,state
0,1.089912e+20,2021-01-06 05:12:07.056,5,Love there korean rice cake.,San Soo Dang,"San Soo Dang, 761 S Vermont Ave, Los Angeles, ...",,34.058092,-118.29213,Korean restaurant,4.4,18,No Price,Open ⋅ Closes 6PM
2,1.112903e+20,2021-02-09 05:47:28.663,5,Good very good,San Soo Dang,"San Soo Dang, 761 S Vermont Ave, Los Angeles, ...",,34.058092,-118.29213,Korean restaurant,4.4,18,No Price,Open ⋅ Closes 6PM
4,1.126404e+20,2020-03-08 05:04:42.296,4,They make Korean traditional food very properly.,San Soo Dang,"San Soo Dang, 761 S Vermont Ave, Los Angeles, ...",,34.058092,-118.29213,Korean restaurant,4.4,18,No Price,Open ⋅ Closes 6PM
6,1.174403e+20,2019-03-07 05:56:56.355,5,Short ribs are very delicious.,San Soo Dang,"San Soo Dang, 761 S Vermont Ave, Los Angeles, ...",,34.058092,-118.29213,Korean restaurant,4.4,18,No Price,Open ⋅ Closes 6PM
8,1.005808e+20,2017-05-16 05:01:41.933,5,Great food and prices the portions are large,San Soo Dang,"San Soo Dang, 761 S Vermont Ave, Los Angeles, ...",,34.058092,-118.29213,Korean restaurant,4.4,18,No Price,Open ⋅ Closes 6PM


In [12]:
df_ca.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1330208 entries, 0 to 614558
Data columns (total 14 columns):
 #   Column          Non-Null Count    Dtype         
---  ------          --------------    -----         
 0   user_id         1330208 non-null  float64       
 1   time            1330208 non-null  datetime64[ns]
 2   rating          1330208 non-null  int64         
 3   text            801500 non-null   object        
 4   name_y          1330195 non-null  object        
 5   address         1319592 non-null  object        
 6   description     253243 non-null   object        
 7   latitude        1330208 non-null  float64       
 8   longitude       1330208 non-null  float64       
 9   category        1330208 non-null  object        
 10  avg_rating      1330208 non-null  float64       
 11  num_of_reviews  1330208 non-null  int64         
 12  price           1330208 non-null  object        
 13  state           1231676 non-null  object        
dtypes: datetime64[ns](1), fl

In [13]:
# Renombramos la columna "name_y" por "business name"
df_ca.rename(columns={'name_y': 'business name'}, inplace=True)

Realizamos un filtro de negocios relacionados con la hotelería, contenidos en la columna 'category'

In [17]:
#Creamos una lista con las palabras clave
keywords = ['Hotel', 'Hostel', 'Motel', 'Resort', 'Inn', 'Lodging', 'Lodge', 'Accommodation', 
            'Bed and Breakfast (B&B)', 'Guesthouse', 'Boutique Hotel', 'Vacation Rental', 
            'Homestay', 'Cabin', 'Suites', 'Spa Resort', 'Boutique Inn', 'Extended Stay', 
            'Boutique Accommodation', 'Retreat']


# Se realiza la busqueda de coincidencias exactas de las palabras clave
import re
pattern = r'\b(?:' + '|'.join(re.escape(keyword) for keyword in keywords) + r')\b'


# Creamos un dataframe de hoteles unicamente
ca_hotels = df_ca[df_ca['category'].str.contains(pattern, case=False, na=False)]



In [18]:
ca_hotels.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2658 entries, 31716 to 612711
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   user_id         2658 non-null   float64       
 1   time            2658 non-null   datetime64[ns]
 2   rating          2658 non-null   int64         
 3   text            1521 non-null   object        
 4   business name   2658 non-null   object        
 5   address         2643 non-null   object        
 6   description     560 non-null    object        
 7   latitude        2658 non-null   float64       
 8   longitude       2658 non-null   float64       
 9   category        2658 non-null   object        
 10  avg_rating      2658 non-null   float64       
 11  num_of_reviews  2658 non-null   int64         
 12  price           2658 non-null   object        
 13  state           111 non-null    object        
dtypes: datetime64[ns](1), float64(4), int64(2), object(7)
m

In [19]:
ca_hotels.describe()

Unnamed: 0,user_id,time,rating,latitude,longitude,avg_rating,num_of_reviews
count,2658.0,2658,2658.0,2658.0,2658.0,2658.0,2658.0
mean,1.097295e+20,2018-09-15 08:03:35.470112512,3.920993,35.706194,-119.284191,3.885779,46.676072
min,1.000041e+20,2009-12-30 05:24:16.369000,1.0,32.679144,-124.067515,2.2,13.0
25%,1.057088e+20,2017-08-23 21:12:33.820999936,3.0,33.946246,-120.885765,3.3,27.0
50%,1.095537e+20,2018-10-24 00:17:07.139000064,5.0,34.265423,-118.420527,3.9,38.0
75%,1.143784e+20,2019-09-15 23:26:49.742000128,5.0,37.791257,-117.864265,4.5,55.0
max,1.184397e+20,2021-08-15 23:47:37.763000,5.0,41.728145,-114.731168,5.0,138.0
std,5.204594e+18,,1.368417,2.386993,1.981616,0.7065,29.103142


In [20]:
ca_hotels.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2658 entries, 31716 to 612711
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   user_id         2658 non-null   float64       
 1   time            2658 non-null   datetime64[ns]
 2   rating          2658 non-null   int64         
 3   text            1521 non-null   object        
 4   business name   2658 non-null   object        
 5   address         2643 non-null   object        
 6   description     560 non-null    object        
 7   latitude        2658 non-null   float64       
 8   longitude       2658 non-null   float64       
 9   category        2658 non-null   object        
 10  avg_rating      2658 non-null   float64       
 11  num_of_reviews  2658 non-null   int64         
 12  price           2658 non-null   object        
 13  state           111 non-null    object        
dtypes: datetime64[ns](1), float64(4), int64(2), object(7)
m

In [21]:
ca_hotels.head()

Unnamed: 0,user_id,time,rating,text,business name,address,description,latitude,longitude,category,avg_rating,num_of_reviews,price,state
31716,1.089268e+20,2021-06-13 02:27:17.434,4,Great pet friendly place on the American River...,Coloma Cottages,"Coloma Cottages, 5941 New River Rd, Coloma, CA...",,38.799352,-120.885765,Resort hotel,5.0,48,No Price,
31716,1.089268e+20,2021-06-13 02:27:17.434,4,Great pet friendly place on the American River...,Coloma Cottages,"Coloma Cottages, 5941 New River Rd, Coloma, CA...",,38.799352,-120.885765,Hotel,5.0,48,No Price,
31716,1.089268e+20,2021-06-13 02:27:17.434,4,Great pet friendly place on the American River...,Coloma Cottages,"Coloma Cottages, 5941 New River Rd, Coloma, CA...",,38.799352,-120.885765,Indoor lodging,5.0,48,No Price,
31716,1.089268e+20,2021-06-13 02:27:17.434,4,Great pet friendly place on the American River...,Coloma Cottages,"Coloma Cottages, 5941 New River Rd, Coloma, CA...",,38.799352,-120.885765,Inn,5.0,48,No Price,
31716,1.089268e+20,2021-06-13 02:27:17.434,4,Great pet friendly place on the American River...,Coloma Cottages,"Coloma Cottages, 5941 New River Rd, Coloma, CA...",,38.799352,-120.885765,Lodge,5.0,48,No Price,


In [22]:
ca_hotels.to_parquet('ca_hotels.parquet', index=False)