### Descripción del proceso de unión de los datasets de metadatos y de reviews:

In [7]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Cargamos los archivos json y los concatenamos:

In [2]:


file_paths = ['Google Maps/reviews-estados/review-Nevada/{}.json'.format(i) for i in range(1, 13)]

dfs = [pd.read_json(file_path, lines=True) for file_path in file_paths]

#Concatenamos todos los datasets:
dfc = pd.concat(dfs, axis=0, join='inner')

Estimamos conveniente realizar una combinación de los metadatos con los reviews. Para ello tomamos los grupos de
metadatos que habíamos separado previamente para facilitar la manipulación: 

In [4]:
#Cargamos el dataset del metadata1
metadata1= pd.read_parquet('Google Maps/metadata-sitios/metadata1.parquet')


Realizamos una combinación entre los dataframes:

In [5]:

NV1 = pd.merge(dfc, metadata1, on='gmap_id')

In [6]:
# Guardamos el dataframe resultante en un archivo .parquet
NV1.to_parquet('NV1.parquet', index=False)

In [7]:
#Liberamos memoria:
del NV1, metadata1

In [8]:

#Cargamos la información del metadata2
metadata2= pd.read_parquet('Google Maps/metadata-sitios/metadata2.parquet')

In [9]:
#Realizamos una combinación entre los dataframes
NV2 = pd.merge(dfc, metadata2, on='gmap_id')

In [10]:
NV2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 256627 entries, 0 to 256626
Data columns (total 22 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   user_id           256627 non-null  float64
 1   name_x            256627 non-null  object 
 2   time              256627 non-null  int64  
 3   rating            256627 non-null  int64  
 4   text              159832 non-null  object 
 5   pics              11168 non-null   object 
 6   resp              31838 non-null   object 
 7   gmap_id           256627 non-null  object 
 8   name_y            256627 non-null  object 
 9   address           255866 non-null  object 
 10  description       139631 non-null  object 
 11  latitude          256627 non-null  float64
 12  longitude         256627 non-null  float64
 13  category          256616 non-null  object 
 14  avg_rating        256627 non-null  float64
 15  num_of_reviews    256627 non-null  int64  
 16  price             11

In [12]:
# Guardamos el dataframe resultante en un archivo .parquet
NV2.to_parquet('NV2.parquet', index=False)

In [13]:
#Liberamos memoria
del NV2, metadata2

In [3]:
#leemos los datos del metadata3
metadata3= pd.read_parquet('Google Maps/metadata-sitios/metadata3.parquet')

In [4]:
#Realizamos una combinación entre los dataframes
NV3 = pd.merge(dfc, metadata3, on='gmap_id')

In [5]:
NV3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179747 entries, 0 to 179746
Data columns (total 22 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   user_id           179747 non-null  float64
 1   name_x            179747 non-null  object 
 2   time              179747 non-null  int64  
 3   rating            179747 non-null  int64  
 4   text              111242 non-null  object 
 5   pics              6765 non-null    object 
 6   resp              26300 non-null   object 
 7   gmap_id           179747 non-null  object 
 8   name_y            179747 non-null  object 
 9   address           179285 non-null  object 
 10  description       41502 non-null   object 
 11  latitude          179747 non-null  float64
 12  longitude         179747 non-null  float64
 13  category          179690 non-null  object 
 14  avg_rating        179747 non-null  float64
 15  num_of_reviews    179747 non-null  int64  
 16  price             47

In [6]:
# Guardamos el dataframe resultante en un archivo .parquet
NV3.to_parquet('NV3.parquet', index=False)

In [3]:
df_nv = pd.concat([pd.read_parquet(f'NV{i}.parquet') for i in range(1, 4)], axis=0, join='inner')

Guardamos el dataset final correspondiente al estado de Nevada: 

In [4]:
df_nv.to_parquet('df_nv.parquet', index=False)

___________________________________________________________________________________________________________

### Análisis Exploratorio de Datos - Google reviews:Estado de Nevada

En el presente notebook se realizará un análisis exploratorio de los datos de reviews de negocios relacionados al hospedaje realizadas en Google Maps en el estado de Nevada:


Realizamos la carga del dataset de Nevada concatenado previamente:

In [8]:
#Cargamos el dataset 
df_nv= pd.read_parquet('df_nv.parquet')

In [4]:
#Verificamos la infomación y el tamaño del dataframe
df_nv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 482596 entries, 0 to 482595
Data columns (total 22 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   user_id           482596 non-null  float64
 1   name_x            482596 non-null  object 
 2   time              482596 non-null  int64  
 3   rating            482596 non-null  int64  
 4   text              300916 non-null  object 
 5   pics              19435 non-null   object 
 6   resp              66383 non-null   object 
 7   gmap_id           482596 non-null  object 
 8   name_y            482596 non-null  object 
 9   address           480769 non-null  object 
 10  description       186592 non-null  object 
 11  latitude          482596 non-null  float64
 12  longitude         482596 non-null  float64
 13  category          482528 non-null  object 
 14  avg_rating        482596 non-null  float64
 15  num_of_reviews    482596 non-null  int64  
 16  price             16

Primero analizamos las columnas 'name_x', 'pics', 'resp', 'MISC' y 'url' y decidimos eliminarlas por los siguientes motivos:
'name_x': Nombre de usuario irrelevante para nuestro estudio.
'pics'  : Corresponden a fotografías tomadas por los usuarios. La eliminaremos ya que no las consideramos necesarias.
'resp'  : Respuesta del establecimiento a la review del usuario. El numero de datos es muy pequeño.
'MISC'  : Opciones del servicio. No son relevantes en nuestro estudio.
'gmap_id':  Códigos de google que no tienen significado relevante.
'relative_results': Códigos de google que no tienen significado relevante.
'url'   : Dirección web del establecimiento comercial. No relevante.


In [9]:
#Eliminamos las columnas que no se usarán en el análisis: 

df_nv= df_nv.drop(['name_x','pics', 'resp', 'MISC', 'url', 'relative_results', 'gmap_id', 'hours'], axis=1)

Verificamos la cantidad de valores nulos:

In [10]:
null_counts = df_nv.isnull().sum()
print(null_counts)

user_id                0
time                   0
rating                 0
text              181680
name_y                 0
address             1827
description       296004
latitude               0
longitude              0
category              68
avg_rating             0
num_of_reviews         0
price             317555
state              37249
dtype: int64


De acuerdo a lo anterior, existe una gran cantidad de valores faltantes en la columna "text", que contiene los comentarios de las reviews realizadas por el usuario. Se decide mantener las mismas, ya que tal vez podamos implementar algún método para completar los mismos a través del rating ingresado por el usuario De igual manera para las columnas "description", "address", "price" y "state". 



Desanidamos la columna 'category':

In [11]:
df_nv = df_nv.explode('category')
df_nv = df_nv.dropna(subset=['category'])

In [12]:
# Se controla que no existan filas duplicadas:
df_nv.duplicated().sum()

58395

In [13]:
# Se eliminan valores duplicados:
df_nv= df_nv.drop_duplicates()
df_nv.shape

(1375620, 14)

Llenamos los datos faltantes de la columna "price" por 'no price'

In [14]:
df_nv['price'].fillna('No Price', inplace=True)


Cambiamos el tipo de dato de la columna time a datetime:

In [15]:
df_nv['time'] = pd.to_datetime(df_nv['time'], unit='ms')

In [16]:
df_nv.head()

Unnamed: 0,user_id,time,rating,text,name_y,address,description,latitude,longitude,category,avg_rating,num_of_reviews,price,state
0,1.028432e+20,2021-06-03 17:15:04.476,5,No frills smaller coffee counter in the corner...,Castle Coffee,"Castle Coffee, Excalibur Hotel and Casino, 385...",,36.099575,-115.176338,Coffee shop,3.2,24,No Price,Open ⋅ Closes 2PM
2,1.149544e+20,2019-11-07 20:43:33.820,1,Save yourself the frustration and just ignore ...,Castle Coffee,"Castle Coffee, Excalibur Hotel and Casino, 385...",,36.099575,-115.176338,Coffee shop,3.2,24,No Price,Open ⋅ Closes 2PM
4,1.165234e+20,2018-01-06 15:06:03.362,3,I passed two Starbucks in order to get to this...,Castle Coffee,"Castle Coffee, Excalibur Hotel and Casino, 385...",,36.099575,-115.176338,Coffee shop,3.2,24,No Price,Open ⋅ Closes 2PM
6,1.033041e+20,2017-09-23 22:46:02.095,3,It does the job and is my preference over the ...,Castle Coffee,"Castle Coffee, Excalibur Hotel and Casino, 385...",,36.099575,-115.176338,Coffee shop,3.2,24,No Price,Open ⋅ Closes 2PM
8,1.141624e+20,2016-07-12 04:46:11.892,3,,Castle Coffee,"Castle Coffee, Excalibur Hotel and Casino, 385...",,36.099575,-115.176338,Coffee shop,3.2,24,No Price,Open ⋅ Closes 2PM


In [31]:
df_nv.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1375620 entries, 0 to 482595
Data columns (total 14 columns):
 #   Column          Non-Null Count    Dtype         
---  ------          --------------    -----         
 0   user_id         1375620 non-null  float64       
 1   time            1375620 non-null  datetime64[ns]
 2   rating          1375620 non-null  int64         
 3   text            841737 non-null   object        
 4   business name   1375620 non-null  object        
 5   address         1370277 non-null  object        
 6   description     707912 non-null   object        
 7   latitude        1375620 non-null  float64       
 8   longitude       1375620 non-null  float64       
 9   category        1375620 non-null  object        
 10  avg_rating      1375620 non-null  float64       
 11  num_of_reviews  1375620 non-null  int64         
 12  price           1375620 non-null  object        
 13  state           1317712 non-null  object        
dtypes: datetime64[ns](1), fl

In [32]:
# Renombramos la columna "name_y" por "business name"
df_nv.rename(columns={'name_y': 'business name'}, inplace=True)

Realizamos un filtro de negocios relacionados con la hotelería, contenidos en la columna 'category'

In [33]:
#Creamos una lista con las palabras clave
keywords = ['Hotel', 'Hostel', 'Motel', 'Resort', 'Inn', 'Lodging', 'Lodge', 'Accommodation', 
            'Bed and Breakfast (B&B)', 'Guesthouse', 'Boutique Hotel', 'Vacation Rental', 
            'Homestay', 'Cabin', 'Suites', 'Spa Resort', 'Boutique Inn', 'Extended Stay', 
            'Boutique Accommodation', 'Retreat']


# Se realiza la busqueda de coincidencias exactas de las palabras clave
import re
pattern = r'\b(?:' + '|'.join(re.escape(keyword) for keyword in keywords) + r')\b'


# Creamos un dataframe de hoteles unicamente
nv_hotels = df_nv[df_nv['category'].str.contains(pattern, case=False, na=False)]



In [34]:
nv_hotels.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3907 entries, 8245 to 464140
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   user_id         3907 non-null   float64       
 1   time            3907 non-null   datetime64[ns]
 2   rating          3907 non-null   int64         
 3   text            2383 non-null   object        
 4   business name   3907 non-null   object        
 5   address         3907 non-null   object        
 6   description     213 non-null    object        
 7   latitude        3907 non-null   float64       
 8   longitude       3907 non-null   float64       
 9   category        3907 non-null   object        
 10  avg_rating      3907 non-null   float64       
 11  num_of_reviews  3907 non-null   int64         
 12  price           3907 non-null   object        
 13  state           766 non-null    object        
dtypes: datetime64[ns](1), float64(4), int64(2), object(7)
me

In [35]:
df_nv.describe()

Unnamed: 0,user_id,time,rating,latitude,longitude,avg_rating,num_of_reviews
count,1375620.0,1375620,1375620.0,1375620.0,1375620.0,1375620.0,1375620.0
mean,1.09316e+20,2019-04-08 03:27:32.202178048,4.241728,36.81825,-116.0234,4.214695,1083.728
min,1e+20,2005-06-10 00:00:00,1.0,35.12886,-119.9955,1.1,6.0
25%,1.047904e+20,2018-05-30 01:59:30.780000,4.0,36.09158,-115.2982,4.0,78.0
50%,1.093092e+20,2019-05-10 21:25:28.222000128,5.0,36.15698,-115.1905,4.3,225.0
75%,1.138704e+20,2020-04-24 00:06:42.374000128,5.0,36.25719,-115.123,4.6,1095.0
max,1.184467e+20,2021-09-07 10:09:29.993000,5.0,41.98962,-114.0472,5.0,9998.0
std,5.277704e+18,,1.202381,1.384651,1.770253,0.4788529,1991.07


In [36]:
nv_hotels.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3907 entries, 8245 to 464140
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   user_id         3907 non-null   float64       
 1   time            3907 non-null   datetime64[ns]
 2   rating          3907 non-null   int64         
 3   text            2383 non-null   object        
 4   business name   3907 non-null   object        
 5   address         3907 non-null   object        
 6   description     213 non-null    object        
 7   latitude        3907 non-null   float64       
 8   longitude       3907 non-null   float64       
 9   category        3907 non-null   object        
 10  avg_rating      3907 non-null   float64       
 11  num_of_reviews  3907 non-null   int64         
 12  price           3907 non-null   object        
 13  state           766 non-null    object        
dtypes: datetime64[ns](1), float64(4), int64(2), object(7)
me

In [37]:
nv_hotels.head()

Unnamed: 0,user_id,time,rating,text,business name,address,description,latitude,longitude,category,avg_rating,num_of_reviews,price,state
8245,1.107179e+20,2020-01-03 03:03:00.050,5,Clean rooms no bugs or rodents great price and...,Sunset Motel,"Sunset Motel, 2091 W 4th St, Reno, NV 89503",,39.523581,-119.837903,Motel,3.0,38,No Price,
8246,1.040384e+20,2018-07-30 18:58:30.945,4,Very good for a Reno motel. Clean and family ...,Sunset Motel,"Sunset Motel, 2091 W 4th St, Reno, NV 89503",,39.523581,-119.837903,Motel,3.0,38,No Price,
8247,1.0633e+20,2018-06-18 02:48:23.106,4,Owner is very helpful to everyone he deals wit...,Sunset Motel,"Sunset Motel, 2091 W 4th St, Reno, NV 89503",,39.523581,-119.837903,Motel,3.0,38,No Price,
8248,1.089285e+20,2018-08-23 06:16:42.832,5,I give this place 5 stars because Harry and hi...,Sunset Motel,"Sunset Motel, 2091 W 4th St, Reno, NV 89503",,39.523581,-119.837903,Motel,3.0,38,No Price,
8249,1.175148e+20,2019-01-03 04:13:34.728,3,It keeps you warm and it's as clean as you'd e...,Sunset Motel,"Sunset Motel, 2091 W 4th St, Reno, NV 89503",,39.523581,-119.837903,Motel,3.0,38,No Price,


In [38]:
nv_hotels.to_parquet('nv_hotels.parquet', index=False)