# Contenido

El presente archivo se realiza con la intención de buscar sites repetidos en los datasets, con el objetivo de crear un listado de locales únicos para luego unir ambos datasets.

El plan es el siguiente: se conservarán todos los locales del dataset de GOOGLE, y se eliminarán los locales que se encuentren repetidos en el dataset de YELP.

# Importación de librerías

In [1]:
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist

# Carga de datos

In [2]:
# Lectura de dataset de restaurants de GOOGLE
dfgrst_coord = pd.read_parquet('dfgrst_coord.parquet')

dfgrst_coord['x'] = dfgrst_coord['x'].round(2)
dfgrst_coord['y'] = dfgrst_coord['y'].round(2)

dfgrst_coord['latitude'] = dfgrst_coord['latitude'].round(4)
dfgrst_coord['longitude'] = dfgrst_coord['longitude'].round(4)

print(dfgrst_coord.info())
dfgrst_coord.sample(2)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231427 entries, 0 to 231426
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   business_id  231427 non-null  object 
 1   latitude     231427 non-null  float64
 2   longitude    231427 non-null  float64
 3   name         231427 non-null  object 
 4   state        231427 non-null  object 
 5   city         231427 non-null  object 
 6   postal_code  231427 non-null  int64  
 7   source       231427 non-null  object 
 8   x            231427 non-null  float64
 9   y            231427 non-null  float64
dtypes: float64(4), int64(1), object(5)
memory usage: 17.7+ MB
None


Unnamed: 0,business_id,latitude,longitude,name,state,city,postal_code,source,x,y
207356,0x88e729309129abc7:0x9ddcf17f86c3de60,29.0286,-80.928,Gypsy Fresh Grill,FL,New Smyrna Beach,32168,google,878.36,-5500.98
169212,0x864de31fcd2ece05:0x12e69e1e14253e92,32.9351,-97.5503,tacos el compa,TX,Azle,76020,google,-702.59,-5300.73


In [3]:
# Lectura de dataset de restaurants de YELP
dfyrst_coord = pd.read_parquet('dfyrst_coord.parquet')

dfyrst_coord['x'] = dfyrst_coord['x'].round(2)
dfyrst_coord['y'] = dfyrst_coord['y'].round(2)
dfyrst_coord['postal_code'] = dfyrst_coord['postal_code'].astype('int64')

dfyrst_coord['latitude'] = dfyrst_coord['latitude'].round(4)
dfyrst_coord['longitude'] = dfyrst_coord['longitude'].round(4)

print(dfyrst_coord.info())
dfyrst_coord.sample(2)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59187 entries, 0 to 59186
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   business_id  59187 non-null  object 
 1   latitude     59187 non-null  float64
 2   longitude    59187 non-null  float64
 3   name         59187 non-null  object 
 4   state        59187 non-null  object 
 5   city         59187 non-null  object 
 6   postal_code  59187 non-null  int64  
 7   source       59187 non-null  object 
 8   x            59187 non-null  float64
 9   y            59187 non-null  float64
dtypes: float64(4), int64(1), object(5)
memory usage: 4.5+ MB
None


Unnamed: 0,business_id,latitude,longitude,name,state,city,postal_code,source,x,y
1540,yPzni14snM06Kk5xLy6e6w,39.6127,-86.1576,Wang Cai,IN,Greenwood,46142,yelp,328.9,-4897.01
13825,iTjhwZjjlGjyPywY_rgCaA,39.9466,-75.1746,Palm Tree Gourmet,PA,Philadelphia,19103,yelp,1249.77,-4721.69


# Procesamiento

In [4]:
# Locales repetidos
df_gy = pd.merge(dfgrst_coord, dfyrst_coord, how='inner', on=['postal_code', 'x', 'y'])
print(df_gy.info())
df_gy.sample(2)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8324 entries, 0 to 8323
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   business_id_x  8324 non-null   object 
 1   latitude_x     8324 non-null   float64
 2   longitude_x    8324 non-null   float64
 3   name_x         8324 non-null   object 
 4   state_x        8324 non-null   object 
 5   city_x         8324 non-null   object 
 6   postal_code    8324 non-null   int64  
 7   source_x       8324 non-null   object 
 8   x              8324 non-null   float64
 9   y              8324 non-null   float64
 10  business_id_y  8324 non-null   object 
 11  latitude_y     8324 non-null   float64
 12  longitude_y    8324 non-null   float64
 13  name_y         8324 non-null   object 
 14  state_y        8324 non-null   object 
 15  city_y         8324 non-null   object 
 16  source_y       8324 non-null   object 
dtypes: float64(6), int64(1), object(10)
memory usage: 1.

Unnamed: 0,business_id_x,latitude_x,longitude_x,name_x,state_x,city_x,postal_code,source_x,x,y,business_id_y,latitude_y,longitude_y,name_y,state_y,city_y,source_y
6899,0x54afaab6c31ecfab:0x821cdfc25102a7e,43.6901,-116.3531,Idaho Pizza Company,ID,Eagle,83616,google,-2044.96,-4128.03,iAj-I-UKC3T3qJnNUnkBRw,43.69,-116.3531,Brewforia Beer Market,ID,Eagle,yelp
4331,0x88c2e92bcdd55095:0xd2936bdb00563c12,27.9709,-82.5689,BLUFIN Waterfront Grill,FL,Tampa,33607,google,727.74,-5579.52,LVGGURuHa5vh9cmgwgUIFg,27.9709,-82.5689,Totally Tiki Tours,FL,Tampa,yelp


In [5]:
# Locales a eliminar en dfy
dfy_a_eliminar = df_gy['business_id_y']
dfy_a_eliminar = dfy_a_eliminar.to_frame()

print(dfy_a_eliminar.info())
dfy_a_eliminar.sample(2)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8324 entries, 0 to 8323
Data columns (total 1 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   business_id_y  8324 non-null   object
dtypes: object(1)
memory usage: 65.2+ KB
None


Unnamed: 0,business_id_y
1510,Q5-CkgsvtlRF5GphtoBFRQ
6751,ZKKaP4Lx-dexf4T1aCsBKw


In [6]:
# Listado final de locales a conservar en dfyrst (restaurantes de YELP)
dfy_rst_final = dfyrst_coord[~dfyrst_coord['business_id'].isin(dfy_a_eliminar['business_id_y'])]

print(dfy_rst_final.info())
dfy_rst_final.sample(2)

<class 'pandas.core.frame.DataFrame'>
Index: 52291 entries, 0 to 59186
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   business_id  52291 non-null  object 
 1   latitude     52291 non-null  float64
 2   longitude    52291 non-null  float64
 3   name         52291 non-null  object 
 4   state        52291 non-null  object 
 5   city         52291 non-null  object 
 6   postal_code  52291 non-null  int64  
 7   source       52291 non-null  object 
 8   x            52291 non-null  float64
 9   y            52291 non-null  float64
dtypes: float64(4), int64(1), object(5)
memory usage: 4.4+ MB
None


Unnamed: 0,business_id,latitude,longitude,name,state,city,postal_code,source,x,y
18834,KAShxVv6oTtYeX9D_n7Vxw,43.5907,-116.2913,HeavenEssence Floral & Gifts,ID,Boise,83709,yelp,-2043.88,-4137.07
53796,R0RbI29U_ft7lNAV6h-VRg,32.1345,-110.9717,Sonic Drive-In,AZ,TUCSON,85706,yelp,-1930.89,-5037.6


In [7]:
# Listado final de locales a conservar en dfgyrst (restaurantes de YELP y GOOGLE en conjunto)
dfgy_rst_final = pd.concat([dfgrst_coord, dfy_rst_final])

print(dfgy_rst_final.info())
dfgy_rst_final.sample(2)

<class 'pandas.core.frame.DataFrame'>
Index: 283718 entries, 0 to 59186
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   business_id  283718 non-null  object 
 1   latitude     283718 non-null  float64
 2   longitude    283718 non-null  float64
 3   name         283718 non-null  object 
 4   state        283718 non-null  object 
 5   city         283718 non-null  object 
 6   postal_code  283718 non-null  int64  
 7   source       283718 non-null  object 
 8   x            283718 non-null  float64
 9   y            283718 non-null  float64
dtypes: float64(4), int64(1), object(5)
memory usage: 23.8+ MB
None


Unnamed: 0,business_id,latitude,longitude,name,state,city,postal_code,source,x,y
7417,0x88e5c77f74a1248f:0x9c6350c77c4d69f3,30.1887,-81.7192,Healthy Wave LLC,FL,Orange Park,32073,google,793.14,-5449.51
211557,0x885a91d4f3256fe9:0xdfb4d89316dd2fc7,36.5493,-82.4465,Indian Springs Ice Cream and Grill,TN,Kingsport,37664,google,672.79,-5073.7


# Exportación de datos

In [8]:
# Nota: el cógido se deja comentado a propósito para que al correr el código se evite la sobreescritura 
# de los archivos ya generados

#dfgy_rst_final.to_parquet('dfgy_rest_uniques.parquet')
#dfy_rst_final.to_parquet('dfy_rest_uniques.parquet')

# Conclusiones

El objetivo de este ETL es lograr un listado único de locales, uniendo ambos datasets de yelp y google.

* Se conserva el campo "site_id" original de cada dataset, pero como cada dataset tiene un formatto diferente, se busca otra alternativa para identificar en forma unívoca a cada site
* Para hallar locales coincidentes, se utilizan los campos de coordenadas cartesianas "x" e "y", y además el campo "postal_code"
* Finalmente, se conserva el listado original de google, y se eliminan los locales repetidos en el listado de yelp 