# Oxxo: Cleaning

This is the second part recovering information about the popular Oxxo stores.

Here, the data previously obtained from Google Maps will be cleaned.

Additionally, and as a preview of what was attempted, socioeconomic information about the locations where the establishments were located was crossed, but unfortunately, the level of granularity offered by institutes such as the INEGI (Instituto Nacional de Estadística y Geografía) or the CONEVAL (Consejo Nacional de Evaluación de la Política de Desarrollo Social) only reaches the level of municipalities (such as the Azcapotzalco delegation in Mexico City) and, therefore, a good resolution was not achieved in this aspect.

Thus, only information regarding the socioeconomic status by municipality was added. This information comes from the "Encuesta Nacional de Ingresos y Gastos de los Hogares 2020" or by its abbreviation ENIGH.

ENIGH evaluates aspects such as household income and expenses by subcategories, description of housing and household, labor aspects, etc. Here, the dataset used "conjunto_de_datos_viviendas_enigh_2020_ns.csv" represents 87755 households throughout the national territory (Mexico).

In [26]:
import pandas as pd
import numpy as np
import geopy
import os

In [27]:
oxxo_1 = pd.read_csv("../data/oxxo/oxxo_coordinates_0.csv")
oxxo_2 = pd.read_csv("../data/oxxo/oxxo_coordinates_1.csv")
oxxo_3 = pd.read_csv("../data/oxxo/oxxo_coordinates_2.csv")

oxxo_df = pd.concat([oxxo_1,oxxo_2,oxxo_3])
oxxo_df

Unnamed: 0,latitude,longitude,name,comments,rating,cp
0,19.353753,-99.189937,OXXO Helenico,4.0,4.3,1000
1,19.348316,-99.185520,Oxxo La Paz,13.0,3.5,1000
2,19.351621,-99.185912,OXXO,9.0,3.0,1000
3,19.341629,-99.203068,Oxxo,27.0,4.0,1000
4,19.360927,-99.185247,Oxxo,15.0,3.0,1000
...,...,...,...,...,...,...
38929,19.300681,-99.112765,Oxxo Camil,2.0,3.0,16797
38930,19.266287,-98.900100,OXXO OBREGÓN MEX,11.0,3.5,16797
38931,19.291616,-98.908509,Oxxo Santa cruz II,12.0,3.4,16797
38932,19.338992,-98.953441,Oxxo Geo san Isidro,24.0,2.5,16797


In [28]:
mask_only_oxxo = oxxo_df["name"].str.contains("oxxo", case=False)
mask_only_stores = ~oxxo_df["name"].str.contains("\bgas\b", case=False)

oxxo_df = oxxo_df[mask_only_oxxo & mask_only_stores]
oxxo_df

Unnamed: 0,latitude,longitude,name,comments,rating,cp
0,19.353753,-99.189937,OXXO Helenico,4.0,4.3,1000
1,19.348316,-99.185520,Oxxo La Paz,13.0,3.5,1000
2,19.351621,-99.185912,OXXO,9.0,3.0,1000
3,19.341629,-99.203068,Oxxo,27.0,4.0,1000
4,19.360927,-99.185247,Oxxo,15.0,3.0,1000
...,...,...,...,...,...,...
38929,19.300681,-99.112765,Oxxo Camil,2.0,3.0,16797
38930,19.266287,-98.900100,OXXO OBREGÓN MEX,11.0,3.5,16797
38931,19.291616,-98.908509,Oxxo Santa cruz II,12.0,3.4,16797
38932,19.338992,-98.953441,Oxxo Geo san Isidro,24.0,2.5,16797


In [29]:
oxxo_df = oxxo_df.drop_duplicates(["latitude", "longitude"]).reset_index(drop=True)
oxxo_df

Unnamed: 0,latitude,longitude,name,comments,rating,cp
0,19.353753,-99.189937,OXXO Helenico,4.0,4.3,1000
1,19.348316,-99.185520,Oxxo La Paz,13.0,3.5,1000
2,19.351621,-99.185912,OXXO,9.0,3.0,1000
3,19.341629,-99.203068,Oxxo,27.0,4.0,1000
4,19.360927,-99.185247,Oxxo,15.0,3.0,1000
...,...,...,...,...,...,...
1800,19.263130,-99.105382,Oxxo,,,16000
1801,19.273990,-99.124151,OXXO Aldama,321.0,3.9,16010
1802,19.274346,-99.120606,Museo OXXO,1.0,5.0,16010
1803,32.524194,-116.997760,OXXO POSTAL,1.0,5.0,16083


In [5]:
geolocator = geopy.ArcGIS()

get_address = lambda row: ",".join(row.transform(str).values)
lat_lon_list = oxxo_df[["latitude", "longitude"]].apply(get_address, axis=1)

address_df = pd.DataFrame()

for i, lat_lon in enumerate(lat_lon_list):
    
    address = geolocator.reverse(lat_lon)
    current_oxxo_address = pd.DataFrame(address.raw, index=[i])
    address_df = pd.concat([address_df, current_oxxo_address])

    if ((i + 1) % 10) == 0:
        print(f"Iteration {i + 1} out {len(lat_lon_list)}", end="\r")

print(f"Iteration {i + 1} out {len(lat_lon_list)}", end="\r")

Iteration 1805 out 1805

In [30]:
get_columns = ["Address","Neighborhood", "City", "Postal", "Subregion", "Region"]
new_address_df = address_df[get_columns]

In [31]:
new_oxxo_df = oxxo_df.merge(new_address_df, right_index=True, left_index=True, how="inner")
new_oxxo_df.drop("cp", inplace=True, axis=1)

In [32]:
mask_cdmx = (new_oxxo_df["Region"] == "Ciudad de México") | (new_oxxo_df["Region"] == "México")
new_oxxo_df = new_oxxo_df[mask_cdmx].reset_index(drop=True)
new_oxxo_df.head()

Unnamed: 0,latitude,longitude,name,comments,rating,Address,Neighborhood,City,Postal,Subregion,Region
0,19.353753,-99.189937,OXXO Helenico,4.0,4.3,Avenida Revolución,Sn Ángel,Guadalupe Inn,1020,Álvaro Obregón,Ciudad de México
1,19.348316,-99.18552,Oxxo La Paz,13.0,3.5,Avenida Miguel Ángel de Quevedo 36B,,Chimalistac,1050,Álvaro Obregón,Ciudad de México
2,19.351621,-99.185912,OXXO,9.0,3.0,Avenida Vito Alessio Robles 12,,Florida,1030,Álvaro Obregón,Ciudad de México
3,19.341629,-99.203068,Oxxo,27.0,4.0,Calle Veracruz 87,,Progreso Tizapán,1080,Álvaro Obregón,Ciudad de México
4,19.360927,-99.185247,Oxxo,15.0,3.0,Calle Manuel M. Ponce,Sn Ángel,Guadalupe Inn,1020,Álvaro Obregón,Ciudad de México


In [33]:
convert_marks = {"á":"a","é":"e","í":"i","ó":"o","ú":"u"}

def ConvertAccentMarks(text):
    
    new_string = ""
    for letter in text:
        
        if letter in convert_marks.keys():
            new_string += convert_marks[letter] 
            
        else:
            new_string += letter
    
    return new_string
            
new_oxxo_df["Subregion"] = new_oxxo_df["Subregion"].apply(str.lower).apply(ConvertAccentMarks)
new_oxxo_df["Region"] = new_oxxo_df["Region"].apply(str.lower)

In [34]:
path_enigh = "../data/oxxo/conjunto_de_datos_enigh_ns_2020_csv/" 

path_vivienda = "conjunto_de_datos_viviendas_enigh_2020_ns/conjunto_de_datos/conjunto_de_datos_viviendas_enigh_2020_ns.csv"
path_state_vivienda = "conjunto_de_datos_viviendas_enigh_2020_ns/catalogos/ubica_geo.csv"

In [35]:
vivienda_df = pd.read_csv(os.path.join(path_enigh, path_vivienda),
                          usecols=["folioviv","ubica_geo","est_socio"])

states_keys = pd.read_csv(os.path.join(path_enigh, path_state_vivienda))

In [36]:
# est_socio equivalences
#1=Bajo,
#2=Medio bajo,
#3=Medio alto,
#4=Alto,

vivienda_df.head()

Unnamed: 0,folioviv,ubica_geo,est_socio
0,100013605,1001,3
1,100013606,1001,3
2,100017801,1001,3
3,100017802,1001,3
4,100017803,1001,3


In [37]:
status_equiv = {1:"Bajo", 2:"Medio Bajo", 3:"Medio Alto", 4:"Alto"}
vivienda_df["est_socio"] = vivienda_df["est_socio"].apply(lambda x: status_equiv[x])
vivienda_df.head()

Unnamed: 0,folioviv,ubica_geo,est_socio
0,100013605,1001,Medio Alto
1,100013606,1001,Medio Alto
2,100017801,1001,Medio Alto
3,100017802,1001,Medio Alto
4,100017803,1001,Medio Alto


In [38]:
states_keys.head()

Unnamed: 0,ubica_geo,entidad,desc_ent,municipio,des_mun
0,1001,1,Aguascalientes,1,Aguascalientes
1,1002,1,Aguascalientes,2,Asientos
2,1003,1,Aguascalientes,3,Calvillo
3,1004,1,Aguascalientes,4,Cosio
4,1005,1,Aguascalientes,5,Jesus Maria


In [39]:
mask_states = (states_keys["desc_ent"] == "Ciudad De México") | (states_keys["desc_ent"] == "México" )
states_keys = states_keys[mask_states]
states_keys.head()

Unnamed: 0,ubica_geo,entidad,desc_ent,municipio,des_mun
166,9002,9,Ciudad De México,2,Azcapotzalco
167,9003,9,Ciudad De México,3,Coyoacan
168,9004,9,Ciudad De México,4,Cuajimalpa De Morelos
169,9005,9,Ciudad De México,5,Gustavo A. Madero
170,9006,9,Ciudad De México,6,Iztacalco


In [40]:
population_df = states_keys.merge(vivienda_df, how="left", on="ubica_geo")
population_df = population_df.drop(["municipio","folioviv","ubica_geo","entidad"], axis=1)
population_df

Unnamed: 0,desc_ent,des_mun,est_socio
0,Ciudad De México,Azcapotzalco,Medio Alto
1,Ciudad De México,Azcapotzalco,Medio Alto
2,Ciudad De México,Azcapotzalco,Medio Alto
3,Ciudad De México,Azcapotzalco,Medio Alto
4,Ciudad De México,Azcapotzalco,Medio Alto
...,...,...,...
6047,México,San Jose Del Rincon,Bajo
6048,México,San Jose Del Rincon,Bajo
6049,México,San Jose Del Rincon,Bajo
6050,México,San Jose Del Rincon,Bajo


In [41]:
population_df["des_mun"] = population_df["des_mun"].apply(str.lower).apply(ConvertAccentMarks)
population_df["desc_ent"] = population_df["desc_ent"].apply(str.lower)

In [42]:
population_group = population_df.groupby(["desc_ent","des_mun"]).value_counts().reset_index()
population_group.rename({0:"Counts"}, inplace=True, axis=1)
population_group.head()

Unnamed: 0,desc_ent,des_mun,est_socio,Counts
0,ciudad de méxico,alvaro obregon,Medio Bajo,104
1,ciudad de méxico,alvaro obregon,Medio Alto,80
2,ciudad de méxico,alvaro obregon,Alto,24
3,ciudad de méxico,azcapotzalco,Medio Alto,47
4,ciudad de méxico,azcapotzalco,Medio Bajo,40


In [43]:
population_est_socio = population_group.pivot_table("Counts", ["desc_ent","des_mun"], "est_socio")
population_est_socio = population_est_socio.reset_index()
population_est_socio = population_est_socio[["desc_ent","des_mun","Bajo","Medio Bajo","Medio Alto","Alto"]]
population_est_socio.head()

est_socio,desc_ent,des_mun,Bajo,Medio Bajo,Medio Alto,Alto
0,ciudad de méxico,alvaro obregon,,104.0,80.0,24.0
1,ciudad de méxico,azcapotzalco,,40.0,47.0,23.0
2,ciudad de méxico,benito juarez,,,54.0,39.0
3,ciudad de méxico,coyoacan,,53.0,60.0,43.0
4,ciudad de méxico,cuajimalpa de morelos,,48.0,22.0,11.0


In [44]:
population_est_socio[["Bajo","Medio Bajo","Medio Alto","Alto"]] = population_est_socio[["Bajo","Medio Bajo","Medio Alto","Alto"]].astype("Int32")
population_est_socio.head()

est_socio,desc_ent,des_mun,Bajo,Medio Bajo,Medio Alto,Alto
0,ciudad de méxico,alvaro obregon,,104.0,80,24
1,ciudad de méxico,azcapotzalco,,40.0,47,23
2,ciudad de méxico,benito juarez,,,54,39
3,ciudad de méxico,coyoacan,,53.0,60,43
4,ciudad de méxico,cuajimalpa de morelos,,48.0,22,11


In [45]:
df = new_oxxo_df.merge(population_est_socio, how="left",
                  left_on=["Region", "Subregion"], right_on=["desc_ent", "des_mun"])

df.drop(["des_mun","desc_ent"], axis=1, inplace=True)

In [46]:
df.head()

Unnamed: 0,latitude,longitude,name,comments,rating,Address,Neighborhood,City,Postal,Subregion,Region,Bajo,Medio Bajo,Medio Alto,Alto
0,19.353753,-99.189937,OXXO Helenico,4.0,4.3,Avenida Revolución,Sn Ángel,Guadalupe Inn,1020,alvaro obregon,ciudad de méxico,,104,80,24
1,19.348316,-99.18552,Oxxo La Paz,13.0,3.5,Avenida Miguel Ángel de Quevedo 36B,,Chimalistac,1050,alvaro obregon,ciudad de méxico,,104,80,24
2,19.351621,-99.185912,OXXO,9.0,3.0,Avenida Vito Alessio Robles 12,,Florida,1030,alvaro obregon,ciudad de méxico,,104,80,24
3,19.341629,-99.203068,Oxxo,27.0,4.0,Calle Veracruz 87,,Progreso Tizapán,1080,alvaro obregon,ciudad de méxico,,104,80,24
4,19.360927,-99.185247,Oxxo,15.0,3.0,Calle Manuel M. Ponce,Sn Ángel,Guadalupe Inn,1020,alvaro obregon,ciudad de méxico,,104,80,24


In [47]:
df.to_csv("../data/oxxo/oxxo_data.csv", header=True, index=False)