# Cleaning data
This notebook processes and filters a collection of CSV files related to places, performs some transformations, and saves the final result as a CSV file. The primary steps involve combining multiple CSV files, cleaning the data, applying specific filters, and adding new computed columns. The final output is a filtered and enriched dataset saved in final_places.csv.

## Importing Required Libraries

In [75]:
from unidecode import unidecode
import pandas as pd
import os

## Combining Multiple CSV Files into One DataFrame

In [76]:
dataframes = []

for archivo in os.listdir("./data/output/places"):
    if archivo.endswith(".csv") and archivo.startswith("places"):
        df = pd.read_csv(os.path.join("./data/output/places", archivo))
        dataframes.append(df)

df_combined = pd.concat(dataframes, ignore_index=True).drop("Unnamed: 0", axis = 1)
df_combined.head()

Unnamed: 0,name,desc,score,c_score,price,category,accessibility,schedule,web,search_parameters,phone,address,lat,lon
0,Los Fabio's Popular,Información Opciones de servicio\n\nAsientos...,4.5,74.0,$ 10.000-20.000,Hamburguesería,,"6 pm.,9 pm.",,Comuna Popular Restaurantes,319 6117987,"Cra. 42c #107-001, La Isla, Medellín, Popular,...",6.295462,-75.5485
1,BRASAS MI SOR,Información Opciones de servicio\n\nRetiros ...,5.0,1.0,,Restaurante,,,https://brasas-mi-sor.ola.click/,Comuna Popular Restaurantes,314 7757452,"Cra. 43 #110 a 58, La Isla, Medellín, Popular,...",6.299469,-75.547267
2,Mandingas la 107,Información Opciones de servicio\n\nPara lle...,4.8,20.0,$ 1-10.000,Restaurante de comida para llevar,,"6 pm.,6 .",,Comuna Popular Restaurantes,319 3353560,"Cra 49B #107-3, Villa Niza, Medellín, Santa Cr...",6.297972,-75.554448
3,Grosseto Pizzeria,Información Accesibilidad\n\nEntrada accesib...,4.4,14.0,,Restaurante,Accesible con silla de ruedas,"12 pm.,6 .",,Comuna Popular Restaurantes,324 2886618,"Calle 126, Cra. 42ee #88, Medellín, Antioquia",6.304309,-75.546252
4,Las Paisanas,Información Opciones de servicio\n\nAsientos...,4.6,15.0,,Restaurante,,,,Comuna Popular Restaurantes,302 3874084,"0505, Medellín, Antioquia",6.30567,-75.55314


## Filtering the Data

In [77]:
df_filtered = df_combined.copy()
df_filtered = df_filtered[df_filtered["score"] > 3]
df_filtered = df_filtered[df_filtered["c_score"] > 10]
df_filtered = df_filtered[df_filtered["name"].apply(lambda x: "infantil" not in unidecode(x).lower())]
df_filtered = df_filtered[df_filtered["search_parameters"].apply(lambda x: x.split()[2:][0] != "Hoteles")]
len(df_filtered)

1345

## Adding Computed Columns

In [None]:
df_filtered["bayesian_mean"] = (df_filtered["score"] * df_filtered["c_score"] + 4 * 10) / (df_filtered["c_score"] + 10)
df_filtered["Comuna"] = df_filtered["search_parameters"].apply(lambda x: x.split()[1])
df_filtered.head()


##  Data Summary

In [81]:
print("Datos iniciales:", len(df_combined))
print("Datos filtrados:", len(df_filtered))

Datos iniciales: 1781
Datos filtrados: 1345


## Saving the Final Dataset

In [82]:
df_filtered.to_csv("./data/output/places/final_places.csv")