## data_cleaning.ipynb

Performs data cleaning for restaurant reviews collected via web scraping. It includes extracting structured information, handling missing values, checking for duplicates, and preparing the data for further analysis.


In [1]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join('..')))

import pandas as pd
import numpy as np
import re
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
from src import cleaning



### Select the raw data to process

In [2]:
raw_data_path = '../data/raw/'

name = '17punto10'
reviews_raw = pd.read_csv(raw_data_path + 'collected_reviews_' + name + '.csv')
resumme_raw = pd.read_csv(raw_data_path + 'resumme_' + name + '.csv')
display(resumme_raw)
display(reviews_raw.sample(5))

Unnamed: 0,stars,reviews
0,5,571
1,4,191
2,3,35
3,2,13
4,1,13


Unnamed: 0,author,local_guide_info,rating,review,date_text,text_backup
12,Javier,Local Guide · 82 reseñas · 66 fotos,4 estrellas,"El sitio está un poco escondido, difícil para ...",Hace 10 meses,Javier\nLocal Guide · 82 reseñas · 66 fotos\n...
4,Carolina Gonzalez,Local Guide · 54 reseñas · 66 fotos,5 estrellas,SIEMPRE DE 10!\nHemos venido varias veces y ta...,Hace 3 semanas,Carolina Gonzalez\nLocal Guide · 54 reseñas · ...
18,Gema Pérez,Local Guide · 54 reseñas · 192 fotos,5 estrellas,Atención y comida maravillosa.\nNo sabría eleg...,Hace un año,Gema Pérez\nLocal Guide · 54 reseñas · 192 fot...
22,Coupage Unique,Local Guide · 285 reseñas · 1.979 fotos,4 estrellas,Uno de los mejores lugares de Zaragoza en rela...,Hace un año,Coupage Unique\nLocal Guide · 285 reseñas · 1....
20,David Larreal (El del sombrero),39 reseñas · 103 fotos,3 estrellas,Una visita en este lugar! Entre la ensalada dé...,Hace un año,David Larreal (El del sombrero)\n39 reseñas · ...


### Search words selected

Define a dictionary of regular expressions to extract specific fields (service, meal type, price range, scores, etc.) from the review text.

In [3]:
restaurant_search_words = {
    'service': r'Servicio\n([^\n]+)',
    'meal_type': r'Tipo de comida\n([^\n]+)',
    'price_per_person': r'Precio por persona\n([0-9€\- ]+)',
    'food_score': r'Comida: (\d+)',
    'service_score': r'Servicio: (\d+)',
    'atmosphere_score': r'Ambiente: (\d+)',
    'recommended': r'Platos recomendados\n([^\n]+)'
}

In [4]:
reviews = reviews_raw.copy()

### Removing duplicates

Check for duplicated rows in the dataset and remove them to ensure data integrity

In [5]:
# Convert any list-like columns to strings so they can be checked for duplicates
check_dups = reviews.copy()
for col in check_dups.columns:
    if check_dups[col].dtype == 'object' and isinstance(check_dups[col].iloc[0], list):
        check_dups[col] = check_dups[col].apply(lambda x: str(x))

# Now you can check and remove duplicates
duplicates_count = check_dups.duplicated().sum()
print(f"Number of duplicated rows: {duplicates_count}")

# Remove duplicates
reviews.drop_duplicates(inplace=True)
print("Duplicates removed successfully.")

Number of duplicated rows: 0
Duplicates removed successfully.


### Prepare and process all fields

Clean and convert relevant columns to numeric types, extract additional details (e.g., average price per person), and drop unnecessary columns from the DataFrame.

In [6]:
reviews['local_guide_reviews'] = reviews['local_guide_info'].apply(cleaning.extractReviewCount)
reviews['rating_score'] = reviews['rating'].apply(cleaning.extractStarRating)
reviews = cleaning.applyExtractDetails(reviews, search_words = restaurant_search_words)
reviews['recommendations_list'] = reviews['recommended'].apply(cleaning.extractRecommendations)
reviews['date'] = reviews['date_text'].apply(cleaning.convertToDate)

reviews['food_score'] = reviews['food_score'].apply(pd.to_numeric, errors='coerce')
reviews['service_score'] = reviews['service_score'].apply(pd.to_numeric, errors='coerce')
reviews['atmosphere_score'] = reviews['atmosphere_score'].apply(pd.to_numeric, errors='coerce')
reviews['avg_price_per_person'] = reviews['price_per_person'].str.extract(r'-(\d+)\s*€')
reviews['avg_price_per_person'] = pd.to_numeric(reviews['avg_price_per_person'], errors='coerce').astype('Int64')


reviews.drop(columns = ['text_backup', 'local_guide_info', 'rating', 'author', 'recommended', 'date_text'], inplace = True)
reviews.reset_index(inplace=True)
reviews.rename(columns={'index': 'review_id', 'price_per_person':'price_per_person_category'}, inplace=True)

### Check null values

Fill missing values in specific columns with defaults (e.g., 1 for local_guide_reviews, 1 for rating_score).

In [7]:
# Check for missing values in each column
missing_values = reviews.isnull().sum()
print("Missing values per column:")
print(missing_values)

# Optionally, you can also check the percentage of missing values
missing_percentage = (reviews.isnull().mean() * 100).round(2)
print("Percentage of missing values per column:")
print(missing_percentage)


Missing values per column:
review_id                     0
review                        0
local_guide_reviews           0
rating_score                  0
service                       0
meal_type                     0
price_per_person_category     0
food_score                    9
service_score                 9
atmosphere_score              9
recommendations_list          0
date                          0
avg_price_per_person         11
dtype: int64
Percentage of missing values per column:
review_id                     0.00
review                        0.00
local_guide_reviews           0.00
rating_score                  0.00
service                       0.00
meal_type                     0.00
price_per_person_category     0.00
food_score                   30.00
service_score                30.00
atmosphere_score             30.00
recommendations_list          0.00
date                          0.00
avg_price_per_person         36.67
dtype: float64


In [8]:
# Fill NA values
reviews['local_guide_reviews'] = reviews['local_guide_reviews'].fillna(1)
reviews['rating_score'] = reviews['rating_score'].fillna(1)

### Variables distribution

Generate a summary of the numeric variables in the dataset. This provides insights into the distribution of ratings, review counts, and prices.

In [9]:
# Summary of numeric columns
print("Summary of numeric variables:")
display(reviews.describe())

# Summary of categorical columns
print("Distribution of categorical variables:")
for col in reviews.select_dtypes(include=['object']).columns:
    if col in ("review", 'recommendations_list', 'date'):
        continue
    print(f"\n{col} distribution:")
    print(reviews[col].value_counts())


Summary of numeric variables:


Unnamed: 0,review_id,local_guide_reviews,rating_score,food_score,service_score,atmosphere_score,avg_price_per_person
count,30.0,30.0,30.0,21.0,21.0,21.0,19.0
mean,14.5,78.333333,4.8,4.904762,4.857143,4.761905,37.894737
std,8.803408,82.964194,0.484234,0.300793,0.358569,0.538958,12.283208
min,0.0,3.0,3.0,4.0,4.0,3.0,30.0
25%,7.25,28.75,5.0,5.0,5.0,5.0,30.0
50%,14.5,43.5,5.0,5.0,5.0,5.0,30.0
75%,21.75,97.75,5.0,5.0,5.0,5.0,40.0
max,29.0,311.0,5.0,5.0,5.0,5.0,80.0


Distribution of categorical variables:

service distribution:
service
             25
Comí allí     5
Name: count, dtype: int64

meal_type distribution:
meal_type
          24
Comida     4
Cena       2
Name: count, dtype: int64

price_per_person_category distribution:
price_per_person_category
           11
20-30 €    10
30-40 €     6
40-50 €     2
70-80 €     1
Name: count, dtype: int64


### Saving clean data to processed folder

In [12]:
csv_file_path = '../data/processed/'
reviews.to_csv(csv_file_path + name + '_reviews.csv', index=False)
print('OK! -> processed reviews saved at', csv_file_path + name + '_reviews.csv')

OK! -> processed reviews saved at ../data/processed/17punto10_reviews.csv


In [13]:
display(reviews.sample(20))

Unnamed: 0,review_id,review,local_guide_reviews,rating_score,service,meal_type,price_per_person_category,food_score,service_score,atmosphere_score,recommendations_list,date,avg_price_per_person
8,8,"Fuimos de celebración,el sitio es pequeño, de ...",40,5,,,30-40 €,5.0,5.0,5.0,[],2024-03-01,40.0
9,9,Es un sitio donde la comida es sencillamente d...,64,4,,,,5.0,5.0,5.0,[],2023-01-01,
11,11,Volvemos al 17punto10 y… que gozada!!!\n\nUn s...,69,5,,,,,,,[],2023-01-01,
12,12,"El sitio está un poco escondido, difícil para ...",82,4,,,40-50 €,4.0,4.0,3.0,[],2023-11-01,50.0
13,13,"Empezamos con las croquetas de bacon ibérico, ...",78,5,,,30-40 €,5.0,5.0,5.0,[Panceta Asada Mojo Rojo Y Chimichurri Casero ...,2024-02-01,40.0
5,5,Fuimos recomendados por un amigo y fue todo un...,158,5,,,30-40 €,5.0,5.0,5.0,[],2024-03-01,40.0
3,3,Espacio confortable aún estando en la barra ja...,16,5,,,20-30 €,5.0,5.0,5.0,[Tarta de Manzana Al Revés Con Helado de Canela],2024-05-01,30.0
6,6,Un sitio muy interesante para visitar. El serv...,103,5,,,20-30 €,5.0,5.0,5.0,"[Cremoso de Chocolate Blanco, Burrata Con Toma...",2024-04-01,30.0
24,24,Una maravilla de restaurante. Personal amabilí...,31,5,Comí allí,Comida,20-30 €,5.0,5.0,5.0,[],2023-01-01,30.0
23,23,Un sitio muy agradable en el que probar comida...,11,5,Comí allí,Cena,70-80 €,,,,[],2023-01-01,80.0
