# ETL (Extract, Transform, Load)

This dataset contains detailed information on over 125,000 restaurants across 31 major cities in Europe, collected from Tripadvisor. Each entry includes:

- City location

- Cuisine (single or multiple cuisines per restaurant)

- Rating (on a 1–5 scale)

- Ranking within the city

- Price Range (Low, Mid, High)

- Number of Reviews

- Sample of customer reviews and their dates

- Tripadvisor URL and unique restaurant ID

The dataset supports a wide range of analyses, including cuisine trends, pricing patterns, customer preferences, and regional comparisons. While rich in content, it also contains missing values that require preprocessing for accurate analysis.

## Objectives
The objective of this step is to extract, clean, and transform raw data into a structured format suitable for analysis and visualization, ensuring consistency, accuracy, and usability throughout the project.

## Inputs
The dataset was obtained from [Kaggle](https://www.kaggle.com/datasets/damienbeneschi/krakow-ta-restaurans-data-raw/data)

## Outputs
The cleaned csv file found [here]()

# ETL Process

- Load the dataset
- Understand dataset structure and content
- Clean the dataset
- Convert data types
- Add country column
- Add cuisine counts column
- Save the clean dataset as a csv file

---

# Change working directory
Change the working directory from its current folder to its parent folder as the notebooks will be stored in a subfolder
* We access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\amron\\Desktop\\euro-dine-insights\\jupyter_notebooks'

Make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [5]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [7]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\amron\\Desktop\\euro-dine-insights'

Changing path directory to the dataset

In [8]:
raw_data_dir = os.path.join(current_dir, 'data_set/raw') #path directory

processed_data_dir = os.path.join(current_dir, 'data_set/processed') #path directory


---

# Import packages

In [9]:
import numpy as np #import numpy
import pandas as pd #import pandas
import matplotlib.pyplot as plt #import matplotlib
import seaborn as sns #import seaborn
import plotly.express as px # import plotly
sns.set_style('whitegrid') #set style for visuals

---

# Load the raw dataset

In [10]:
#load the dataset
import pandas as pd
df = pd.read_csv(os.path.join(raw_data_dir, 'TA_restaurants_curated.csv'))

The raw dataset is loaded using Pandas for ETL process

---

# Understand the dataset structure and content

In [11]:
#displaying data
df.head() 

Unnamed: 0.1,Unnamed: 0,Name,City,Cuisine Style,Ranking,Rating,Price Range,Number of Reviews,Reviews,URL_TA,ID_TA
0,0,Martine of Martine's Table,Amsterdam,"['French', 'Dutch', 'European']",1.0,5.0,$$ - $$$,136.0,"[['Just like home', 'A Warm Welcome to Wintry ...",/Restaurant_Review-g188590-d11752080-Reviews-M...,d11752080
1,1,De Silveren Spiegel,Amsterdam,"['Dutch', 'European', 'Vegetarian Friendly', '...",2.0,4.5,$$$$,812.0,"[['Great food and staff', 'just perfect'], ['0...",/Restaurant_Review-g188590-d693419-Reviews-De_...,d693419
2,2,La Rive,Amsterdam,"['Mediterranean', 'French', 'International', '...",3.0,4.5,$$$$,567.0,"[['Satisfaction', 'Delicious old school restau...",/Restaurant_Review-g188590-d696959-Reviews-La_...,d696959
3,3,Vinkeles,Amsterdam,"['French', 'European', 'International', 'Conte...",4.0,5.0,$$$$,564.0,"[['True five star dinner', 'A superb evening o...",/Restaurant_Review-g188590-d1239229-Reviews-Vi...,d1239229
4,4,Librije's Zusje Amsterdam,Amsterdam,"['Dutch', 'European', 'International', 'Vegeta...",5.0,4.5,$$$$,316.0,"[['Best meal.... EVER', 'super food experience...",/Restaurant_Review-g188590-d6864170-Reviews-Li...,d6864170


Upon loading the dataset, we observed an Unnamed: 0 column, which is a duplicate of the default index. Since it does not carry any additional information, it will be dropped in the data cleaning step to avoid redundancy.

---