# Webscraping & Applied ML
<p> </p>
Sarujan DENSON <br>
Yahya EL OUDOUNI <br>
Mohamed Houssem REZGUI <br>
DIA 2

# Data cleaning

## Hotel dataset

Thanks to the API, we have collected the data of 45 hotels and it doesn't contain any errors or any missing values. But we have collected them in a json file. So now we have to convert them into a csv file.

In [1]:
import json
import csv

def json_to_csv(json_file,csv_file):
    #Load the json file
    with open(json_file,'r',encoding='utf-8') as f:
        data=json.load(f)
    #Names of fields we would like to have in our csv file
    fields=["chainCode","iataCode","dupeId","name","hotelId","latitude","longitude","countryCode","value","unit","amenities","rating","lastUpdate"]

    #Write in the csv file
    with open(csv_file,'w',newline='',encoding='utf-8') as csvfile:
        writer=csv.DictWriter(csvfile,fieldnames=fields)
        writer.writeheader() #headers

        #Collect the value of eeach field of the json file to put it on the csv file
        for item in data.get("data",[]):
            row={
                "chainCode":item.get("chainCode"),
                "iataCode":item.get("iataCode"),
                "dupeId":item.get("dupeId"),
                "name":item.get("name"),
                "hotelId":item.get("hotelId"),
                "latitude":item.get("geoCode",{}).get("latitude"),
                "longitude":item.get("geoCode",{}).get("longitude"),
                "countryCode":item.get("address",{}).get("countryCode"),
                "value":item.get("distance",{}).get("value"),
                "unit":item.get("distance",{}).get("unit"),
                "amenities": ",".join(item.get("amenities",[])),
                "rating":item.get("rating"),
                "lastUpdate":item.get("lastUpdate")
            }
            writer.writerow(row)

json_file='/content/paris_hotels_list.json'  #input file in json format
csv_file='paris_hotels.csv'    #output file in csv format
json_to_csv(json_file,csv_file)
print(f"Data of Paris hotels in csv format: {csv_file}")

Data of Paris hotels in csv format: paris_hotels.csv


## Restaurant dataset

Thanks to our scraping code with Selenium and Beautifulsoup, we have collected data of Paris restaurants from the website TheFork. We have collected data of 100 restaurants and we put them in a csv file. We got information about the name of the restaurant, the link of the website of the specific restaurant, the address, the Price, the speciality of the restaurant, the ratings, the description of the restaurant, the menu and the reviews. But the scraping doesn't perform well because we contains some errors, some bad lines and some missing values when we put the scraped data in the csv file. So now, we're gonna clean it and export in a new csv file for a machine learning problem.

First step:


*   We're gonna remove quotes from fields and values
*   We're gonna replace comma with hyphens in "description", "menu" and "reviews" fields to avoid confusion with delimiters in the csv file

In [3]:
def clean_csv(input_file,output_file):

    with open(input_file,'r',encoding='utf-8') as infile:
        reader=csv.reader(infile) #read the restaurant csv file
        headers=next(reader) #clean the headers by removing quotes
        clean_headers=[header.strip('"') for header in headers]

        #Clean the data by removing comma in "description", "menu" and "reviews" fields and replacing them with hyphens
        cleaned_data=[]
        for row in reader:
            clean_row=[]
            for value in row:
                clean_value=value.strip('"').replace(",","-") #put hyphens
                clean_row.append(clean_value)
            cleaned_data.append(clean_row)

    #write in a new csv file containing restaurant data
    with open(output_file,'w',newline='',encoding='utf-8') as outfile:
        writer=csv.writer(outfile)
        writer.writerow(clean_headers)  #with clean headers
        writer.writerows(cleaned_data)  #withe clean data

input_file='/content/all_restaurants_thefork_paris_detailed.csv'  #input file
output_file='paris_restaurants.csv'  #output file
clean_csv(input_file,output_file)
print(f"First version of the csv file of restaurant data: {output_file}")

First version of the csv file of restaurant data: paris_restaurants.csv


Now we can read the csv file which contains data of Paris restaurants

In [4]:
import pandas as pd
df = pd.read_csv('/content/paris_restaurants.csv', delimiter=',', quoting=3, on_bad_lines='skip')
df

Unnamed: 0,Title,Link,Address,Price,Type_of_Restaurant,Mark,Number_of_Reviews,Image,Ambiance,Plats,Service,Description,Menu,Avis
0,Asahi,https://www.thefork.fr/restaurant/asahi-r47875...,56- rue de Belleville- 75020- Paris,20 €,Japonais,9-4,281.0,https://res.cloudinary.com/tf-lab/image/upload...,9.4,9.5,9.5,JAPONAIS PIONNIER – Monsieur YE est un pionnie...,Takoyaki- Wakame- Karaage- Ikageso- Carpaccio ...,Cadre sympathique- service rapide et bons cons...
1,Le Fil Rouge Café,https://www.thefork.fr/restaurant/le-fil-rouge...,3- rue René Boulanger- 75010- Paris,26 €,Américain,9-4,511.0,https://res.cloudinary.com/tf-lab/image/upload...,9.5,9.2,9.6,En plein coeur du 10e arrondissement- à deux p...,Sticks mozzarella- Oignons rings (5 pièces)- Q...,le décor- l'ambiance et le fait maison...on y ...
2,Feyrouz,https://www.thefork.fr/restaurant/feyrouz-r753...,10 Rue de Lourmel- 75015- Paris,25 €,Libanais,9-1,76.0,https://res.cloudinary.com/tf-lab/image/upload...,8.7,9.2,9.5,Chez Feyrouz- l'ambiance familiale et la bonne...,Taboule- Moutabal- Taboule- Fatouche- Laban Co...,l accueil et la gentillesse sont le seul point...
3,Galia - Maxim Godigna,https://www.thefork.fr/restaurant/galia-maxim-...,123 Rue Didot- 75014- Paris,44 €,Fusion,9-1,1.0,https://res.cloudinary.com/tf-lab/image/upload...,8.4,9.3,9.2,UNE TABLE QUI PROMET - Diplômé de la haute éco...,Ceviche de poisson d'arrivage- condiment goyav...,Tres décevant par rapport à notre dernière vis...
4,La Table de Colette,https://www.thefork.fr/restaurant/la-table-de-...,17 Rue Laplace- 75005- Paris,70 €,Français,9-2,579.0,https://res.cloudinary.com/tf-lab/image/upload...,8.9,9.2,9.2,La Table de Colette offre une expérience culin...,Menu en 7 temps- Accord mets et boissons- Menu...,Une cuisine à la croisée de la modernité- des ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,Sorza,https://www.thefork.fr/restaurant/sorza-r25344...,51- rue Saint Louis en L'Ile- 75004- Paris,39 €,Français,9-4,1.0,https://res.cloudinary.com/tf-lab/image/upload...,9.3,9.4,9.5,Sorza- situé dans le charmant cadre de l'Île S...,Bouteille d'eau- Demi-bouteille d'eau- Café- S...,Superbe expérience ! Plats délicieux et servic...
86,L'OLIVIER,https://www.thefork.fr/restaurant/l-olivier-r1...,88- rue Ordener- 75018- Paris,46 €,Français,9-1,2.0,https://res.cloudinary.com/tf-lab/image/upload...,9.3,9.4,9.5,L'OLIVIER charme par son ambiance chaleureuse ...,Rolls- Crevettes Bio De Madagascar- Mayonnaise...,La viande était très tendre mais la sauce béar...
87,Visconti Madeleine,https://www.thefork.fr/restaurant/visconti-mad...,4- rue de l'Arcade- 75008- Paris,48 €,Italien,9-1,3.0,https://res.cloudinary.com/tf-lab/image/upload...,8.7,9.2,9.2,Visconti Madeleine vous transporte en Italie a...,Rucola e grana padano- Caprese- Bruschetta mis...,L’accueil et la qualité des mets Une valeur sû...
88,LES FILAOS,https://www.thefork.fr/restaurant/les-filaos-r...,5 Rue Guy de Maupassant- 75116- Paris,25 €,Mauricien,,,https://res.cloudinary.com/tf-lab/image/upload...,9.0,9.1,9.2,Les Filaos séduit par son cadre chaleureux et ...,Carpaccio d'Espadon au Citron Vert- Crabe Farc...,Restaurant calme- salle agréable et bonne cuis...


So now, we can read our csv file. But we have to remove 10 lines of data because they are considered as bad lines in the csv file for some reasons with our scraping methods. Now, there are 90 restaurants in our new csv file.

Second step :    


*   We're gonna remove '€' from the price
*   We replace the hyphen by a dot for the "mark" field
*   We create a new column with only the last two digits of postal code
*   We change the type of the column "number_of_reviews" (float to Integer)


In [5]:
df['Price']=df['Price'].str.replace('€','').str.strip()

df['Mark']=df['Mark'].str.replace('-','.', regex=False)

#For the postal code, we use something like a regex filter to detect a sequence of 5 digits in the address field to get only the postal code and we collect the last two digits of the postal code to create our new column
df['PostalCode_LastTwo']=df['Address'].str.extract(r'(\d{5})')[0].str[-2:]

df['Number_of_Reviews']=df['Number_of_Reviews'].fillna(0).astype(int)

df.to_csv('/content/paris_restaurants_2.csv',index=False)
df

Unnamed: 0,Title,Link,Address,Price,Type_of_Restaurant,Mark,Number_of_Reviews,Image,Ambiance,Plats,Service,Description,Menu,Avis,PostalCode_LastTwo
0,Asahi,https://www.thefork.fr/restaurant/asahi-r47875...,56- rue de Belleville- 75020- Paris,20,Japonais,9.4,281,https://res.cloudinary.com/tf-lab/image/upload...,9.4,9.5,9.5,JAPONAIS PIONNIER – Monsieur YE est un pionnie...,Takoyaki- Wakame- Karaage- Ikageso- Carpaccio ...,Cadre sympathique- service rapide et bons cons...,20
1,Le Fil Rouge Café,https://www.thefork.fr/restaurant/le-fil-rouge...,3- rue René Boulanger- 75010- Paris,26,Américain,9.4,511,https://res.cloudinary.com/tf-lab/image/upload...,9.5,9.2,9.6,En plein coeur du 10e arrondissement- à deux p...,Sticks mozzarella- Oignons rings (5 pièces)- Q...,le décor- l'ambiance et le fait maison...on y ...,10
2,Feyrouz,https://www.thefork.fr/restaurant/feyrouz-r753...,10 Rue de Lourmel- 75015- Paris,25,Libanais,9.1,76,https://res.cloudinary.com/tf-lab/image/upload...,8.7,9.2,9.5,Chez Feyrouz- l'ambiance familiale et la bonne...,Taboule- Moutabal- Taboule- Fatouche- Laban Co...,l accueil et la gentillesse sont le seul point...,15
3,Galia - Maxim Godigna,https://www.thefork.fr/restaurant/galia-maxim-...,123 Rue Didot- 75014- Paris,44,Fusion,9.1,1,https://res.cloudinary.com/tf-lab/image/upload...,8.4,9.3,9.2,UNE TABLE QUI PROMET - Diplômé de la haute éco...,Ceviche de poisson d'arrivage- condiment goyav...,Tres décevant par rapport à notre dernière vis...,14
4,La Table de Colette,https://www.thefork.fr/restaurant/la-table-de-...,17 Rue Laplace- 75005- Paris,70,Français,9.2,579,https://res.cloudinary.com/tf-lab/image/upload...,8.9,9.2,9.2,La Table de Colette offre une expérience culin...,Menu en 7 temps- Accord mets et boissons- Menu...,Une cuisine à la croisée de la modernité- des ...,05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,Sorza,https://www.thefork.fr/restaurant/sorza-r25344...,51- rue Saint Louis en L'Ile- 75004- Paris,39,Français,9.4,1,https://res.cloudinary.com/tf-lab/image/upload...,9.3,9.4,9.5,Sorza- situé dans le charmant cadre de l'Île S...,Bouteille d'eau- Demi-bouteille d'eau- Café- S...,Superbe expérience ! Plats délicieux et servic...,04
86,L'OLIVIER,https://www.thefork.fr/restaurant/l-olivier-r1...,88- rue Ordener- 75018- Paris,46,Français,9.1,2,https://res.cloudinary.com/tf-lab/image/upload...,9.3,9.4,9.5,L'OLIVIER charme par son ambiance chaleureuse ...,Rolls- Crevettes Bio De Madagascar- Mayonnaise...,La viande était très tendre mais la sauce béar...,18
87,Visconti Madeleine,https://www.thefork.fr/restaurant/visconti-mad...,4- rue de l'Arcade- 75008- Paris,48,Italien,9.1,3,https://res.cloudinary.com/tf-lab/image/upload...,8.7,9.2,9.2,Visconti Madeleine vous transporte en Italie a...,Rucola e grana padano- Caprese- Bruschetta mis...,L’accueil et la qualité des mets Une valeur sû...,08
88,LES FILAOS,https://www.thefork.fr/restaurant/les-filaos-r...,5 Rue Guy de Maupassant- 75116- Paris,25,Mauricien,,0,https://res.cloudinary.com/tf-lab/image/upload...,9.0,9.1,9.2,Les Filaos séduit par son cadre chaleureux et ...,Carpaccio d'Espadon au Citron Vert- Crabe Farc...,Restaurant calme- salle agréable et bonne cuis...,16


We're gonna see the types of each field

In [6]:
df.dtypes

Unnamed: 0,0
Title,object
Link,object
Address,object
Price,object
Type_of_Restaurant,object
Mark,object
Number_of_Reviews,int64
Image,object
Ambiance,float64
Plats,float64


We're gonna see if there are any missing values

In [7]:
df.isnull().sum()

Unnamed: 0,0
Title,0
Link,0
Address,0
Price,0
Type_of_Restaurant,1
Mark,1
Number_of_Reviews,0
Image,0
Ambiance,1
Plats,1


So there are some missing values :
*   10 missing values for "description"
*   1 missing value "Type_of_restaurant", "Mark", "Ambiance", "Plats", "Service" and "Menu"



Third step : To solve these issues,


*   We're gonna add an "Inconnue" value for the missing values in "description" and "Type_of_restaurant" fields
*   For missing values in the rating fields like "Ambiance", "Mark", "Plats" and "Service", we give the mean value received by other restaurants lines in order to complete missing values



In [9]:
#Fill with "Inconnue" value for the three columns
df['Description']=df['Description'].fillna('Inconnue')
df['Type_of_Restaurant']=df['Type_of_Restaurant'].fillna('Inconnue')
df['Menu']=df['Menu'].fillna('Inconnue')

#compute the mean value for each field and then put it on missing values for each field
cols_to_fill=['Ambiance','Mark','Plats','Service']
for col in cols_to_fill:
    df[col]=pd.to_numeric(df[col],errors='coerce')
    mean_value=df[col].mean()
    df[col]=df[col].fillna(mean_value)

#Check now if there are any missing values
missing_values_after_fill=df.isnull().sum()
print("\nMissing values after filling :\n",missing_values_after_fill)
df.to_csv('/content/paris_restaurants_3.csv',index=False)


Missing values after filling :
 Title                 0
Link                  0
Address               0
Price                 0
Type_of_Restaurant    0
Mark                  0
Number_of_Reviews     0
Image                 0
Ambiance              0
Plats                 0
Service               0
Description           0
Menu                  0
Avis                  0
PostalCode_LastTwo    0
dtype: int64


Now we can see that we have no missing values

Fourth step :    


*   We add two new fields : longitude and latitude based on the coordinates of each district of Paris. We will put these fields for each restaurant based on the column containing the last two digits of of the postal code of each restaurant. It will help us later to solve our machine learning problem (to see the distance between the hotel and the restaurant). We already have these fields in our hotel dataset. So it will be very easy to see restaurants near to the selected hotel in our app.



In [10]:
# Average value of latitude and longitude of each district of Paris
coordinates={
    '01':{'latitude':48.8626,'longitude':2.3364},  # 1st district
    '02':{'latitude':48.8682,'longitude':2.3442},  # 2nd district
    '03':{'latitude':48.8635,'longitude':2.3604},  # 3rd district
    '04':{'latitude':48.8555,'longitude':2.3560},  # 4th district
    '05':{'latitude':48.8440,'longitude':2.3499},  # 5th district
    '06':{'latitude':48.8503,'longitude':2.3332},  # 6th district
    '07':{'latitude':48.8564,'longitude':2.3126},  # 7th district
    '08':{'latitude':48.8738,'longitude':2.3186},  # 8th district
    '09':{'latitude':48.8762,'longitude':2.3370},  # 9th district
    '10':{'latitude':48.8768,'longitude':2.3590},  # 10th district
    '11':{'latitude':48.8570,'longitude':2.3768},  # 11th district
    '12':{'latitude':48.8402,'longitude':2.4131},  # 12th district
    '13':{'latitude':48.8323,'longitude':2.3554},  # 13th district
    '14':{'latitude':48.8301,'longitude':2.3230},  # 14th district
    '15':{'latitude':48.8412,'longitude':2.2923},  # 15th district
    '16':{'latitude':48.8638,'longitude':2.2673},  # 16th district
    '17':{'latitude':48.8851,'longitude':2.3084},  # 17th district
    '18':{'latitude':48.8925,'longitude':2.3447},  # 18th district
    '19':{'latitude':48.8840,'longitude':2.3824},  # 19th district
    '20':{'latitude':48.8686,'longitude':2.3992},  # 20th district
}

#Add the two new columns based on the last two diggies of the postal code of each restaurant
df['latitude_restaurant']=df['PostalCode_LastTwo'].apply(lambda x: coordinates[str(x).zfill(2)]['latitude'])
df['longitude_restaurant']=df['PostalCode_LastTwo'].apply(lambda x: coordinates[str(x).zfill(2)]['longitude'])

df.to_csv('/content/paris_restaurants_4.csv', index=False)
df.head()

Unnamed: 0,Title,Link,Address,Price,Type_of_Restaurant,Mark,Number_of_Reviews,Image,Ambiance,Plats,Service,Description,Menu,Avis,PostalCode_LastTwo,latitude_restaurant,longitude_restaurant
0,Asahi,https://www.thefork.fr/restaurant/asahi-r47875...,56- rue de Belleville- 75020- Paris,20,Japonais,9.4,281,https://res.cloudinary.com/tf-lab/image/upload...,9.4,9.5,9.5,JAPONAIS PIONNIER – Monsieur YE est un pionnie...,Takoyaki- Wakame- Karaage- Ikageso- Carpaccio ...,Cadre sympathique- service rapide et bons cons...,20,48.8686,2.3992
1,Le Fil Rouge Café,https://www.thefork.fr/restaurant/le-fil-rouge...,3- rue René Boulanger- 75010- Paris,26,Américain,9.4,511,https://res.cloudinary.com/tf-lab/image/upload...,9.5,9.2,9.6,En plein coeur du 10e arrondissement- à deux p...,Sticks mozzarella- Oignons rings (5 pièces)- Q...,le décor- l'ambiance et le fait maison...on y ...,10,48.8768,2.359
2,Feyrouz,https://www.thefork.fr/restaurant/feyrouz-r753...,10 Rue de Lourmel- 75015- Paris,25,Libanais,9.1,76,https://res.cloudinary.com/tf-lab/image/upload...,8.7,9.2,9.5,Chez Feyrouz- l'ambiance familiale et la bonne...,Taboule- Moutabal- Taboule- Fatouche- Laban Co...,l accueil et la gentillesse sont le seul point...,15,48.8412,2.2923
3,Galia - Maxim Godigna,https://www.thefork.fr/restaurant/galia-maxim-...,123 Rue Didot- 75014- Paris,44,Fusion,9.1,1,https://res.cloudinary.com/tf-lab/image/upload...,8.4,9.3,9.2,UNE TABLE QUI PROMET - Diplômé de la haute éco...,Ceviche de poisson d'arrivage- condiment goyav...,Tres décevant par rapport à notre dernière vis...,14,48.8301,2.323
4,La Table de Colette,https://www.thefork.fr/restaurant/la-table-de-...,17 Rue Laplace- 75005- Paris,70,Français,9.2,579,https://res.cloudinary.com/tf-lab/image/upload...,8.9,9.2,9.2,La Table de Colette offre une expérience culin...,Menu en 7 temps- Accord mets et boissons- Menu...,Une cuisine à la croisée de la modernité- des ...,5,48.844,2.3499


Now we check again the types of each column and missing values

In [11]:
df.isnull().sum()

Unnamed: 0,0
Title,0
Link,0
Address,0
Price,0
Type_of_Restaurant,0
Mark,0
Number_of_Reviews,0
Image,0
Ambiance,0
Plats,0


In [12]:
df.dtypes

Unnamed: 0,0
Title,object
Link,object
Address,object
Price,object
Type_of_Restaurant,object
Mark,float64
Number_of_Reviews,int64
Image,object
Ambiance,float64
Plats,float64


Fifth step : Now, we're gonna work on if the restaurant is eco-responsible or not. For that we decide to put some conditions to know if it's eco responsible or not.


*   If the rating "Mark" has a value greater than or equal to 9
*   If the ratings "Ambiance", "Plats" and "Service" are greater or equal to 9
*   If the columns "Description", "Menu" and "Avis" contains some words like "végétarien" or "bio"

Then the restaurant is eco-responsible !






In [14]:
import numpy as np

def is_eco_responsible(row):
    # we check if the two works appear in the 3 fields (Description, Menu, Avis)
    eco_keywords=['végétarien', 'bio']
    text_data=f"{row['Description']}{row['Menu']}{row['Avis']}".lower()
    contains_eco_keywords=any(keyword in text_data for keyword in eco_keywords)

    # We check if the ratings are greater or equal to 9
    meets_rating_criteria=row['Mark']>=9
    meets_avg_criteria=np.mean([row['Ambiance'],row['Plats'],row['Service']])>=9

    #We put "Oui" or "Non" to prouve that the restaurant is eco-responsible or not based on ratings and the keywords
    return "Oui" if contains_eco_keywords and meets_rating_criteria and meets_avg_criteria else "Non"

#we create a new column Eco_responsable which contains "Oui" or "Non" based on what we have done previously
df['Eco_responsable'] = df.apply(is_eco_responsible, axis=1)
df.to_csv('/content/paris_restaurants_5.csv', index=False)

Last step : Add indexes for the two csv files containing data on Paris Hotels and Paris Restaurants

In [15]:
#Load the two csv files
restaurants_file='/content/paris_restaurants_5.csv'
hotels_file='/content/paris_hotels.csv'

#Load the data
df_restaurants=pd.read_csv(restaurants_file)
df_hotels=pd.read_csv(hotels_file)

#Add index for each file
df_restaurants['Index']=range(1,len(df_restaurants)+1)
df_hotels['Index']=range(1,len(df_hotels)+1)

# Final csv files with data cleaning
df_restaurants.to_csv('/content/paris_restaurants_final.csv', index=False)
df_hotels.to_csv('/content/paris_hotels_final.csv', index=False)