## Explorative Data Analysis of the 1500  north american restraurants.
the aim of this analysis if to pinpoint the differences between the different countries in the restauration industry. I also try to identify the predominent cuisines in each country using simple techniques from NLP such as lemmatization and Countvectorizer. After i did an analysis on the rating given to each restaurant and under which category of cuisines this restaurant falls. 

Challenges: some of the challenges that faced me during this analysis are the bilinguality of some restaurant in Canada, since in some areas of canada the main language is french, this introduces another problem which is:  the handling of synonyms in different languages, using the naive approach would yield into a separate handling of synonyms because they are different. 
The name of the cuisines are not clear for example: "food" "sandiwich".... however  the most popular cuisine is american.

In [3]:
import pandas as pd


df =  pd.read_csv("../data/North_America_Restaurants.csv")
df.head(5)

Unnamed: 0,name,city,state,zipcode,country,cuisines,pickup_enabled,delivery_enabled,weighted_rating_value,aggregated_rating_count
0,Burger King,Manitowoc,WI,54220,US,"American, Burger, Burgers, Family Meals, Fast ...",True,True,2.4,42
1,Petro-Canada,Airdrie,AB,T4A,CA,"Ben & Jerry's, Café/Thé, Coffee/Tea, Convenien...",True,True,4.1,1
2,Boba Bae,Ashwaubenon,WI,54304,US,"American, Asian Food, Bubble Tea, Coffee & Tea...",True,True,4.0,88
3,1001 Nights Shawarma,Kitchener,ON,N2C,CA,"Beau, Bon, Local, Chicken, Dessert, Desserts, ...",True,True,4.6,1077
4,Chirpyhut Fried Chicken (JlgJ),Richmond,BC,V6X 2B8,CA,"Ailes, Allergy Friendly, American, Beau, Bon, ...",True,True,4.6,30


In [4]:
df['cuisines'] = df['cuisines'].str.lower()
df['cuisines'][0]


'american, burger, burgers, family meals, fast food, subs & sandwiches'

In [5]:
df.describe()

Unnamed: 0,weighted_rating_value,aggregated_rating_count
count,1500.0,1500.0
mean,3.724533,85.500667
std,0.989005,277.071136
min,1.0,1.0
25%,2.9,5.0
50%,4.1,25.0
75%,4.6,68.0
max,5.0,4211.0


In [6]:
df.isnull().sum()

name                       0
city                       0
state                      0
zipcode                    0
country                    0
cuisines                   1
pickup_enabled             0
delivery_enabled           0
weighted_rating_value      0
aggregated_rating_count    0
dtype: int64

In [7]:
df.dropna(subset=['cuisines'], inplace=True)

In [96]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')
X = vectorizer.fit_transform(df['cuisines'])
word_counts = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
word_freq = word_counts.sum(axis=0)

word_freq_df = pd.DataFrame({'word': word_freq.index, 'count': word_freq.values})
word_freq_df = word_freq_df.sort_values(by='count', ascending=False)
word_freq_df['country'] = df['country']
word_freq_df.head(15)

Unnamed: 0,word,count,country
1391,food,988,US
106,american,755,US
473,breakfast,645,US
2694,sandwiches,644,
753,chicken,619,CA
1303,et,602,CA
1217,drinks,569,CA
513,brunch,509,US
1797,ice,428,
938,cream,426,US


In [47]:
import matplotlib.pyplot as plt

colors = {
    'US': 'blue',
    'CA': 'red'
}
word_freq_df['color'] = word_freq_df['country'].map(colors)
word_freq_df

Unnamed: 0,word,count,country,color
1392,food,988,US,blue
85,american,755,CA,red
455,breakfast,645,US,blue
2664,sandwiches,644,,
741,chicken,619,CA,red
...,...,...,...,...
2387,pasta thai,1,,
2386,pasta subs,1,,
1051,dessert sandwiches,1,CA,red
562,burgers classic,1,CA,red


In [48]:
top_n = 10

# Get top 10 words for each country
top_words_df = pd.DataFrame()

for country in word_freq_df['country'].unique():
    top_words = word_freq_df[word_freq_df['country'] == country].sort_values(by='count', ascending=False).head(top_n)
    top_words_df = pd.concat([top_words_df, top_words])


# Reset index for easier plotting
top_words_df.reset_index(drop=True, inplace=True)
top_words_df

Unnamed: 0,word,count,country,color
0,food,988,US,blue
1,breakfast,645,US,blue
2,burgers,370,US,blue
3,dessert,350,US,blue
4,breakfast brunch,310,US,blue
5,fast,292,US,blue
6,fast food,292,US,blue
7,care,287,US,blue
8,dessert desserts,268,US,blue
9,asian,265,US,blue


In [41]:
import plotly.express as px

# Create an interactive scatter plot
fig = px.scatter(
    top_words_df,
    x='word',
    y='count',
    size='count',
    color='country',
    hover_name='word',
    hover_data={'count': True, 'country': True},
    title='Top 10 Most Dominant Words by Country',
    labels={'word': 'Word', 'count': 'Frequency'}
)

# Update layout to add a legend and adjust the appearance
fig.update_layout(
    legend_title='Country',
    xaxis_title='Word',
    yaxis_title='Frequency',
    xaxis=dict(tickmode='linear'),
    showlegend=True
)

# Show the plot
fig.show()


# Interpretation: 
The plot above shows the most dominant words in our cuisines columns. the naive usage of countvectorizer did not work properly as we can see  in the word "ice" and "cream" they are almost in the same position, this would imply that the word ice and cream are treated separately by the countvectorizer since they are separated by a space. The word food in this case can be considered as stop word because it does not give us any specific information. 

The data is not monolinguistic(english) because some areas in canada speak french. Words that are synonym but in different languages will be considered different words. 

The data needs to be preprocessed: 
- Lower case: countvectorizer is case sensitive 
- Lemmatization: tranforms the word to its root
- stemming: cuts off prefixes and sufixes of words

###  preprocessing

In [8]:
from nltk.stem  import WordNetLemmatizer
from nltk.tokenize  import word_tokenize
import nltk

nltk.download('wordnet')
nltk.download('punkt')

lemmatizer = WordNetLemmatizer()

def remove_punctuation(text):
    import string
    return text.translate(str.maketrans('', '', string.punctuation))

def lemmatize_text(text):
    # Tokenize the text into words
    tokens = word_tokenize(text)
    # Lemmatize each token
    lemmatized_tokens = [lemmatizer.lemmatize(token, pos='v') for token in tokens]
    # Join the lemmatized tokens back into a single string
    return ' '.join(lemmatized_tokens)

df['cuisines'] = df['cuisines'].apply(remove_punctuation)
df['cuisines'] = df['cuisines'].apply(lemmatize_text)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Nouam\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Nouam\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [9]:
df['cuisines']

0       american burger burgers family meals fast food...
1       ben jerrys caféthé coffeetea convenience cool ...
2       american asian food bubble tea coffee tea dessert
3       beau bon local chicken dessert desserts global...
4       ail allergy friendly american beau bon local b...
                              ...                        
1495    american burgers chicken dessert dinner kid me...
1496    american fast food healthy pizza ice cream fre...
1497    convenience everyday essentials grocery home p...
1498    american breakfast and brunch coffee and tea d...
1499    alcohol asian asian fusion chinese dessert des...
Name: cuisines, Length: 1499, dtype: object

In [13]:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

custom_stop_words = {
    'food', 'et','friendly','offer','offres','offre','offering','offers','sub'
}

combined_stop_words = ENGLISH_STOP_WORDS.union(custom_stop_words)

vectorizer = CountVectorizer(ngram_range=(1, 4), stop_words=list(combined_stop_words))
X = vectorizer.fit_transform(df['cuisines'])
word_counts = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
word_freq = word_counts.sum(axis=0)

word_freq_df = pd.DataFrame({'word': word_freq.index, 'count': word_freq.values})
word_freq_df = word_freq_df.sort_values(by='count', ascending=False)
word_freq_df.reset_index(drop=True, inplace=True)

word_freq_df['country'] = df['country']
word_freq_df.head(20)


Unnamed: 0,word,count,country
0,sandwich,1021,US
1,american,755,CA
2,breakfast,645,US
3,chicken,619,CA
4,drink,569,CA
5,brunch,509,CA
6,ice,506,CA
7,cream,426,US
8,ice cream,426,US
9,sandwich sandwich,410,CA


In [14]:
top_n = 10

# Get top 10 words for each country
top_words_df = pd.DataFrame()

for country in word_freq_df['country'].unique():
    top_words = word_freq_df[word_freq_df['country'] == country].sort_values(by='count', ascending=False).head(top_n)
    top_words_df = pd.concat([top_words_df, top_words])


# Reset index for easier plotting
top_words_df.reset_index(drop=True, inplace=True)
top_words_df

Unnamed: 0,word,count,country
0,sandwich,1021,US
1,breakfast,645,US
2,cream,426,US
3,ice cream,426,US
4,desserts,392,US
5,snack,321,US
6,breakfast brunch,310,US
7,fast,292,US
8,care,287,US
9,personal care,287,US


In [75]:
import plotly.express as px

# Create an interactive scatter plot
fig = px.scatter(
    top_words_df,
    x='word',
    y='count',
    size='count',
    color='country',
    hover_name='word',
    hover_data={'count': True, 'country': True},
    title='Top 10 Most Dominant Words by Country',
    labels={'word': 'Word', 'count': 'Frequency'}
)

# Update layout to add a legend and adjust the appearance
fig.update_layout(
    legend_title='Country',
    xaxis_title='Word',
    yaxis_title='Frequency',
    xaxis=dict(tickmode='linear'),
    showlegend=True
)

# Show the plot
fig.show()

### Analysis:
The plot above highlightes the frequency of each "cuisine" in every country (USA, Canada). The proposed cuisines from restaurants in both countries are mainly fast food oriented. This analysis aligns with the nature of the eating lifestyle in north america. 

### Sentiment analysis: 
Now that we have explored the types of cuisines in the respective restaurants of ever country, i would like to focus now on the ratings given to each restaurant. For This we will only consider the restaurant that have more than 100 ratings, since restaurants with less aggregated 

In [19]:
df_100 = df[df['aggregated_rating_count']>50]
df_100.reset_index(drop=True, inplace=True)

df_100

Unnamed: 0,name,city,state,zipcode,country,cuisines,pickup_enabled,delivery_enabled,weighted_rating_value,aggregated_rating_count
0,Boba Bae,Ashwaubenon,WI,54304,US,american asian food bubble tea coffee tea dessert,True,True,4.0,88
1,1001 Nights Shawarma,Kitchener,ON,N2C,CA,beau bon local chicken dessert desserts global...,True,True,4.6,1077
2,7 West Cafe,Toronto,ON,M4Y 1R4,CA,2pour1 2pour1 alcohol alcool allergy friendly ...,True,True,4.4,354
3,Petro-Canada,Montréal,QC,H3S,CA,convenience everyday essentials grocery home p...,True,True,4.7,73
4,Beef 'o' Brady's - Kingsport Tn,Kingsport,TN,37660,US,american burger burgers chicken desserts salad...,True,True,4.5,548
...,...,...,...,...,...,...,...,...,...,...
478,Pizza & Grill,San Antonio,TX,78208,US,american bistro cheesesteak chicken desserts d...,True,True,2.1,667
479,Golden Chick,Dallas,TX,75217,US,bbq family friendly family meals fast food sou...,False,True,3.7,116
480,The Saffron Biryani,Jersey City,NJ,7306,US,chicken coffee and tea dessert indian seafood ...,True,True,2.5,63
481,Jimbob's Pizza,Eau Claire,WI,54701,US,american fast food healthy pizza ice cream fre...,False,True,4.1,127


In [20]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

custom_stop_words = {
    'food', 'et','friendly','offer','offres','offre','offering','offers','sub'
}

combined_stop_words = ENGLISH_STOP_WORDS.union(custom_stop_words)

vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words=list(combined_stop_words))
X = vectorizer.fit_transform(df_100['cuisines'])
word_counts = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
word_freq = word_counts.sum(axis=0)

word_freq_df = pd.DataFrame({'word': word_freq.index, 'count': word_freq.values})
word_freq_df = word_freq_df.sort_values(by='count', ascending=False)
word_freq_df.reset_index(drop=True, inplace=True)

word_freq_df['country'] = df['country']
word_freq_df.head(20)


Unnamed: 0,word,count,country
0,sandwich,336,US
1,chicken,273,CA
2,american,253,US
3,breakfast,228,CA
4,lunch,148,CA
5,brunch,143,CA
6,hot,137,CA
7,burgers,128,US
8,wing,123,US
9,group,123,CA


In [21]:
top_n = 20

# Get top 10 words for each country
top_words_df = pd.DataFrame()

for country in word_freq_df['country'].unique():
    top_words = word_freq_df[word_freq_df['country'] == country].sort_values(by='count', ascending=False).head(top_n)
    top_words_df = pd.concat([top_words_df, top_words])


# Reset index for easier plotting
top_words_df.reset_index(drop=True, inplace=True)
top_words_df

Unnamed: 0,word,count,country
0,sandwich,336,US
1,american,253,US
2,burgers,128,US
3,wing,123,US
4,drink,120,US
5,pizza,108,US
6,desserts,101,US
7,family,90,US
8,comfort,86,US
9,hamburgers,86,US


In [22]:
import plotly.express as px

# Create an interactive scatter plot
fig = px.scatter(
    top_words_df,
    x='word',
    y='count',
    size='count',
    color='country',
    hover_name='word',
    hover_data={'count': True, 'country': True},
    title='Top 10 Most Dominant Words by Country',
    labels={'word': 'Word', 'count': 'Frequency'}
)

# Update layout to add a legend and adjust the appearance
fig.update_layout(
    legend_title='Country',
    xaxis_title='Word',
    yaxis_title='Frequency',
    xaxis=dict(tickmode='linear'),
    showlegend=True
)

# Show the plot
fig.show()