# Find Dishonest Restaurant
<img src="https://www.netclipart.com/pp/m/349-3494556_forex-scams-by-dishonest-person-lying-cartoon.png" width="600px">

## Description

Sometimes, some dishonest restaurants cheating TripAdvisor and their guests by winding up the rating higher than it should be.

The main aim of the project is to try to predict the rating of the restaurant with given data.

In case if the predictions of our model have significant differences from the actual result, then, most likely we found a dishonest restaurant. 

### Column Defination

Restaurant_id — restaurant / restaurant chain identification number;

City — In what city it is located;

Cuisine Style — related to a restaurant cuisine;

Ranking — the place that this restaurant occupies among all restaurants in its city;

Rating — restaurant rating according to TripAdvisor (Target Variable);

Price Range — restaurant price range;

Number of Reviews — Number of Reviews ;

Reviews — data about two reviews that are displayed on the restaurant's website;

URL_TA — URL on TripAdvisor;

ID_TA — Identificator of restaurant in TripAdvisor's DataBase.


---
### Import Libraries
---

In [1]:
from jupyterthemes import jtplot
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import re
from datetime import timedelta
from textblob import TextBlob  # for sentiment analysis
from wordcloud import WordCloud  # for creating cloud of words

warnings.filterwarnings('ignore')


pd.set_option('display.max_rows', 50)  # Show more rows
pd.set_option('display.max_columns', 50)  # Show more columns
plt.style.use('ggplot')  # Nice plotting

%matplotlib inline

jtplot.style(theme='monokai', context='notebook', ticks=True, grid=False)

ModuleNotFoundError: No module named 'textblob'

In [None]:
# colors = ['#001c57', '#50248f', '#a6a6a6', '#38d1ff']
colors = ['#50248f', '#38d1ff']
sns.palplot(sns.color_palette(colors))

### 1.  Read and Chek the Dataset

In [None]:
df = pd.read_csv('main_task.csv')
print(f'Dataset shape: {df.shape}')

### 1.1 Show basic info

In [None]:
display(df.head())
df.info()

### 1.2 Show the data types

In [None]:
dtype_df = df.dtypes.reset_index()
dtype_df.columns = ['Count', 'Column Type']
dtype_df.groupby('Column Type').agg('count').reset_index()

Let's see what type of data in each column-cell

In [None]:
for i, j in enumerate(df.columns):
    print(j, type(df.loc[1][i]))

Well, it is more interesting. Let's briefly see at the content of object data.

In [None]:
obj = df.dtypes[df.dtypes == 'object'].index
print(obj)

In [None]:
for i in obj:
    print(f'Col Name: {i}, Content: {df[i].unique()}\n')

Well, that is clearly obvious,that some data in columns pretend to be as a list, however it is string or float type.

Such as:
 - column Cuisine Style is looks like a list, but has float64 type;
 - column Reviews looks like nested list with following template [[ comment_1 , comment_2 ], [date1 , date2]], but in fact it has str type of data.

Rename columns removing spaces and substituting capital letters

In [None]:
df.rename(columns={'Restaurant_id': 'restaurant_id',
                   'City': 'city',
                   'Cuisine Style': 'cuisine_style',
                   'Ranking': 'ranking',
                   'Rating': 'rating',
                   'Price Range': 'price_range',
                   'Number of Reviews': 'reviews_number',
                   'Reviews': 'reviews',
                   'URL_TA': 'url_ta',
                   'ID_TA': 'id_ta'}, inplace=True)
# show the data
df.head(1)

### 1.3 Missing values

Let's observe an empty data

In [None]:
# Plot missing values
cols = df.columns
fig, ax = plt.subplots(figsize=(7, 7))
sns.heatmap(df[cols].isnull(), cmap=sns.color_palette(colors))

# Show in percents
for col in df.columns:
    pct_missing = np.mean(df[col].isnull())
    print(f'{col} - {round(pct_missing*100)}%')

---
## Resume
---

 - DataSet has 40k rows and 10 columns.
 - Column 'cuisine_style' has 23% of missing values
 - Column 'price_range' has 35% of missing values
 - Column 'reviews_number' has 6% of missing values while column 'reviews' has no any single missing value. Here is a discrepancy. If we visually check the content of the column 'reviews', we find the following value :'[[],[]]'. Definatelly it is a missing value which need to be treated in further data-processing.
 - Type in particular cell sometimes differ with pd.dtypes. Need to take care about that in further.


## 2. Exploratory Data Analysis

### 2.1 Target Variable analysis

In [None]:
plt.figure(figsize=(15, 5))
plt.subplot(121)
sns.distplot(df.rating.values, bins=10, color=colors[0])
plt.title('Rating Distribution\n', fontsize=15)
plt.xlabel('Rating')
plt.ylabel('Quantity (frequency)')

plt.subplot(122)
sns.boxplot(df.rating.values, color=colors[1])
plt.title('Rating Distribution\n', fontsize=15)
plt.xlabel('Rating')

In [None]:
df['rating'].describe()

The Target variable has a normal distribution shifted to the right side of 1 to 5. The first and third quartiles are in the range from  3.5 to 4.5, the median is 4. Also outliers has been observed for target variable.

### 2.2 Restaurant_Id column

In [None]:
print(f'Unique Id quantity: {df.restaurant_id.nunique()}')
df['restaurant_id'].value_counts()

---

Well, the total quantity of rows is 40k while number of unique id is 11k.

We may see some dublicates here. But:

With a reference to a column description it may says us that we have some chain restaurants in dataset.

---

### 2.3 City column

How are cities distributed?

In [None]:
plt.figure(figsize=(15, 5), dpi=100)
sns.countplot(df['city'], order=df['city'].value_counts().index)
plt.xticks(rotation=45)
plt.title('Cities Distribution\n', fontsize=15)
plt.xlabel('City Name')
plt.ylabel('Quantity (frequency)')

print(f'Total Number of Cities in DataSet: {df.city.nunique()}')

---

The overwhelming majority of restaurants presented in the dataset located in London, Paris, Madrid. 

All cities are European.

The city of Oporto is not identified. It is actually the name of the restaurant in Porto(Portugal)

Most likely the city shall have a name as Porto instead of Oporto.

---

### 2.4 Cuisine_style column

First of all, we have to manage the list of cuisines in a way that we can use the data and produce some statistics.

In [None]:
# Before we start, we need to save an information in the dataset, where there were a missing values

# Create a column where indicate that Cuisine is not presented for this restaurant
df['cuisine_style_empty'] = df['cuisine_style'].isnull().astype('uint8')

# Fill missing values in column with 'unknown'
df['cuisine_style'] = df['cuisine_style'].fillna("['unknown']")

# convert string in the column into a list
df['cuisine_style'] = df['cuisine_style'].apply(lambda x: eval(x))

# Create separate dataframe for the analysis

df1 = df[['city', 'cuisine_style', 'ranking', 'rating',
          'reviews_number']].explode('cuisine_style')

# -1 cos we already filled missed value. Dont count it
print(df1['cuisine_style'].nunique()-1)

Check  a top 10 of the most common cuisine styles

In [None]:
df_cuisine_style = df1['cuisine_style'].value_counts(
).sort_values(ascending=False)[:10]

count_ths = np.arange(0, 1.3e4, 5e3)
count = np.arange(0, 20, 5)


fig = plt.figure(figsize=(15, 5))
ax = plt.subplot()


plt.bar(x=df_cuisine_style.index, height=df_cuisine_style, color=colors[0])

plt.yticks(count_ths, count)
plt.xticks(rotation=45)
plt.ylabel('Total Places (Thousands)')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
plt.title('Top 10 most common cuisines')
ax.tick_params(direction='out', length=0, width=0, colors='grey')

--- 
The dataset has 125 unique cuisines.

Vegetarian Friendly places are clearly the most common ones around Europe, followed by mostly European-style cuisine.

---

Which are the cuisines that people tend to review?

In [None]:
df_cuisine_style = df1.groupby(
    'cuisine_style').reviews_number.sum().sort_values(ascending=False)[:10]

In [None]:
count_ths = np.arange(0, 3.3e6, 5e5)
count = np.arange(0, 9.3, 0.5)

fig = plt.figure(figsize=(15, 5))
ax = plt.subplot()

plt.bar(x=df_cuisine_style.index, height=df_cuisine_style, color=colors[1])

plt.yticks(count_ths, count)
plt.xticks(rotation=45)
plt.ylabel('Total Reviews (Million)')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
plt.title('Top 10 most reviewed cuisines')
ax.tick_params(direction='out', length=0, width=0, colors='grey')

---

The chart is more or less simillar with above one.But it's notable that the Vegan and Gluten Free Options are very likely to be reviewed by the customers.

People are not tend to review restaurants where cuisins are not shown. So we made a right desicion to keep (but not delete) rows with missing cuisins by replacing nan value to 'unknown'. Perhaps this info may help us in further modeling.

---

### 2.5 Ranking column

How is it distributed?

In [None]:
plt.figure(figsize=(15, 5))
plt.subplot(121)
sns.distplot(df.ranking.values, bins=25, color=colors[0])
plt.title('Ranking Distribution\n', fontsize=15)
plt.xlabel('Ranking')
plt.ylabel('Quantity (frequency)')

plt.subplot(122)
sns.boxplot(df.ranking.values, color=colors[1])
plt.title('Ranking Distribution\n', fontsize=15)
plt.xlabel('Ranking')

In [None]:
df['ranking'].describe()

---

The Ranking distribution shifted to the left side and scattered from 1 to 16444. The first and third quartiles are in the range from 973 to 5260, the median is 2285.

However, with a reference to the data description, the Ranking is the place that this restaurant occupies among all restaurants in its city. So we cannot observe it separately from cities.

---

So, let's plot a distribution of a ranking depend on city (take 10-top cities in dataset)

In [None]:
plt.figure(figsize=(15, 5))

for city in (df['city'].value_counts())[0:10].index:
    sns.distplot(df['ranking'][df['city'] == city], kde=False, label=city)

plt.legend(prop={'size': 10})
plt.title('Ranking Distribution among cities\n', fontsize=15)
plt.xlabel('Ranking')
plt.ylabel('Quantity (frequency)')

---

Well, as we see the ranking has normal distribution in each separate city. And as we already know , London is taking a top place by presented restaurants. So it in not surprised, that why do we have a shifting of distribution in a left side. Big cities has lots of restaurants.

Then in a further feature engeniring section (below) we need to consider this by creating equivalent ranking.

---

Chek the correlation

In [None]:
X = df.corr()
fig, ax = plt.subplots(figsize=(7, 7))
sns.heatmap(X, vmax=.7, square=True, annot=True)

print(f'Rank of Matrix: {np.linalg.matrix_rank(X)}')
print(f'Determinat of matrix :{np.round(np.linalg.det(X),3)}')
print(f'Shape of matrix :{np.shape(X)}')

---

Here we can see some correlations with target variable.

There is no correlations between features. It is also confirmed by rank of correlation matrix. It is a full rank matrix.

Most likely, we won't have a problems in a modeling

----

### 2.6 Price range column

First of all, with a reference to a section 1, we know that this column has 35% of nan values and presented as dollar symbol

Let's replace these symbols in the dataset for the three price ranges with  more intuitive (cheap - medium - high ranges) and replace the NaN ones (set as not available)

In [None]:
df['price_range'] = df['price_range'].fillna('NA')
price_ranges = {'$': 'Cheap', '$$ - $$$': 'Medium',
                '$$$$': 'High', 'NA': 'NotAvailable'}
df['price_range'] = df['price_range'].map(price_ranges)

In [None]:
plt.figure(figsize=(10, 5))

sns.boxplot(x='price_range', y='rating', data=df)

plt.title('Price Range to Rating\n', fontsize=15)
plt.xlabel('Price range')
plt.ylabel('Rating')

---

Well, restaurants with high prices and cheap prices getting low ratings less often than restaurants with a medium level

---

### 2.7 Reviews number column

---

We already know that this column has nan values. Also we know that The most reviewable cuisin is vegan

Missing values will be filled in Feature engineering section.

Important Note. 

Filling the missing data must be done in a complex with proceeding the reviews column as we note that there is some descripancy between review numbers and revies itself

---

### 2.8 Reviews column

#### 2.8.1 Pre-processing and analyse of missing values

Let's re-call what have we seen in section 1 of this notebook

In [None]:
print(df['reviews'][0])
print(df['reviews'][3])
print(f'Govno {type(df.loc[1]["reviews"])}')

As was mentioned above the content has structural data - [ [ ],[ ] ].

**However** it is not a list, but just a string type of variable.

Then, we want to extract the data form the colums 'reviews' into a 4 independent colums and remove original one from dataset.

In the end we shall get following columns in our dataset:

***Review_1*** - we put review No.1 (if any).

***Date_1*** - we put the date when the review was added

***Review_2*** - we put review No.2 (if any).

***Date_2*** - we put the date when the review was added

***Diff_rev*** - Time difference in days between first and second review


In [None]:
# create a template for search
lrx = re.compile('\[\[.*\]\]')


def review_extraction(row):
    '''Function is called for extracting data from column 
    reviews and splitting it out into a separate columns
    INPUT: Whole dataset
    OUTPUT: Dataset with additional columns'''

    cell = row['reviews']
    aux_list = [[], []]  # create an auxilliary list for saving temp.data
    if type(cell) == str and lrx.fullmatch(cell):  # compare with searech template
        nan = None
        aux_list = eval(cell)  # transform into a list

    row['first_review'] = aux_list[0][1] if len(aux_list[0]) > 1 else nan
    row['last_review'] = aux_list[0][0] if len(aux_list[0]) > 0 else nan

    row['first_date'] = pd.to_datetime(
        aux_list[1][1] if len(aux_list[1]) > 1 else nan)
    row['last_date'] = pd.to_datetime(aux_list[1][0] if len(
        aux_list[1]) > 0 else nan, format='%m/%d/%Y', errors='coerce')

    row['first_date'] = pd.to_datetime(row['first_date'])
    row['last_date'] = pd.to_datetime(row['last_date'])

    return row

In [None]:
# apply the function to dataset and see the result
df = df.apply(review_extraction, axis=1)

# show data
df.head(4)

Create a column with time difference between first and second reviews

In [None]:
# Create a function to transform date to days

def get_days(timedelta):
    '''transform date to a day'''
    return timedelta.days

In [None]:
# find a diffderence between date of the first review and the last one
# add this information into a new column

df['diff_rev'] = df['last_date'] - df['first_date']

# call the function and get difference in days
df['diff_rev'] = df['diff_rev'].apply(get_days)

# show data
df.head(4)

In [None]:
sns.boxplot(df.diff_rev.values, color=colors[0])
plt.title('Days betweem comments Distribution\n', fontsize=15)
plt.xlabel('Days')

As we can see, here are negative values. That might mean that some dates are not refer to a first placed comment. Means it is inverted.

Let's fix it

In [None]:
# simply revert a sign  to a positive where it is negative
df['diff_rev'] = df['diff_rev'].apply(lambda x: abs(x))

A restaurant's rating may depend not only on how much time has passed between the last two reviews, but also on how many days have passed since the last review was posted to the current date.

Create a relevant column

In [None]:
CURRENT_DATE = pd.to_datetime('12/01/2021')

In [None]:
df['days_from_last_rev'] = df['last_date'].apply(
    lambda date: CURRENT_DATE - date)
df['days_from_last_rev'] = df['days_from_last_rev'].apply(get_days)

In [None]:
# check how many cells with reviews have no revierw
df[['first_review', 'last_review']].isnull().sum()

In [None]:
# check how many cells reviews_number have no review
df['reviews_number'].isnull().sum()

Now, let's sort our dataset and compare mising reviews with missing reviews number

In [None]:
no_rev_num = df[df['reviews_number'].isnull()]
no_rev_num.head()

In [None]:
no_rev_num[['first_review', 'last_review']].isnull().sum()

That's pretty interesting. The review number actually doesn't show accurate information with a number of reviews. For example, in the feedback columns we have at least one review but it is not depicted in the review's number.

Probably we need to drop the columns with the number of reviews or fill somehow the missing data. We will decide it in section 3 'Feature engineering'

Let's go ahead with analyze

Before we start, let's create two new vectors that indicate to us what 'review' has missing values.

In [None]:
# Create a column which indicate that review is not avaliable
df['first_review_miss'] = df['first_review'].isnull().astype('uint8')
df['last_review_miss'] = df['last_review'].isnull().astype('uint8')

In [None]:
# Replace Nan values with 'No comment' for further data proceeding
df['last_review'] = df['last_review'].fillna('no comment')
df['first_review'] = df['first_review'].fillna('no comment')

# show data
df.head(4)

#### 2.8.2 Sentiment Analysis

For a more convenient analyze let's withdraw columns with reviews only and create a new data frame

In [None]:
df_sentiment = df[['first_review', 'last_review']]

# show data
df_sentiment.head(1)

Clean the text 

In [None]:
# Create a function to clean comments

def cleanTxt(text):
    '''Function is called for cleaning text from trash
    INPUT: dirty string
    OUTPUT: More or less clean string'''

    text = re.sub(r'@[A-Za-z0-9]+', '', text)  # Remove @
    text = re.sub(r'#', '', text)  # remove #
    text = re.sub('^a-zA-Z', ' ', text)
    text = re.sub(r'https?:\/\/\S+', '', text)  # remove hyperlink
    text = re.sub(r'👍🏻', '', text)
    # there are much more emoji. I don't know how to identify them so far
    text = re.sub(r'🍕', '', text)

    text = text.lower()
    text = text.strip()
    #text = text.split()
    return text

In [None]:
# Apply function to clean a text

df_sentiment['first_review'] = df_sentiment['first_review'].apply(cleanTxt)
df_sentiment['last_review'] = df_sentiment['last_review'].apply(cleanTxt)

Generate subjectivity and polarity

In [None]:
# Create a function to get the subjectivity
def get_subjectivity(text):
    return TextBlob(text).sentiment.subjectivity

# Create a function to get the polarity


def get_polarity(text):
    return TextBlob(text).sentiment.polarity

In [None]:
# Create new cols and call the func

df_sentiment['subjectivity_fst'] = df_sentiment['first_review'].apply(
    get_subjectivity)
df_sentiment['subjectivity_snd'] = df_sentiment['last_review'].apply(
    get_subjectivity)
df_sentiment['polarity_fst'] = df_sentiment['first_review'].apply(get_polarity)
df_sentiment['polarity_snd'] = df_sentiment['last_review'].apply(get_polarity)
# show data
df_sentiment.head()

Let's see to most frequent words

In [None]:
# Plot Word Cloud
all_words_1 = ' '.join(
    [reviews for reviews in df_sentiment['first_review'] if reviews != 'no comment'])
all_words_2 = ' '.join(
    [reviews for reviews in df_sentiment['last_review'] if reviews != 'no comment'])
all_words = all_words_1 + all_words_2
wordCloud = WordCloud(width=500, height=300, random_state=21,
                      max_font_size=119).generate(all_words)

plt.figure(figsize=(10, 5))
plt.imshow(wordCloud, interpolation='bilinear')
plt.axis('off')
plt.show()

As we may see, the most frequent words are 'Nice, good', 'food' etc. So people tend to remain more positive reviews and often mention 'food'.

Let's create a new column where we identify the sentiment itself

In [None]:
# Create a function to compute the negative, neutral and positive analysis

def get_analysis(score):
    if score < 0:
        return 'negative'
    elif score == 0:
        return 'neutral'
    else:
        return 'positive'

In [None]:
df_sentiment['analysis_fst'] = df_sentiment['polarity_fst'].apply(get_analysis)
df_sentiment['analysis_snd'] = df_sentiment['polarity_snd'].apply(get_analysis)
df_sentiment.head()

In [None]:
# # Print all of the positive reviews

# j=1
# sortedDF_1 = df_sentiment.sort_values(by='polarity_1')
# for i in range(0,sortedDF_1.shape[0]):
#   if (sortedDF_1['analysis_1'][i] == 'positive'):
#     print(str(j)+ ')' +sortedDF_1['review_1'][i])
#     print()
#     j += 1

Let's check how many positive feedback we have and how they changed

In [None]:
# Get the percentage of positive reviews
p_rev = df_sentiment[df_sentiment['analysis_fst'] == 'positive']
p_rev = p_rev['first_review']
round((p_rev.shape[0] / df_sentiment.shape[0])*100, 1)

In [None]:
# Get the percentage of positive reviews
p_rev = df_sentiment[df_sentiment['analysis_snd'] == 'positive']
p_rev = p_rev['last_review']
round((p_rev.shape[0] / df_sentiment.shape[0])*100, 1)

Well, we have increased positive feedbacks since the first review placed on the website and the last review. Interesting why?

Let's check how much negative feedback we have and how they changed

In [None]:
# Get the percentage of negative reviews
n_rev = df_sentiment[df_sentiment['analysis_fst'] == 'negative']
n_rev = n_rev['first_review']
round((n_rev.shape[0] / df_sentiment.shape[0])*100, 1)

In [None]:
# Get the percentage of negative reviews
n_rev = df_sentiment[df_sentiment['analysis_snd'] == 'negative']
n_rev = n_rev['last_review']
round((n_rev.shape[0] / df_sentiment.shape[0])*100, 1)

Same thing in negative feedback but not so significant

Let's see into the distribution of negative and positive feedbacks

In [None]:
# plot and visualize

plt.title('Sentiment Analysis 0f first comments')
plt.xlabel('Sentiment')
plt.ylabel('Counts')
df_sentiment['analysis_fst'].value_counts().plot(kind='bar')
plt.show()

# show the value counts

df_sentiment['analysis_fst'].value_counts()

In [None]:
# plot and visualize

plt.title('Sentiment Analysis 0f lastt comments')
plt.xlabel('Sentiment')
plt.ylabel('Counts')
df_sentiment['analysis_snd'].value_counts().plot(kind='bar')
plt.show()

# show the value counts

df_sentiment['analysis_snd'].value_counts()

As we may see, more or less it is the same. We also note that we can't to refer to neutral comments because we replaced the missing values with 'No comments' and it gives us the neutral sentiment.

### 2.9 url_ta column

We're not going to analyse this column. Just drop it in feture engineering.

### 2.9 id_ta column

In [None]:
# transform to numeric
df['id_ta'] = df['id_ta'].apply(lambda x: int(x[1:]))

In [None]:
# Check the counts of id
df['id_ta'].value_counts()

Well, we have some duplicates here. As per the dataset description, this is a univocal identifier for each restaurant. Drop duplicates in the feature engineering section.

## 3. Feature Engineering and tidy dataset up

### 3.1 Restaurant_Id column

With a reference to EDA provided in Section 2, we may add one additional column which depicts, whether the restaurant belongs to the chain or not.

In [None]:
# Create a list with restaurants which might be in chain

chained_rest_list = list(df['restaurant_id'].value_counts()[
    df['restaurant_id'].value_counts() > 1].index)

# If it is in chain, we add in a new column the identificator '1', otherwise '0'
df['chained_rest'] = df[df['restaurant_id'].isin(
    chained_rest_list)].restaurant_id.apply(lambda x: 1)
df['chained_rest'] = df['chained_rest'].fillna(0)

# Check
df['chained_rest'].value_counts()

### 3.2 City column

In [None]:
# Fix Oporto
df['city'] = df['city'].replace(['Oporto'], 'Porto')

In [None]:
all_cities = df['city'].value_counts().index

Let's add a column with information about whether the Restaurant is in capital or not

In [None]:
# Cos we don't have too many cities, let's create a dict where mention whether the city is a capital
capital = [True, True, True, False, True, False, True,
           True, True, True, True, True, False, False,
           False, True, True, True, True, True, True,
           True, False, False, False, False, True,
           True, True, True, True]

capital_dict = dict(zip(list(all_cities), capital))

In [None]:
df['capital'] = df['city'].map(capital_dict)

# show data
df.head()

Add the population and add information to what country the city belongs.

In [None]:
city_population = {
    'London': 8787892,
    'Paris': 2187526,
    'Madrid': 3300000,
    'Barcelona': 1593075,
    'Berlin': 3726902,
    'Milan': 1331586,
    'Rome': 2860000,
    'Prague': 1300000,
    'Lisbon': 505526,
    'Vienna': 1900000,
    'Amsterdam': 872080,
    'Brussels': 144784,
    'Hamburg': 1840000,
    'Munich': 1558395,
    'Lyon': 506615,
    'Stockholm': 975904,
    'Budapest': 1752286,
    'Warsaw': 1720398,
    'Dublin': 1793579,
    'Copenhagen': 1330993,
    'Athens': 3090508,
    'Edinburgh': 476100,
    'Zurich': 402275,
    'Porto': 237559,
    'Geneva': 196150,
    'Krakow': 779115,
    'Oslo': 697549,
    'Helsinki':  656229,
    'Bratislava': 563682,
    'Luxembourg': 626108,
    'Ljubljana': 295504
}

--- 

In [None]:
city_country = {
    'London': 'United Kingdom',
    'Paris': 'France',
    'Madrid': 'Spain',
    'Barcelona': 'Spain',
    'Berlin': 'Germany',
    'Milan': 'Italy',
    'Rome': 'Italy',
    'Prague': 'Czech',
    'Lisbon': 'Portugal',
    'Vienna': 'Austria',
    'Amsterdam': 'Netherlands',
    'Brussels': 'Belgium',
    'Hamburg': 'Germany',
    'Munich': 'Germany',
    'Lyon': 'France',
    'Stockholm': 'Sweden',
    'Budapest': 'Hungary',
    'Warsaw': 'Poland',
    'Dublin': 'Ireland',
    'Copenhagen': 'Denmark',
    'Athens': 'Greece',
    'Edinburgh': 'Schotland',
    'Zurich': 'Switzerland',
    'Porto': 'Portugal',
    'Geneva': 'Switzerland',
    'Krakow': 'Poland',
    'Oslo': 'Norway',
    'Helsinki': 'Finland',
    'Bratislava': 'Slovakia',
    'Luxembourg': 'Luxembourg',
    'Ljubljana': 'Slovenija'
}

In [None]:
# add columns with the city population
df['city_population'] = df['city'].map(city_population)

# add column with countries
df['country'] = df['city'].map(city_country)

In [None]:
# Reduce number of not populasr cities
df['new_city'] = df['city']
# Create a top Cites list (more than 70% in Dataset)
top_cities_list = df['new_city'].value_counts()[
    df['new_city'].value_counts() > np.percentile((df['new_city'].value_counts().values), 70)].index.tolist()

In [None]:
cities_to_drop = list(set(all_cities)-set(top_cities_list))

In [None]:
df.loc[df['new_city'].isin(cities_to_drop), 'new_city'] = 'Other'

In [None]:
plt.figure(figsize=(15, 5), dpi=100)
sns.countplot(df['new_city'], order=df['new_city'].value_counts().index)
plt.xticks(rotation=45)
plt.title('Cities Distribution\n', fontsize=15)
plt.xlabel('City Name')
plt.ylabel('Quantity (frequency)')

print(f'Total Number of Cities in DataSet: {df.city.nunique()}')

Encode the city column

In [None]:
from sklearn.preprocessing import OneHotEncoder
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_city = pd.DataFrame(OH_encoder.fit_transform(df[['new_city']]))

# Adding column names to the encoded data set.
OH_city.columns = OH_encoder.get_feature_names(['new_city'])

# Show data
OH_city.head(2)

In [None]:
# from sklearn.preprocessing import LabelEncoder
# le = LabelEncoder() # create an object
# le.fit(df['city'])
# df['city_CODE'] = le.transform(df['city'])

# #show data
# df.head(2)

Encode the Country column

In [None]:
# OH_country = pd.DataFrame(OH_encoder.fit_transform(df[['country']]))

# # Adding column names to the encoded data set.
# OH_country.columns = OH_encoder.get_feature_names(['country'])

# # Show data
# OH_country.head(2)

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()  # create an object
le.fit(df['country'])
df['country_CODE'] = le.transform(df['country'])

# show data
df.head(2)

In [None]:
# df_test = pd.concat([OH_city, OH_country], axis=1)

In [None]:
# X = df_test.corr()

# print(f'Rank of Matrix: {np.linalg.matrix_rank(X)}')
# print(f'Determinat of matrix :{np.round(np.linalg.det(X),3)}')
# print(f'Shape of matrix :{np.shape(X)}')

### 3.3 Cuisine_style column

Let's add a new feature - the quantity of cuisine in a restaurant.

In [None]:
df['cuisine_count'] = df['cuisine_style'].apply(lambda x: len(x))

# show data
df.head(2)

Add one more feature which tells that restaurant has in its set a rare cuisine.

Assume rare value if it is met in dataset less than 50 times.

In [None]:
cuisine_rare_lst = df.explode('cuisine_style')['cuisine_style'].value_counts()[
    df.explode('cuisine_style')['cuisine_style'].value_counts() < 50].index.tolist()

In [None]:
def get_cuisine_rare(row):
    '''Function called for creating a number of
    rare cuisins
    INPUT: A cell from dataset
    OUTPUT: Number of rare cuisins'''

    number = 0
    for i in cuisine_rare_lst:
        if i in row:
            number += 1  # count qty of rare cuisines
    return number

In [None]:
# create a column with rare cuisine and call func
df['rare_cuisine'] = df['cuisine_style'].apply(get_cuisine_rare)

Is it important to know (perhaps), whether the cuisine belongs to the region where it is coming from?

Let's add this feature.

In [None]:
# Create a global variable (dict) with cuisines and related region of it
# I don't know how to use NLP for this case, so this work need to be done by hands :(

cuisine_region = {
    'France': ['French', 'Central European'],
    'Sweden': ['Swedish', 'Scandinavian'],
    'United Kingdom': ['British'],
    'Germany': ['German', 'Central European'],
    'Italy': ['Pizza', 'Italian'],
    'Slovakia': ['Eastern European'],
    'Austria': ['Austrian'],
    'Spain': ['Spanish'],
    'Ireland': ['Irish'],
    'Belgium': ['Belgian'],
    'Switzerland': ['Swiss'],
    'Poland': ['Polish', 'Ukrainian'],
    'Hungary': ['Hungarian'],
    'Denmark': ['Scandinavian'],
    'Netherlands': ['Dutch'],
    'Portugal': ['Portuguese'],
    'Czech': ['Czech'],
    'Norway': ['Norwegian', 'Scandinavian'],
    'Finland': ['Central European'],
    'Schotland': ['Scottish'],
    'Slovenija': ['Slovenian'],
    'Greece': ['Greek'],
    'Luxembourg': ['Central European']
}

In [None]:
# Create a function for identification of local cuisine

def get_local_cuisine(row):
    '''Function called for identifying
    whether restaurant includes local
    cuisine or not
    INPUT: A cell from dataset
    OUTPUT: 1 - if includes
            0 - if does not include'''

    local_cuis = cuisine_region[row['country']]
    for i in local_cuis:
        if i in row['cuisine_style'] and i != '':
            return 1
    return 0

In [None]:
# create a column with identificator, whether cuisine is local
df['local_cuisine'] = df.apply(get_local_cuisine, axis=1)

# count them
df['local_cuisine'].value_counts()

Our dataset has 13455 cuisines that belong to the country of origin.

With a reference to our EDA, we noticed that the Vegan and Gluten-Free Options are very likely to be reviewed by the customers. Let's add the feature, whether a restaurant includes vegan food.

In [None]:
# create a func to identify, whether restaurant includes vegan food

def get_vegan(row):
    vegan_cuis = ['Vegetarian Friendly', 'Vegan Options',  # All vegan food in my oppinion
                  'Gluten Free Options', 'Healthy', ]
    for i in vegan_cuis:
        if i in row['cuisine_style'] and i != '':
            return 1
    return 0

In [None]:
# create a new column with identification, whether restaurant includes a vegan food
df['vegan_include'] = df.apply(get_vegan, axis=1)

# Show Data
df.head(2)

Encode a cuisine column

In [None]:
#df_cuisine_encode = df['cuisine_style'].copy()

In [None]:
# # use a standard pandas method 'get dummies'
# df_cuisine_encod = pd.get_dummies(df_cuisine_encode, dummy_na=True)
# # show data
# df_cuisine_encode.head(2)

In [None]:
# OH_cuisine = pd.DataFrame(OH_encoder.fit_transform(df[['cuisine_style']]))

# # Adding column names to the encoded data set.
# OH_cuisine.columns = OH_encoder.get_feature_names(['cuisine_style'])

# # Show data
# OH_cuisine.head(2)

### 3.4 Ranking column

With a reference to EDA section 2.5 let's create a column with equivalent ranking.

Step 1: Create a total number of a restaurants in a single city

In [None]:
city_restaurant = dict(df['city'].value_counts())
df['restaurant_qty'] = df['city'].map(city_restaurant)

Step 2: Create an equivalent ranking

In [None]:
# devide ranking in dataset by quantity of restaurants in a city
df['equiv_ranking'] = df['ranking']/df['restaurant_qty']

Step 3. Check distribution of a normalized ranking

In [None]:
plt.figure(figsize=(15, 7))

for city in (df['city'].value_counts())[0:10].index:
    sns.distplot(df['equiv_ranking'][df['city'] == city],
                 kde=False, label=city)

plt.legend(prop={'size': 10})
plt.title('Equivalent ranking Distribution among cities\n', fontsize=15)
plt.xlabel('Equivalent ranking')
plt.ylabel('Quantity (frequency)')

Well, now it looks better than in section 2.5. Distribution is normal

Create a column with mean value of people to a single restaurant in a city

In [None]:
df['people_per_restaur'] = df['city_population']/df['restaurant_qty']
# Show Data
df.head(2)

Create a column with with equivalent to reviews number ranking

In [None]:
df['reviews_in_city'] = df['city'].apply(lambda x: df.groupby(
    ['city'])['reviews_number'].sum().sort_values(ascending=False)[x])

In [None]:
df['equivalent_rank_reviews'] = df['ranking'] / df['reviews_in_city']

# Show Data
df.head(2)

### 3.5 Price range column

In [None]:
df['price_range'] = df.price_range.astype('category')

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()  # create an object
le.fit(df['price_range'])
df['price_range'] = le.transform(df['price_range'])

# show data
df.head(2)

### 3.6 Reviews number column

In [None]:
# df['reviews_number'] = df['reviews_number'].fillna(0)

Use the sort of smart filling of missing values. As we know from EDA section, there is a column 'Last review' with some comments, while the Reviews number is empty. So we might assume that it is a kind of mistake.

Let's fill the NaN in this manner:

If there is 1 comment in a ’Last Review‘ or 'First review', we put '1' into a Review Numbers column.

In case of nor the first comment neither the second filled, put 0

In [None]:
# This function takes long for computing
# You may comment it and just run next one

df['reviews_number'] = df.apply(
    lambda row: 1 if np.isnan(row['reviews_number']) and (row['last_review_miss'] == 0 or row['first_review_miss'] == 0) else row['reviews_number'], axis=1
)

In [None]:
np.isnan(df['reviews_number']).sum()

Ok. we filled some missed reviews number. Rest of them fill with 0

In [None]:
df['reviews_number'] = df['reviews_number'].fillna(0)

### 3.8 Diff_rev column

Simply replace the NaN values by zeros

In [None]:
# Fill and check the Nan values
df['diff_rev'] = df['diff_rev'].fillna(0)
df['days_from_last_rev'] = df['days_from_last_rev'].fillna(0)

display(df['diff_rev'].isna().sum())
df['days_from_last_rev'].isna().sum()

### 3.7 MAP the Data frame to model

In [None]:
cols_to_drop = ['restaurant_id', 'city', 'new_city', 'cuisine_style',
                'url_ta', 'reviews', 'first_review',
                'last_review', 'first_date', 'last_date', 'country']

In [None]:
df_to_model = df.drop(cols_to_drop, axis=1)

Running the code several times and it was determined, that following columns has negative effect on the final MAE. Drop them out

Column with subjectivity in a sentiment analys data frame has negative effect on to MAE. Do not include it in final dataset

In [None]:
# Concat dataframes

df_to_model = pd.concat(
    [df_to_model, df_sentiment[['polarity_fst', 'polarity_snd']]], axis=1)

In [None]:
df_to_model = pd.concat([df_to_model, OH_city], axis=1)

In [None]:
cols_to_drop2 = ['new_city_Barcelona', 'cuisine_style_empty', 'new_city_Paris',
                 'last_review_miss', 'new_city_London', 'price_range',
                 'new_city_Lisbon', 'chained_rest', 'new_city_Berlin', 'rare_cuisine',
                 'first_review_miss', 'capital']

In [None]:
df_to_model = df_to_model.drop(cols_to_drop2, axis=1)

### 3.8 Resulting dataframe verification

In [None]:
# Plot missing values
cols = df_to_model.columns
fig, ax = plt.subplots(figsize=(5, 5))
sns.heatmap(df_to_model[cols].isnull(), cmap=sns.color_palette(colors))

#Show in percents

for col in df_to_model.columns:
    pct_missing = np.mean(df_to_model[col].isnull())
    print(f'{col} - {round(pct_missing*100)}%')

In [None]:
df_to_model['id_ta'].duplicated().sum()

Col 'id_ta' has duplicates. However the final result will be worst in case we remove them

Matrix verification

In [None]:
X = df_to_model.corr()
# fig, ax = plt.subplots(figsize=(50, 50))
# sns.heatmap(X, vmax=.7, square=True, annot=True)
print(f'Rank of Matrix: {np.linalg.matrix_rank(X)}')
print(f'Shape of matrix :{np.shape(X)}')

The final matrix of data has a full rank. Shall not be a problem to feed this data in a model

### 4.  Model

### 4.1 Split data into X and y

In [None]:
RANDOM_SEED = 42

In [None]:
# Х - данные с информацией о ресторанах, у - целевая переменная (рейтинги ресторанов)
X = df_to_model.drop(['rating'], axis=1)
y = df_to_model['rating']

In [None]:
# Загружаем специальный инструмент для разбивки:
from sklearn.model_selection import train_test_split

In [None]:
# Наборы данных с меткой "train" будут использоваться для обучения модели, "test" - для тестирования.
# Для тестирования мы будем использовать 25% от исходного датасета.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=RANDOM_SEED)

In [None]:
# Импортируем необходимые библиотеки:
# инструмент для создания и обучения модели
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics  # инструменты для оценки точности модели

In [None]:
# Создаём модель
regr = RandomForestRegressor(
    n_estimators=100, verbose=1, n_jobs=-1, random_state=RANDOM_SEED)

# Обучаем модель на тестовом наборе данных
regr.fit(X_train, y_train)

# Используем обученную модель для предсказания рейтинга ресторанов в тестовой выборке.
# Предсказанные значения записываем в переменную y_pred
y_pred = regr.predict(X_test)

In [None]:
# It can be observed that the difference in that real ratings are always multiples of 0.5
# Write a function to round the predicted ratings accordingly
def round_rating_pred(rating_pred):
    if rating_pred <= 0.5:
        return 0.0
    if rating_pred <= 1.5:
        return 1.0
    if rating_pred <= 1.75:
        return 1.5
    if rating_pred <= 2.25:
        return 2.0
    if rating_pred <= 2.75:
        return 2.5
    if rating_pred <= 3.25:
        return 3.0
    if rating_pred <= 3.75:
        return 3.5
    if rating_pred <= 4.25:
        return 4.0
    if rating_pred <= 4.75:
        return 4.5
    return 5.0

In [None]:
# Round it
for i in range(len(y_pred)):
    y_pred[i] = round_rating_pred(y_pred[i])

In [None]:
# Сравниваем предсказанные значения (y_pred) с реальными (y_test), и смотрим насколько они в среднем отличаются
# Метрика называется Mean Absolute Error (MAE) и показывает среднее отклонение предсказанных значений от фактических.
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))

In [None]:
plt.rcParams['figure.figsize'] = (10, 10)
feat_importances = pd.Series(regr.feature_importances_, index=X.columns)
feat_importances.nlargest(15).plot(kind='barh')