# Predict TripAdvisor Rating
## В этом соревновании нам предстоит предсказать рейтинг ресторана в TripAdvisor
**По ходу задачи:**
* Прокачаем работу с pandas
* Научимся работать с Kaggle Notebooks
* Поймем как делать предобработку различных данных
* Научимся работать с пропущенными данными (Nan)
* Познакомимся с различными видами кодирования признаков
* Немного попробуем [Feature Engineering](https://ru.wikipedia.org/wiki/Конструирование_признаков) (генерировать новые признаки)
* И совсем немного затронем ML
* И многое другое...   



### И самое важное, все это вы сможете сделать самостоятельно!

*Этот Ноутбук являетсся Примером/Шаблоном к этому соревнованию (Baseline) и не служит готовым решением!*   
Вы можете использовать его как основу для построения своего решения.

> что такое baseline решение, зачем оно нужно и почему предоставлять baseline к соревнованию стало важным стандартом на kaggle и других площадках.   
**baseline** создается больше как шаблон, где можно посмотреть как происходит обращение с входящими данными и что нужно получить на выходе. При этом МЛ начинка может быть достаточно простой, просто для примера. Это помогает быстрее приступить к самому МЛ, а не тратить ценное время на чисто инженерные задачи. 
Также baseline являеться хорошей опорной точкой по метрике. Если твое решение хуже baseline - ты явно делаешь что-то не то и стоит попробовать другой путь) 

В контексте нашего соревнования baseline идет с небольшими примерами того, что можно делать с данными, и с инструкцией, что делать дальше, чтобы улучшить результат.  Вообще готовым решением это сложно назвать, так как используются всего 2 самых простых признака (а остальные исключаются).

# import

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime
import re
import math

import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline

# Загружаем специальный удобный инструмент для разделения датасета:
from sklearn.model_selection import train_test_split

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [3]:
# зафиксируем версию пакетов, чтобы эксперименты были воспроизводимы:
!pip freeze > requirements.txt

In [4]:
# всегда фиксируйте RANDOM_SEED, чтобы ваши эксперименты были воспроизводимы!
RANDOM_SEED = 42

# DATA

In [5]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

In [6]:
DATA_DIR = '/kaggle/input/sf-dst-restaurant-rating/'
df_train = pd.read_csv(DATA_DIR+'/main_task.csv')
df_test = pd.read_csv(DATA_DIR+'kaggle_task.csv')
sample_submission = pd.read_csv(DATA_DIR+'/sample_submission.csv')

In [7]:
CITY_DIR = '/kaggle/input/world-cities/'
db_c = pd.read_csv(CITY_DIR+'worldcities.csv')

In [8]:
POP_DIR = '/kaggle/input/cities-population/'
db_pop_cities = pd.read_csv(POP_DIR+'csvData.csv')

## Lists and Dicts

In [9]:
country_list = ['Austria', 'Belgium', 'Czechia', 'Denmark', 'Finland', 'France', 'Germany', 'Greece',
                'Hungary', 'Ireland', 'Italy', 'Luxembourg', 'Netherlands', 'Poland', 'Portugal', 
                'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'United Kingdom', 'Norway', 'Switzerland']

# info from www.numbeo.com
country_index_dict = {'Austria': 176.36, 'Belgium': 148.18, 'Czechia': 157.49, 'Denmark': 186.25, 
                      'Finland': 178.95, 'France': 153.60, 'Germany': 175.24, 'Greece': 127.96, 
                      'Hungary': 134.54, 'Ireland': 150.54, 'Italy': 137.77, 'Luxembourg': 171.81,
                      'Netherlands': 180.27, 'Poland': 127.79, 'Portugal': 159.83, 'Slovakia': 147.09, 
                      'Slovenia': 165.74, 'Spain': 163.48, 'Sweden': 170.19, 'United Kingdom': 156.94, 
                      'Norway': 171.72, 'Switzerland': 188.36}

# info from www.numbeo.com
city_index_dict = {'Paris': 119.85, 'Helsinki': 173.28, 'Edinburgh': 175.31, 'London': 130.64, 
                   'Bratislava': 144.38, 'Lisbon': 147.90, 'Budapest': 123.59, 'Stockholm': 155.92, 
                   'Rome': 106.95, 'Milan': 117.35, 'Munich': 177.14, 'Hamburg': 164.86, 'Prague': 155.43,
                   'Vienna': 180.52, 'Dublin': 137.34, 'Barcelona': 133.91, 'Brussels': 139.01,
                   'Madrid': 148.59, 'Oslo': 162.11, 'Amsterdam': 166.83, 'Berlin': 157.91, 'Lyon': 148.10, 
                   'Athens': 119.44, 'Warsaw': 124.30, 'Porto': 153.46, 'Krakow': 114.99, 'Copenhagen': 177.54,
                   'Luxembourg': 178.62, 'Zurich': 194.41, 'Geneva': 180.97, 'Ljubljana': 162.95}

# info from www.numbeo.com
crime_index_dict = {'Paris': 54.96, 'Helsinki': 26.03, 'Edinburgh': 30.05, 'London': 53.13, 
                   'Bratislava': 30.41, 'Lisbon': 28.21, 'Budapest': 35.68, 'Stockholm': 45.30, 
                   'Rome': 52.51, 'Milan': 44.20, 'Munich': 17.39, 'Hamburg': 42.84, 'Prague': 24.20,
                   'Vienna': 25.83, 'Dublin': 50.84, 'Barcelona': 46.10, 'Brussels': 50.85,
                   'Madrid': 29.76, 'Oslo': 34.97, 'Amsterdam': 33.38, 'Berlin': 41.97, 'Lyon': 47.55, 
                   'Athens': 54.20, 'Warsaw': 26.60, 'Porto': 36.00, 'Krakow': 28.32, 'Copenhagen': 27.07,
                   'Luxembourg': 28.15, 'Zurich': 16.33, 'Geneva': 27.23, 'Ljubljana': 21.63}

# info from www.numbeo.com
rest_price_city_index_dict = {'Paris': 85.63, 'Helsinki': 90.47, 'Edinburgh': 89.20, 'London': 90.97, 
                   'Bratislava': 44.94, 'Lisbon': 50.49, 'Budapest': 37.35, 'Stockholm': 87.36, 
                   'Rome': 76.23, 'Milan': 82.88, 'Munich': 79.43, 'Hamburg': 71.31, 'Prague': 40.34,
                   'Vienna': 68.04, 'Dublin': 87.39, 'Barcelona': 64.99, 'Brussels': 81.08,
                   'Madrid': 65.56, 'Oslo': 108.53, 'Amsterdam': 90.25, 'Berlin': 63.83, 'Lyon': 73.05, 
                   'Athens': 57.51, 'Warsaw': 43.90, 'Porto': 45.67, 'Krakow': 38.61, 'Copenhagen': 108.85,
                   'Luxembourg': 95.81, 'Zurich': 130.84, 'Geneva': 133.10, 'Ljubljana': 54.51}

In [10]:
# Current date
cur_date = datetime.datetime.now().date()

# Foundation date (from 'About' from tripadvisor site)
foundation_date = datetime.datetime.strptime('02-2000', '%m-%Y').date()

## Functions

In [11]:
def parse_cuisine(x):
    '''Parsing cuisine column. Drop first and last char, split and strip (there are 
    spaces around cuisins)'''
    if not pd.isna(x):
        if isinstance(x, str):
            if '[' in x:
                x = [i.strip().strip("'") for i in x[1: len(x) - 1].split(', ')]
            else:
                x = [i.strip() for i in x.split(', ')]
    return x


def perscent_nans(x):
    '''Returning perscentage of NaNs'''
    per_nans = 1 - x.count() / x.shape[0]
    print('Perscent of missed values =', round(per_nans, 2), '%')


def parse_nan_rev(x):
    '''Return number depending on number of reviews from reviews colunm'''
    if x == '[[], []]':
        return 0
    elif x.count("', '") == 2:
        return 1
    else:
        return 2
        

def parse_reviews(x):
    '''Func for simple parsing Review column. Returing comment string and 2 date strings'''
    if x == '[[], []]':
        return ('NO_COMMENT', 'NO_DATE', 'NO_DATE')
    else:
        txt, dates = x.split('], [')
        parse_dates = re.findall(r"\'([^\'\']+)\'", dates)
        if len(parse_dates) == 2:
            return (txt, parse_dates[0], parse_dates[1])
        else:
            return (txt, 'NO_DATE', parse_dates[0])
        
        
def calc_days(row):
    '''Func for calculating difference between first and second comment and last comment and present time'''
    if row[0] == 'NO_DATE' and row[1] == 'NO_DATE':
        return (0, 0)
    elif row[0] == 'NO_DATE' or row[1] == 'NO_DATE':
        if row[0] == 'NO_DATE':
            return (0, (cur_date - datetime.datetime.strptime(row[1], "%m/%d/%Y").date()).days)
        else:
            return (0, (cur_date - datetime.datetime.strptime(row[0], "%m/%d/%Y").date()).days)
    first_date = datetime.datetime.strptime(row[0], "%m/%d/%Y").date()
    second_date = datetime.datetime.strptime(row[1], "%m/%d/%Y").date()
    if first_date < foundation_date:
        first_date = foundation_date
    if second_date < foundation_date:
        second_date = foundation_date
    max_date = max(first_date, second_date)
    return (abs((second_date - first_date).days), (cur_date - max_date).days)


def season(x):
    '''Func to know season of last review'''
    if x == 'NO_DATE':
        return 'NO_SEASON'
    elif 1 <= datetime.datetime.strptime(x, "%m/%d/%Y").month <= 2 or datetime.datetime.strptime(x, "%m/%d/%Y").month == 12:
        return 'WINTER'
    elif 3 <= datetime.datetime.strptime(x, "%m/%d/%Y").month <= 5:
        return 'SPRING'
    elif 6 <= datetime.datetime.strptime(x, "%m/%d/%Y").month <= 8:
        return 'SUMMER'
    else:
        return 'AUTUMN'

    
def perc_iqr(col):
    '''Function to calculate IQR and 25, 75 quantile'''
    perc25 = col.quantile(0.25)
    perc75 = col.quantile(0.75)
    perc50 = col.quantile(0.50)
    iqr = perc75 - perc25
    print(
        'First quantile: {},'.format(perc25),
        'Third quantile: {},'.format(perc75),
        "IQR: {}, ".format(iqr),
        "Outliers: [{f}, {l}].".format(f=perc25 - 1.5*iqr, l=perc75 + 1.5*iqr))
    return (perc25 - 1.5*iqr, perc75 + 1.5*iqr, perc25, perc75, perc50)


def distplot_for_onecolumn(col):
    '''Histogram plot func'''
    plt.rcParams['figure.figsize'] = (15,5)
    plt.xticks(rotation=90)
    sns.distplot(col)
    plt.show()

In [12]:
df_train.info()

In [13]:
df_train.head(5)

In [14]:
df_test.info()

In [15]:
df_test.head(5)

In [16]:
sample_submission.head(5)

In [17]:
sample_submission.info()

db_ta.head()

In [18]:
# ВАЖНО! дря корректной обработки признаков объединяем трейн и тест в один датасет
df_train['sample'] = 1 # помечаем где у нас трейн
df_test['sample'] = 0 # помечаем где у нас тест
df_test['Rating'] = 0 # в тесте у нас нет значения Rating, мы его должны предсказать, по этому пока просто заполняем нулями

db = df_test.append(df_train, sort=False).reset_index(drop=True) # объединяем

In [19]:
db.info()

Подробнее по признакам:
* City: Город 
* Cuisine Style: Кухня
* Ranking: Ранг ресторана относительно других ресторанов в этом городе
* Price Range: Цены в ресторане в 3 категориях
* Number of Reviews: Количество отзывов
* Reviews: 2 последних отзыва и даты этих отзывов
* URL_TA: страница ресторана на 'www.tripadvisor.com' 
* ID_TA: ID ресторана в TripAdvisor
* Rating: Рейтинг ресторана

In [20]:
db.sample(5)

# Cleaning and Prepping Data
Обычно данные содержат в себе кучу мусора, который необходимо почистить, для того чтобы привести их в приемлемый формат. Чистка данных — это необходимый этап решения почти любой реальной задачи.   
![](https://analyticsindiamag.com/wp-content/uploads/2018/01/data-cleaning.png)

## 1. Prep dataframes

### Column names work

#### DB

In [21]:
db_columns = list(db.columns)
db_columns = [i.strip().lower() for i in db_columns] # lower all
db_columns = ['_'.join(i.split(' ')) for i in db_columns] # change ' ' to '_'
db.set_axis(db_columns, axis='columns', inplace=True)

#### DB Country

In [22]:
db_c = db_c[db_c['country'].isin(country_list)] # I need only country from my DB.
db_c.drop(['city', 'iso2', 'iso3', 'id', 'admin_name'], axis=1, inplace=True)
db_c.rename(columns={'city_ascii': 'city', 'lat': 'lat_city', 'lng': 'lng_city'}, inplace=True)

#### DB Cities Population

In [23]:
db_pop_cities.drop(['country'], axis=1, inplace=True)
db_pop_cities.rename(columns={'name': 'city'}, inplace=True)
db_pop_cities = db_pop_cities.append({'city': 'Luxembourg', 'pop': 613961}, ignore_index=True)
db_pop_cities = db_pop_cities.append({'city': 'Geneva', 'pop': 196173}, ignore_index=True)

#### City Column treat

In [24]:
db['city'].value_counts()

Changing name for Oporto to Porto

In [25]:
db['city'] = db['city'].replace('Oporto', 'Porto')

#### Let's see cities distribution

In [26]:
plt.rcParams['figure.figsize'] = (15,5)
plt.xticks(rotation=90)
sns.countplot(x=db[db['sample'] == 1]['city'])
plt.show()

#### Nice. Looks normal. London and Paris are standing out. But it's understandable becouse of they are largest cities in Europe.

#### Let's merge 3 tables to 1

In [27]:
db = pd.merge(db, db_c, on='city', how='left')

In [28]:
db = pd.merge(db, db_pop_cities, on='city', how='left')

### Let's see NaN's around DF.

In [29]:
db[db['sample'] == 1].isna().sum()

### Heatmap of NaNs

In [30]:
plt.figure(figsize=(10, 7))
sns.heatmap(db[['cuisine_style', 'price_range', 'number_of_reviews', 'reviews']].isnull());

I have NaNs distributing smoothly over dataframe. No groupping.
#### So, I need treat NaNs in 3 columns.

In [31]:
db.info()

## 2. NAN Treatment

#### 'Cuisine Style' column

#### First I fill NA in 'cuisine_style' by values from 'cuisines'

In [32]:
db['cuisine_style_NAN'] = pd.isna(db['cuisine_style']).astype('uint8')

In [33]:
db['cuisine_style'] = db['cuisine_style'].apply(parse_cuisine)

In [34]:
db['cuisine_style'] = db['cuisine_style'].fillna('unknown')

In [35]:
plt.rcParams['figure.figsize'] = (15,5)
plt.xticks(rotation=90)
sns.countplot(db['cuisine_style'].explode())
plt.show()

#### 'Price Range' column

In [36]:
db['price_range'].value_counts(dropna=False)

In [37]:
perscent_nans(db[db['sample'] == 1]['price_range'])

Frame has a lot of missed values. I have treat them. But first I make column of NaN cells. Let's see on distribution.

In [38]:
db['price_range'].replace({'$': 'chip', '$$ - $$$': 'average', '$$$$': 'expensive'}, inplace=True)

In [39]:
plt.rcParams['figure.figsize'] = (15,5)
plt.xticks(rotation=90)
sns.countplot(x='city', hue='price_range', data=db[db['sample'] == 1])
plt.show()

In [40]:
plt.rcParams['figure.figsize'] = (15,5)
sns.boxplot(x='price_range', y='rating', data=db[db['sample'] == 1])
plt.show()

Average priced rests are most common in every city and in DF. Most rests have 3.5 - 4.5 marks. There are few outliers, but they are OK.

Before filling NaNs I'm making Nans column

In [41]:
db['price_range_NAN'] = pd.isna(db['price_range']).astype('uint8')

Everyvere average price restaurans are more frequent. Fill NaNs by modes by cities.

In [42]:
db['price_range'] = db.groupby('city')['price_range'].transform(lambda x: x.fillna(x.mode()[0]))

In [43]:
db.sample(5)

#### 'Reviews' column

In [44]:
db[db['sample'] == 1]['reviews'].value_counts()

Hoh, there is 8112 emplty (not by type, but by scence).
Let's check duplicated strings.

In [45]:
db[(db['sample'] == 1) & (db['reviews'].duplicated()) & (db['reviews'] != '[[], []]')]['reviews'].count()

Intresting. There are 30 duplicated reviews. Guess it's site bug end I leave it.

In [46]:
db[pd.isna(db['reviews'])]

Have only 2. Fill it by '[[], []]'

In [47]:
db['reviews'] = db['reviews'].fillna('[[], []]')

In [48]:
# db['reviews_NAN'] = db['reviews'].apply(lambda x: 1 if x == '[[], []]' else 0)

In [49]:
db[['comment', 'f_date', 's_date']] = db.apply(lambda x: parse_reviews(x['reviews']), axis=1, result_type='expand')

In [50]:
db[['dif_days', 'days_from_last']] = db.apply(lambda x: calc_days(x[['f_date', 's_date']]), axis=1, result_type='expand')

In [51]:
distplot_for_onecolumn(db[db['sample'] == 1]['dif_days'])

In [52]:
distplot_for_onecolumn(db[db['sample'] == 1]['days_from_last'])

In [53]:
perc_iqr(db[db['sample'] == 1]['days_from_last'])

So, there are 2 normal distributions. In Days from Last I'll replace 0 to median.

In [54]:
median_days_from_last = db[db['sample'] == 1]['days_from_last'].median()
db['days_from_last'] = db['days_from_last'].apply(lambda x: median_days_from_last if x == 0 else x)

In [55]:
distplot_for_onecolumn(db[db['sample'] == 1]['days_from_last'])

#### 'Number of Reviews' column

In [56]:
db[db['sample'] == 1]['number_of_reviews'].isna().sum()

In [57]:
perscent_nans(db[db['sample'] == 1]['number_of_reviews'])

Number of missed cells in 'number_of_reviews' column is miserable. But I fix it by func.

In [58]:
sns.boxplot(db[db['sample'] == 1]['number_of_reviews'])
plt.show()

In [59]:
iqr_rev = perc_iqr(db[db['sample'] == 1]['number_of_reviews'])

In [60]:
plt.rcParams['figure.figsize'] = (15,10)
plt.xticks(rotation=90)
sns.boxplot(x='city', y='number_of_reviews', data=db[db['sample'] == 1])
plt.show()

In [61]:
distplot_for_onecolumn(db[db['sample'] == 1]['number_of_reviews'].dropna())

There are so many outliers. But I wouldn't fix it because of some rests have a lot of comments in real, others - a few. It is not bad. But found out outliers by price range. Intresting...

In [62]:
db[db['number_of_reviews'] > 2000].groupby(['city', 'price_range'])['number_of_reviews'].count()
plt.rcParams['figure.figsize'] = (15,7)
plt.xticks(rotation=90)
sns.barplot(x='city', y='number_of_reviews', hue='price_range', data=db[(db['number_of_reviews'] > 300) & (db['sample'] == 1)])
plt.show()

Good distribution among outliers values. Let it be. just fill empty cells.
First make NaNs column

In [63]:
db['number_of_reviews_NAN'] = pd.isna(db['number_of_reviews']).astype('uint8')

In [64]:
db['number_of_reviews'] = db.apply(lambda x: parse_nan_rev(str(x['reviews'])) 
                                  if pd.isna(x['number_of_reviews']) else x['number_of_reviews'], axis=1)

In [65]:
distplot_for_onecolumn(db[db['sample'] == 1]['number_of_reviews'])

In [66]:
db.info()

## 3. Treatment of numeric columns

#### Ranking column

In [67]:
for i in (db['city'].value_counts())[0:10].index:
    plt.rcParams['figure.figsize'] = (15,7)
    plt.xticks(rotation=90)
    sns.distplot(db['ranking'][db['city'] == i], bins=100)
plt.show()

The distribution for each city is normal. So, let's normalize min-max all ranking column by city to new column.

In [68]:
rank_minmax = db.groupby('city')['ranking'].agg(['min', 'max']).to_dict()
db['weight_rank'] = db.apply(lambda x: 1 - (x['ranking'] - rank_minmax['min'][x['city']])/(rank_minmax['max'][x['city']] 
                                                                                           - rank_minmax['min'][x['city']]), axis=1)

In [69]:
for i in (db['city'].value_counts())[0:10].index:
    plt.rcParams['figure.figsize'] = (15,7)
    plt.xticks(rotation=90)
    sns.distplot(db['weight_rank'][db['city'] == i], bins=100)
plt.show()

Wonderful!

#### ID_TA column

Let's try convert this column to numeric.

In [70]:
db['id_ta'] = db['id_ta'].apply(lambda x: x[1:]).astype(int)

In [71]:
plt.rcParams['figure.figsize'] = (15,5)
plt.xticks(rotation=90)
sns.distplot(db['id_ta'])
plt.show()

#### Price Range column to numeric

In [72]:
db['price_range_num'] = db['price_range'].replace({'chip': 1, 'average': 2, 'expensive': 3})

In [73]:
# Column with mean price range by city
db['price_range_mean'] = db.groupby('city')['price_range_num'].transform('mean')

#### Cuisine style to numeric

In [74]:
# number of cuisines in restaurant
db['number_of_cuisine'] = db['cuisine_style'].apply(lambda x: len(x))

In [75]:
# Mean of cuicines by city
db['cuisine_mean_by_city'] = db.groupby('city')['number_of_cuisine'].transform('mean')

### Index column block

In [76]:
# Adding country index column
db['country_idx'] = db['country'].map(country_index_dict)

In [77]:
# Adding city index column
db['city_idx'] = db['city'].map(city_index_dict)

In [78]:
# Adding Crime index column by city
db['crime_city_idx'] = db['city'].map(crime_index_dict)

In [79]:
# Adding restaurants price index column by city
db['price_city_idx'] = db['city'].map(rest_price_city_index_dict)

### Dummies block

#### Mono cuisine or multi (depends on more or less then mean by city)

In [80]:
db['ismono_cuisine'] = db.apply(lambda x: 'mono' if x['number_of_cuisine'] < x['cuisine_mean_by_city'] 
                               else 'multi', axis=1)
dumm_ismono_cuisine = pd.get_dummies(db['ismono_cuisine'], prefix='cuisine')
db = pd.concat([db, dumm_ismono_cuisine], axis=1, join='outer')

#### Price range dummies

In [81]:
# dumm_price_range = pd.get_dummies(db['price_range'])
# db = pd.concat([db, dumm_price_range], axis=1, join='outer')

#### City dummies

In [82]:
dumm_city = pd.get_dummies(db['city'], prefix='city')
db = pd.concat([db, dumm_city], axis=1, join='outer')

#### Cuisine dummies

In [83]:
dumm_cuisine_style = pd.get_dummies(db['cuisine_style'].explode()).groupby(level=0).sum()
db = pd.concat([db, dumm_cuisine_style], axis=1, join='outer')

In [84]:
city_counts_list = db.groupby('city')['ranking'].count().sort_values(ascending=False).head(10).index.to_list()
db['top10_cities'] = db['city'].apply(lambda x: x if x in city_counts_list else 'other')
dumm_top10_city = pd.get_dummies(db['top10_cities'], prefix='topcity')
db = pd.concat([db, dumm_top10_city], axis=1, join='outer')

### New features

In [85]:
# Restaurant ID - tranform to number (quantity of net)
db['rest_id_num'] = db.groupby('restaurant_id')['ranking'].transform('count')

In [86]:
# Column of depend ranking of quantity rests in city
db['rests_in_city'] = db.groupby('city')['ranking'].transform('count')
db['rank_by_citycount'] = db.apply(lambda x: x['ranking'] / x['rests_in_city'], axis=1)

In [87]:
# Column of depend number of rests in city to population of city
db['rests_per_pop'] = round((db['rests_in_city'] / db['pop']) * 1000, 4)

In [88]:
# Ranking of quantity rests in city to population
db['wrank_by_pop'] = db['rank_by_citycount'] / db['pop']

In [89]:
db['review_to_pop'] = db['number_of_reviews'] / db['pop']
db['rank_with_review_to_pop'] = db['rank_by_citycount'] * db['review_to_pop']

In [90]:
db['rewies_per_city'] = db.groupby('city')['number_of_reviews'].transform('sum')
db['rank_to_reviews_per_city'] = db['ranking'] / db['rewies_per_city']

In [91]:
db['rank_to_review_to_rest_id'] = db['number_of_reviews'] * db['rest_id_num'] * db['ranking']

In [92]:
db['rank_city'] = db['city'].rank()

In [93]:
# How lengh of comment affects to rating
db['comment_len'] = db['comment'].apply(len)

### Yes-No columns

In [94]:
db['rest_id_isnet'] = db['rest_id_num'].apply(lambda x: 1 if x == 1 else 0)

In [95]:
db['is_capital'] = db['capital'].apply(lambda x: 1 if x == 'primary' else 0)

In [96]:
db.sample(3)

#### And deleting some columns. 'reviews' - I parsed it on 3 dif columns, 'url_ta' - I can't parse url yet. 

In [99]:
db.drop(['reviews', 'url_ta'], axis=1, inplace=True)

In [None]:
heatmap = sns.heatmap(db[['rating', 'ranking', 'number_of_reviews']].corr(), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Start column correlation', fontdict={'fontsize': 18}, pad=12)
plt.show()

So, not much. Adding columns

In [None]:
db_list = ['rating', 'ranking', 'number_of_reviews', 'dif_days', 'days_from_last', 'weight_rank', 'price_range_num', 
        'price_range_mean', 'number_of_cuisine', 'cuisine_mean_by_city', 'rest_id_num', 'rests_in_city', 'rank_by_citycount',  
        'rests_per_pop', 'wrank_by_pop', 'review_to_pop', 'rank_with_review_to_pop', 'rewies_per_city', 'rank_to_reviews_per_city', 
        'rank_to_review_to_rest_id', 'rank_city', 'comment_len']
plt.figure(figsize=(15, 15))
heatmap = sns.heatmap(db[db_list].corr(), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation table', fontdict={'fontsize': 18}, pad=12)
plt.show()

# Data Preprocessing

#### Запускаем и проверяем что получилось

In [None]:
df_preproc = db.copy()
df_preproc.sample(3)

In [None]:
object_columns = [s for s in df_preproc if df_preproc[s].dtypes == 'object']
df_preproc.drop(object_columns, axis = 1, inplace=True)

In [None]:
df_preproc.drop(['lat_city', 'lng_city', 'population', 'pop'], axis = 1, inplace=True)

In [None]:
df_preproc.info()

In [None]:
# Теперь выделим тестовую часть
train_data = df_preproc.query('sample == 1').drop(['sample'], axis=1)
test_data = df_preproc.query('sample == 0').drop(['sample'], axis=1)

y = train_data.rating.values            # наш таргет
X = train_data.drop(['rating'], axis=1)

**Перед тем как отправлять наши данные на обучение, разделим данные на еще один тест и трейн, для валидации. 
Это поможет нам проверить, как хорошо наша модель работает, до отправки submissiona на kaggle.**

In [None]:
# Воспользуемся специальной функцие train_test_split для разбивки тестовых данных
# выделим 20% данных на валидацию (параметр test_size)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)

In [None]:
# проверяем
test_data.shape, train_data.shape, X.shape, X_train.shape, X_test.shape

# Model 
Сам ML

In [None]:
# Импортируем необходимые библиотеки:
from sklearn.ensemble import RandomForestRegressor # инструмент для создания и обучения модели
from sklearn import metrics # инструменты для оценки точности модели

In [None]:
# Создаём модель (НАСТРОЙКИ НЕ ТРОГАЕМ)
model = RandomForestRegressor(n_estimators=100, verbose=1, n_jobs=-1, random_state=RANDOM_SEED)

In [None]:
# Обучаем модель на тестовом наборе данных
model.fit(X_train, y_train)

# Используем обученную модель для предсказания рейтинга ресторанов в тестовой выборке.
# Предсказанные значения записываем в переменную y_pred
y_pred = model.predict(X_test)

In [None]:
# Сравниваем предсказанные значения (y_pred) с реальными (y_test), и смотрим насколько они в среднем отличаются
# Метрика называется Mean Absolute Error (MAE) и показывает среднее отклонение предсказанных значений от фактических.
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))

In [None]:
# в RandomForestRegressor есть возможность вывести самые важные признаки для модели
plt.rcParams['figure.figsize'] = (10,10)
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(30).plot(kind='barh')

# Submission
Если все устраевает - готовим Submission на кагл

In [None]:
test_data.sample(10)

In [None]:
test_data = test_data.drop(['rating'], axis=1)

In [None]:
sample_submission

In [None]:
predict_submission = model.predict(test_data)

In [None]:
predict_submission

In [None]:
sample_submission['Rating'] = predict_submission
sample_submission.to_csv('submission.csv', index=False)
sample_submission.head(10)

# What's next?
Или что делать, чтоб улучшить результат:
* Обработать оставшиеся признаки в понятный для машины формат
* Посмотреть, что еще можно извлечь из признаков
* Сгенерировать новые признаки
* Подгрузить дополнительные данные, например: по населению или благосостоянию городов
* Подобрать состав признаков

В общем, процесс творческий и весьма увлекательный! Удачи в соревновании!


## 