# SLU12 - Feature Engineering (aka Real World Data): Exercises notebook

## 1 About the data

In this exercise we will be using a dataset from Zomato, adapted from [here](https://github.com/MehtaShruti/Zomato-Restaurants-Recommendations).

In [None]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.read_csv('data/zomato.csv')
data.head()

The fields in this dataset have the following meaning:
* **Restaurant Name** - name of the restaurant.
* **City** - name of the city where the restaurant belong.
* **Cuisines** - type of cuisine served on the restaurant.
* **Average Cost for two** - self explanatory(as when scraped).
* **Has Table booking** - Yes or No.
* **Has Online delivery** - Yes or No.
* **Price Range** - prices range from 1 to 4.
* **Agregate rating** - overall user rating of the app (as when scraped).
* **Rating color** - rating in color: White/ Red/ Orange/ Yeallow/ Green/ Dark Green.
* **Rating text** - rating in text values: Not rated/ Poor / Average / Good/ Very Good/ Excellent.
* **Votes** - number of user reviews for the restaurant (as when scraped).

The first thing we want to do is to check the dtypes of our features.

In [None]:
data.dtypes

## 2 Category dtype in pandas

### Exercise 1: Convert fields into category dtype (graded)

The fields `Cuisines` and `Rating text` are of dtype `object` but can be converted into dtype `category`, as explained in the Learning Notebook. Moreover:
* `Cuisines` is a *nominal* categorical field, that is, without any meaningful order;
* `Rating text` is an *ordinal* categorical field, as its values has a natural order.

In the following exercise, convert both fields into dtype `category` and, in the case of the field `Rating text`, assign a natural order for its categories.

_Note:_ Regarding the "natural order" for the field `Rating text`, use the order shown in the field meaning. 

In [None]:
def convert_categorical_features(X, nominal_feat='Cuisines', ordinal_feat='Rating text'):

    X_s = X.copy()
    
    ## convert nominal feature to dtype 'category'
    # ...
    ## create list of ordered categories for ordinal feature
    # ordered_cats = ...
    ## convert ordinal feature to dtype 'category'
    # ...
    ## Assign natural order to ordinal feature
    # ...
    
    # YOUR CODE HERE
    raise NotImplementedError()

    return X_s

In [None]:
"""Check that the solution is correct."""
X_cat_conv = convert_categorical_features(data)

assert X_cat_conv['Cuisines'].dtype == 'category'
assert X_cat_conv['Rating text'].dtype == 'category'
assert X_cat_conv['Rating text'].min() == 'Not rated'
assert X_cat_conv['Rating text'].max() == 'Excellent'

### Exercise 2: Encode binary field (graded)

In this exercise, encode the target variable to be `1` when an restaurant `Has Table booking` and to `0` when it doesn't using the `map` method.

In [None]:
def encode_binary_field(f):

    f_e = f.copy()
    
    ## create a dictionary mapping the current values to int values
    # enconding_map = ...
    ## change target using the mapping
    # f_e = ...
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return f_e

In [None]:
"""Check that the solution is correct."""
f_encoded = encode_binary_field(data['Has Table booking'])

assert f_encoded[123] == 0
assert f_encoded[2004] == 1
assert sum(f_encoded.fillna(0)) == 1111

### Exercise 3: Discretize `Votes` field (graded)

The field `Votes` is a continuous field, with a distribution which is, not surprisingly, very skewed to the right (remember *skewness* from SLU04?).


In [None]:
data['Votes'].plot.hist(bins=100, figsize=(10,6));
plt.xlim(0);
plt.xlabel('Votes');
print("The field 'Votes' ranges from", data['Votes'].min(), "to", data['Votes'].max())

We will deal with the skewness in a bit. Let's first discretize this field in two ways:
* create a new field called `discrete_votes` which is the discretization of the `Votes` field, such that the range is between 0 and 49 and the original instances are uniformly distributed;
* create a new field called `binary_votes` which is the binarization of the `Votes` field, such that amounts smaller than `100` become `0` and amounts equal or greater than `100` become 1.

Use `sklearn` transformers in this exercise.

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import Binarizer

def discretize_votes(X):

    X_a = X.copy()
    
    ## create new column `discrete_amount` using suitable transformer
    # discretizer = ...
    # ...
    ## create new column `binary_amount` using suitable transformer
    # binarizer = ...
    # ...
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return X_a

In [None]:
"""Check that the solution is correct."""
X_votes = discretize_votes(data)

assert X_votes.discrete_votes.nunique() == 32
assert X_votes.discrete_votes.max() == 49
assert X_votes.loc[123, 'discrete_votes']  == 0
assert X_votes.binary_votes.nunique() == 2
assert X_votes.binary_votes.max() == 1
assert X_votes.loc[123, 'binary_votes'] == 0

Check the distribution of the two new fields you just calculated:

In [None]:
X_votes.discrete_votes.plot.hist(bins=40, figsize=(10,6));
plt.xlim(0,50);
plt.xlabel('discrete_votes');
plt.title('Votes after discretization');

In [None]:
X_votes.binary_votes.plot.hist(figsize=(4,4));
plt.xlim(0,1);
plt.xlabel('binary_votes');
plt.title('Votes after binarization');

### Exercise 4: Scale `Votes` field (graded)

In the Learning Notebook, you also learned that numerical data can be scaled. 

In this exercise, let's scale the field `Votes` in three different ways and compare the results:
* create a new field called `minmaxscaled_votes` which scales uniformly the `Votes` field such that the values range from 0 to 1;
* create a new field called `standardscaled_votes` which scales the `Votes` field such that the *mean* is 0 and the standard deviation is 1;
* create a new field called `robustscaled_votes` which scales the `Votes` field such that the *median* is 0 and it is scaled according to the Interquartile Range.

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

def scale_votes(X):

    X_s = X.copy()
    
    ## create new column `minmaxscaled_reviews` using suitable transformer
    # ...
    ## create new column `standardscaled_reviews` using suitable transformer
    # ...
    ## create new column `robustscaled_reviews` using suitable transformer
    # ...
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return X_s

In [None]:
"""Check that the solution is correct."""
X_scaled = scale_votes(data)

assert X_scaled.minmaxscaled_votes.min() == 0
assert X_scaled.minmaxscaled_votes.max() == 1
assert math.isclose(X_scaled.minmaxscaled_votes.mean(), 0.0125, abs_tol = 0.0001)
assert math.isclose(X_scaled.loc[1234, 'minmaxscaled_votes'], 0.03576 , abs_tol = 0.00001)
assert math.isclose(X_scaled.standardscaled_votes.min(), -0.321, abs_tol = 0.001)
assert math.isclose(X_scaled.standardscaled_votes.max(), 25.23, abs_tol = 0.01)
assert math.isclose(X_scaled.standardscaled_votes.mean(), -2.874e-17, abs_tol = 0.01e-17)
assert math.isclose(X_scaled.loc[1234, 'standardscaled_votes'], 0.592, abs_tol = 0.01)
assert math.isclose(X_scaled.robustscaled_votes.min(), -0.25, abs_tol = 0.0001)
assert math.isclose(X_scaled.robustscaled_votes.max(), 113.65, abs_tol = 0.01)
assert math.isclose(X_scaled.robustscaled_votes.mean(), 1.1793, abs_tol = 0.001)
assert math.isclose(X_scaled.loc[1234, 'robustscaled_votes'], 3.82292, abs_tol = 0.001)

Plot the distributions for the new fields you just calculated:

In [None]:
X_scaled.minmaxscaled_votes.plot.hist(bins=30, figsize=(10,6));
plt.xlim(0,1);
plt.xlabel('minmaxscaled_votes');
plt.title('Votes after min-max scaling');

In [None]:
X_scaled.standardscaled_votes.plot.hist(bins=30, figsize=(10,6));
plt.xlabel('standardscaled_votes');
plt.title('Votes after standard scaling');

In [None]:
X_scaled.robustscaled_votes.plot.hist(bins=30, figsize=(10,6));
plt.xlabel('robustscaled_votes');
plt.title('Votes after robust scaling');

### Exercise 5: Ordinal encode `Rating text` feature

Finally, let's deal with the categorical features.

First, create a new field called `rating_text_encoded` which is the result of ordinal encoding of the `Rating text` feature.

In [None]:
import category_encoders as ce

def encode_rating_text(X):

    X_r = X.copy()
    
    # create new column using suitable transformer
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return X_r

In [None]:
"""Check that the solution is correct."""
X_rating_text = encode_rating_text(data)

assert X_rating_text.rating_text_encoded.dtype == int
assert X_rating_text.rating_text_encoded.min() == 1
assert X_rating_text.rating_text_encoded.max() == 6
assert X_rating_text.loc[1234, 'rating_text_encoded'] == 2

### Exercise 6: One-hot encode type feature

Finally, perform a one-hot encoding of the `Cuisines` feature. Pay attention to the following points:
* return the original DataFrame `X`, but with the `Cuisines` feature replaced by the new ones resulting from the one-hot encoding;
* make sure the new features have names of the form `Cuisines_<value>`, where `<value>` is the category being indicated by that feature.

In [None]:
def encode_cuisines(X):

    X_t = X.copy()
    
    # perform one-hot encoding in X_t
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return X_t

In [None]:
"""Check that the solution is correct."""
X_cuisines = encode_cuisines(data)

assert X_cuisines.shape[1] > 10
assert X_cuisines.Cuisines_Mughlai.sum() == 103
assert X_cuisines['Cuisines_-1'].sum() == 0
assert X_cuisines.loc[1234, 'Cuisines_North Indian, Chinese, Continental'] == 0
assert X_cuisines.loc[4322, 'Cuisines_North Indian, European'] == 0

----