# Homework 03
---
[mlbookcamp 03-classification](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/03-classification/homework.md)

In [3]:
import numpy as np
import pandas as pd



In [4]:
DATAPATH = "/dataset/AB_NYC_2019.csv"

### Downloading data

In [5]:
%%bash -s "$DATAPATH"
# Downloads data if not available.
if [[ -f "$1" ]]
    then
        echo 'Data already there.';
    else
        echo 'Downloading data';
        wget -O "$1" https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv
fi

Data already there.


In [6]:
data = pd.read_csv(DATAPATH)

## Features
For the rest of the homework, you'll need to use the features from the previous homework with additional two 'neighbourhood_group' and 'room_type'. So the whole feature set will be set as follows:

- 'neighbourhood_group',
- 'room_type',
- 'latitude',
- 'longitude',
- 'price',
- 'minimum_nights',
- 'number_of_reviews',
- 'reviews_per_month',
- 'calculated_host_listings_count',
- 'availability_365'

Select only them and fill in the missing values with 0.

In [7]:
required_columns = [
    'neighbourhood_group', 
    'room_type', 
    'latitude', 
    'longitude', 
    'price', 
    'minimum_nights', 
    'number_of_reviews', 
    'reviews_per_month', 
    'calculated_host_listings_count', 
    'availability_365'
]

In [8]:
data = data[required_columns].fillna(0)

## Question 1
What is the most frequent observation (mode) for the column `'neighbourhood_group'`?

In [9]:
data['neighbourhood_group'].mode()

0    Manhattan
dtype: object

## Split the data
- Split your data in train/val/test sets, with 60%/20%/20% distribution.
- Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
- Make sure that the target value ('price') is not in your dataframe.

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(data.drop('price', axis=1), 
                                                    data['price'], 
                                                    test_size=0.4, 
                                                    shuffle=True, 
                                                    random_state=42)
X_test, X_val, Y_test, Y_val = train_test_split(X_test, 
                                                Y_test, 
                                                test_size=0.5, 
                                                shuffle=False, 
                                                random_state=42)

## Question 2
- Create the correlation matrix for the numerical features of your train dataset.
  - In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
- What are the two features that have the biggest correlation in this dataset?

In [11]:
X_train.dtypes

neighbourhood_group                object
room_type                          object
latitude                          float64
longitude                         float64
minimum_nights                      int64
number_of_reviews                   int64
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

In [12]:
X_train.head().T

Unnamed: 0,1261,19170,45159,9085,20490
neighbourhood_group,Manhattan,Brooklyn,Manhattan,Brooklyn,Manhattan
room_type,Private room,Entire home/apt,Entire home/apt,Entire home/apt,Private room
latitude,40.72006,40.68048,40.75933,40.67886,40.72087
longitude,-73.99579,-73.99322,-73.98751,-73.96802,-73.98079
minimum_nights,6,4,29,1,1
number_of_reviews,18,8,0,2,2
reviews_per_month,0.21,0.24,0.0,0.04,0.15
calculated_host_listings_count,1,1,327,1,1
availability_365,0,0,336,0,0


In [13]:
X_train.corr()

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
latitude,1.0,0.087732,0.027252,-0.01008,-0.014699,0.019442,-0.005975
longitude,0.087732,1.0,-0.067251,0.058775,0.132226,-0.116669,0.080776
minimum_nights,0.027252,-0.067251,1.0,-0.085092,-0.127316,0.12406,0.141089
number_of_reviews,-0.01008,0.058775,-0.085092,1.0,0.581124,-0.072687,0.176481
reviews_per_month,-0.014699,0.132226,-0.127316,0.581124,1.0,-0.047254,0.166533
calculated_host_listings_count,0.019442,-0.116669,0.12406,-0.072687,-0.047254,1.0,0.222986
availability_365,-0.005975,0.080776,0.141089,0.176481,0.166533,0.222986,1.0


In [14]:
X_train.corr().unstack().sort_values(ascending = False)

latitude                        latitude                          1.000000
longitude                       longitude                         1.000000
calculated_host_listings_count  calculated_host_listings_count    1.000000
reviews_per_month               reviews_per_month                 1.000000
minimum_nights                  minimum_nights                    1.000000
number_of_reviews               number_of_reviews                 1.000000
availability_365                availability_365                  1.000000
number_of_reviews               reviews_per_month                 0.581124
reviews_per_month               number_of_reviews                 0.581124
availability_365                calculated_host_listings_count    0.222986
calculated_host_listings_count  availability_365                  0.222986
availability_365                number_of_reviews                 0.176481
number_of_reviews               availability_365                  0.176481
availability_365         

## Make price binary
- We need to turn the price variable from numeric into binary.
- Let's create a variable above_average which is `1` if the price is above (or equal to) `152`.

In [15]:
Y_train_above_average = (Y_train >= 152) * 1
Y_test_above_average = (Y_test >= 152) * 1
Y_val_above_average = (Y_val >= 152) * 1
Y_train_above_average.head()

1261     0
19170    0
45159    1
9085     1
20490    1
Name: price, dtype: int64

## Question 3
- Calculate the mutual information score with the (binarized) price for the two categorical variables that we have. Use the training set only.
- Which of these two variables has bigger score?
- Round it to 2 decimal digits using round(score, 2)

In [16]:
column_types = X_train.dtypes
cat_features = list(column_types[column_types == 'object'].index)
cat_features

['neighbourhood_group', 'room_type']

In [17]:
from sklearn.metrics import mutual_info_score

info_score_dict = {feature: mutual_info_score(X_train[feature], Y_train_above_average) 
    for feature in cat_features}
info_score_dict

{'neighbourhood_group': 0.04640789222639313, 'room_type': 0.14253822175954092}

In [18]:
max(info_score_dict.items(), key=lambda x: x[1])

('room_type', 0.14253822175954092)

## Question 4
- Now let's train a logistic regression
- Remember that we have two categorical variables in the data. Include them using one-hot encoding.
- Fit the model on the training dataset.
  - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
  - model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

In [19]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


preprocessing = ColumnTransformer([
    ('cat', OneHotEncoder(), cat_features)], 
    remainder='passthrough'
    )
X_train_processed = preprocessing.fit_transform(X_train)
X_test_processed = preprocessing.transform(X_test)
X_val_processed = preprocessing.transform(X_val)

In [20]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
model.fit(X_train_processed, Y_train_above_average)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(random_state=42)

In [21]:
from sklearn.metrics import accuracy_score

yhat_val = model.predict(X_val_processed)
acc = accuracy_score(Y_val_above_average, yhat_val)
round(acc, 2)

0.79

## Question 5
- We have 9 features: 7 numerical features and 2 categorical.
- Let's find the least useful one using the feature elimination technique.
- Train a model with all these features (using the same parameters as in Q4).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
- Which of following feature has the smallest difference?
    - neighbourhood_group
    - room_type
    - number_of_reviews
    - reviews_per_month
> note: the difference doesn't have to be positive

In [22]:
feature_selection_dict = {}
for del_feature in X_train.columns:
    selected_cat_features = [feature for feature in cat_features if feature != del_feature]
    x_query = X_train.drop(columns=del_feature)
    val_query = X_val.drop(columns=del_feature)
    preprocessing = ColumnTransformer([
        ('cat', OneHotEncoder(), selected_cat_features)], 
        remainder='passthrough'
        )
    x_query = preprocessing.fit_transform(x_query)
    val_query = preprocessing.transform(val_query)
    model.fit(x_query, Y_train_above_average)
    feature_selection_dict[del_feature] = accuracy_score(Y_val_above_average, model.predict(val_query))
feature_selection_dict

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

{'neighbourhood_group': 0.746702116780857,
 'room_type': 0.7177625524082217,
 'latitude': 0.7864812353001329,
 'longitude': 0.7866857551896922,
 'minimum_nights': 0.7855608957971163,
 'number_of_reviews': 0.7864812353001329,
 'reviews_per_month': 0.7866857551896922,
 'calculated_host_listings_count': 0.7864812353001329,
 'availability_365': 0.7850495960732181}

In [23]:
feature_selection_df = pd.DataFrame(feature_selection_dict, index=['accuracy']).T
(acc - feature_selection_df.accuracy).abs().sort_values(ascending=True)

minimum_nights                    0.000000
availability_365                  0.000511
latitude                          0.000920
number_of_reviews                 0.000920
calculated_host_listings_count    0.000920
longitude                         0.001125
reviews_per_month                 0.001125
neighbourhood_group               0.038859
room_type                         0.067798
Name: accuracy, dtype: float64

## Question 6
- For this question, we'll see how to use a linear regression model from Scikit-Learn
- We'll need to use the original column 'price'. Apply the logarithmic transformation to this column.
- Fit the Ridge regression model on the training data.
- This model has a parameter alpha. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`
- Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest alpha.

In [24]:
Y_train_log = np.log1p(Y_train)
Y_test_log = np.log1p(Y_test)
Y_val_log = np.log1p(Y_val)

In [29]:
from logging import root
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

alpha_values = [0, 0.01, 0.1, 1, 10]
score_dict = {'alpha': [], 'val_score': []}
for alpha in alpha_values:
    model = Ridge(alpha=alpha, random_state=42)
    model.fit(X_train_processed, Y_train_log)
    yhat = model.predict(X_val_processed)
    score_dict['alpha'].append(alpha)
    score_dict['val_score'].append(mean_squared_error(Y_val_log, yhat, squared=False))

In [30]:
score_df = pd.DataFrame(score_dict)
score_df['val_score'] = score_df['val_score'].round(3)
score_df.sort_values(by='val_score', ascending=False)

Unnamed: 0,alpha,val_score
0,0.0,0.492
1,0.01,0.492
2,0.1,0.492
3,1.0,0.492
4,10.0,0.492
