In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## Features

For the rest of the homework, you'll need to use the features from the previous homework with additional two 'neighbourhood_group' and 'room_type'. So the whole feature set will be set as follows:

* `neighbourhood_group',
* `room_type`,
* `latitude`,
* `longitude`,
* `price`,
* `minimum_nights`,
* `number_of_reviews`,
* `reviews_per_month`,
* `calculated_host_listings_count`,
* `availability_365`

Select only them and fill in the missing values with 0.

In [2]:
cols = ['neighbourhood_group', 'room_type', 'latitude', 'longitude', 'price', 'minimum_nights', 
         'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']
df = pd.read_csv('nyc_airbnb.csv', usecols=cols)
df.head()

Unnamed: 0,neighbourhood_group,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,40.64749,-73.97237,Private room,149,1,9,0.21,6,365
1,Manhattan,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,2,355
2,Manhattan,40.80902,-73.9419,Private room,150,3,0,,1,365
3,Brooklyn,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,1,194
4,Manhattan,40.79851,-73.94399,Entire home/apt,80,10,9,0.1,1,0


In [3]:
df.isnull().sum()

neighbourhood_group                   0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

In [4]:
df = df.fillna(0)
df.isnull().sum()

neighbourhood_group               0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

## Question 1

What is the most frequent observation (mode) for the column 'neighbourhood_group'?

In [5]:
df.neighbourhood_group.mode()

0    Manhattan
dtype: object

### Split the data

* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
* Make sure that the target value ('price') is not in your dataframe.

In [6]:
above_average = df.price >= 152
df.price = above_average.astype(int)

In [7]:
df.head()

Unnamed: 0,neighbourhood_group,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,40.64749,-73.97237,Private room,0,1,9,0.21,6,365
1,Manhattan,40.75362,-73.98377,Entire home/apt,1,1,45,0.38,2,355
2,Manhattan,40.80902,-73.9419,Private room,0,3,0,0.0,1,365
3,Brooklyn,40.68514,-73.95976,Entire home/apt,0,1,270,4.64,1,194
4,Manhattan,40.79851,-73.94399,Entire home/apt,0,10,9,0.1,1,0


In [8]:
df.price.value_counts()

0    33992
1    14903
Name: price, dtype: int64

In [9]:
df_full_train, df_test = train_test_split(df, test_size = 0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [10]:
len(df_train), len(df_val), len(df_test)

(29337, 9779, 9779)

In [11]:
len(df_val) / (len(df_train) + len(df_val) + len(df_test))

0.2

In [12]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [13]:
y_train = df_train.price.values
y_val = df_val.price.values
y_test = df_test.price.values

In [14]:
del df_train['price']
del df_val['price']
del df_test['price']

## Question 2

* Create the correlation matrix for the numerical features of your train dataset.
    * In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?


In [15]:
df_corr = df_train.corr()
df_corr

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
latitude,1.0,0.080301,0.027441,-0.006246,-0.007159,0.019375,-0.005891
longitude,0.080301,1.0,-0.06066,0.055084,0.134642,-0.117041,0.083666
minimum_nights,0.027441,-0.06066,1.0,-0.07602,-0.120703,0.118647,0.138901
number_of_reviews,-0.006246,0.055084,-0.07602,1.0,0.590374,-0.073167,0.174477
reviews_per_month,-0.007159,0.134642,-0.120703,0.590374,1.0,-0.048767,0.165376
calculated_host_listings_count,0.019375,-0.117041,0.118647,-0.073167,-0.048767,1.0,0.225913
availability_365,-0.005891,0.083666,0.138901,0.174477,0.165376,0.225913,1.0


### Make price binary

* We need to turn the price variable from numeric into binary.
* Let's create a variable `above_average` which is 1 if the price is above (or equal to) 152.

## Question 3

* Calculate the mutual information score with the (binarized) price for the two categorical variables that we have. Use the training set only.
* Which of these two variables has bigger score?
* Round it to 2 decimal digits using `round(score, 2)`

In [16]:
df_full_train.dtypes

neighbourhood_group                object
latitude                          float64
longitude                         float64
room_type                          object
price                               int32
minimum_nights                      int64
number_of_reviews                   int64
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

In [17]:
df_full_train.columns

Index(['neighbourhood_group', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365'],
      dtype='object')

In [18]:
df_full_train.head()

Unnamed: 0,neighbourhood_group,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
32645,Brooklyn,40.71577,-73.9553,Entire home/apt,1,3,11,0.87,1,1
23615,Manhattan,40.84917,-73.94048,Private room,0,2,2,0.16,1,0
31183,Brooklyn,40.68993,-73.95947,Private room,0,2,0,0.0,2,0
29260,Brooklyn,40.68427,-73.93118,Entire home/apt,0,3,87,4.91,1,267
7275,Queens,40.74705,-73.89564,Private room,0,5,13,0.25,1,0


In [19]:
from sklearn.metrics import mutual_info_score

In [20]:
categorical = ['neighbourhood_group', 'room_type']

In [21]:
numerical = ['latitude', 'longitude', 'minimum_nights', 'number_of_reviews', 'reviews_per_month',
            'calculated_host_listings_count', 'availability_365']

In [22]:
mutual_info_score(df_full_train.neighbourhood_group, df_full_train.price)

0.0462226506346477

In [23]:
mutual_info_score(df_full_train.room_type, df_full_train.price)

0.14238980766429526

In [24]:
def mutual_info_price_score(series):
    return mutual_info_score(series, df_full_train.price)

In [25]:
df_full_train[categorical].apply(mutual_info_price_score).round(2)

neighbourhood_group    0.05
room_type              0.14
dtype: float64

## Question 4

* Now let's train a logistic regression
* Remember that we have two categorical variables in the data. Include them using one-hot encoding.
* Fit the model on the training dataset.
    * To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    * `model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

In [26]:
from sklearn.feature_extraction import DictVectorizer

In [27]:
train_dicts = df_train[categorical + numerical].to_dict(orient='records')

In [28]:
dv = DictVectorizer(sparse=False)

In [29]:
dv.fit(train_dicts)

DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
               sparse=False)

In [30]:
dv.get_feature_names()

['availability_365',
 'calculated_host_listings_count',
 'latitude',
 'longitude',
 'minimum_nights',
 'neighbourhood_group=Bronx',
 'neighbourhood_group=Brooklyn',
 'neighbourhood_group=Manhattan',
 'neighbourhood_group=Queens',
 'neighbourhood_group=Staten Island',
 'number_of_reviews',
 'reviews_per_month',
 'room_type=Entire home/apt',
 'room_type=Private room',
 'room_type=Shared room']

In [31]:
X_train = dv.transform(train_dicts)

In [32]:
val_dicts = df_val[numerical + categorical].to_dict(orient='records')

In [33]:
X_val = dv.transform(val_dicts)

In [34]:
from sklearn.linear_model import LogisticRegression

In [35]:
model = LogisticRegression(solver='lbfgs', random_state=42, max_iter=900)

In [36]:
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=900,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [37]:
model.predict_proba(X_train)

array([[0.68041785, 0.31958215],
       [0.86002854, 0.13997146],
       [0.89521642, 0.10478358],
       ...,
       [0.90346622, 0.09653378],
       [0.98215859, 0.01784141],
       [0.39533994, 0.60466006]])

In [38]:
y_pred = model.predict_proba(X_val)[:, 1]

In [39]:
y_pred

array([0.02886042, 0.59559439, 0.42611385, ..., 0.11399177, 0.03462159,
       0.52933489])

In [40]:
price_decision = (y_pred > 0.5)

In [41]:
df_val[price_decision]

Unnamed: 0,neighbourhood_group,latitude,longitude,room_type,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
1,Brooklyn,40.68498,-73.96618,Entire home/apt,14,4,0.11,2,343
4,Manhattan,40.76075,-73.99893,Entire home/apt,30,0,0.00,18,365
6,Manhattan,40.73243,-74.00932,Entire home/apt,4,1,0.03,1,0
7,Manhattan,40.80630,-73.96268,Entire home/apt,7,2,0.05,1,0
9,Manhattan,40.80861,-73.94574,Entire home/apt,3,123,1.73,3,248
10,Manhattan,40.80549,-73.96393,Entire home/apt,3,10,0.38,1,0
12,Brooklyn,40.62795,-73.96281,Entire home/apt,3,6,0.71,3,173
13,Manhattan,40.77822,-73.98016,Entire home/apt,1,0,0.00,1,0
14,Manhattan,40.79633,-73.94819,Entire home/apt,30,3,0.16,1,212
19,Manhattan,40.77359,-73.94865,Entire home/apt,1,1,0.25,5,365


In [42]:
y_val

array([0, 0, 1, ..., 0, 0, 0])

In [43]:
price_decision.astype(int)

array([0, 1, 0, ..., 0, 0, 1])

In [44]:
(y_val == price_decision).mean()

0.790878412925657

In [45]:
df_pred = pd.DataFrame()
df_pred['probability'] = y_pred
df_pred['prediction'] = price_decision.astype(int)
df_pred['actual'] = y_val

In [46]:
df_pred.head()

Unnamed: 0,probability,prediction,actual
0,0.02886,0,0
1,0.595594,1,0
2,0.426114,0,1
3,0.075114,0,0
4,0.811493,1,1


In [47]:
df_pred['correct'] = df_pred.prediction == df_pred.actual

In [48]:
df_pred.head()

Unnamed: 0,probability,prediction,actual,correct
0,0.02886,0,0,True
1,0.595594,1,0,False
2,0.426114,0,1,False
3,0.075114,0,0,True
4,0.811493,1,1,True


In [49]:
round(df_pred.correct.mean(), 2)

0.79

## Question 5

* We have 9 features: 7 numerical features and 2 categorical.
* Let's find the least useful one using the feature elimination technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
* Which of following feature has the smallest difference?
    * `neighbourhood_group`
    * `room_type`
    * `number_of_reviews`
    * `reviews_per_month`
    
Note: the difference doesn't have to be positive