# Machine Learning

As all the variables were identified it's time now to build models to best predict if the listing represents a good deal. But first let's manually identify the listings that fit under a 'good deal' criterion.

## Preparing data for Machine Learning

After reading listings file with added sentiment score for each listing, I am adding a new column of average price per neighbourhood. On top of being rated as top 25% in each category, I want the place not to be priced higher than corresponding neighbourhood average. This will illiminate luxury places that might be perfect in all senses but are completely unaffordable.

In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# read listing data base with added sentiment score
df = pd.read_csv('entire_home_listings_sentiment.csv', header=0, parse_dates=['host_since'])
df.drop('Unnamed: 0', axis=1, inplace=True)

In [3]:
# add average per neighbourhood price column
mean_price_neigh = df.groupby('neighbourhood').mean().loc[:,'price'].reset_index()
mean_price_neigh.columns = ['neighbourhood', 'mean_price']
df_1 = df.merge(mean_price_neigh, on='neighbourhood')

In [4]:
# define a function to mark each listing whether it represents a good deal or not
def good_deal(row, df=df_1):
    """Identify if listing represents a good deal"""
    if row['sentiment'] >= df.sentiment.quantile(0.75):
        if row['price'] <= row['mean_price']:
            if row['review_scores_rating'] >= df.review_scores_rating.quantile(0.75):
                if row['review_scores_accuracy'] >= df.review_scores_accuracy.quantile(0.75):
                    if row['review_scores_cleanliness'] >= df.review_scores_cleanliness.quantile(0.75):
                        if row['review_scores_checkin'] >= df.review_scores_checkin.quantile(0.75):
                            if row['review_scores_communication'] >= df.review_scores_communication.quantile(0.75):
                                if row['review_scores_location'] >= df.review_scores_location.quantile(0.75):
                                    if row['review_scores_value'] >= df.review_scores_value.quantile(0.75):
                                        return 1
    return 0

In [5]:
# for each listing apply good_deal function
a = []
for row in df_1.index:
    a.append(good_deal(df_1.loc[row,:]))

In [6]:
print('Out of {} listings there '.format(len(df_1)),'are {} good deals.'.format(np.sum(a)))

Out of 25991 listings there  are 1239 good deals.


In [7]:
# add a column to the database
df_1['good_deal'] = a

Now as the listings were manually labeled, it's time to separate the data into independent predictors and a target variable. From exploratory data analysis, I discovered that below variables influence occupancy rate, which means that they might be among deciding factors for renters when they are choosing the apartment:
- host_response_rate
- host_is_superhost
- accommodates
- host_response_rate
- cancellation_policy
- instant_bookable
- all review_scores variables

Most of these variables are continuous and can be added as is. However, some categorical varibales need some transformation.

I am also including price, occupancy_rate and sentiment in the list of predictors.

In [8]:
# start building set of independent variables. First take out all numerical variables I will include
X = df_1.loc[:,['host_response_rate', 'host_is_superhost', 'accommodates', 'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'price', 'occupancy_rate', 'sentiment']]

In [9]:
# transform categorical variables
x = pd.get_dummies(df_1.host_response_time).drop('unknown', axis=1)
x.columns = ['rt_few_days', 'rt_day', 'rt_few_hours', 'rt_hour']
X = X.join(x)

x1 = df_1.instant_bookable.replace({'f':0, 't':1})
X = X.join(x1)

x2 = pd.get_dummies(df_1.cancellation_policy).drop('moderate', axis=1)
x2.columns = ['cp_flexible', 'cp_strict']
X = X.join(x2)

In [10]:
# build a target variable
Y = df_1.good_deal

Some additional transformations to predictors will be done before starting to build models. I will transform price into log-price and scale all numerical variables to get them all have mean of 0 and variance of 1.

In [11]:
# change price to log-price
X.price = np.log(X.price)

In [12]:
numerical = ['host_response_rate', 'accommodates',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'price', 'occupancy_rate', 'sentiment']
categorical = ['host_is_superhost', 'rt_few_days', 'rt_day', 'rt_few_hours', 'rt_hour', 'instant_bookable',
       'cp_flexible', 'cp_strict']

In [13]:
# scale numerical data
from sklearn.preprocessing import StandardScaler
num = pd.DataFrame(StandardScaler().fit_transform(X[numerical]))
num.columns = numerical

In [14]:
X = num.join(X[categorical])

In [15]:
# save variables to files
X.to_csv('independent_variables.csv')
pd.DataFrame(Y).to_csv('dependent_variable.csv')

## Machine Learning

Since I have a classification problem, I will try few classic approaches first. For all of the models I will split my data into train and test sets. Test set will be reserved for evaluating the goodness of the model at the end. All training and transformations will be done on train sets.

In [16]:
# split data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(X,Y, test_size=0.3, random_state=42)

## Logistic regression

In [17]:
# create simple logistic regression classifier
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
lgr = LogisticRegression(solver='lbfgs')
lgr.fit(X_train, y_train)
y_lgr_test = lgr.predict(X_test)
y_lgr_train = lgr.predict(X_train)

In [18]:
# check accuracy score
print('Accuracy score on test data: {}'.format(metrics.accuracy_score(y_test,y_lgr_test)))
print('Accuracy score on train data: {}'.format(metrics.accuracy_score(y_train,y_lgr_train)))

Accuracy score on test data: 0.9744806360605284
Accuracy score on train data: 0.9695487275325675


Looks like the classic logistic regression delivered amazing accuracy scores.Is it really the case? Let's check some more metrics.

In [19]:
print('Confusion matrix for test data: \n', metrics.confusion_matrix(y_test,y_lgr_test))
print('Confusion matrix for train data: \n', metrics.confusion_matrix(y_train,y_lgr_train))
print('Report for test data: \n', metrics.classification_report(y_test,y_lgr_test))
print('Report for train data: \n', metrics.classification_report(y_train,y_lgr_train))

Confusion matrix for test data: 
 [[7379   67]
 [ 132  220]]
Confusion matrix for train data: 
 [[17104   202]
 [  352   535]]
Report for test data: 
               precision    recall  f1-score   support

           0       0.98      0.99      0.99      7446
           1       0.77      0.62      0.69       352

    accuracy                           0.97      7798
   macro avg       0.87      0.81      0.84      7798
weighted avg       0.97      0.97      0.97      7798

Report for train data: 
               precision    recall  f1-score   support

           0       0.98      0.99      0.98     17306
           1       0.73      0.60      0.66       887

    accuracy                           0.97     18193
   macro avg       0.85      0.80      0.82     18193
weighted avg       0.97      0.97      0.97     18193



Looks like I am dealing with imbalanced data here and lots of listings (over 30% in good deal category in test data) were classified as not representing a good deal. Some further manipulations of predictors are needed to avoid imbalanced data effects. 

I will also create a dataframe that will contain results for each classifier. It will help me to compare them later on.

In [20]:
# create a dictionary first
a = {'classifier_name':'Logistic', 'score':metrics.accuracy_score(y_test, y_lgr_test), 
     'f1_score': metrics.f1_score(y_test, y_lgr_test), 'precision': metrics.precision_score(y_test, y_lgr_test),
    'recall': metrics.recall_score(y_test, y_lgr_test)}
cls_comparison = pd.DataFrame([a])

## Imbalanced data

There are few techniques that can help to deal with imbalanced data:
- Undersampling - evening out the number of samples in each class by randomly reducing the number of samples in the majority class
- Oversampling - evening out the number of samples in each classby randomly sampling over minority class with replacemnets to get as many samples as in majority class
- SMOTE - Synthetic Minority Over-Sampling Technique - synthesises new minority instances between existing (real) minority instances. 

### Undersampling

I will apply this technique manually by sampling over majority class to get as many samples as minority class contains.

In [21]:
# combine predictor and target variables
all_imb = X_train.join(y_train)
class2 = all_imb[all_imb.good_deal == 1]
n = len(class2)
# select random samples from majority clasee
class1 = all_imb[all_imb.good_deal==0].sample(n)
all_bal = pd.concat([class1,class2])
X_train_under = all_bal.iloc[:,:-1]
y_train_under = all_bal.loc[:,'good_deal']

In [22]:
# train logistic classifier with balanced data
lgr1 = LogisticRegression(solver='lbfgs')
lgr1.fit(X_train_under, y_train_under)
y_lgu_test = lgr1.predict(X_test)
y_lgu_train = lgr1.predict(X_train_under)

In [23]:
# print new scores
print('Accuracy score on test data: {}'.format(metrics.accuracy_score(y_test,y_lgu_test)))
print('Accuracy score on train data: {}'.format(metrics.accuracy_score(y_train_under,y_lgu_train)))
print('Confusion matrix for test data: \n', metrics.confusion_matrix(y_test,y_lgu_test))
print('Confusion matrix for train data: \n', metrics.confusion_matrix(y_train_under,y_lgu_train))
print('Report for test data: \n', metrics.classification_report(y_test,y_lgu_test))
print('Report for train data: \n', metrics.classification_report(y_train_under,y_lgu_train))

Accuracy score on test data: 0.9157476275968197
Accuracy score on train data: 0.9526493799323562
Confusion matrix for test data: 
 [[6792  654]
 [   3  349]]
Confusion matrix for train data: 
 [[804  83]
 [  1 886]]
Report for test data: 
               precision    recall  f1-score   support

           0       1.00      0.91      0.95      7446
           1       0.35      0.99      0.52       352

    accuracy                           0.92      7798
   macro avg       0.67      0.95      0.73      7798
weighted avg       0.97      0.92      0.93      7798

Report for train data: 
               precision    recall  f1-score   support

           0       1.00      0.91      0.95       887
           1       0.91      1.00      0.95       887

    accuracy                           0.95      1774
   macro avg       0.96      0.95      0.95      1774
weighted avg       0.96      0.95      0.95      1774



In [24]:
# add scores to comparison dataframe
a = {'classifier_name':'Logistic_under', 'score':metrics.accuracy_score(y_test, y_lgu_test), 
     'f1_score': metrics.f1_score(y_test, y_lgu_test), 'precision': metrics.precision_score(y_test, y_lgu_test),
    'recall': metrics.recall_score(y_test, y_lgu_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

In [25]:
cls_comparison

Unnamed: 0,classifier_name,f1_score,precision,recall,score
0,Logistic,0.688576,0.766551,0.625,0.974481
0,Logistic_under,0.515129,0.347956,0.991477,0.915748


Undersampling improved significantly recall score (almost 99% of good deals were correctly identified), but the amount of incorrectly labeled non-good deals has increased, which lowered  precision score and overall accuracy and f1 scores.

Let's check how oversampling will change classifier performance.

## Oversampling

In order to perform oversampling on the training data I will use imbalanced-learn package from python. That specifically uses different approaches to restructure imbalanced data into balanced one.

RandomOverSampler will perform regular oversamling (evening out number of samples in each class by randomly sampling over minority class to reach majority class quantity of samples)

In [26]:
# balance training data
from imblearn import over_sampling
b_smote = over_sampling.RandomOverSampler(random_state=42)
X_train_b, y_train_b = b_smote.fit_sample(X_train, y_train)



In [27]:
# train model with oversampled data
lgr2 = LogisticRegression(solver='lbfgs')
lgr2.fit(X_train_b, y_train_b)
y_lgo_test = lgr2.predict(X_test)
y_lgo_train = lgr2.predict(X_train_b)

In [28]:
print('Accuracy score on test data: {}'.format(metrics.accuracy_score(y_test,y_lgo_test)))
print('Accuracy score on train data: {}'.format(metrics.accuracy_score(y_train_b,y_lgo_train)))
print('Confusion matrix for test data: \n', metrics.confusion_matrix(y_test,y_lgo_test))
print('Confusion matrix for train data: \n', metrics.confusion_matrix(y_train_b,y_lgo_train))
print('Report for test data: \n', metrics.classification_report(y_test,y_lgo_test))
print('Report for train data: \n', metrics.classification_report(y_train_b,y_lgo_train))

Accuracy score on test data: 0.9353680430879713
Accuracy score on train data: 0.9624696637004507
Confusion matrix for test data: 
 [[6948  498]
 [   6  346]]
Confusion matrix for train data: 
 [[16152  1154]
 [  145 17161]]
Report for test data: 
               precision    recall  f1-score   support

           0       1.00      0.93      0.96      7446
           1       0.41      0.98      0.58       352

    accuracy                           0.94      7798
   macro avg       0.70      0.96      0.77      7798
weighted avg       0.97      0.94      0.95      7798

Report for train data: 
               precision    recall  f1-score   support

           0       0.99      0.93      0.96     17306
           1       0.94      0.99      0.96     17306

    accuracy                           0.96     34612
   macro avg       0.96      0.96      0.96     34612
weighted avg       0.96      0.96      0.96     34612



In [29]:
# add scores to comparison dataframe
a = {'classifier_name':'Logistic_over', 'score':metrics.accuracy_score(y_test, y_lgo_test), 
     'f1_score': metrics.f1_score(y_test, y_lgo_test), 'precision': metrics.precision_score(y_test, y_lgo_test),
    'recall': metrics.recall_score(y_test, y_lgo_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

In [30]:
cls_comparison

Unnamed: 0,classifier_name,f1_score,precision,recall,score
0,Logistic,0.688576,0.766551,0.625,0.974481
0,Logistic_under,0.515129,0.347956,0.991477,0.915748
0,Logistic_over,0.578595,0.409953,0.982955,0.935368


Oversampling also did not help to improve results from training over imbalanced data. Lets now use SMOTE technique to get balanced data.

## SMOTE

SMOTE (synthetic minority oversampling technique) is one of the most commonly used oversampling methods to solve the imbalance problem.
It aims to balance class distribution by randomly increasing minority class examples by replicating them.
SMOTE synthesises new minority instances between existing minority instances. It generates the virtual training records by linear interpolation for the minority class. These synthetic training records are generated by randomly selecting one or more of the k-nearest neighbors for each example in the minority class.

In [31]:
# balance train data using SMOTE
sm = over_sampling.SMOTE(random_state=42)#, categorical_features=[12,13,14,15,16,17,18,19])
X_train_smote, y_train_smote = sm.fit_sample(X_train,y_train)



In [32]:
# train classifier with SMOTE balanced data
lgr3 = LogisticRegression(solver='lbfgs')
lgr3.fit(X_train_smote, y_train_smote)
y_lgs_test = lgr3.predict(X_test)
y_lgs_train = lgr3.predict(X_train)

In [33]:
print('Accuracy score on test data: {}'.format(metrics.accuracy_score(y_test,y_lgs_test)))
print('Accuracy score on train data: {}'.format(metrics.accuracy_score(y_train,y_lgs_train)))
print('Confusion matrix for test data: \n', metrics.confusion_matrix(y_test,y_lgs_test))
print('Confusion matrix for train data: \n', metrics.confusion_matrix(y_train,y_lgs_train))
print('Report for test data: \n', metrics.classification_report(y_test,y_lgs_test))
print('Report for train data: \n', metrics.classification_report(y_train,y_lgs_train))

Accuracy score on test data: 0.9425493716337523
Accuracy score on train data: 0.9426702577914583
Confusion matrix for test data: 
 [[7007  439]
 [   9  343]]
Confusion matrix for train data: 
 [[16284  1022]
 [   21   866]]
Report for test data: 
               precision    recall  f1-score   support

           0       1.00      0.94      0.97      7446
           1       0.44      0.97      0.60       352

    accuracy                           0.94      7798
   macro avg       0.72      0.96      0.79      7798
weighted avg       0.97      0.94      0.95      7798

Report for train data: 
               precision    recall  f1-score   support

           0       1.00      0.94      0.97     17306
           1       0.46      0.98      0.62       887

    accuracy                           0.94     18193
   macro avg       0.73      0.96      0.80     18193
weighted avg       0.97      0.94      0.95     18193



In [34]:
# add scores to comparison dataframe
a = {'classifier_name':'Logistic_smote', 'score':metrics.accuracy_score(y_test, y_lgs_test), 
     'f1_score': metrics.f1_score(y_test, y_lgs_test), 'precision': metrics.precision_score(y_test, y_lgs_test),
    'recall': metrics.recall_score(y_test, y_lgs_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

In [35]:
cls_comparison

Unnamed: 0,classifier_name,f1_score,precision,recall,score
0,Logistic,0.688576,0.766551,0.625,0.974481
0,Logistic_under,0.515129,0.347956,0.991477,0.915748
0,Logistic_over,0.578595,0.409953,0.982955,0.935368
0,Logistic_smote,0.604938,0.438619,0.974432,0.942549


Looks like SMOTE tackles similar problem as other balancing techniques. It improves recall score (tries to identify as many samples from minority class as possible). It might be beneficial in certain tasks (like fraud activity detection or desease identification). However, in the problem of identifying good deal listings it is not only important to correctly label all good deal ones, but also not to mistakenly point at the listings that do not represent a deal a customer might be after. Therefore it is important to find the balance between precision and recall to get better results.

In attempt to do so I will try few more resampling techniques:

- SMOTENC  - same as SMOTE but it separates categorical variables to treat them differently than numeric ones
- ADASYN - oversampling technique - also uses interpolation to generate new samples, but it focuses on generating samples next to the original samples which are wrongly classified using a k-Nearest Neighbors classifier
- NearMiss - under-sampling techniques that uses k-nearest neighbor algorithm to select samples from overrepresented class.
- EditedNearestNeighbours - under-sampling technique that also applies a nearest-neighbors algorithm.
- SMOTEENN - combination of over-sampling and under-sampling techniques.

## SMOTENC


In [36]:
# balance data with SMOTENC
smote_nc = over_sampling.SMOTENC(random_state=42, categorical_features=[12,13,14,15,16,17,18,19])
X_train_smotenc, y_train_smotenc = smote_nc.fit_sample(X_train, y_train)



In [37]:
lgr4 = LogisticRegression(solver='lbfgs')
lgr4.fit(X_train_smotenc, y_train_smotenc)
y_lgsn_test = lgr4.predict(X_test)

In [38]:
# add scores to comparison dataframe
a = {'classifier_name':'Logistic_smotenc', 'score':metrics.accuracy_score(y_test, y_lgsn_test), 
     'f1_score': metrics.f1_score(y_test, y_lgsn_test), 'precision': metrics.precision_score(y_test, y_lgsn_test),
    'recall': metrics.recall_score(y_test, y_lgsn_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## ADASYN

It is another over-sampling technique that also uses interpolation to generate new samples, but it focuses on generating samples next to the original samples which are wrongly classified using a k-Nearest Neighbors classifier

In [39]:
ada = over_sampling.ADASYN(random_state=42)
X_train_ada, y_train_ada = ada.fit_sample(X_train, y_train)



In [40]:
lgr5 = LogisticRegression(solver='lbfgs')
lgr5.fit(X_train_ada, y_train_ada)
y_lga_test = lgr5.predict(X_test)

In [41]:
# add scores to comparison dataframe
a = {'classifier_name':'Logistic_adasyn', 'score':metrics.accuracy_score(y_test, y_lga_test), 
     'f1_score': metrics.f1_score(y_test, y_lga_test), 'precision': metrics.precision_score(y_test, y_lga_test),
    'recall': metrics.recall_score(y_test, y_lga_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## NearMiss

There are three versions of this under-sampling technique in imblearn. I will be using version #3 as it provided the best out of these three versions results. It is a 2-steps algorithm. First, for each negative (in our case 'good deal' sample) sample, their M nearest-neighbors will be kept. Then, the positive samples (non-good deal samples) selected are the one for which the average distance to the N nearest-neighbors is the largest.

In [42]:
# resample imbalanced data
from imblearn import under_sampling
near_miss = under_sampling.NearMiss(random_state=42, version=3)
X_train_nm, y_train_nm = near_miss.fit_sample(X_train, y_train)

# train classifier
lgr6 = LogisticRegression(solver='lbfgs')
lgr6.fit(X_train_nm, y_train_nm)
y_nm_test = lgr6.predict(X_test)



In [43]:
# add scores to comparison dataframe
a = {'classifier_name':'Logistic_nearmiss', 'score':metrics.accuracy_score(y_test, y_nm_test), 
     'f1_score': metrics.f1_score(y_test, y_nm_test), 'precision': metrics.precision_score(y_test, y_nm_test),
    'recall': metrics.recall_score(y_test, y_nm_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## Edited Nearest Neighbor

This under-sampling technique applies a nearest-neighbors algorithm. and “edit” the dataset by removing samples which do not agree “enough” with their neighboorhood. For each sample in the class to be under-sampled, the nearest-neighbours are computed and if the selection criterion is not fulfilled, the sample is removed. Two selection criteria are currently available: (i) the majority (i.e., kind_sel='mode') or (ii) all (i.e., kind_sel='all') the nearest-neighbors have to belong to the same class than the sample inspected to keep it in the dataset. Criterion 'mode' resulted in higher precision score ( more balance between precsion and recall), therefore I am using it.

In [44]:
# resample imbalanced data
enn = under_sampling.EditedNearestNeighbours(random_state=42, kind_sel='mode')
X_train_enn, y_train_enn = enn.fit_sample(X_train, y_train)

# train classifier
lgr7 = LogisticRegression(solver='lbfgs')
lgr7.fit(X_train_enn, y_train_enn)
y_enn_test = lgr7.predict(X_test)



In [45]:
# add scores to comparison dataframe
a = {'classifier_name':'Logistic_enn', 'score':metrics.accuracy_score(y_test, y_enn_test), 
     'f1_score': metrics.f1_score(y_test, y_enn_test), 'precision': metrics.precision_score(y_test, y_enn_test),
    'recall': metrics.recall_score(y_test, y_enn_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## SMOTEENN

This combination technique first performs over-sampling using SMOTE and then cleans the data using edited nearest neighbours.

In [46]:
# resample imbalanced data
from imblearn import combine
smoteenn = combine.SMOTEENN(random_state=42)
X_train_smenn, y_train_smenn = smoteenn.fit_sample(X_train, y_train)

# train classifier
lgr8 = LogisticRegression(solver='lbfgs')
lgr8.fit(X_train_smenn, y_train_smenn)
y_smenn_test = lgr8.predict(X_test)



In [47]:
# add scores to comparison dataframe
a = {'classifier_name':'Logistic_smoteenn', 'score':metrics.accuracy_score(y_test, y_smenn_test), 
     'f1_score': metrics.f1_score(y_test, y_smenn_test), 'precision': metrics.precision_score(y_test, y_smenn_test),
    'recall': metrics.recall_score(y_test, y_smenn_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

In [48]:
cls_comparison.reset_index(inplace=True)
cls_comparison.drop('index', axis=1, inplace=True)
cls_comparison

Unnamed: 0,classifier_name,f1_score,precision,recall,score
0,Logistic,0.688576,0.766551,0.625,0.974481
1,Logistic_under,0.515129,0.347956,0.991477,0.915748
2,Logistic_over,0.578595,0.409953,0.982955,0.935368
3,Logistic_smote,0.604938,0.438619,0.974432,0.942549
4,Logistic_smotenc,0.602941,0.445652,0.931818,0.944601
5,Logistic_adasyn,0.583193,0.414081,0.985795,0.936394
6,Logistic_nearmiss,0.633554,0.518051,0.815341,0.957425
7,Logistic_enn,0.707865,0.7,0.715909,0.973326
8,Logistic_smoteenn,0.555823,0.387458,0.982955,0.929084


From the above table one can conclude that most of the resampling techniques did improve recall score (percentage of good deals identified), however, none got precision score improved. The best results were shown by ENN, NearMiss and SMOTENC.

For training further models I will use other classification algorithms and use three training data sets for each of them: imbalanced original data, ENN for under-sampling and SMOTENC for oversampling. I will continue adding results into comparison dataframe.


## k-nearest Neighbors

Knn - is another classification algorithms that defines the class based on the majority class of sample's k (hyperparameter to be tuned) nearest neighbours.

In [49]:
# train classifier on imbalanced data
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train,np.array(y_train).reshape(-1,1).ravel())
y_knn_test = knn.predict(X_test)

In [50]:
# add scores to comparison dataframe
a = {'classifier_name':'KNN', 'score':metrics.accuracy_score(y_test, y_knn_test), 
     'f1_score': metrics.f1_score(y_test, y_knn_test), 'precision': metrics.precision_score(y_test, y_knn_test),
    'recall': metrics.recall_score(y_test, y_knn_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## Using Edited Nearest Neighbours data for KNN

In [51]:
# train classifier with resampled data using ENN
knn2 = KNeighborsClassifier()
knn2.fit(X_train_enn,np.array(y_train_enn).reshape(-1,1).ravel())
y_knnenn_test = knn2.predict(X_test)

In [52]:
# add scores to comparison dataframe
a = {'classifier_name':'KNN_ENN', 'score':metrics.accuracy_score(y_test, y_knnenn_test), 
     'f1_score': metrics.f1_score(y_test, y_knnenn_test), 'precision': metrics.precision_score(y_test, y_knnenn_test),
    'recall': metrics.recall_score(y_test, y_knnenn_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## KNN using SMOTENC resampling technique

In [53]:
# train classifier with resampled by SMOTENC data
knn3 = KNeighborsClassifier()
knn3.fit(X_train_smotenc,np.array(y_train_smotenc).reshape(-1,1).ravel())
y_knnsm_test = knn3.predict(X_test)

In [54]:
# add scores to comparison dataframe
a = {'classifier_name':'KNN_SMOTENC', 'score':metrics.accuracy_score(y_test, y_knnsm_test), 
     'f1_score': metrics.f1_score(y_test, y_knnsm_test), 'precision': metrics.precision_score(y_test, y_knnsm_test),
    'recall': metrics.recall_score(y_test, y_knnsm_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

In [55]:
cls_comparison.reset_index(inplace=True)
cls_comparison.drop('index', axis=1, inplace=True)
cls_comparison

Unnamed: 0,classifier_name,f1_score,precision,recall,score
0,Logistic,0.688576,0.766551,0.625,0.974481
1,Logistic_under,0.515129,0.347956,0.991477,0.915748
2,Logistic_over,0.578595,0.409953,0.982955,0.935368
3,Logistic_smote,0.604938,0.438619,0.974432,0.942549
4,Logistic_smotenc,0.602941,0.445652,0.931818,0.944601
5,Logistic_adasyn,0.583193,0.414081,0.985795,0.936394
6,Logistic_nearmiss,0.633554,0.518051,0.815341,0.957425
7,Logistic_enn,0.707865,0.7,0.715909,0.973326
8,Logistic_smoteenn,0.555823,0.387458,0.982955,0.929084
9,KNN,0.632653,0.649701,0.616477,0.967684


## SVM - support vector machines

It is a supervised machine learning algorithm that finds the hyper-plane in n-dimensional feature space. This hyperplace separates two classes.

In [56]:
# train on imbalanced data
from sklearn.svm import SVC
svm = SVC()
svm.fit(X_train, y_train)
y_svm_train = svm.predict(X_train)
y_svm_test = svm.predict(X_test)

In [57]:
# add scores to comparison dataframe
a = {'classifier_name':'SVM', 'score':metrics.accuracy_score(y_test, y_svm_test), 
     'f1_score': metrics.f1_score(y_test, y_svm_test), 'precision': metrics.precision_score(y_test, y_svm_test),
    'recall': metrics.recall_score(y_test, y_svm_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## SVM with ENN

In [58]:
# train with resampled by ENN data
svm2 = SVC()
svm2.fit(X_train_enn, y_train_enn)
y_svmenn_test = svm2.predict(X_test)

In [59]:
# add scores to comparison dataframe
a = {'classifier_name':'SVM_ENN', 'score':metrics.accuracy_score(y_test, y_svmenn_test), 
     'f1_score': metrics.f1_score(y_test, y_svmenn_test), 'precision': metrics.precision_score(y_test, y_svmenn_test),
    'recall': metrics.recall_score(y_test, y_svmenn_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## SVM with SMOTENC

In [60]:
# train with resampled by SMOTENC data
svm3 = SVC()
svm3.fit(X_train_smotenc, y_train_smotenc)
y_svmsm_test = svm3.predict(X_test)

In [61]:
# add scores to comparison dataframe
a = {'classifier_name':'SVM_SMOTENC', 'score':metrics.accuracy_score(y_test, y_svmsm_test), 
     'f1_score': metrics.f1_score(y_test, y_svmsm_test), 'precision': metrics.precision_score(y_test, y_svmsm_test),
    'recall': metrics.recall_score(y_test, y_svmsm_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

From the above table it can be concluded that KNN algorithm did not provide any improvement. Support Vector Machine one delivered better results as before with the largest precision score for the case when trained over imbalanced data and with the most balanced and both highest scores for precision and recall when trained on ENN balanced data. In this case the F1 score is the largest so far also with 0.75.

## Decision Tree

It is a supervised learning algorithm that separates the data at each node based on certain criteria it learns from trained data.

In [62]:
# train on imbalanced data
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_dt_test = dtc.predict(X_test)

In [63]:
# add scores to comparison dataframe
a = {'classifier_name':'Decision Tree', 'score':metrics.accuracy_score(y_test, y_dt_test), 
     'f1_score': metrics.f1_score(y_test, y_dt_test), 'precision': metrics.precision_score(y_test, y_dt_test),
    'recall': metrics.recall_score(y_test, y_dt_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## Desicion Tree with ENN

In [64]:
# train on ENN balanced data
dtc2 = DecisionTreeClassifier()
dtc2.fit(X_train_enn, y_train_enn)
y_dtenn_test = dtc2.predict(X_test)

In [65]:
# add scores to comparison dataframe
a = {'classifier_name':'Decision Tree ENN', 'score':metrics.accuracy_score(y_test, y_dtenn_test), 
     'f1_score': metrics.f1_score(y_test, y_dtenn_test), 'precision': metrics.precision_score(y_test, y_dtenn_test),
    'recall': metrics.recall_score(y_test, y_dtenn_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## Decision Tree with SMOTENC

In [66]:
# train on SMOTENC balanced data
dtc3 = DecisionTreeClassifier()
dtc3.fit(X_train_smotenc, y_train_smotenc)
y_dtsm_test = dtc3.predict(X_test)

In [67]:
# add scores to comparison dataframe
a = {'classifier_name':'Decision Tree SMOTENC', 'score':metrics.accuracy_score(y_test, y_dtsm_test), 
     'f1_score': metrics.f1_score(y_test, y_dtsm_test), 'precision': metrics.precision_score(y_test, y_dtsm_test),
    'recall': metrics.recall_score(y_test, y_dtsm_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## Random Forest

It is an ensemble algorithm that uses multiple decision trees (100 trees in sklearn function) and combines the results from them to make a classification decision.  Each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. Random forests achieve a reduced variance by combining diverse trees, sometimes at the cost of a slight increase in bias. In practice the variance reduction is often significant hence yielding an overall better model.

In [68]:
# train on imbalanced data
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_rf_test = rfc.predict(X_test)

In [69]:
# add scores to comparison dataframe
a = {'classifier_name':'Random Forest', 'score':metrics.accuracy_score(y_test, y_rf_test), 
     'f1_score': metrics.f1_score(y_test, y_rf_test), 'precision': metrics.precision_score(y_test, y_rf_test),
    'recall': metrics.recall_score(y_test, y_rf_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## Random Forest with ENN

In [70]:
# train on enn balanced data
rfc2 = RandomForestClassifier()
rfc2.fit(X_train_enn, y_train_enn)
y_rfenn_test = rfc2.predict(X_test)

In [71]:
# add scores to comparison dataframe
a = {'classifier_name':'Random Forest ENN', 'score':metrics.accuracy_score(y_test, y_rfenn_test), 
     'f1_score': metrics.f1_score(y_test, y_rfenn_test), 'precision': metrics.precision_score(y_test, y_rfenn_test),
    'recall': metrics.recall_score(y_test, y_rfenn_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## Random Forest with SMOTENC

In [72]:
# train on SMOTENC balanced data
rfc3 = RandomForestClassifier()
rfc3.fit(X_train_smotenc, y_train_smotenc)
y_rfsm_test = rfc3.predict(X_test)

In [73]:
# add scores to comparison dataframe
a = {'classifier_name':'Random Forest SMOTENC', 'score':metrics.accuracy_score(y_test, y_rfsm_test), 
     'f1_score': metrics.f1_score(y_test, y_rfsm_test), 'precision': metrics.precision_score(y_test, y_rfsm_test),
    'recall': metrics.recall_score(y_test, y_rfsm_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## AdaBoost

It is also an ensemble method that fits a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction. 

The data modifications at each so-called boosting iteration consist of applying weights to each of the training samples. Initially, those weights are all set to 1/N, so that the first step simply trains a weak learner on the original data. For each successive iteration, the sample weights are individually modified and the learning algorithm is reapplied to the reweighted data. At a given step, those training examples that were incorrectly predicted by the boosted model induced at the previous step have their weights increased, whereas the weights are decreased for those that were predicted correctly. As iterations proceed, examples that are difficult to predict receive ever-increasing influence. Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence

In [74]:
# train on imbalanced data
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier()
ada.fit(X_train,y_train)
y_ada_test = ada.predict(X_test)

In [75]:
# add scores to comparison dataframe
a = {'classifier_name':'AdaBoost', 'score':metrics.accuracy_score(y_test, y_ada_test), 
     'f1_score': metrics.f1_score(y_test, y_ada_test), 'precision': metrics.precision_score(y_test, y_ada_test),
    'recall': metrics.recall_score(y_test, y_ada_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## AdaBoost with ENN

In [76]:
# train on ENN balanced data
ada2 = AdaBoostClassifier()
ada2.fit(X_train_enn, y_train_enn)
y_adaenn_test = ada2.predict(X_test)

In [77]:
# add scores to comparison dataframe
a = {'classifier_name':'AdaBoost ENN', 'score':metrics.accuracy_score(y_test, y_adaenn_test), 
     'f1_score': metrics.f1_score(y_test, y_adaenn_test), 'precision': metrics.precision_score(y_test, y_adaenn_test),
    'recall': metrics.recall_score(y_test, y_adaenn_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## AdaBoost with SMOTENC

In [78]:
# with smotenc
ada3 = AdaBoostClassifier()
ada3.fit(X_train_smotenc, y_train_smotenc)
y_adasm_test = ada3.predict(X_test)

In [79]:
# add scores to comparison dataframe
a = {'classifier_name':'AdaBoost SMOTENC', 'score':metrics.accuracy_score(y_test, y_adasm_test), 
     'f1_score': metrics.f1_score(y_test, y_adasm_test), 'precision': metrics.precision_score(y_test, y_adasm_test),
    'recall': metrics.recall_score(y_test, y_adasm_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## Gradient Boosting

It is another ensemble model that uses weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

In [80]:
# train on imbalanced data
from sklearn.ensemble import GradientBoostingClassifier
gbc= GradientBoostingClassifier()
gbc.fit(X_train, y_train)
y_gb_test = gbc.predict(X_test)

In [81]:
# add scores to comparison dataframe
a = {'classifier_name':'Gradient Boosting', 'score':metrics.accuracy_score(y_test, y_gb_test), 
     'f1_score': metrics.f1_score(y_test, y_gb_test), 'precision': metrics.precision_score(y_test, y_gb_test),
    'recall': metrics.recall_score(y_test, y_gb_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## Gradient Boosting with ENN

In [82]:
# train on ENN balanced data
gbc2 = GradientBoostingClassifier()
gbc2.fit(X_train_enn, y_train_enn)
y_gbenn_test = gbc2.predict(X_test)

In [83]:
# add scores to comparison dataframe
a = {'classifier_name':'Gradient Boosting ENN', 'score':metrics.accuracy_score(y_test, y_gbenn_test), 
     'f1_score': metrics.f1_score(y_test, y_gbenn_test), 'precision': metrics.precision_score(y_test, y_gbenn_test),
    'recall': metrics.recall_score(y_test, y_gbenn_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

## Gradient Boosting with SMOTENC

In [84]:
gbc3 = GradientBoostingClassifier()
gbc3.fit(X_train_smotenc, y_train_smotenc)
y_gbsm_test = gbc3.predict(X_test)

In [85]:
# add scores to comparison dataframe
a = {'classifier_name':'Gradient Boosting SMOTENC', 'score':metrics.accuracy_score(y_test, y_gbsm_test), 
     'f1_score': metrics.f1_score(y_test, y_gbsm_test), 'precision': metrics.precision_score(y_test, y_gbsm_test),
    'recall': metrics.recall_score(y_test, y_gbsm_test)}
cls_comparison = pd.concat([cls_comparison, pd.DataFrame([a])])

In [86]:
cls_comparison.reset_index(inplace=True)
cls_comparison.drop('index', axis=1, inplace=True)
cls_comparison.sort_values(['precision', 'f1_score','recall'], ascending=False)

Unnamed: 0,classifier_name,f1_score,precision,recall,score
18,Random Forest,0.834286,0.83908,0.829545,0.985124
21,AdaBoost,0.86612,0.834211,0.900568,0.987433
24,Gradient Boosting,0.834512,0.830986,0.838068,0.984996
12,SVM,0.716561,0.815217,0.639205,0.977174
19,Random Forest ENN,0.847185,0.80203,0.897727,0.985381
15,Decision Tree,0.789773,0.789773,0.789773,0.981021
17,Decision Tree SMOTENC,0.815321,0.78628,0.846591,0.982688
22,AdaBoost ENN,0.848958,0.783654,0.926136,0.985124
16,Decision Tree ENN,0.826203,0.780303,0.877841,0.983329
25,Gradient Boosting ENN,0.840467,0.77327,0.920455,0.984227


I am sorting the scores in the following order: precision, f1 and recall. As I have discussed earlier, the biggest interest here not only to correctly identify good deal listings but also not to mistakenly label regular listing as a 'good deal' ones. Therefore I am trying to get the highest precision score. The best results so far were achieved by the ensemble algorithms trained on imbalanced data. They can be rated in the following order:

- Random Forest with almost 0.85 precision score
- AdaBoosting with over 0.83 precision score
- Gradient Boosting with about 0.83 precision scores

Same algorithms trained on Edited Nearest Neighbours balanced data showed slightly lower precision scores but higher recall scores (identified correctly larger number of actual 'good deal' listings than the ones trained on imbalanced data.

All of the 'winning' algorithms represent ensemble methods where training data is resampled within algorithm to train separate models. It might explain why these algorithms perform better when working with original imbalanced data.

Further, I will try to combine these top three algorithms in voting classifier (with hard voting = majority vote rule) to see if there can be any improvement.

## Voting Classifier

In [87]:
from sklearn.ensemble import VotingClassifier

clf1 = RandomForestClassifier()
clf2 = AdaBoostClassifier()
clf3 = GradientBoostingClassifier()
eclf = VotingClassifier(estimators=[('rf', clf1), ('D', clf2), ('gb', clf3)], voting='hard')
eclf.fit(X_train, y_train)
y_v_test = eclf.predict(X_test)

In [88]:
print(metrics.confusion_matrix(y_test, y_v_test))
print(metrics.classification_report(y_test,y_v_test))

[[7391   55]
 [  48  304]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      7446
           1       0.85      0.86      0.86       352

    accuracy                           0.99      7798
   macro avg       0.92      0.93      0.92      7798
weighted avg       0.99      0.99      0.99      7798



It yield about the same result for precision score, but improved recall and hence f1 score slightly.

For further improvement it might be worth to play around with features (maybe taking less of them into account or creating interdependencies between them as separate variables) and/or adjust model parameters.
