# AirBNB Price Prediction with Logistic Regression

We will predict the price (`price_gte_150` column) of an AirBNB dataset used last week.

## 1. Setup

In [95]:
# Common imports
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
np.random.seed(1)

# 2. Load the data

We will use the AirBNB data that we cleaned in last class (the original, not the one that you altered for last weeks exercise).

In [96]:
# Uncomment the following snippet of code to debug problems with finding the .csv file path
# This snippet of code will exit the program and print the current working directory.
import os
print(os.getcwd())

/Users/shambhavimishra/Downloads/DSP


In [97]:
X_train = pd.read_csv("/Users/shambhavimishra/Downloads/DSP/airbnb_train_X_price_gte_150.csv")
X_test = pd.read_csv("/Users/shambhavimishra/Downloads/DSP/airbnb_test_X_price_gte_150.csv")
y_train = pd.read_csv("/Users/shambhavimishra/Downloads/DSP/airbnb_train_y_price_gte_150.csv")
y_test = pd.read_csv("/Users/shambhavimishra/Downloads/DSP/airbnb_test_y_price_gte_150.csv")

## 3. Model the data

First, we will create a dataframe to hold all the results of our models.

In [98]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

### 3.1 Fit and test a Logistic Regression model

In [99]:
log_reg_model = LogisticRegression(penalty='none', max_iter=900)
_ = log_reg_model.fit(X_train, np.ravel(y_train))



In [100]:
model_preds = log_reg_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"default logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.866917,0.852995,0.885122,0.868762


### 3.2 Change to liblinear solver

In [101]:
log_reg_liblin_model = LogisticRegression(solver='liblinear').fit(X_train, np.ravel(y_train))

In [102]:
model_preds = log_reg_liblin_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"liblinear logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.866917,0.852995,0.885122,0.868762
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454


### 3.3 L2 Regularization

In [103]:
log_reg_L2_model = LogisticRegression(penalty='l2', max_iter=1000)
_ = log_reg_L2_model.fit(X_train, np.ravel(y_train))

In [104]:
model_preds = log_reg_L2_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L2 logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.866917,0.852995,0.885122,0.868762
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,L2 logistic,0.861293,0.850091,0.875706,0.862709


### 3.4 L1 Regularization

In [105]:
log_reg_L1_model = LogisticRegression(solver='liblinear', penalty='l1')
_ = log_reg_L1_model.fit(X_train, np.ravel(y_train))

In [106]:
model_preds = log_reg_L1_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L1 logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.866917,0.852995,0.885122,0.868762
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,L1 logistic,0.858482,0.845455,0.875706,0.860315


### 3.5 Elastic Net Regularization

In [107]:
log_reg_elastic_model = LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=0.5, max_iter=1000)
_ = log_reg_elastic_model.fit(X_train, np.ravel(y_train))

In [108]:
model_preds = log_reg_elastic_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Elestic logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.866917,0.852995,0.885122,0.868762
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,L1 logistic,0.858482,0.845455,0.875706,0.860315
0,Elestic logistic,0.859419,0.846995,0.875706,0.861111


### 3.6 Fit SVM Classification Model using Linear Kernal

In [109]:
svm_lin_model = SVC(kernel="linear")
_ = svm_lin_model.fit(X, np.ravel(y))

In [110]:
model_preds = svm_lin_model.predict(X)
c_matrix = confusion_matrix(y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"linear svm", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.866917,0.852995,0.885122,0.868762
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,L1 logistic,0.858482,0.845455,0.875706,0.860315
0,Elestic logistic,0.859419,0.846995,0.875706,0.861111
0,linear svm,0.76,0.833333,0.714286,0.769231


### 3.7 Fit SVM Classification using RBF Kernal 

In [111]:
svm_rbf_model = SVC(kernel="rbf", C=10, gamma='scale')
_ = svm_rbf_model.fit(X, np.ravel(y))

In [112]:
model_preds = svm_rbf_model.predict(X)
c_matrix = confusion_matrix(y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"rbf svm", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.866917,0.852995,0.885122,0.868762
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,L1 logistic,0.858482,0.845455,0.875706,0.860315
0,Elestic logistic,0.859419,0.846995,0.875706,0.861111
0,linear svm,0.76,0.833333,0.714286,0.769231
0,rbf svm,0.76,1.0,0.571429,0.727273


### 3.8 Fit SVM Classification using Polynomial Kernal 

In [113]:
svm_poly_model = SVC(kernel="poly", degree=3, coef0=1, C=10)
_ = svm_poly_model.fit(X, np.ravel(y))

In [114]:
model_preds = svm_poly_model.predict(X)
c_matrix = confusion_matrix(y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"poly svm", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.866917,0.852995,0.885122,0.868762
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,L1 logistic,0.858482,0.845455,0.875706,0.860315
0,Elestic logistic,0.859419,0.846995,0.875706,0.861111
0,linear svm,0.76,0.833333,0.714286,0.769231
0,rbf svm,0.76,1.0,0.571429,0.727273
0,poly svm,0.76,1.0,0.571429,0.727273


## 5.0 Summary

Sorted by accuracy, the best models are:

In [115]:
performance.sort_values(by=['Accuracy'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,linear svm,0.76,0.833333,0.714286,0.769231
0,rbf svm,0.76,1.0,0.571429,0.727273
0,poly svm,0.76,1.0,0.571429,0.727273
0,L1 logistic,0.858482,0.845455,0.875706,0.860315
0,Elestic logistic,0.859419,0.846995,0.875706,0.861111
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,default logistic,0.866917,0.852995,0.885122,0.868762


Sorted by Precision, the best models are:

In [116]:
performance.sort_values(by=['Precision'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,linear svm,0.76,0.833333,0.714286,0.769231
0,L1 logistic,0.858482,0.845455,0.875706,0.860315
0,Elestic logistic,0.859419,0.846995,0.875706,0.861111
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,default logistic,0.866917,0.852995,0.885122,0.868762
0,rbf svm,0.76,1.0,0.571429,0.727273
0,poly svm,0.76,1.0,0.571429,0.727273


Sorted by Recall, the best models are:

In [117]:
performance.sort_values(by=['Recall'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,rbf svm,0.76,1.0,0.571429,0.727273
0,poly svm,0.76,1.0,0.571429,0.727273
0,linear svm,0.76,0.833333,0.714286,0.769231
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,L1 logistic,0.858482,0.845455,0.875706,0.860315
0,Elestic logistic,0.859419,0.846995,0.875706,0.861111
0,default logistic,0.866917,0.852995,0.885122,0.868762


Sorted by F1, the best models are:

In [118]:
performance.sort_values(by=['F1'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,rbf svm,0.76,1.0,0.571429,0.727273
0,poly svm,0.76,1.0,0.571429,0.727273
0,linear svm,0.76,0.833333,0.714286,0.769231
0,L1 logistic,0.858482,0.845455,0.875706,0.860315
0,Elestic logistic,0.859419,0.846995,0.875706,0.861111
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,default logistic,0.866917,0.852995,0.885122,0.868762


### So which model is the 'best' and the one you wish to choose?

This is very much depending on the profit or loss associated with FP, FN, TP and TN. We will discuss this in the next class.

Conclusion: The best fit model is defined as the one that minimizes differences between observed and predicted values depending on t. By analysing the Model using confusion Matrix, the best approach is to choose the model with least FP(False Positive) and FN(False Negative). In the above problem, it is evident that SVM Classifiaction model using Polynomial Kernal fits the best as it has least FP and FN and it has the highest Precision(76%) and Accuracy(100%) as compared to the oter models. Thus, the best model and the model that I wish to choose is SVM classification model with polynomial kernal 