# Capstone Workbook 4: Initial Modelling

In [1]:
# Import libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Import data 
airbnb_ldn = pd.read_csv('airbnb_ldn_pp.csv')

In [3]:
# drop 'Unnamed: 0'
airbnb_ldn = airbnb_ldn.drop(columns = 'Unnamed: 0')

In [4]:
# View data:
airbnb_ldn.head().T

Unnamed: 0,0,1,2,3,4
Listing Title,Cozy 2BR house with a garden view,GuestReady - Amazing home with a private garden,Cosy cottage on Richmond Park,"Entire Flat. Free parking, Garden , Richmond park",Maisonette inbetween Richmond Park and Wimbledon
Property Type,Entire home,Entire home,Entire home,Entire rental unit,Private room in rental unit
City,Greater London,Greater London,Greater London,Greater London,Greater London
Zipcode,SW15 3,SW15 3,SW15 3,SW15 3,SW15 3
Number of Reviews,9,11,1,20,0
Bedrooms,2.0,2.0,1.0,2.0,1.0
Bathrooms,2,1,2,1,1
Max Guests,6,4,3,4,2
Airbnb Superhost,0,1,0,0,0
Cleaning Fee (Native),154.8,0.0,0.0,34.8,0.0


The data has now been cleaned, had some initial EDA completed and been preprocessed. 

Some initial models will now be built, starting with a regression model with a L1 penalty. This will help identify which columns are influencial in predicting the target column.

First, the dataframe will be split into the independent and target variables, using just numerical variables for now:

In [5]:
X = airbnb_ldn.select_dtypes(exclude='object').drop(columns = 'Annual Revenue LTM (Native)')
y = airbnb_ldn['Annual Revenue LTM (Native)']

In [6]:
# drop all columns with null values:
X = X.drop(columns = ['Airbnb Communication Rating', 'Airbnb Accuracy Rating', 'Airbnb Cleanliness Rating', 'Airbnb Checkin Rating', 'Airbnb Location Rating', 'Airbnb Value Rating', 'Host Listing Count'])

In [7]:
X.shape

(32686, 34)

The dependent and target variables will now be split into into train and test sets:

In [8]:
# download required sklearn packages:
from sklearn.model_selection import train_test_split

# splitting data into train and test set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=42)

In [9]:
print(X_train.shape)
print(y_train.shape)

(21899, 34)
(21899,)


As random samples have been taken, the indexes for the dataframes will be reset:

In [10]:
# reset index for X training set:
X_train.reset_index(inplace=True)

# drop created 'index' column:
X_train.drop(columns='index', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train.drop(columns='index', inplace=True)


In [11]:
# reset index for X_test:
X_test.reset_index(inplace=True)

# drop created 'index column:
X_test.drop(columns='index', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test.drop(columns='index', inplace=True)


The index for the target column must also be reset:

In [12]:
# complete the same tranformation for the y_test data:
# convert the series to a dataframe:
y_train = y_train.to_frame()

# reset index
y_train.reset_index(inplace=True)

# drop 'index' column
y_train.drop(columns = 'index', inplace=True)

# return column to series
y_train = y_train.squeeze()

Complete the same tranformation for the y_test data:

In [13]:
# convert the series to a dataframe:
y_test = y_test.to_frame()

# reset index
y_test.reset_index(inplace=True)

# drop 'index' column
y_test.drop(columns = 'index', inplace=True)

# return column to series
y_test = y_test.squeeze()

Now the train and test datasets have been transferred, the first model can be made:

In [14]:
# import required libraries
from scipy import stats
import statsmodels.api as sm


In [15]:
# initiallly manually add the y-intercept:
X_train_withconstant = sm.add_constant(X_train)
X_test_withconstant = sm.add_constant(X_test)

In [16]:
# 1. instantiate model
myregression = sm.OLS(y_train, X_train_withconstant)

# fit model
myregression_results = myregression.fit()

# Looking at the summary
myregression_results.summary()

0,1,2,3
Dep. Variable:,Annual Revenue LTM (Native),R-squared:,0.692
Model:,OLS,Adj. R-squared:,0.692
Method:,Least Squares,F-statistic:,1641.0
Date:,"Sat, 23 Mar 2024",Prob (F-statistic):,0.0
Time:,19:44:44,Log-Likelihood:,-235910.0
No. Observations:,21899,AIC:,471900.0
Df Residuals:,21868,BIC:,472100.0
Df Model:,30,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.0043,1.509,-0.003,0.998,-2.961,2.953
Number of Reviews,-24.5546,1.728,-14.213,0.000,-27.941,-21.168
Bedrooms,-295.4206,150.672,-1.961,0.050,-590.748,-0.093
Bathrooms,143.6861,141.546,1.015,0.310,-133.755,421.127
Max Guests,817.6309,69.164,11.822,0.000,682.064,953.198
Airbnb Superhost,455.5591,194.801,2.339,0.019,73.734,837.384
Cleaning Fee (Native),38.3364,2.470,15.524,0.000,33.496,43.177
Extra People Fee(Native),-17.2931,10.183,-1.698,0.089,-37.253,2.666
Minimum Stay,-4.2713,4.290,-0.996,0.319,-12.681,4.138

0,1,2,3
Omnibus:,11578.05,Durbin-Watson:,1.989
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3055889.355
Skew:,1.366,Prob(JB):,0.0
Kurtosis:,60.807,Cond. No.,4.44e+18


The first model has produced fairly positive results. An R^2 value of 0.692 indicates that the approximately 70% of the variance within the Annual Revenue can be explained by the various numerical features.

Let see how this model works with the test data:

A second more advanced linear model will now be produced. This second model will use an L1 penalty.

In [28]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [19]:
# scale the train and test variables:
X_train_scaled = StandardScaler().fit_transform(X_train_withconstant)
X_test_scaled = StandardScaler().fit_transform(X_test_withconstant)

In [20]:
# fit the model:
linreg = LinearRegression()
linreg.fit(X_train_scaled, y_train)

LinearRegression()

In [21]:
# Evaluate model
y_pred = linreg.predict(X_test_scaled)

In [23]:
mse = mean_squared_error(y_test, y_pred)

In [29]:
r2 = r2_score(y_test, y_pred)

In [30]:
r2

0.70855038540672