# INFO370 Problem Set 6: Linear Regression

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.formula.api as smf

## 1 Data description (15pt)

#### 1. Load the data airbnb-seattle-listings-train.csv. Broadly describe the variables you see, their encoding, and discuss if these may be valuable in determining the price. For instance, you may want to thell that house_rules is text, and you may want to check if smoking allowed/not allowed is related to the price.

In [2]:
listings = pd.read_csv('airbnb-seattle-listings-train.csv', sep='\t')

After reading in the data, we learned that there are 106 variables for each airbnb. Some of these variables are only useful for backend such as 'id' and 'listing_url' while other variables such as 'description,' 'review_scores,' 'weekly_price,' and 'monthly_price' are useful for determining price. One example of a useful variable is 'review_scores_cleanliness' which is a float that tells the user the rating of how clean the airbnb is. The cleanliness score allows the seller to market their airbnb at a higher price.

#### 2. Consider how will you handle missing data. For instance, 95% of the 'square feet' observations are missing, 17% of 'security deposit' observations are missing. You lose too many observations if you just ignore those.

In [3]:
listings = listings.drop("square_feet", axis=1)
listings = listings.drop("thumbnail_url", axis=1)
listings = listings.drop("medium_url", axis=1)
listings = listings.drop("xl_picture_url", axis=1)
listings = listings.drop("host_acceptance_rate", axis=1)
listings = listings.drop("license", axis=1)
listings = listings.drop("host_about", axis=1)
listings = listings.drop("weekly_price", axis=1)
listings = listings.drop("monthly_price", axis=1)

Our group decided to delete any variables that are missing more than 20% of data and do not seem useful to the project. There are 15 out of 106 variables that are missing this much data, and we decided that 9 of these are unnecessary to the project. We got rid of 'square feet' because it was missing 95% of the data, and we got rid of the URL variables because we knew that these would not be useful for our analysis. We also decided that weekly and monthly price would not be necessary because we already have the daily price of AirBnB's, which is a more common way of deciding price.

#### 3. Consider which variables you are going to use below. For all of these, create a summary table that contains relevant summary information. In particular pay attention to the missing values. Note that missings may not just be coded as such, they may also be empty strings and values like 'N/A.' You may return to this point repeatedly as you develop your model.

The variables that we chose were beds, security deposit, zipcode, and score rating.

In [5]:
listings['security'] = listings.security_deposit

In [25]:
for i in range(0, len(listings.security_deposit)):
    var = listings.security_deposit[i]
    
    if (isinstance(var, str)):
        var = listings.security_deposit[i][1:]
        var = var.replace(',', '')
        var = float(var)
    
    
    listings.security[i] = var

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


In [26]:
median_deposit = listings.security.median()
mean_deposit = listings.security.mean()
listings.security = listings.security.fillna(value = median_deposit)

0    500.0
1    120.0
2    100.0
3      0.0
4    300.0
5    300.0
6    200.0
7    300.0
8    350.0
9    200.0
Name: security, dtype: float64

In [28]:
listings.security.describe()

count    7540.000000
mean      249.131432
std       354.117086
min         0.000000
25%         0.000000
50%       200.000000
75%       300.000000
max      5000.000000
Name: security, dtype: float64

In [29]:
listings.security.mode()

0    0.0
dtype: float64

In [61]:
listings.beds = listings.beds.fillna(value = 1.0)

In [31]:
listings.beds.describe()

count    7537.000000
mean        1.908850
std         1.561733
min         0.000000
25%         1.000000
50%         1.000000
75%         2.000000
max        49.000000
Name: beds, dtype: float64

In [32]:
listings.beds.mode()

0    1.0
dtype: float64

In [33]:
listings['zipcode'].mode()

0    98122
dtype: object

In [66]:
review_median = listings.review_scores_rating.median()
listings.review_scores_rating = listings.review_scores_rating.fillna(value = review_median)

In [67]:
listings['review_scores_rating'].mode()

0    97.0
dtype: float64

Ther variables that we chose were Security Deposit Value, Number of Beds, Zipcodes, and the Review Score Rating. We chose these because we believe that they are all useful in representing individual AirBnB's.<br/><br/>
There was not a lot cleaning we needed to do to the data because we chose variables that were both representative of things we consider important for price (like location, area, rating) and also fairly clean.
Most of the cleaning involved resolving NaN values to something usable and converting strings to numeric values (e.g  for security deposits, filling NaN with the median price $200, stripping dollar sign, converting to float value)

## 2 Model (60pt)

#### 1. Either split your data into training and validation sets, or just use cross validation below

In [153]:
import sklearn.linear_model as slm
import sklearn.model_selection

In [37]:
for i in range(0, len(listings.price)):
    var = listings.price[i][1:]
    listings.price[i] = var.replace(',', '')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [38]:
listings.price = listings.price.astype('float')

In [70]:
# beds, security deposit, and review score rating
X = np.stack((listings.beds.values, listings.security.values, listings.review_scores_rating.values), axis = 1)
y = listings.price

In [104]:
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size = 0.2)

In [130]:
Xtr.shape

(6032, 3)

In [127]:
# Underfit or Overfit?
m = slm.LinearRegression().fit(Xtr, ytr)
m.intercept_, m.coef_

(274.49405151631765, array([27.19866135,  0.05092941, -1.78257493]))

In [143]:
m = slm.LinearRegression().fit(Xtr, ytr)
m.intercept_, m.coef_

(274.49405151631765, array([27.19866135,  0.05092941, -1.78257493]))

#### 2. Develop the models. Report all the variables and how do you clean/encode those. While the exact details are visible in the code, explain the broad choices in text.

In [148]:
yhattr = m.predict(Xtr)

In [149]:
yhatte = m.predict(Xte)
yhatte

array([242.41574747, 135.91324418, 142.27919313, ..., 138.96882686,
       127.00036952, 137.18625193])

In [150]:
rmsetr = np.sqrt(np.mean((yhattr - ytr) ** 2))
rmsetr

193.31229089942624

In [151]:
rmsete = np.sqrt(np.mean((yhatte - yte) ** 2))
rmsete

168.3800820956932

In [156]:
#R2
sklearn.model_selection.cross_val_score(m, Xte, yte, scoring = 'r2')



array([0.05128614, 0.06958619, 0.08534699])

In [157]:
#Coefficients
m.coef_

array([27.19866135,  0.05092941, -1.78257493])

We calculated the model for the variables beds, security deposit, and the score rating versus the price for each AirBnB. By comparing the test and training data, we see that they are very similar and large Root Mean Squared Errors. Because they are similar means that the data is not overfit, but because it is not close to zero, we cannot confirm that it is underfit. To fix this, we must add more variables into our model.

#### 3. Report the final number of observations, the estimated coefficient values, adjusted R2, and RMSE on validataion data for three models:

(a) a simple one that only contains a few most important variables/best predictors. What do you think are 2-3 best predictors in the data?

In our above model, we had 6302 observations. Our variable coefficients were beds at 27.2, security deposit at .051, and calculated score at -1.78. On the validation data, our RMSE was 168.38, and our R2 was 0.0512.

(b) the full model: everything you consider useful.

(c) something in between.

#### 4. Interpret the coefficients of the reported models. Again, only interpret the most interesting/important ones, not all of those! Do the coefficient values differ between the models? Can you explain why?

## 3 Think (15pt)

#### 1. does your model do a good job in predicting the prices?

#### 2. how will your model be useful to

#### (a) AirBnB hosts

#### (b) AirBnB customers

#### 3. Did you include any other price-related variables, such as "weekly price" or "security deposit" in your model? Do you think it is a good idea to use these attributes while trying to predict price?

ANSWER THIS ONE NOW

#### 4. Do you think this model can be used by Airbnb itself or the government?

#### 5. Do you see any ethical issues with this work?

In [None]:
m = smf.ols(formula = 'beds ~ price', data = listings).fit()
m.summary()

In [103]:
(((listings.isnull().sum() / (listings.isnull().sum() + listings.notnull().sum())) * 100) > 20).value_counts()

False    92
True      6
dtype: int64