## Key Takeaways from Initial Data Analysis
- There don't seem to be any consistencies with the missing bathroom/bedroom data. I believe it was simply never input into the airbnb listing. Therefore, I chose to delete out those listings.
- Almost all (15,827 out of 15,835) "last_review" and "first review" are because there are no reviews. These columns are dates. Cut these columns since "host_since" had less missing data and essentially fills the same aspect of understandinig how long the airbnb has been around for.
- All null "review_scores_rating" are caused by no reviews. Therefore, I've inserted "no reviews" into column
- Almost all of the reviews contain images. This could make a cool project to analyze how thumbnail images affect price. Most likely not useful for our project.
- Created new columns based on extracted text from "ammenities" column
- For 180 NaN's in "host has no profile picture." I decided to just assumed there was no image and set value to false 

In [10]:
# Import Statements
from datetime import datetime
import pandas as pd
import numpy as np
from PIL import Image
import requests
import datetime
from datetime import datetime

In [2]:
# Import files
train = pd.read_csv("archive-2/train.csv", index_col='id')
test = pd.read_csv("archive-2/test.csv", index_col='id')

In [13]:
train_before = train.shape

In [14]:
test_before = test.shape

In [15]:
def cleaned_dataframe(df):
    """
    1. Adds feature columns to df
    2. Deals with all null values
    3. Turns ratings into categorical column
    4. Converts "Host_since" into a measure of time
    """ 
    features = ['Wireless Internet','Air conditioning', 'Kitchen', 'Heating','Family/kid friendly', 'Essentials', 'Hair dryer', 'Iron', 
                'Smoke detector', 'Shampoo', 'Hangers', 'Hair dryer', 'Fire extinguisher', 'Laptop friendly workspace', 'First aid kit', 'Indoor fireplace',
                'TV','Cable TV', 'Elevator in building']
    
    # forloop to create all new columns
    for item in features:
        df[item]=np.where(df['amenities'].str.contains(item), 1, 0)
        
        
    # drop unnecessary column & columns with no host information
    # neighborhood will be dictated by zip and latitude/longitude
    df.drop(columns=['amenities', 'first_review', 'last_review', 'host_response_rate', 'neighbourhood'], axis=1, inplace=True)
    
    # drop rows with null values in certain columns
    df = df.dropna(axis=0, subset=['bathrooms', 'bedrooms', 'beds'])
    
    # dropped rows with no host information or now zip code
    df = df.dropna(axis=0, subset=['host_since', 'host_identity_verified', 'zipcode'])
    
    # Dealing with ratings column
    # Zero isn't a real rating in the columns
    # Temporarily assign rating_score with no previous reviews as 0 so it can later make it a category
    df['review_scores_rating']=np.where(df['number_of_reviews']==0, 0, df['review_scores_rating'])
    
    # drop remaining 800 rows with no values
    df = df.dropna(axis=0, subset=['review_scores_rating'])
    
    # change reviews into categories
    df['review_scores_rating'] = df['review_scores_rating'].round(-1).astype('int').astype('str')
    
    # reassign 0 ratings as "no past ratings" category
    df['review_scores_rating']=np.where(df['number_of_reviews']==0, 'no past ratings', df['review_scores_rating'])
    
    # Convert "host_since" into column that measures # of days an individual has been a host
    for i in range(len(df['host_since'])):
        today = datetime.today()
        date_time_obj = datetime.strptime(df['host_since'].iloc[i], '%Y-%m-%d')
        df['host_since'].iloc[i] = (today - date_time_obj).days
    
    # Convert "host_since" from object to int
    df['host_since'] = df['host_since'].astype('int')
    
    
    #drop columns with low correlation
    df.drop(columns=['latitude', 'longitude', 'Smoke detector', 'number_of_reviews', 'Hangers','First aid kit', 'Elevator in building', 'Essentials', 'zipcode', 'thumbnail_url', 'description', 'name'], axis=1, inplace=True)
    return df

In [16]:
train = cleaned_dataframe(train)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [17]:
test = cleaned_dataframe(test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [18]:
# Statement on cut data
train_after = train.shape
lost_rows = train_before[0]-train_after[0]
print("We cut", lost_rows, "data points when cleaning data leaving",train_after[0],"data points.")

We cut 2353 data points when cleaning data leaving 71758 data points.


In [20]:
# Each column contains image of airbnb

# def image_look(pick_a_row):
#     url = train['thumbnail_url'].iloc[pick_a_row]
#     im = Image.open(requests.get(url, stream=True).raw)
#     return im

# image_look(12304)

In [None]:
# Export CSV Files

In [None]:
train.to_csv('train_2.csv')

In [None]:
test.to_csv('test_2.csv')

## The Models

In [21]:
train.head(1)

Unnamed: 0_level_0,log_price,property_type,room_type,accommodates,bathrooms,bed_type,cancellation_policy,cleaning_fee,city,host_has_profile_pic,...,Heating,Family/kid friendly,Hair dryer,Iron,Shampoo,Fire extinguisher,Laptop friendly workspace,Indoor fireplace,TV,Cable TV
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6901257,5.010635,Apartment,Entire home/apt,3,1.0,Real Bed,strict,True,NYC,t,...,1,1,1,1,0,0,0,0,0,0


In [22]:
train.dtypes

log_price                    float64
property_type                 object
room_type                     object
accommodates                   int64
bathrooms                    float64
bed_type                      object
cancellation_policy           object
cleaning_fee                    bool
city                          object
host_has_profile_pic          object
host_identity_verified        object
host_since                     int64
instant_bookable              object
review_scores_rating          object
bedrooms                     float64
beds                         float64
Wireless Internet              int64
Air conditioning               int64
Kitchen                        int64
Heating                        int64
Family/kid friendly            int64
Hair dryer                     int64
Iron                           int64
Shampoo                        int64
Fire extinguisher              int64
Laptop friendly workspace      int64
Indoor fireplace               int64
T

In [None]:
train.corr()

In [None]:
train['review_scores_rating']

In [None]:
import seaborn as sns
#sns.pairplot(train)

In [None]:
test.columns

In [None]:
test.dtypes

In [23]:
# IMPORTS FOR MODEL
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, VarianceThreshold
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split

In [24]:
test.head(1)

Unnamed: 0_level_0,property_type,room_type,accommodates,bathrooms,bed_type,cancellation_policy,cleaning_fee,city,host_has_profile_pic,host_identity_verified,...,Heating,Family/kid friendly,Hair dryer,Iron,Shampoo,Fire extinguisher,Laptop friendly workspace,Indoor fireplace,TV,Cable TV
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3895911,Apartment,Private room,2,1.0,Real Bed,flexible,True,LA,t,f,...,1,0,1,1,1,1,1,0,1,1


In [25]:
# Split into X, y, train/test

target = 'log_price'
y = train[target]
X = train.drop(columns=target)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [26]:
# Linear Regression Baseline
y_mean = [y_train.mean()] * len(y_train)

baseline_mae = mean_absolute_error(y_train, y_mean)

print('Baseline MAE:', baseline_mae)

Baseline MAE: 0.5596581463316458


In [27]:
# Categorical columns to OneHotEncode
cat = X_train.select_dtypes(include=['object', 'bool']).columns

# Integer/Float Columns to Scale
num = X_train.select_dtypes(include=['int', 'float']).columns

In [28]:
# sanity check
assert len(cat) + len(num) == train.shape[1]-1, 'Column categories don\'t equal df columns'

In [36]:
model_lr = make_pipeline(
    OneHotEncoder(handle_unknown='ignore'),
    LinearRegression()
)

model_lr.fit(X_train, y_train);

In [37]:
train_mae = mean_absolute_error(y_train, model_lr.predict(X_train))
test_mae = mean_absolute_error(y_test, model_lr.predict(X_test))
score = model_lr.score(X_train, y_train)

print('Linear Regression Model')
print('Training MAE:', train_mae)
print('Validation MAE:', test_mae)
print('R^2 Score:', score)

Linear Regression Model
Training MAE: 0.32850936553567783
Validation MAE: 0.3515381903885942
R^2 Score: 0.6297182783267588


## Linear Regression

In [38]:
# define model
model = LinearRegression()
# define transform
transformer = ColumnTransformer(transformers=[('cat', OneHotEncoder(handle_unknown='ignore'), cat),
                                             ('num', StandardScaler(), num)], remainder='passthrough')
# define pipeline
pipeline = Pipeline(steps=[('t', transformer), ('m',model)])
# fit the pipeline on the transformed data
pipeline.fit(X_train, y_train)
# make predictions
yhat = pipeline.predict(X_test)

In [39]:
#Gradient Boosting Model
train_mae = mean_absolute_error(y_train, pipeline.predict(X_train))
test_mae = mean_absolute_error(y_test, pipeline.predict(X_test))
score = pipeline.score(X_train, y_train)

print('Linear Regression Model')
print('Training MAE:', train_mae)
print('Validation MAE:', test_mae)
print('R^2 Score:', score)

Linear Regression Model
Training MAE: 0.34579558840364066
Validation MAE: 2121766.133196058
R^2 Score: 0.5910491227649668


## Gradient Boosting

In [40]:
# define model
model = GradientBoostingRegressor()
# define transform
transformer = ColumnTransformer(transformers=[('cat', OneHotEncoder(handle_unknown='ignore'), cat),
                                             ('num', StandardScaler(), num)], remainder='passthrough')
# define pipeline
pipeline = Pipeline(steps=[('t', transformer), ('m',model)])
# fit the pipeline on the transformed data
pipeline.fit(X_train, y_train)
# make predictions
yhat = pipeline.predict(X_test)

In [41]:
g_train_mae = mean_absolute_error(y_train, pipeline.predict(X_train))
g_test_mae = mean_absolute_error(y_test, pipeline.predict(X_test))
g_score = pipeline.score(X_train, y_train)

print('Linear Regression Model')
print('Training MAE:', g_train_mae)
print('Validation MAE:', g_test_mae)
print('R^2 Score:', g_score)

Linear Regression Model
Training MAE: 0.33190747927388253
Validation MAE: 0.33467449801991134
R^2 Score: 0.6231389218388331


## Random Forrest Regressor

In [53]:
# define model
model = RandomForestRegressor(n_estimators=200,
                              max_depth=20,
)

# define transform
transformer = ColumnTransformer(transformers=[('cat', OneHotEncoder(handle_unknown='ignore'), cat),
                                             ('num', StandardScaler(), num)], remainder='passthrough')
# define pipeline
pipeline = Pipeline(steps=[('t', transformer), ('m',model)])
# fit the pipeline on the transformed data
pipeline.fit(X_train, y_train)
# make predictions
yhat = pipeline.predict(X_test)

In [54]:
r_train_mae = mean_absolute_error(y_train, pipeline.predict(X_train))
r_test_mae = mean_absolute_error(y_test, pipeline.predict(X_test))
r_score = pipeline.score(X_train, y_train)
r_score_t = pipeline.score(X_test, y_test)

print('Random Forrest Regression Model')
print('Training MAE:', r_train_mae)
print('Validation MAE:', r_test_mae)
print('R^2 Score:', r_score)
print('R^2 t Score:', r_score_t)

Random Forrest Regression Model
Training MAE: 0.18631484160120906
Validation MAE: 0.3274487835151864
R^2 Score: 0.8833031167792813
R^2 t Score: 0.6260824947154131


In [None]:
from sklearn import metrics
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1_macro')

In [None]:
scores

In [None]:
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

In [26]:
model_rr = make_pipeline(
    OneHotEncoder()
    RandomForestRegressor(),
)

model_lr.fit(X_train, y_train);

SyntaxError: invalid syntax (<ipython-input-26-61838df3723c>, line 3)