<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Before-We-Begin,-Remember-DS-Life-Cycle" data-toc-modified-id="Before-We-Begin,-Remember-DS-Life-Cycle-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Before We Begin, Remember DS Life Cycle</a></span><ul class="toc-item"><li><span><a href="#Other-frameworks-to-consider" data-toc-modified-id="Other-frameworks-to-consider-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Other frameworks to consider</a></span></li></ul></li><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Preparation</a></span><ul class="toc-item"><li><span><a href="#Convert-them-into-NumPy-Arrays" data-toc-modified-id="Convert-them-into-NumPy-Arrays-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Convert them into NumPy Arrays</a></span></li></ul></li><li><span><a href="#Feature-Engineering" data-toc-modified-id="Feature-Engineering-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Feature Engineering</a></span></li><li><span><a href="#Model-the-Data" data-toc-modified-id="Model-the-Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Model the Data</a></span></li><li><span><a href="#Evaluate-the-Model" data-toc-modified-id="Evaluate-the-Model-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Evaluate the Model</a></span><ul class="toc-item"><li><span><a href="#k-fold-validation" data-toc-modified-id="k-fold-validation-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>k-fold validation</a></span></li></ul></li><li><span><a href="#Note-to-Save-Your-Models!" data-toc-modified-id="Note-to-Save-Your-Models!-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Note to Save Your Models!</a></span><ul class="toc-item"><li><span><a href="#pickle" data-toc-modified-id="pickle-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span><code>pickle</code></a></span></li><li><span><a href="#joblib" data-toc-modified-id="joblib-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span><code>joblib</code></a></span></li></ul></li></ul></div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 

Data: https://www.kaggle.com/c/fis-pt012120-mod2-project-warmup

# Before We Begin, Remember DS Life Cycle 

![](https://lh3.googleusercontent.com/proxy/PYozaN6A4m2uZ4D3uWR0ORx1mi4qUq7FXb3UM8ybEYXkorNGsAf22cXaTUZ6vQpmzVMfokPMABo_NiFjl21xyx1wWIM0q7OoqrCStK4L5LnW-WHy4upFr-w60KebsxKKyJ4avYfXWRyMGxdWlYsjd2sBihqEfa6mcg)

## Other frameworks to consider

> In the future, we will talk about specific frameworks like CRISP-DM **(CRoss-Industry Standard Process for Data Mining)** & OSEMN **Obtain, Scrub, Explore, Model, and iNterpret)**

![](images/crisp-dm.png)

![](images/osemn.png)

# Data Preparation

In [None]:
my_data = pd.read_csv('data/kaggle_comp_ny_airbnb/train.csv')
my_data.head()

> https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn-datasets-fetch-california-housing

In [None]:
# Need specifically the target (price) and the other features separate
X_df = my_data.iloc[:,:-1]
display(X_df.head())

y = my_data.iloc[:,-1]
display(y.head())

In [None]:
# Split your data into train-test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_df, y, test_size=0.2, random_state=27)

## Convert them into NumPy Arrays

In [None]:
# #I like numpy arrays for this (we'll do it for just the y-target here)
# y = y.values
# display(y)

In [None]:
# # Still need it to be for each row is one value
# y = y.reshape(-1,1)
# display(y)

In [None]:
# Do the same for X features
# X = X.values

# Feature Engineering

> We'd normally go back and forth between modeling and using different features. You can think of this stage as the in-between of exploration and modeling

In [None]:
# I'm going to make this simple and get rid of any non-numerical data
X_train.head()

In [None]:
X_train.columns

In [None]:
def get_numerical_features(feature_dataframe):
    '''
    Gets numerical feature data based on the original loaded data (expects an
    order for the columns)
    '''
    # Removes 'id', 'host_name', 'room_type', 'last_review'
    columns_to_keep = ['host_id', 'latitude', 'longitude',
       'minimum_nights', 'number_of_reviews',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365']
    return feature_dataframe[columns_to_keep]

In [None]:
X_train_numerics = get_numerical_features(X_train)
X_train_numerics.head()

In [None]:
# Let's fill in some null values
X_train_numerics.info()

In [None]:
# I'm lazy, let's just do the median values
def fill_null_values(feature_dataframe, train_dataframe):
    '''
    Fill in the null values with the median from the training data.
    '''
    values_to_fill = {
        col:X_train_numerics[col].mean() 
        for col in X_train_numerics.columns
    }
    return feature_dataframe.fillna(value=values_to_fill)

In [None]:
X_train_numerics_filled = fill_null_values(X_train_numerics, X_train_numerics)
X_train_numerics_filled.head()

In [None]:
# Final features
features = X_train_numerics_filled
labels = y_train

# Model the Data

In [None]:
from sklearn.linear_model import LinearRegression

my_model = LinearRegression()
my_model.fit(features, labels)

# Evaluate the Model

In [None]:
from sklearn.metrics import mean_squared_error

predictions = my_model.predict(features)
mse = mean_squared_error(labels, predictions)
rmse = np.sqrt(mse)

print(rmse)

In [None]:
display(my_model.coef_)
display(my_model.intercept_)


pd.DataFrame(
    data=np.append(
            my_model.coef_,my_model.intercept_
        ).reshape(1,9)
    ,columns=list(features.columns)+['inter'])

## k-fold validation

> We use k-fold validation to see how well our model did using just the "training set". This effectively creates a new train-validation set for each fold. We can use the RMSE to compare different models or different variations of our models.

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
scores = cross_val_score(
            my_model, 
            features,
            labels,
            cv=8,
            scoring="neg_mean_squared_error"
)

rmse_scores = np.sqrt(-scores)

In [None]:
display(rmse_scores)

print(f'Mean: {rmse_scores.mean()}')
print(f'Std Dev: {rmse_scores.std()}')

# Note to Save Your Models!

<img src='images/save_the_models.png' height=60%/>

## `pickle`

In [None]:
import pickle

pickle.dump(my_model, open('my_model_pickle.pkl','wb'))

In [None]:
# Load the model from earlier
model_loaded = pickle.load(open('my_model_pickle.pkl','rb'))

## `joblib`

In [None]:
import joblib

joblib.dump(my_model, "my_model.pkl")

In [None]:
# Load the model from earlier
my_model_loaded = joblib.load("my_model.pkl")