<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Before-We-Begin,-Remember-DS-Life-Cycle" data-toc-modified-id="Before-We-Begin,-Remember-DS-Life-Cycle-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Before We Begin, Remember DS Life Cycle</a></span><ul class="toc-item"><li><span><a href="#Other-frameworks-to-consider" data-toc-modified-id="Other-frameworks-to-consider-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Other frameworks to consider</a></span></li></ul></li><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Preparation</a></span></li><li><span><a href="#Feature-Engineering" data-toc-modified-id="Feature-Engineering-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Feature Engineering</a></span></li><li><span><a href="#Model-the-Data" data-toc-modified-id="Model-the-Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Model the Data</a></span></li><li><span><a href="#Evaluate-the-Model" data-toc-modified-id="Evaluate-the-Model-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Evaluate the Model</a></span><ul class="toc-item"><li><span><a href="#k-fold-validation" data-toc-modified-id="k-fold-validation-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>k-fold validation</a></span></li></ul></li><li><span><a href="#Note-to-Save-Your-Models!" data-toc-modified-id="Note-to-Save-Your-Models!-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Note to Save Your Models!</a></span><ul class="toc-item"><li><span><a href="#pickle" data-toc-modified-id="pickle-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span><code>pickle</code></a></span></li><li><span><a href="#joblib" data-toc-modified-id="joblib-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span><code>joblib</code></a></span></li></ul></li></ul></div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 

# Before We Begin, Remember DS Life Cycle 

![](https://lh3.googleusercontent.com/proxy/PYozaN6A4m2uZ4D3uWR0ORx1mi4qUq7FXb3UM8ybEYXkorNGsAf22cXaTUZ6vQpmzVMfokPMABo_NiFjl21xyx1wWIM0q7OoqrCStK4L5LnW-WHy4upFr-w60KebsxKKyJ4avYfXWRyMGxdWlYsjd2sBihqEfa6mcg)

## Other frameworks to consider

> In the future, we will talk about specific frameworks like CRISP-DM **(CRoss-Industry Standard Process for Data Mining)** & OSEMN **Obtain, Scrub, Explore, Model, and iNterpret)**

![](images/crisp-dm.png)

![](images/osemn.png)

# Data Preparation

In [None]:
import sklearn.datasets

my_data = sklearn.datasets.fetch_california_housing()

> https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn-datasets-fetch-california-housing

In [None]:
my_data.keys()

In [None]:
print(my_data.DESCR)

In [None]:
X = my_data.data
y = my_data.target

In [None]:
# Something we're a little more familiar with
import pandas as pd

df = pd.DataFrame(data=X, columns=my_data.feature_names)
df.head()

In [None]:
# Split your data into train-test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=27)

# Feature Engineering

> We'd normally go back and forth between modeling and using different features. You can think of this stage as the in-between of exploration and modeling

In [None]:
features = X_train
labels = y_train

# Model the Data

In [None]:
from sklearn.linear_model import LinearRegression

my_model = LinearRegression()
my_model.fit(features, labels)

# Evaluate the Model

In [None]:
from sklearn.metrics import mean_squared_error

predictions = my_model.predict(features)
mse = mean_squared_error(labels, predictions)
rmse = np.sqrt(mse)

print(rmse)

In [None]:
display(my_model.coef_)
display(my_model.intercept_)


pd.DataFrame(
    data=np.append(
            my_model.coef_,my_model.intercept_
        ).reshape(1,9)
    ,columns=list(my_data.feature_names)+['inter'])

## k-fold validation

> We use k-fold validation to see how well our model did using just the "training set". This effectively creates a new train-validation set for each fold. We can use the RMSE to compare different models or different variations of our models.

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(
            my_model, 
            features,
            labels,
            cv=8,
            scoring="neg_mean_squared_error"
)

rmse_scores = np.sqrt(-scores)

In [None]:
display(rmse_scores)
display(rmse_scores.mean())
display(rmse_scores.std())

# Note to Save Your Models!

<img src='images/save_the_models.png' height=60%/>

## `pickle`

In [None]:
import pickle

pickle.dump(my_model, open('my_model_pickle.pkl','wb'))

In [None]:
# Load the model from earlier
model_loaded = pickle.load(open('my_model_pickle.pkl','rb'))

## `joblib`

In [None]:
import joblib

joblib.dump(my_model, "my_model.pkl")

In [None]:
# Load the model from earlier
my_model_loaded = joblib.load("my_model.pkl")