# Project Part II: Predicting Housing Prices - Build Your Own Model (50 pts)

 

### Grading Scheme

Your grade for the project will be based on your training RMSE and test RMSE. The thresholds are as follows:

Points | 50 | 40 | 30 | 20
--- | --- | --- | --- | ---
Test RMSE | Top 20% | (20%, 40%] | (40%, 70%] | Last 30%


In [105]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [151]:
from proj import *


In [152]:
# Some Imports You Might Need
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from joblib import dump, load

import warnings
warnings.filterwarnings("ignore")

import zipfile
import os

from sklearn.preprocessing import OneHotEncoder
from sklearn import linear_model as lm

# Plot settings
plt.rcParams['figure.figsize'] = (12, 9)
plt.rcParams['font.size'] = 12

# Extract Dataset
with zipfile.ZipFile('cook_county_contest_data.zip') as item:
    item.extractall()
    
    
### Note: we filtered the data in cook_county_contest_data, 
####so please use this dataset instead of the old one.

### Note

This notebook is specifically designed to guide you through the process of exporting your model's predictions on the test dataset for submission so you can see how your model performs.

Most of what you have done in project part I should be transferrable here. 

## Step 1. Set up all the helper functions for your `create_pipeline` function.

You can do that in proj.py

## Step 2. Initiate a pipeline

Create a pipeline instance:
pipeline = create_pipeline()


In [153]:
pipeline = create_pipeline()

## Step 3. Train your model

Run the following cell to import the new set of training data to fit your model on. **You can use any regression model, the following is just an example** 

**As usual**, your model will predict the log-transformed sale price, and our grading will transform your predictions back to the normal vlaues.

In [None]:
train_data = pd.read_csv('cook_county_contest_data/cook_county_contest_train.csv')
train_data = remove_outliers(train_data, "Sale Price")
y_train = np.log(train_data['Sale Price'])
train_data = Preprocess(train_data.drop(columns=['Sale Price']))
#train_data, y_train = preprocess_train(train_data)

###You can use any model in Sklearn
pipeline.fit(train_data, y_train);

##Export your pipeline

dump(pipeline, '../../model/pipeline.joblib.gz', compress=('gzip', 3))

#This saves the pipeline to a compressed file
#The compress parameter takes a tuple of the compression method and the compression level, which in this case is ( 'gzip', 3)
# The compression level ranges from 0 to 9, with 0 being no compression 
# and 9 being the highest level of compression. 
# A higher compression level will result in a smaller file size, but will also take longer to compress and decompress.

## Step 4. Cross validation and push your code

Do cross-validation on the train set to test the performance of your model. Push your code to Gitea and send your model to the server.

In [146]:
### You can do cross-validation here
test_data = pd.read_csv('cook_county_contest_data/cook_county_contest_train.csv').head(20)

In [147]:
m = load(os.path.join("../../model/", 'pipeline.joblib.gz'))

In [148]:
from sklearn.model_selection import KFold, cross_val_score
kfold = KFold(n_splits=5)
scores = cross_val_score(m, Preprocess(test_data), test_data["Sale Price"], cv=5, scoring="neg_mean_squared_error")
print(scores)

    Building Square Feet                                        Description  \
0                 2568.0  This property, sold on 11/08/2018, is a two-st...   
1                 1040.0  This property, sold on 12/29/2017, is a one-st...   
2                 1188.0  This property, sold on 01/27/2017, is a one-st...   
3                 2252.0  This property, sold on 07/03/2018, is a one-st...   
4                  787.0  This property, sold on 05/26/2017, is a one-st...   
5                 1040.0  This property, sold on 11/05/2013, is a one-st...   
6                 1404.0  This property, sold on 02/02/2017, is a one-st...   
7                 1450.0  This property, sold on 04/07/2015, is a one-st...   
8                  912.0  This property, sold on 06/13/2017, is a one-st...   
9                  894.0  This property, sold on 12/30/2014, is a one-st...   
10                1879.0  This property, sold on 09/06/2019, is a one-st...   
11                1083.0  This property, sold on 11/

In [150]:
m.predict(Preprocess(test_data))

    Building Square Feet                                        Description  \
0                 2568.0  This property, sold on 11/08/2018, is a two-st...   
1                 1040.0  This property, sold on 12/29/2017, is a one-st...   
2                 1188.0  This property, sold on 01/27/2017, is a one-st...   
3                 2252.0  This property, sold on 07/03/2018, is a one-st...   
4                  787.0  This property, sold on 05/26/2017, is a one-st...   
5                 1040.0  This property, sold on 11/05/2013, is a one-st...   
6                 1404.0  This property, sold on 02/02/2017, is a one-st...   
7                 1450.0  This property, sold on 04/07/2015, is a one-st...   
8                  912.0  This property, sold on 06/13/2017, is a one-st...   
9                  894.0  This property, sold on 12/30/2014, is a one-st...   
10                1879.0  This property, sold on 09/06/2019, is a one-st...   
11                1083.0  This property, sold on 11/

array([12.65649414, 11.94396973, 12.05163574, 12.5501709 , 11.7376709 ,
       11.96325684, 12.16760254, 12.19372559, 11.85693359, 11.82141113,
       12.42285156, 11.99609375, 12.31164551, 12.44018555, 11.79101562,
       12.1706543 , 12.53527832, 12.14123535, 12.54577637, 12.09484863])