# Project Part II: Predicting Housing Prices - Build Your Own Model (50 pts)

 

### Grading Scheme

Your grade for the project will be based on your training RMSE and test RMSE. The thresholds are as follows:

Points | 50 | 40 | 30 | 20
--- | --- | --- | --- | ---
Test RMSE | Top 20% | (20%, 40%] | (40%, 70%] | Last 30%


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from proj import *


In [3]:
# Some Imports You Might Need
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from joblib import dump, load

import warnings
warnings.filterwarnings("ignore")

import zipfile
import os

from sklearn.preprocessing import OneHotEncoder
from sklearn import linear_model as lm

# Plot settings
plt.rcParams['figure.figsize'] = (12, 9)
plt.rcParams['font.size'] = 12

# Extract Dataset
with zipfile.ZipFile('cook_county_contest_data.zip') as item:
    item.extractall()
    
    
### Note: we filtered the data in cook_county_contest_data, 
####so please use this dataset instead of the old one.

### Note

This notebook is specifically designed to guide you through the process of exporting your model's predictions on the test dataset for submission so you can see how your model performs.

Most of what you have done in project part I should be transferrable here. 

## Step 1. Set up all the helper functions for your `create_pipeline` function.

You can do that in proj.py

## Step 2. Initiate a pipeline

Create a pipeline instance:
pipeline = create_pipeline()


In [11]:
pipeline = create_pipeline()

## Step 3. Train your model

Run the following cell to import the new set of training data to fit your model on. **You can use any regression model, the following is just an example** 

**As usual**, your model will predict the log-transformed sale price, and our grading will transform your predictions back to the normal vlaues.

In [12]:
# model 1
train_data = pd.read_csv('cook_county_contest_data/cook_county_contest_train.csv')
train_data = remove_outliers(train_data, "Sale Price",degree=1)
y_train = train_data['Sale Price']
train_data = Preprocess(train_data.drop(columns=['Sale Price']))
#train_data, y_train = preprocess_train(train_data)

###You can use any model in Sklearn
pipeline.fit(train_data, y_train);

##Export your pipeline

dump(pipeline, '../../model/m1.joblib.gz', compress=('gzip', 3))
#
#This saves the pipeline to a compressed file
#The compress parameter takes a tuple of the compression method and the compression level, which in this case is ( 'gzip', 3)
# The compression level ranges from 0 to 9, with 0 being no compression 
# and 9 being the highest level of compression. 
# A higher compression level will result in a smaller file size, but will also take longer to compress and decompress.

['../../model/m1.joblib.gz']

In [6]:
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(y_train,pipeline.predict(train_data)))
rmse

59772.02732480719

## Step 4. Cross validation and push your code

Do cross-validation on the train set to test the performance of your model. Push your code to Gitea and send your model to the server.

In [7]:
### You can do cross-validation here
test_data = pd.read_csv('cook_county_contest_data/cook_county_contest_train.csv').head(20)

In [8]:
m = load(os.path.join("../../model/", 'm1.joblib.gz'))

In [9]:
from sklearn.model_selection import KFold, cross_val_score
kfold = KFold(n_splits=5)
scores = cross_val_score(m, Preprocess(test_data), test_data["Sale Price"], cv=5, scoring="neg_mean_squared_error")
print(scores)

[-9.77421483e+09 -5.35395148e+09 -1.70586962e+10 -9.25664240e+09
 -4.38231712e+09]


In [10]:
pd.DataFrame({"prediction":m.predict(Preprocess(test_data))},test_data["Sale Price"])

Unnamed: 0_level_0,prediction
Sale Price,Unnamed: 1_level_1
451400,567446.022296
121000,214029.407636
65900,160472.574781
482000,489198.619684
62500,81237.389488
80000,136722.793385
300000,306697.737105
271250,247927.306615
410000,324409.349268
94900,74561.819167
