# Project Part II: Predicting Housing Prices - Build Your Own Model (50 pts)

 

Now it's your turn to train a better model on predicting house price! Please only use models in sklearn. Don't introduce other models by uploading extra libraries to the server.

### Grading Scheme

Your grade for the project will be based on your test RMSE and your readme.md. The breakdown are as follows:

1. Readme.md + Completeness of your ipynb(10 pts)

2. 
Points | 40 | 30 | 25 | 20
--- | --- | --- | --- | ---
Test RMSE | Top 20% | (20%, 40%] | (40%, 70%] | Last 30%


In [15]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [16]:
from proj import *

In [17]:
# Some Imports You Might Need
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from joblib import dump, load

import warnings
warnings.filterwarnings("ignore")

import zipfile
import os

from sklearn.preprocessing import OneHotEncoder
from sklearn import linear_model as lm

# Plot settings
plt.rcParams['figure.figsize'] = (12, 9)
plt.rcParams['font.size'] = 12

# Extract Dataset
with zipfile.ZipFile('cook_county_contest_data.zip') as item:
    item.extractall()

### Note: we filtered the data in cook_county_contest_data, 
####so please use this dataset instead of the old one.

## Step 1. `create_pipeline` function.

See proj.py

## Step 2. Initiate a pipeline

Create a pipeline instance:
pipeline = create_pipeline()


In [18]:
pipeline = create_pipeline()

## Step 3. Train your model

Run the following cell to import the new set of training data to fit your model on. **You can use any regression model, the following is just an example** 

Your model will predict the **original sale price**, If you take log in the middle, please **transfer back** to the normal vlaues.

In [19]:
train_data = pd.read_csv('cook_county_contest_train.csv')

In [20]:
# # Resample the data
# from sklearn.utils import resample
# train_data_resample = pd.DataFrame()
# for i in range(9):
#     train_data_inrange = train_data[(train_data['Building Square Feet']>=i*1000) & (train_data['Building Square Feet'] < (i+1)*1000)]
#     df_temp = resample(train_data_inrange,
#         replace=True,
#         n_samples=(i+1)*len(train_data_inrange),
#         random_state=4710)
#     train_data_resample = pd.concat([train_data_resample, df_temp])
# train_data = train_data_resample
# train_data.shape

In [21]:
train_data['Building Square Feet']

0         2568.0
1         1040.0
2         1188.0
3         2252.0
4          787.0
           ...  
138212     882.0
138213    1004.0
138214    1085.0
138215    1494.0
138216     864.0
Name: Building Square Feet, Length: 138217, dtype: float64

In [22]:
y_train = train_data['Sale Price']
train_data = train_data.drop(columns=['Sale Price'])
y_train = y_train / train_data['Building Square Feet']

In [23]:
# for parameter in pipeline.get_params():
#     print(parameter)

In [24]:
# from sklearn.model_selection import KFold
# from sklearn.model_selection import cross_val_score
# from sklearn.model_selection import RandomizedSearchCV

# # Define hyperparameter distributions to search over
# param_distributions = {
#     'lin-reg__max_depth': [5, 10, 20, 50, 100, 200, 500, 1000],
#     'lin-reg__max_features': [1.0, 0.5, 0.2, 0.1, 'sqrt', 'log2'],
#     'lin-reg__max_leaf_nodes': [5, 10, 20, 50, 100, 200, 500, 1000],
#     'lin-reg__n_estimators': [20, 50, 100, 200, 500, 1000],
#     'lin-reg__ccp_alpha': [0.0, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5],
#     'lin-reg__min_impurity_decrease': [0.0, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1],
#     'lin-reg__random_state': [22, 42, 4710],
# }

# # Perform randomized search
# random_search = RandomizedSearchCV(
#     pipeline,
#     param_distributions = param_distributions,
#     n_iter=10,
#     scoring='neg_mean_squared_error',
#     cv=5,
#     n_jobs=-1,
#     verbose=1,
#     random_state=4710
# )

# # Fit the model on training data
# random_search.fit(train_data, y_train)

# # Evaluate the best model on test data
# print("Best score:", random_search.best_score_)
# print("Best params:", random_search.best_params_)

In [25]:
###You can use any model in Sklearn
pipeline.set_params(**{
    'lin-reg__max_depth': 20,
    'lin-reg__max_features': 'log2',
    'lin-reg__max_leaf_nodes': 1000,
    'lin-reg__n_estimators': 500,
    'lin-reg__ccp_alpha': 0.01,
    'lin-reg__min_impurity_decrease': 0.01,
    'lin-reg__random_state': 42,
})
pipeline.fit(train_data, y_train)

##Export your pipeline
dump(pipeline, '519370910113-2.gz', compress=('gzip', 6))

#This saves the pipeline to a compressed file
#The compress parameter takes a tuple of the compression method and the compression level, which in this case is ( 'gzip', 6)
# The compression level ranges from 0 to 9, with 0 being no compression 
# and 9 being the highest level of compression. 
# A higher compression level will result in a smaller file size, but will also take longer to compress and decompress.

TypeError: Pipeline.set_params() takes 1 positional argument but 2 were given

## Step 4. Cross validation and push your code

Do cross-validation on the train set to test the performance of your model. **Push your code to Gitea** and send your model to the server.

In [None]:
train_data.shape

(138217, 63)

In [None]:
y_train.shape

(138217,)

In [None]:
## You can do cross-validation here
# from sklearn.model_selection import KFold
# from sklearn.model_selection import cross_val_score
# pipeline = create_pipeline()
# cv = KFold(n_splits=3, random_state=4710, shuffle=True)
# scores = cross_val_score(pipeline, train_data, y_train, cv=cv, scoring='neg_mean_squared_error')
# scores = np.sqrt(np.abs(scores))
# scores