# Homework 7: Predicting Housing Prices - Build Your Own Model

## Due Date: 11:59pm Thursday, March 25th

## You are not required to complete this notebook nor will you be graded on this part -- this only serves as a guide to simplify the process of submitting your model predictions for the contest. 

In [None]:
# Some Imports You Might Need
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import zipfile
import os

from sklearn.preprocessing import OneHotEncoder
from sklearn import linear_model as lm

# Plot settings
plt.rcParams['figure.figsize'] = (12, 9)
plt.rcParams['font.size'] = 12

# Extract Dataset
with zipfile.ZipFile('cook_county_contest_data.zip') as item:
    item.extractall()

### Note

This notebook is specifically designed to guide you through the process of exporting your model's predictions on the test dataset for submission so you can see how your model stacks up against others'. 

Most of what you have done in question 8 of Homework 7 should be transferrable here. However, there are a few small changes you will need to make on your part to make sure the function meets our requirements, so **please read the following instructions very carefully**:

## Step 1. Set up all the helper functions for your `process_data_fm` function.

**Copy-paste all of the helper functions your `process_data_fm` need here in the following cell**. Note that we have provided you with the skeletons for some of the feature engineering functions we asked you to implement in the assignment below, but feel free to also add more of your own functions. You **do not** have to fill out all of the functions in the cell below -- only fill out those that are actually useful to your feature engineering pipeline.

In [None]:
def add_total_bedrooms(data):
    """
    Input:
      data (data frame): a data frame containing at least the Description column.
    """
    with_rooms = data.copy()
    ...
    return with_rooms

def ohe_roof_material(data):
    """
    One-hot-encodes roof material.  New columns are of the form 0x_QUALITY.
    """
    ...
    
def process_data_gm(data, pipeline_functions, prediction_col):
    """Process the data for a guided model."""
    for function, arguments, keyword_arguments in pipeline_functions:
        if keyword_arguments and (not arguments):
            data = data.pipe(function, **keyword_arguments)
        elif (not keyword_arguments) and (arguments):
            data = data.pipe(function, *arguments)
        else:
            data = data.pipe(function)
    X = data.drop(columns=[prediction_col]).to_numpy()
    y = data.loc[:, prediction_col].to_numpy()
    return X, y

def select_columns(data, *columns):
    """Select only columns passed as arguments."""
    return data.loc[:, columns]

## Step 2. Setup your `process_data_fm` function

**Copy-paste your implementation of `process_data_fm` from question 8 of Homework 7 into the following cell.**

Here are a few additional things **you should check and change to make sure your `process_data_fm` function satisfies**:
- Unlike the homework, we will not be expecting your `process_data_fm` function to return both the design matrix `X` and the observed target vector `y`; your function should now **only return X**.
- In addition, you **may NOT incorporate the `Sale Price` column in your feature engineering process** (so things such as removing outliers in Sale Price that would work for question 8 will no longer apply here anymore)
- We understand that the original training and test data have a lot illegitimate prices that actually detract quite a bit from your model's performance. In order to help you focus on actually **engineering the best features and looking for patterns within your data**, we have prefiltered the data a bit more and removed those outliers in advance.

In [None]:
# Please include all of your feature engineering process inside this function.
# Do not modify the parameters of the function below. 
# Note that data will no longer have the column Sale Price in it directly, so plan your feature engineering process around that.
def process_data_fm(data):
    # Replace the following line with your own feature engineering pipeline
    X = data
    ...
    return X

## Step 3. Train your model

Run the following cell to import the new set of training data to fit your model on. **No coding is required from you for this part**. If your `process_data_fm` satisfies all the specified requirements, the cell should run without any error.

**As usual**, your model will predict the log-transformed sale price, and our autograder will handle transforming your predictions back to the normal vlaues.

In [None]:
train_data = pd.read_csv('cook_county_contest_train.csv', index_col='Unnamed: 0')
y_train = np.log(train_data['Sale Price'])
train_data = train_data.drop(columns=['Sale Price'])
X_train = process_data_fm(train_data)
model = lm.LinearRegression(fit_intercept=True)
model.fit(X_train, y_train);

## Step 4. Make Predictions on the Test Dataset

Run the following cell to estimate the sale price on the test dataset and export your model's predictions as a csv file called `predictions.csv`. Download the csv file from the same directory as the current notebook and submit it to the Gradescope assignment **Homework 7 - Build Your Own Model** and you should be able to see your model's rank from there!

In [None]:
test_data = pd.read_csv('cook_county_contest_test.csv', index_col='Unnamed: 0')
X_test = process_data_fm(test_data)
y_test_predicted = model.predict(X_test)
predictions = pd.DataFrame({'Sale Price': y_test_predicted})
predictions.to_csv('predictions.csv')
print('Your predictions have been exported as predictions.csv. Please download the file and submit it to Gradescope. ')