# Project 2: Ames Housing Data

## Kaggle Submission

### Importing and Cleaning the Validation Dataset

In [1]:
import pandas as pd
import numpy as np
import csv
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
import pickle

%matplotlib inline

To get started, I'll bring in the model and scaler created in the previous phase of the project.

In [2]:
with open('../assets/lasso_lin_reg.pkl', 'rb') as f:
    lasso = pickle.load(f)
with open('../assets/scaler.pkl', 'rb') as f:
    ss = pickle.load(f)
kaggle = pd.read_csv('../datasets/test.csv', index_col='Id')

Next I'll apply to the validation set all the transformations that I applied to the training dataset in the cleaning and EDA notebooks. Note that `Lot Frontage` will be filled with the value 69.059406, which was the mean value used to fill that feature's null values in the training set.

In [3]:
kaggle.drop(columns='PID', inplace=True)
kaggle['Pool QC'].fillna('NA', inplace=True)
kaggle['Misc Feature'].fillna('NA', inplace=True)

fence_quality = ['MnPrv', 'GdPrv', 'GdWo', 'MnWw']
for quality in fence_quality:
    kaggle.Fence = kaggle.Fence.str.replace(quality, '1')
kaggle.Fence.fillna(0, inplace=True)
kaggle.Fence = kaggle.Fence.apply(lambda x: int(x))

alley_quality = ['Grvl', 'Pave']
for quality in alley_quality:
    kaggle.Alley = kaggle.Alley.str.replace(quality, '1')
kaggle.Alley.fillna(0, inplace=True)
kaggle.Alley = kaggle.Alley.apply(lambda x: int(x))

kaggle.drop(columns='Fireplace Qu', inplace=True)
kaggle.drop(columns = 'Garage Yr Blt', inplace=True)

kaggle['Garage Type'].fillna('NA', inplace=True)
kaggle['Garage Finish'].fillna('NA', inplace=True)
kaggle['Garage Qual'].fillna('NA', inplace=True)
kaggle['Garage Cond'].fillna('NA', inplace=True)

def fill_basement_nulls(data):
    for row in data.index:
        if data.loc[row, 'Total Bsmt SF'] == 0:
            data.loc[row, 'Bsmt Qual'] = 'NA'
            data.loc[row, 'Bsmt Cond'] = 'NA'
            data.loc[row, 'Bsmt Exposure'] = 'NA'
            data.loc[row, 'BsmtFin Type 1'] = 'NA'
            data.loc[row, 'BsmtFin Type 2'] = 'NA'
    return
fill_basement_nulls(kaggle)

kaggle['Mas Vnr Area'].fillna(0, inplace=True)
kaggle['Mas Vnr Type'].fillna('None', inplace=True)
kaggle['Lot Frontage'].fillna(value=69.059406, inplace=True)

# BELOW THIS LINE IS TAKEN FROM EDA NOTEBOOK

def scale_10_rewrite(column):
    for row in kaggle.index:
        if kaggle.loc[row, column] >= 9:
            kaggle.loc[row, column] = str('Excellent')
        elif kaggle.loc[row, column] >= 7:
            kaggle.loc[row, column] = str('Good')
        elif kaggle.loc[row, column] >= 4:
            kaggle.loc[row, column] = str('Average')
        elif kaggle.loc[row, column] >= 1:
            kaggle.loc[row, column] = str('Fair')
        else:
            kaggle.loc[row, column] = str('Poor')
    return
scale_10_rewrite('Overall Qual')
scale_10_rewrite('Overall Cond')
kaggle.drop(columns = 'Garage Area', inplace=True)
kaggle.drop(columns='Garage Cond', inplace=True)
kaggle.drop(columns = 'Pool Area', inplace=True)
kaggle.drop(columns = 'Garage Finish', inplace = True)

kaggle.drop(columns = 'Heating', inplace=True)
kaggle['Heating QC'] = kaggle['Heating QC'].map({'Ex': 'Gd', 'Gd': 'Gd',
                                                 'TA': 'TA',
                                                 'Fa': 'Po', 'Po': 'Po'})
kaggle['Central Air'] = kaggle['Central Air'].map({'Y': int(1), 'N': int(0)})
kaggle['Garage Qual'] = kaggle['Garage Qual'].map({'Ex': 'Gd', 'Gd': 'Gd',
                                                   'TA': 'TA',
                                                   'Fa': 'Po', 'Po': 'Po',
                                                   'NA': 'NA'})

With all the transformations applied, I'll check for any remaining null values that hadn't occurred in the training set.

In [4]:
kaggle.isnull().sum().sort_values(ascending=False).head()

Electrical        1
Sale Type         0
Mas Vnr Area      0
Year Remod/Add    0
Roof Style        0
dtype: int64

In [5]:
kaggle.Electrical.value_counts()

SBrkr    814
FuseA     48
FuseF     15
FuseP      1
Name: Electrical, dtype: int64

With over 90% of the values in `Electrical` being the same, I feel comfortable imputing the mode of "SBrkr" for the missing value.

In [6]:
kaggle.Electrical.fillna('SBrkr', inplace=True)

The `Electrical` null has been filled, so I'll verify that the dataframe is clean before moving on to column transformations.

In [7]:
kaggle.isnull().sum().sum()

0

### Matching Columns

Before applying the scaler and model, I have to make sure the validation set has the same columns that were used to create the model with the training set. First I'll create a new dataframe with all of the validation set's columns dummied out.

In [8]:
kaggle_dummies = pd.get_dummies(kaggle)

The properties in the validation set likely have some qualitative values that didn't appear in the training set and vice-versa. So next I'll import the dummied training dataset and drop `SalePrice` from it so I can match the validation set's columns to it.

In [9]:
ames_dummies = '../datasets/train_clean_dummies.csv'
X_train = pd.read_csv(ames_dummies, index_col='Id')

X_train.drop(columns='SalePrice', inplace=True)

Now I'll create a set containing the names of the training columns that don't appear in the validation dataframe, and I'll interate through that set with a for loop to add each column to the validation dataframe with values of 0.

In [10]:
missing_cols = set(X_train).difference(kaggle_dummies)

for col in missing_cols:
    kaggle_dummies[col] = 0

In [11]:
kaggle_dummies.shape

(879, 290)

Now that the validation set contains all of the training set's columns, I'll set it to equal the training set's columns. This will eliminate any extra features in the validation set and properly set the order of its columns to be in full compliance with the training set.

In [12]:
kaggle_dummies = kaggle_dummies[X_train.columns]

The validation dataset is ready for the scaler and model.

### Scaling and Predicting the Validation Dataset

In [13]:
kaggle_sc = ss.transform(kaggle_dummies)

  """Entry point for launching an IPython kernel.


In [14]:
preds = lasso.predict(kaggle_sc)

### Exporting the Submission

I will create a `submission` dataframe from the predictions. The dataframe will have an index column of the `Id` for each property, and a column of `SalePrice` with the corresponding price predictions for each property.

In [15]:
submission = pd.DataFrame(preds, index=kaggle.index, columns=['SalePrice'])
submission.head()

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
2658,128071.379916
2718,180666.35259
2414,212790.302083
1989,118975.317471
625,187679.981451


Before exporting I'll sort the dataframe in ascending order by its `Id` index.

In [16]:
submission.sort_index(inplace=True)
submission.head()

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
2,124359.402799
4,280730.486957
6,198731.172974
7,217359.073376
17,196407.901522


Finally, I will export this sorted dataset of predictions to a new CSV file for submission to Kaggle. I've left the command for this export commented out to protect against an accidental overwrite of earlier predictions.

In [17]:
# submission.to_csv('../datasets/kaggle_submission.csv')

This concludes my report. Thank you for reading.