# Predicting Ames - Additional Engineering

#### Jump To:
- [Feature Engineering](#fut_eng)
- [EDA](#eda)
    - [Correlation](#corr)
    - [Distribution](#dist)
- [Addition Engineering](#more)    

### Imports 

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline


### Reading in our datasets

In [2]:
# reading in dataset
df = pd.read_csv('datasets/train_clean.csv')
test_df = pd.read_csv('datasets/test_clean.csv')

<a id='more'></a>
### Additional Engineering
- Once I used One Hot Encoder (in Predictions notebook) I got some pretty good RMSEs 21016, however the test dataset and train dataset ended up with different number of columns (320 and 300)
- Initially I removed the columns that were different in each dataset however my RMSE increased and my model became overfit
- Then I added columns with the value 0 in each dataset to match the column numbers, gave me my worse Kaggle score at 39k, however my train dataset was the best yet at 20k RMSE 
- I will now look at the categorical columns and see if I can shrink them even more

In [3]:
# make a function that iterates through the columns and returns a list of columns with selected name for both datasets
def iter_col(column):
    print([col for col in df.columns if column in col])
    print([col for col in test_df.columns if column in col])

In [4]:
# funtion that takes column names and drops the column from the dataset
# I used kwargs instead of just column in the arguments, because the number of columns can range from 1 - ?
def drop_cols(*kwargs):
    df.drop([*kwargs], axis=1, inplace = True)
    test_df.drop([*kwargs], axis=1, inplace =True)       

In [5]:
# function that takes rating and shrinks it down to 3-4 rating for both datasets
def re_val(column):
    df[column] = df[column].map({'Ex':'Gd', # Excellent becomes good
                                 'Gd' : 'Gd', # Good remains
                                 'TA' : 'Avg', # Changed TA to say average for consistency
                                 'Fa': 'Po', # Fair is Poor
                                 'Po': 'Po', # Poor is poor
                                 'NotApp.' : 'NotApp.'}) # Can't magically have one appear
    
    # Do the same for test_df
    test_df[column] = test_df[column].map({'Ex':'Gd',
                                 'Gd' : 'Gd',
                                 'TA' : 'Avg',
                                 'Fa': 'Po',
                                 'Po': 'Po',
                                 'NotApp.' : 'NotApp.'})

In [6]:
# for loop this time instead of .map because I would have to do 10 different changes

def re_cat(column):# iterate through each row within that column and change value of row based on condition
                            # bringing a 10 point rating down to 4
    for row in df.index:
        if df.loc[row, column] >= 8:
            df.loc[row, column] = str('Ex')
        elif df.loc[row, column] >= 6:
            df.loc[row, column] = str('Gd')
        elif df.loc[row, column] >= 3:
            df.loc[row, column] = str('Avg')
        else:
            df.loc[row, column] = str('Po')    
            
    for row in test_df.index:
        if test_df.loc[row, column] >= 8:
            test_df.loc[row, column] = str('Ex')
        elif test_df.loc[row, column] >= 6:
            test_df.loc[row, column] = str('Gd')
        elif test_df.loc[row, column] >= 3:
            test_df.loc[row, column] = str('Avg')
        else:
            test_df.loc[row, column] = str('Po')         
            

#### Garage

In [11]:
iter_col('Garage') # gves me a list of all the Garage Columns
# Dropping majority of the Garage Columns, Will keep Garage Area & Garage Cond
drop_cols('Garage Type', 'Garage Finish','Garage Area', 'Garage Qual')

['Garage Type', 'Garage Finish', 'Garage Cars', 'Garage Area', 'Garage Qual', 'Garage Cond']
['Garage Type', 'Garage Finish', 'Garage Cars', 'Garage Area', 'Garage Qual', 'Garage Cond']


In [12]:
iter_col('Garage') # What's left in each dataset

['Garage Cars', 'Garage Cond']
['Garage Cars', 'Garage Cond']


In [13]:
re_val('Garage Cond') # Re evaluate the values into smaller bins

In [14]:
df['Garage Cond'].value_counts() #What's left

Avg        1868
NotApp.     114
Po           55
Gd           14
Name: Garage Cond, dtype: int64

In [15]:
# # creating dummy vars for both datasets
# df = pd.concat([df, pd.get_dummies(df[['Garage Cond']], drop_first=True)], axis=1) 
# test_df = pd.concat([df, pd.get_dummies(df[['Garage Cond']], drop_first=True)], axis=1) 

#### Overall Columns
I'll keep both columns, but I want to change the rating from 1 - 10 to a smaller rating so that when I dummify the 
columns later there are only 4 new columns instead of 10

In [16]:
iter_col('Overall') # One Rates the home based on the overall condition, the other based on overall quality of material      

['Overall Qual', 'Overall Cond']
['Overall Qual', 'Overall Cond']


In [7]:
re_cat('Overall Qual')
re_cat('Overall Cond')


In [10]:
test_df['Overall Qual'].value_counts() # A view of what it looks like now

Gd     397
Avg    340
Ex     137
Po       4
Name: Overall Qual, dtype: int64

In [17]:
# # Dummies of Overall Quality and Condition columns
# df = pd.concat([df, pd.get_dummies(df[['Overall Qual']], drop_first=True)], axis=1) 
# test_df = pd.concat([df, pd.get_dummies(df[['Overall Qual']], drop_first=True)], axis=1) 

# df = pd.concat([df, pd.get_dummies(df[['Overall Cond']], drop_first=True)], axis=1) 
# test_df = pd.concat([df, pd.get_dummies(df[['Overall Cond']], drop_first=True)], axis=1) 

In [18]:
# Drop the original non-dummied columns
drop_cols('Overall Qual', 'Overall Cond')

#### Pool
- In our cleanup column we dummied this to show whether the house has a pool or not, I am going to drop the Pool Area column as it was pretty closely correlated with 'Has Pool' and I'll dummy the Pool Quality column

In [19]:
iter_col('Pool')

['Pool Area', 'Pool QC', 'Has Pool']
['Pool Area', 'Pool QC', 'Has Pool']


In [21]:
re_val('Pool QC')

In [22]:
df['Pool QC'].value_counts()

NotApp.    2042
Gd            5
Avg           2
Po            2
Name: Pool QC, dtype: int64

In [None]:
# df = pd.concat([df, pd.get_dummies(df[['Pool QC']], drop_first=True)], axis=1) 
# test_df = pd.concat([df, pd.get_dummies(df[['Pool QC']], drop_first=True)], axis=1) 

In [23]:
drop_cols('Pool Area')

#### Heating
- Will go with the condition of the Heating instead of the type, most times when purchasing a home, the question is how old the furnace is

In [24]:
iter_col('Heating')

['Heating', 'Heating QC']
['Heating', 'Heating QC']


In [25]:
re_val('Heating QC')
df['Heating QC'].value_counts()

Gd     1384
Avg     597
Po       70
Name: Heating QC, dtype: int64

In [26]:
drop_cols('Heating')

#### Electrical

In [29]:
test_df['Electrical'].value_counts()

SBrkr    814
FuseA     48
FuseF     15
FuseP      1
Name: Electrical, dtype: int64

In [None]:
# df.to_csv('./datasets/train_clean.csv', index = False)
# test_df.to_csv('./datasets/test_clean.csv', index = False)

In [None]:
# Just to make sure nothing came back
print('null values left on Train', df.isnull().sum().sum())
print('null values left on Test', test_df.isnull().sum().sum())