# Pre-Processing and Feature Engineering

### Contents:
- [Setup](#Setup)
- [Features](#Features)
- [Train/Test/Split](#Train/Test/Split)
- [Feature Engineering Details](#Feature-Engineering-Details)
- [Data Transformations](#Data-Transformations)

### Setup
---

In [1]:
#Library Imports
import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression

In [2]:
#Read in relevant csvs
train_clean = pd.read_csv('../datasets/train_clean.csv')
validate_clean = pd.read_csv('../datasets/validate_clean.csv')

### Features

Overall quality and kitchen quality of the home may be related, so I've made them interaction terms.

In [3]:
#Interaction terms code

train_clean['kitchen_qual * overall_qual * exter_qual'] = train_clean['kitchen_qual'] * train_clean['overall_qual'] * train_clean['exter_qual']
validate_clean['kitchen_qual * overall_qual * exter_qual'] = validate_clean['kitchen_qual'] * validate_clean['overall_qual'] * validate_clean['exter_qual']


In [4]:
#Features in use
features = ['neighborhood',
            'overall_cond',
            'bldg_type',
            'kitchen_qual',
            'central_air',
            'gr_liv_area',
            'garage_area',
            'total_bsmt_sf',
            '1st_flr_sf',
            'kitchen_qual * overall_qual * exter_qual',
            'bedroom_abvgr',
            'overall_qual',
            'exter_qual',
            'year_built']

### Train/Test/Split
___

In [5]:
#Test/Train Data
X = train_clean[features]
y = train_clean['saleprice']

#Validate Data
val = validate_clean[features]

#Train/Test/Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 24)

### Feature Engineering Details
---

*Why encoders were chosen per feature*

#### Neighborhoods are string values.  They will be one hot encoded.

In [6]:
train_clean['neighborhood'].unique()

array(['Sawyer', 'SawyerW', 'NAmes', 'Timber', 'Edwards', 'OldTown',
       'BrDale', 'CollgCr', 'Somerst', 'Mitchel', 'StoneBr', 'NridgHt',
       'Gilbert', 'Crawfor', 'IDOTRR', 'NWAmes', 'Veenker', 'MeadowV',
       'SWISU', 'NoRidge', 'ClearCr', 'Blmngtn', 'BrkSide', 'NPkVill',
       'Blueste', 'GrnHill', 'Greens', 'Landmrk'], dtype=object)

#### Since Overall Condition and Quality ordinal data are already integers and in an order that makes sense, I will leave it alone.
*Might be good for polynomial engineering with kitchen quality*

In [7]:
print(train_clean['overall_cond'].unique())
validate_clean['overall_cond'].unique()


[8 5 7 6 3 9 2 4 1]


array([8, 4, 5, 6, 7, 9, 3, 2, 1], dtype=int64)

#### Building Type is a good candidate for one hot encoding.

In [8]:
print(train_clean['bldg_type'].unique())
validate_clean['bldg_type'].unique()

['1Fam' 'TwnhsE' 'Twnhs' '2fmCon' 'Duplex']


array(['2fmCon', 'Duplex', '1Fam', 'TwnhsE', 'Twnhs'], dtype=object)

#### Kitchen Quality could be good for ordinal encoding.  I tried to figure it out, but ultimately I just changed the values to integers - 0 represents 'Fa' (Fair), and 3 represents 'Ex' (Excellent).

In [9]:
print(train_clean['kitchen_qual'].unique())
validate_clean['kitchen_qual'].unique()

[2 1 0 3]


array([0, 1, 2, 3], dtype=int64)

#### For Central Air, I already changed the Ys to 1s and Ns to 0s in cleaning, so will leave as-is here.

In [10]:
print(train_clean['central_air'].unique())
validate_clean['central_air'].unique()

[1 0]


array([0, 1], dtype=int64)

#### Number of Bedrooms had some values set to zero, so these will be changed to most frequently reproted number of bedrooms with SimpleImputer

### Data Transformations
---

In [11]:
# Simple Imputing
si = SimpleImputer(strategy = 'most_frequent').set_output(transform = 'pandas')
imputefeatures = ['bedroom_abvgr']

X_train[imputefeatures] = si.fit_transform(X_train[imputefeatures])
X_test[imputefeatures] = si.transform(X_test[imputefeatures])
val[imputefeatures] = si.transform(val[imputefeatures])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val[imputefeatures] = si.transform(val[imputefeatures])


In [12]:
#Transform the data with ColumnTransformer
ohe = OneHotEncoder(drop = 'first',
                    handle_unknown = 'ignore',
                    sparse_output = False)

ctx = ColumnTransformer(
    transformers =[
        ('one_hot', ohe, ['neighborhood', 'bldg_type']),
        ('ss', StandardScaler(), ['bedroom_abvgr', '1st_flr_sf', 'garage_area', 'total_bsmt_sf'])
    ], remainder = 'passthrough',
    verbose_feature_names_out = False
)

In [13]:
#Fit and transform the training set
X_train_ctx = pd.DataFrame(ctx.fit_transform(X_train),
                           columns = ctx.get_feature_names_out())

X_test_ctx = pd.DataFrame(ctx.transform(X_test),
                           columns = ctx.get_feature_names_out())

#Transform the  validation data
val_enc = pd.DataFrame(ctx.transform(val),
                           columns = ctx.get_feature_names_out())

In [16]:
X_train_ctx.head()

Unnamed: 0,neighborhood_Blueste,neighborhood_BrDale,neighborhood_BrkSide,neighborhood_ClearCr,neighborhood_CollgCr,neighborhood_Crawfor,neighborhood_Edwards,neighborhood_Gilbert,neighborhood_Greens,neighborhood_GrnHill,...,garage_area,total_bsmt_sf,overall_cond,kitchen_qual,central_air,gr_liv_area,kitchen_qual * overall_qual * exter_qual,overall_qual,exter_qual,year_built
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.682676,0.804782,5.0,2.0,1.0,1430.0,10.0,5.0,1.0,2004.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,-0.162486,-0.663091,5.0,2.0,1.0,1504.0,24.0,6.0,2.0,2007.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.241722,-0.489891,3.0,1.0,1.0,1338.0,6.0,6.0,1.0,1915.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.546211,1.084067,6.0,1.0,1.0,1559.0,5.0,5.0,1.0,1948.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,-0.608034,-0.771341,5.0,1.0,1.0,1481.0,6.0,6.0,1.0,1994.0


In [17]:
X_test_ctx.head()

Unnamed: 0,neighborhood_Blueste,neighborhood_BrDale,neighborhood_BrkSide,neighborhood_ClearCr,neighborhood_CollgCr,neighborhood_Crawfor,neighborhood_Edwards,neighborhood_Gilbert,neighborhood_Greens,neighborhood_GrnHill,...,garage_area,total_bsmt_sf,overall_cond,kitchen_qual,central_air,gr_liv_area,kitchen_qual * overall_qual * exter_qual,overall_qual,exter_qual,year_built
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.039618,0.179096,5.0,2.0,1.0,1141.0,28.0,7.0,2.0,2006.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,-0.39215,-0.715051,5.0,2.0,1.0,1456.0,24.0,6.0,2.0,2005.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.903153,-0.507211,3.0,1.0,1.0,3082.0,7.0,7.0,1.0,1920.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.030431,-0.28205,5.0,1.0,1.0,1629.0,5.0,5.0,1.0,1997.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.580474,0.176931,5.0,1.0,1.0,1696.0,5.0,5.0,1.0,1962.0


In [18]:
val_enc.head()

Unnamed: 0,neighborhood_Blueste,neighborhood_BrDale,neighborhood_BrkSide,neighborhood_ClearCr,neighborhood_CollgCr,neighborhood_Crawfor,neighborhood_Edwards,neighborhood_Gilbert,neighborhood_Greens,neighborhood_GrnHill,...,garage_area,total_bsmt_sf,overall_cond,kitchen_qual,central_air,gr_liv_area,kitchen_qual * overall_qual * exter_qual,overall_qual,exter_qual,year_built
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.162486,-0.08287,8.0,0.0,0.0,1928.0,0.0,6.0,1.0,1910.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.480572,1.967389,4.0,1.0,1.0,1967.0,5.0,5.0,1.0,1977.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,-0.226792,-0.875261,5.0,2.0,1.0,1496.0,28.0,7.0,2.0,2006.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.021245,-0.19545,6.0,1.0,1.0,968.0,10.0,5.0,2.0,1923.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.177416,0.726842,5.0,1.0,1.0,1394.0,6.0,6.0,1.0,1963.0
