# Project 2: Ames Housing Saleprice Prediction

---

#### 03: <b>Cleaning and Transforming Test Data</b>

 In this notebook, we will be cleaning and transforming the `test` csv using the steps that were previously one in **01 : EDA and cleaning** & **02 : Preprocessing and Feature Engineering**.
 
 We will also be observing the difference in features (i.e. columns) between the final test data and the final train data generated.

### Contents:
- [Imports and functions](#Library-and-data-import)
- [Data Cleaning](#Data-cleaning---Test)
- [Data Transformation](#Feature-Engineering-&-Preprocessing---test)
- [Sanity Checks](#Sanity-checks)
- [Export](#Export)

## Library and data import

In [1]:
# import libraries
import numpy as np
import pandas as pd

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import cross_val_score, RepeatedKFold

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

sns.set_style('whitegrid')

In [2]:
# import test data
test = pd.read_csv('../datasets/test.csv')

# import final train data
train_final_dummy = pd.read_csv('../datasets/train_final_dummy.csv')

pd.set_option('display.max_columns', 200)

## Functions used in:
    - Data cleaning
    - Feature engineering & Preprocessing

In [3]:
# function for fillna
def fillna(df,values):
    df.fillna(value = values, inplace = True)    

# function to drop columns from dataframe
def drop_cols(df, columns):
    df.drop(columns= columns, inplace = True)

# convert ordinal category numerical
def map_new_vals(df,col, dictionary):
    df[col] = df[col].map(dictionary)

## Data cleaning - Test

In [4]:
# function to clean test dataset
def clean_test(df):
# ------- replace NaN values to 'None' based on data description ---------
    # Alley
    test_cleaned = df.fillna(value = {'Alley': 'None'})
    
    # Basement
    values = {'Bsmt Qual': 'None' ,
          'Bsmt Cond': 'None',
          'Bsmt Exposure': 'None',
          'BsmtFin Type 1': 'None',
          'BsmtFin Type 2': 'None'}
    fillna(test_cleaned,values)
    
    # Fireplace Qu
    values = {'Fireplace Qu':'None'}
    fillna(test_cleaned, values)
    
    # Garage
    values = {'Garage Type': 'None',
          'Garage Finish': 'None',
          'Garage Qual': 'None',
          'Garage Cond': 'None'}
    fillna(test_cleaned, values)
    
    # Pool QC
    values = {'Pool QC': 'None'}
    fillna(test_cleaned, values)
    
    # Fence
    values = {'Fence': 'None'}
    fillna(test_cleaned, values)
    
    # Misc Feature
    values = {'Misc Feature': 'None'}
    fillna(test_cleaned, values)
    
    # Lot Frontage
    new_lot_df = test_cleaned[['Lot Frontage','Lot Area','Lot Config']]
    config_dummies_df = pd.get_dummies(data = new_lot_df, columns = ['Lot Config'], drop_first = True)
    it_imputer = IterativeImputer(estimator = LinearRegression())
    frontage_impute = it_imputer.fit_transform(config_dummies_df)
    frontage_impute = pd.DataFrame(frontage_impute, columns = config_dummies_df.columns)
    test_cleaned['Lot Frontage'] = frontage_impute['Lot Frontage']

# ------- replace NaN values with other values based on EDA -----------
    # Mas Vnr Type
    values = {'Mas Vnr Type':'CBlock'}
    fillna(test_cleaned, values)
    
    # Mas Vnr Area
    mean_impute = SimpleImputer(strategy = 'mean')
    test_cleaned['Mas Vnr Area'] = mean_impute.fit_transform(test_cleaned[['Mas Vnr Area']]).ravel()
    
    # Bsmt SF
    values = {'BsmtFin SF 1': 0, 
              'BsmtFin SF 2': 0,
              'Bsmt Unf SF':0,
              'Total Bsmt SF':0}
    fillna(test_cleaned,values)
    
    # Basement Bath
    values = {'Bsmt Full Bath': 0, 'Bsmt Half Bath': 0}
    fillna(test_cleaned,values)
    
    # Garage Yr Blt
    values = {'Garage Yr Blt': 0}
    fillna(test_cleaned,values)
    
    # Garage Cars , Garage Area
    values = {'Garage Cars': 0,'Garage Area': 0}
    fillna(test_cleaned, values)

# ------- standardize column title naming  -------------------
    test_cleaned.columns = test_cleaned.columns.str.lower()
    test_cleaned.columns = test_cleaned.columns.str.strip()
    test_cleaned.columns = test_cleaned.columns.str.replace(' ','_')
    
    return test_cleaned

In [5]:
test_cleaned = clean_test(test)
test_cleaned.head(10)

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,2fmCon,2Story,6,8,1910,1950,Gable,CompShg,AsbShng,AsbShng,,0.0,TA,Fa,Stone,Fa,TA,No,Unf,0,Unf,0,1020,1020,GasA,Gd,N,FuseP,908,1020,0,1928,0,0,2,0,4,2,Fa,9,Typ,0,,Detchd,1910.0,Unf,1,440,Po,Po,Y,0,60,112,0,0,0,,,,0,4,2006,WD
1,2718,905108090,90,RL,67.357671,9662,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,Duplex,1Story,5,4,1977,1977,Gable,CompShg,Plywood,Plywood,,0.0,TA,TA,CBlock,Gd,TA,No,Unf,0,Unf,0,1967,1967,GasA,TA,Y,SBrkr,1967,0,0,1967,0,0,2,0,6,2,TA,10,Typ,0,,Attchd,1977.0,Fin,2,580,TA,TA,Y,170,0,0,0,0,0,,,,0,8,2006,WD
2,2414,528218130,60,RL,58.0,17104,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,7,5,2006,2006,Gable,CompShg,VinylSd,VinylSd,,0.0,Gd,TA,PConc,Gd,Gd,Av,GLQ,554,Unf,0,100,654,GasA,Ex,Y,SBrkr,664,832,0,1496,1,0,2,1,3,1,Gd,7,Typ,1,Gd,Attchd,2006.0,RFn,2,426,TA,TA,Y,100,24,0,0,0,0,,,,0,9,2006,New
3,1989,902207150,30,RM,60.0,8520,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,1Story,5,6,1923,2006,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,Gd,TA,CBlock,TA,TA,No,Unf,0,Unf,0,968,968,GasA,TA,Y,SBrkr,968,0,0,968,0,0,1,0,2,1,TA,5,Typ,0,,Detchd,1935.0,Unf,2,480,Fa,TA,N,0,0,184,0,0,0,,,,0,7,2007,WD
4,625,535105100,20,RL,67.176422,9500,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1963,1963,Gable,CompShg,Plywood,Plywood,BrkFace,247.0,TA,TA,CBlock,Gd,TA,No,BLQ,609,Unf,0,785,1394,GasA,Gd,Y,SBrkr,1394,0,0,1394,1,0,1,1,3,1,TA,6,Typ,2,Gd,Attchd,1963.0,RFn,2,514,TA,TA,Y,0,76,0,0,185,0,,,,0,7,2009,WD
5,333,923228370,160,RM,21.0,1890,Pave,,Reg,Lvl,AllPub,Inside,Gtl,MeadowV,Norm,Norm,TwnhsE,2Story,4,6,1972,1972,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,TA,CBlock,TA,TA,No,Rec,294,Unf,0,252,546,GasA,TA,Y,SBrkr,546,546,0,1092,0,0,1,1,3,1,TA,5,Typ,0,,Attchd,1972.0,Unf,1,286,TA,TA,Y,0,0,64,0,0,0,,,,0,6,2010,WD
6,1327,902427150,20,RM,52.0,8516,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,1Story,4,6,1958,2006,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,TA,TA,No,Unf,0,Unf,0,869,869,GasA,TA,Y,SBrkr,1093,0,0,1093,0,0,1,0,2,1,TA,5,Typ,0,,Detchd,1959.0,Unf,1,308,TA,TA,Y,0,0,0,0,0,0,,,,0,5,2008,WD
7,858,907202130,20,RL,55.994319,9286,Pave,,IR1,Lvl,AllPub,CulDSac,Mod,CollgCr,Norm,Norm,1Fam,1Story,5,7,1977,1989,Gable,CompShg,HdBoard,Plywood,,0.0,TA,TA,CBlock,Gd,Gd,Av,ALQ,196,Unf,0,1072,1268,GasA,TA,Y,SBrkr,1268,0,0,1268,0,0,1,1,3,1,Gd,5,Typ,0,,Detchd,1978.0,Unf,1,252,TA,TA,Y,173,0,0,0,0,0,,,,0,10,2009,WD
8,95,533208090,160,FV,39.0,3515,Pave,Pave,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,TwnhsE,2Story,7,5,2004,2004,Gable,CompShg,VinylSd,VinylSd,,0.0,Gd,TA,PConc,Gd,TA,No,Unf,0,Unf,0,840,840,GasA,Ex,Y,SBrkr,840,840,0,1680,0,0,2,1,2,1,Gd,3,Typ,0,,Attchd,2004.0,RFn,2,588,TA,TA,Y,0,111,0,0,0,0,,,,0,1,2010,WD
9,1568,914476010,20,RL,75.0,10125,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,1Story,6,6,1977,1977,Gable,CompShg,Plywood,Plywood,,0.0,TA,TA,CBlock,TA,TA,No,ALQ,641,LwQ,279,276,1196,GasA,TA,Y,SBrkr,1279,0,0,1279,0,1,2,0,3,1,TA,6,Typ,2,Fa,Detchd,1980.0,Unf,2,473,TA,TA,Y,238,83,0,0,0,0,,MnPrv,,0,2,2008,WD


In [6]:
# export cleaned test data to csv
test_cleaned.to_csv('../datasets/test_cleaned.csv', index = False)

## Feature Engineering & Preprocessing - test

In [7]:
# import cleaned test data
test_cleaned = pd.read_csv('../datasets/test_cleaned.csv')

# function to transform test dataset and export to csv format
def transform_test(df):
# ------- feature selection for numerical columns ----------------
    # drop id and pid columns
    columns = ['id','pid']
    drop_cols(df, columns)
    
    # drop near zero variance predictors
    columns = ['low_qual_fin_sf','bsmt_half_bath','3ssn_porch',
               'screen_porch','pool_area','misc_val']
    drop_cols(df, columns)
    drop_cols(df, ['kitchen_abvgr'])
    
    # drop features with collinearity
    drop_cols(df, ['lot_frontage'])
    drop_cols(df, ['1st_flr_sf','2nd_flr_sf'])
    drop_cols(df, ['bsmt_full_bath'])
    drop_cols(df, ['garage_cars'])
    drop_cols(df, ['totrms_abvgrd'])
    drop_cols(df, ['bsmtfin_sf_1','bsmtfin_sf_2','bsmt_unf_sf'])

# ------- feature engineering for numerical columns --------------
    # create new feature: house_sold_age
    df['house_sold_age'] = df['yr_sold'] - df['year_remod/add']
    # drop year_remod/add, yr_sold and year_built
    drop_cols(df,['year_remod/add','yr_sold', 'year_built'])
    
    # create new feature: garage_age
    df['garage_age'] = 2010 - df['garage_yr_blt']
    # drop garage_yr_blt
    drop_cols(df,['garage_yr_blt'])
    
    # convert mo_sold to encoded ordinal based on saleprice
    df['mo_sold'] = df['mo_sold'].astype(str)
    ordinal_values = {'4':0, '3':1, '10':1, '2':1, '5':2, '6':2, '12':2, '11':2, '8':3,\
                  '7':3, '9':3, '1':4}
    map_new_vals(df,'mo_sold',dictionary=ordinal_values)
    
    # drop ms_subclass due to no correlation to saleprice
    drop_cols(df, ['ms_subclass'])

# ------- feature selection for categorical columns ----------------
    # drop columns based on :
    # 1. low variance (i.e. near zero)
    # 2. how important feature is in affecting saleprice
    cols = ['street','utilities','land_slope','condition_2',
            'roof_matl','heating','central_air','electrical','pool_qc']
    drop_cols(df, cols)
    drop_cols(df, ['sale_type'])
    
# ------- feature engineering for categorical columns --------------
    # binarize 'exterior_2nd_present'
    for index, val in enumerate(df['exterior_2nd']):
        if val == df.loc[index,'exterior_1st']:
            df.loc[index, 'exterior_2nd_present'] = 0
        else:
            df.loc[index, 'exterior_2nd_present'] = 1
    
    # further analyze `exterior_1st`
    drop_cols(df, ['exterior_1st'])
    
    # binarize 'alley_present'
    binary_dict = {'Grvl':1, 'Pave':1, 'None':0}
    df['alley_present'] = df['alley'].map(binary_dict)
    
    # binarize 'paved_drive_present'
    binary_dict = {'Y':1, 'P':1, 'N':0}
    df['paved_drive_present'] = df['paved_drive'].map(binary_dict)
    
    # drop old features
    drop_cols(df, ['exterior_2nd'])
    drop_cols(df, ['alley','paved_drive'])
    
    # encode ordinal variables
    lotshape = {'Reg': 0,
                'IR1': 1,
                'IR2': 2,
                'IR3': 3}

    qual_cond = {'Ex': 5,
                 'Gd': 4,
                 'TA': 3,
                 'Fa': 2,
                 'Po': 1,
                 'None': 0}

    bsmtexposure = {'None': 0,
                    'No': 1,
                    'Mn': 2,
                    'Av': 3,
                    'Gd': 4}

    bsmtfintype = {'GLQ': 6,
                  'ALQ': 5,
                  'BLQ': 4,
                  'Rec': 3,
                  'LwQ': 2,
                  'Unf': 1,
                  'None': 0}

    garagefinish = {'Fin': 3,
                   'RFn': 2,
                   'Unf': 1,
                   'None': 0}

    fence = {'GdPrv': 4,
            'GdWo': 3,
            'MnPrv': 2,
            'MnWw': 1,
            'None': 0}
    
    neighbor = {'MeadowV': 0,
              'IDOTRR': 0,
              'BrDale': 0,
              'OldTown': 0,
              'BrkSide': 0,
              'Edwards': 0,
              'SWISU': 0,
              'Landmrk': 0,
              'Sawyer': 0,
              'NPkVill': 0,
              'Blueste': 0,
              'NAmes': 0,
              'Mitchel': 1,
              'SawyerW': 1,
              'Greens': 1,
              'Gilbert': 1,
              'NWAmes': 1,
              'Blmngtn': 2,
              'CollgCr': 2,
              'Crawfor': 2,
              'ClearCr': 2,
              'Somerst': 2,
              'Timber': 3,
              'Veenker': 3,
              'GrnHill': 3,
              'NoRidge': 4,
              'NridgHt': 4,
              'StoneBr': 4}
    
    map_new_vals(df, 'lot_shape', lotshape)
    map_new_vals(df, 'exter_qual', qual_cond)
    map_new_vals(df, 'exter_cond', qual_cond)
    map_new_vals(df, 'bsmt_qual', qual_cond)
    map_new_vals(df, 'bsmt_cond', qual_cond)
    map_new_vals(df, 'bsmt_exposure', bsmtexposure)
    map_new_vals(df, 'bsmtfin_type_1', bsmtfintype)
    map_new_vals(df, 'bsmtfin_type_2', bsmtfintype)
    map_new_vals(df, 'heating_qc', qual_cond)
    map_new_vals(df, 'kitchen_qual', qual_cond)
    map_new_vals(df, 'fireplace_qu', qual_cond)
    map_new_vals(df, 'garage_finish', garagefinish)
    map_new_vals(df, 'garage_qual', qual_cond)
    map_new_vals(df, 'garage_cond', qual_cond)
    map_new_vals(df, 'fence', fence)
    map_new_vals(df, 'neighborhood', neighbor)
    
    # encode nominal variables (i.e. One-Hot Encoding)
    nom_cols = ['ms_zoning','land_contour','lot_config','condition_1',\
               'bldg_type','house_style','roof_style','mas_vnr_type','foundation',\
                'functional','garage_type','misc_feature']
    test_final_dummy = pd.get_dummies(data = df, columns = nom_cols, drop_first = True)
    
    # recursive feature elimination to further reduce features
    list_to_drop = ['heating_qc','open_porch_sf','lot_area','wood_deck_sf','lot_shape','garage_cond',
                    'bsmtfin_type_2','alley_present','enclosed_porch','fence','garage_age']
    drop_cols(test_final_dummy, list_to_drop)
    
    return test_final_dummy

In [8]:
test_final_dummy = transform_test(test_cleaned)
test_final_dummy.head(10)

Unnamed: 0,neighborhood,overall_qual,overall_cond,mas_vnr_area,exter_qual,exter_cond,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,total_bsmt_sf,gr_liv_area,full_bath,half_bath,bedroom_abvgr,kitchen_qual,fireplaces,fireplace_qu,garage_finish,garage_area,garage_qual,mo_sold,house_sold_age,exterior_2nd_present,paved_drive_present,ms_zoning_FV,ms_zoning_I (all),ms_zoning_RH,ms_zoning_RL,ms_zoning_RM,land_contour_HLS,land_contour_Low,land_contour_Lvl,lot_config_CulDSac,lot_config_FR2,lot_config_FR3,lot_config_Inside,condition_1_Feedr,condition_1_Norm,condition_1_PosA,condition_1_PosN,condition_1_RRAe,condition_1_RRAn,condition_1_RRNe,condition_1_RRNn,bldg_type_2fmCon,bldg_type_Duplex,bldg_type_Twnhs,bldg_type_TwnhsE,house_style_1.5Unf,house_style_1Story,house_style_2.5Fin,house_style_2.5Unf,house_style_2Story,house_style_SFoyer,house_style_SLvl,roof_style_Gable,roof_style_Gambrel,roof_style_Hip,roof_style_Mansard,roof_style_Shed,mas_vnr_type_BrkFace,mas_vnr_type_CBlock,mas_vnr_type_None,mas_vnr_type_Stone,foundation_CBlock,foundation_PConc,foundation_Slab,foundation_Stone,foundation_Wood,functional_Maj2,functional_Min1,functional_Min2,functional_Mod,functional_Typ,garage_type_Attchd,garage_type_Basment,garage_type_BuiltIn,garage_type_CarPort,garage_type_Detchd,garage_type_None,misc_feature_None,misc_feature_Othr,misc_feature_Shed
0,0,6,8,0.0,3,2,2,3,1,1,1020,1928,2,0,4,2,0,0,1,440,1,0,56,0.0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0
1,0,5,4,0.0,3,3,4,3,1,1,1967,1967,2,0,6,3,0,0,3,580,3,3,29,0.0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0
2,1,7,5,0.0,4,3,4,4,3,6,654,1496,2,1,3,4,1,4,2,426,3,3,0,0.0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0
3,0,5,6,0.0,4,3,3,3,1,1,968,968,1,0,2,3,0,0,1,480,2,3,1,0.0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0
4,0,6,5,247.0,3,3,4,3,1,4,1394,1394,1,1,3,3,2,4,2,514,3,3,46,0.0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0
5,0,4,6,0.0,3,3,3,3,1,3,546,1092,1,1,3,3,0,0,1,286,3,2,38,1.0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0
6,0,4,6,0.0,3,3,3,3,1,1,869,1093,1,0,2,3,0,0,1,308,3,2,2,0.0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0
7,2,5,7,0.0,3,3,4,4,3,5,1268,1268,1,1,3,4,0,0,1,252,3,1,20,1.0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0
8,2,7,5,0.0,4,3,4,3,1,1,840,1680,2,1,2,4,0,0,2,588,3,4,6,0.0,1,1,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0
9,1,6,6,0.0,3,3,3,3,1,5,1196,1279,2,0,3,3,2,2,1,473,3,1,31,0.0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0


## Sanity checks

In [9]:
# check for any columns with null values remaining
if test_final_dummy.isnull().sum().sum() == 0:
    print('No columns with null values remain')
else: 
    for name, value in enumerate(test_final_dummy.isnull().sum()):
        if value > 0:
            print(f"Column '{test_final_dummy.isnull().sum().index[name]}' has {test_cleaned.isnull().sum().values[name]} null value left.")


No columns with null values remain


In [10]:
# see shape for final train and test data
print(train_final_dummy.shape)
print(test_final_dummy.shape)

(2045, 89)
(878, 84)


Noted above that without counting 'saleprice', final train data has more features than final test data.

In [11]:
# check for features in final train data not in test data
[col for col in train_final_dummy if col not in test_final_dummy]

['saleprice',
 'ms_zoning_C (all)',
 'functional_Sal',
 'functional_Sev',
 'misc_feature_TenC']

In [12]:
# check for features in final test data not in train data
[col for col in test_final_dummy if col not in train_final_dummy]

[]

There is a total of 4 different features between final train data and final test data. 

This is largely due to the final train data having a particular value that the final test data do not. Hence, after doing one hot encoding, it resulted in a feature not included in the final test data.

## Export

In [13]:
# export final test data as csv
test_final_dummy.to_csv('../datasets/test_final_dummy.csv', index = False)