# Task 3: Prediction with Machine Learning 

### Table of Contents

1. [Introduction](#introduction)
2. [Data Preproceesing](#data-preprocessing)
3. [Machine Learning](#machine-learning)

## Introduction <a class="anchor" id="introduction"></a>

Machine learning will be applied to predict house price with the Ames Housing dataset, which has been cleaning during data wrangling and exploratory data analysis (EDA). The first steps of this project is to preprocess data based on the insights discovered during exploratory data analysis. Subsequently, the preprocessed data will be used to train various machine learning models such as neural network, XGBoost, linear regression and etc in order to predict house price.

## Data Preprocessing <a class="anchor" id="data-preprocessing"></a>

To preprocess the data, the following steps will be carried out in a pipeline.

1. Split data into training data and validation data with a ratio of 80:20
2. Use ordinal encoding to encode ordinal variables
3. Use one-hot encoding to encode nominal variables with small cardinality and that seem to have substantial impact of house price (in order to retain all the information).
4. Use hash encoding to encode nominal variables with high cardinality
5. Perform standardization
6. Use k-nearest neighbors (KNN) imputer to impute missing values

In [None]:
import numpy as np 
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
import category_encoders as ce
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer

In [None]:
# Define URL of training set
dirname = '/kaggle/input'
subdirname = 'dataset'
train_filename = 'train_clean_EDA.csv'
train_filepath = os.path.join(dirname, subdirname, train_filename)

# Load training and testing sets
df = pd.read_csv(train_filepath)

# Drop ID column
df.drop(['Id'], axis=1, inplace=True)

# Split data into training and validation data with the ratio of 80:20
df_train, df_val = train_test_split(df, train_size=0.8, random_state=0)

# Define discrete(dis), continuous(con), nominal(nom), ordinal(ord) variables
data_types_dict = {'nom': ['MSSubClass', 'MSZoning', 'Street', 'Utilities', 'LotConfig', 'Neighborhood', 'Condition1', 'Condition2', 
                           'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation',
                           'Heating', 'CentralAir', 'Electrical', 'GarageType', 'PavedDrive', 'SaleType', 'SaleCondition'],
                   
                   'ord': ['LotShape', 'LandContour', 'LandSlope', 'OverallQual', 'OverallCond', 'ExterQual', 'ExterCond', 'BsmtQual', 
                           'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'Functional', 
                           'GarageFinish', 'GarageQual', 'GarageCond'],
                   
                   'dis': ['YearBuilt', 'YearRemodAdd', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 
                           'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'MoSold', 'YrSold'],
                   
                   'con': ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', 
                           '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 
                           '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'SalePrice']
                  }

# Define order of categorical values in each ordinal variables
ordinal_var_dict = {'LotShape': ['IR3', 'IR2', 'IR1', 'Reg'],
                    'LandContour': ['Lvl', 'Bnk', 'HLS', 'Low'],
                    'LandSlope': ['Sev', 'Mod', 'Gtl'],
                    'OverallQual': list(range(0,11)),
                    'OverallCond': list(range(0,11)),
                    'ExterQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                    'ExterCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                    'BsmtQual': ['NoBsmt', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                    'BsmtCond': ['NoBsmt', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                    'BsmtExposure': ['NoBsmt', 'No', 'Mn', 'Av', 'Gd'],
                    'BsmtFinType1': ['NoBsmt', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
                    'BsmtFinType2': ['NoBsmt', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
                    'HeatingQC': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                    'KitchenQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                    'Functional': ['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'],
                    'GarageFinish': ['NoGarage', 'Unf', 'RFn', 'Fin'],
                    'GarageQual': ['NoGarage', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                    'GarageCond': ['NoGarage', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                   }

for key in ordinal_var_dict:
    # Remove categorical values from ordinal_var_dict that do not exist in training set to avoid data leakage
    train_unique = df_train[key].unique()
    ordinal_var_dict[key] = [cat_value for cat_value in ordinal_var_dict[key] if cat_value in train_unique]
    
    # Convert variables in training set into ordered categorical types
    ordered_var = pd.api.types.CategoricalDtype(ordered = True, categories = ordinal_var_dict[key])
    df_train.loc[:, key] = df_train.loc[:, key].astype(ordered_var)
    

# Define discrete(dis), continuous(con), nominal(nom), ordinal(ord) variables exlcuding target variable, which is SalePrice
data_types_dict = {'nom': ['MSSubClass', 'MSZoning', 'Street', 'Utilities', 'LotConfig', 'Neighborhood', 'Condition1', 'Condition2', 
                           'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation',
                           'Heating', 'CentralAir', 'Electrical', 'GarageType', 'PavedDrive', 'SaleType', 'SaleCondition'],
                   
                   'ord': ['LotShape', 'LandContour', 'LandSlope', 'OverallQual', 'OverallCond', 'ExterQual', 'ExterCond', 'BsmtQual', 
                           'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'Functional', 
                           'GarageFinish', 'GarageQual', 'GarageCond'],
                   
                   'dis': ['YearBuilt', 'YearRemodAdd', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 
                           'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'MoSold', 'YrSold'],
                   
                   'con': ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', 
                           '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 
                           '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal']
                  }

# Break nominal variables into two group: one-hot encoding or hash encoding
data_types_dict['nom_one_hot'] = []
data_types_dict['nom_hash'] = []

for var in data_types_dict['nom']:
    
    # Based on EDA conducted previously, Neighborhood, MSZoning, MasVnrType and Foundation seems to have 
    # substantial impact on SalePrice. Thus, one-hot encoding will be chosen over hash encoding for these
    # variables, regardless of cardinality, because hash encoding will cause loss in information
    if var in ['Neighborhood', 'MSZoning', 'MasVnrType', 'Foundation']:
        data_types_dict['nom_one_hot'].append(var)
        
    else:
        if len(df_train[var].unique()) <= 5 :
            # use one-hot encoding if the cardinality is small
            data_types_dict['nom_one_hot'].append(var)
        else:
            # use one-hot encoding if the cardinality is big
            data_types_dict['nom_hash'].append(var) 

In [None]:
class myOrdinalEncoder(BaseEstimator, TransformerMixin):
    """
    Encode ordinal features, which may contain NaNs, as a 2D array
        
    Parameters
    ----------
    categories: dict
        A dictionary of unique categorical values for each ordinal variable
        
    unknown_value: int, default=0
        Unknown value to use for unknown categorical feature which is not seen in training data
    """
    
    
    def __init__( self, categories={}, unknown_value=0):  
        self.categories = categories
        self.unknown_value = unknown_value

        
    def fit( self, X, y = None ):
        """
        Fit myOrdinalEncoder to X.
    
        Parameters
        ----------
        X : DataFrame, shape [n_samples, n_features]
            The data to determine the categories of each feature.
        y : None
            Ignored. This parameter exists only for compatibility with
            :class:`~sklearn.pipeline.Pipeline`.
        Returns
        -------
        self
        """
        if not self.categories:
            for col in X:
                self.categories[col] = list(X[col].cat.categories)
                
        return self

    def transform(self, X): 
        """
        Transform X using ordinal encoding.
        
        Parameters
        ----------
        X : DataFrame, shape [n_samples, n_features]
            The data to encode.
        Returns
        -------
        X_copy : 2D array
            Transformed input.
        """
        
        # Create a copy of X
        X_copy = X.copy()
        
        for col in X_copy:
            
            # Unconvert CategoricalDtype back to the original data type
            X_copy[col] = np.where(X_copy[col].isnull(), np.nan, X_copy[col].astype(type(self.categories[col][0])))

            # Set unknown categorical feature (except NaN) to unknown_value
            X_copy.loc[~X_copy[col].isin(self.categories[col]) & X_copy[col].notnull(), col] = self.unknown_value
 
            # Transform each feature to ordinal codes
            for i, category in enumerate(self.categories[col]):
                X_copy[col].replace(category, i+1, inplace=True)
    
        return X_copy
    
    
class myOneHotEncoder(BaseEstimator, TransformerMixin):
    """
    Encode categorical features, which may contain NaNs, as a one-hot numeric array
    """
    def __init__(self):
        # Initialize one-hot encoder
        self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore')
        self.nan_replacement = None
        self.one_hot_length = []

    def fit(self, X, y = None ):
        """
        Fit myOneHotEncoder to X.
    
        Parameters
        ----------
        X : DataFrame, shape [n_samples, n_features]
            The data to train the encoder
        y : None
            Ignored. This parameter exists only for compatibility with
            :class:`~sklearn.pipeline.Pipeline`.
        Returns
        -------
        self
        """
        
        # Create a copy of X
        X_copy = X.copy()
        
        # Replace NaNs with replacement before fitting to avoid errors
        self.nan_replacement = X_copy.mode(dropna=True).iloc[0, :]
        X_copy = X_copy.fillna(self.nan_replacement)
        
        # Fit one-hot encoder
        self.one_hot_encoder.fit(X_copy)
        
        # Get the length of one-hot encoding for each feature
        for category in self.one_hot_encoder.categories_:
            self.one_hot_length.append(len(category)) 
            
        return self

    
    def transform(self, X):
        """
        Transform X using one-hot encoding.
        
        Parameters
        ----------
        X : DataFrame, shape [n_samples, n_features]
            The data to encode.
        Returns
        -------
        X_out : 2D array
            Transformed input.
        """
        
        # Create a copy of X
        X_copy = X.copy()
        
        # Create a numpy array that defines the locations of NaNs in the dataframe after hash encoding
        nan_location_arr = X.to_numpy()
        nan_location_arr = np.repeat(nan_location_arr, repeats=self.one_hot_length, axis=1)

        # Replace NaNs with replacement before fitting to avoid errors
        X_out = X_copy.fillna(self.nan_replacement)
        
        # Transform each feature
        X_out = self.one_hot_encoder.transform(X_out).toarray()
        
        # Reconvert back values into NaNs
        X_out[pd.isnull(nan_location_arr)] = np.nan
        
        return X_out
    
    
class myHashingEncoder(BaseEstimator, TransformerMixin):
    """
    Encode nominal features, which may contain NaNs, as a 2D array
        
    Parameters
    ----------
    n_bits: int, default=8
        Number of bits used to represent each nominal feature
    """
    
    
    def __init__(self, n_bits=8):
        self.n_bits = n_bits
        
        # Initialize hashing encoder
        self.hashing_encoder = ce.hashing.HashingEncoder(return_df=False, n_components=n_bits)
        

    def fit(self, X, y = None ):
        """
        Ignored. This exists only for compatibility with 
            :class:`~sklearn.pipeline.Pipeline`.
        Returns
        -------
        self
        """
        
        return self
    

    def transform(self, X):
        """
        Transform X using hash encoding.
        
        Parameters
        ----------
        X : DataFrame, shape [n_samples, n_features]
            The data to encode.
        Returns
        -------
        X_out : 2D array
            Transformed input.
        """
        
        # Create a numpy array that defines the locations of NaNs in the dataframe after hash encoding
        nan_location_arr = X.to_numpy()
        nan_location_arr = np.repeat(nan_location_arr, repeats=self.n_bits, axis=1)
        
        X_out = np.empty((X.shape[0],0))
        
        for col in X:
            # Convert the data type of column into str if int or float
            if(X[col].dtype == np.float64 or X[col].dtype == np.int64):
                X[col] = X[col].astype(str)
                
            # Transform each feature and convert the data type into float for NaN
            X_out = np.concatenate((X_out, self.hashing_encoder.fit_transform(X[col].to_numpy()).astype('float')), axis=1)
            
        # As hashing encoder will turn NaN into numeric array, reconvert back these values into NaNs
        X_out[pd.isnull(nan_location_arr)] = np.nan
        
        return X_out
    
    
class hashAggregator(BaseEstimator, TransformerMixin):
    """
    Combine all the 2D arrays of encoded using hash encoding after KNN imputation 
        
    Parameters
    ----------
    n_feas: int
        Number of features encoded using hash encoding 
    n_bits: int, default=8
        Number of bits used to represent each nominal feature
    """
    
    
    def __init__(self, n_feas, n_bits):
        self.n_feas = n_feas
        self.n_bits = n_bits
        
        
    def fit(self, X, y = None ):
        """
        Ignored. This exists only for compatibility with 
            :class:`~sklearn.pipeline.Pipeline`.
        Returns
        -------
        self
        """
        
        return self
    

    def transform(self, X):
        """
        Combine all the 2D arrays of features encoded using hash encoding
        
        Parameters
        ----------
        X : array, shape [n_samples, n_features*n_bits]
            The data to combine.
        Returns
        -------
        X_out : 2D array
            Combined input.
        """
        X_copy = np.copy(X)

        # Split input array into two array: one containing data encoded using hash encoding, another containing the rest
        X_copy_non_hash, X_copy_hash = np.split(X_copy, [-self.n_feas*self.n_bits,], axis=1)
        
        # Combine all the 2D arrays of features encoded using hash encoding
        X_copy_hash = X_copy_hash.reshape(X_copy_hash.shape[0], -1, self.n_bits).sum(1)
        
        # Concatenate hash and non-hash arrays
        X_out = np.concatenate((X_copy_non_hash, X_copy_hash), axis=1)
        
        return X_out

In [None]:
# Define number of bits for hash encoding
n_bits = 8

# Define ColumnTransformer to transform columns with different methods
column_transformer = ColumnTransformer(transformers=[('dis_transformer', 'passthrough', data_types_dict['dis']),
                                                     ('con_transformer', 'passthrough', data_types_dict['con']),
                                                     ('ord_transformer', myOrdinalEncoder(), data_types_dict['ord']),
#                                                      ('nom_one_hot_transformer', myOneHotEncoder(), data_types_dict['nom_one_hot']),
                                                     ('nom_hash_transformer', myHashingEncoder(n_bits = n_bits), data_types_dict['nom_hash']),
                                                    ])


# Define pipeline for pre-processing data
preprocessor = Pipeline(steps=[('column_transformer', column_transformer),
                               ('standard_scaler', StandardScaler()),
                               ('imputer', KNNImputer(n_neighbors=5, weights='distance')),
                               ('hash_aggregator', hashAggregator(n_feas=len(data_types_dict['nom_hash']), n_bits=n_bits)),
                              ])


transformed_data = preprocessor.fit_transform(df_train)

## Machine Learning <a class="anchor" id="machine-learning"></a>