# **Data cleaning**

## Objectives

**Perform Business requirement 2 user story task: Data cleaning and preparation ML tasks**
* Find and correct if necessary invalid data.
* Handle outliers.
* Split dataset in to train and test subsets.
* Impute missing data.
* Save cleaned train and test datasets
* Determine some of the steps of the data cleaning and feature engineering pipeline.

## Inputs
* house prices dataset: outputs/datasets/collection/house_prices.csv

## Outputs
* cleaned train set: outputs/datasets/ml/cleaned/train_set.csv
* cleaned test set: outputs/datasets/ml/cleaned/test_set.csv
* Pickled outlier indices list: outputs/ml/outlier_indices.pkl

---

## Change working directory

Working directory changed to its parent folder.

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
os.getcwd()

---

## Load house price dataset

In [None]:
import pandas as pd

house_prices_df = pd.read_csv(filepath_or_buffer='outputs/datasets/collection/house_prices.csv')
house_prices_df.dtypes

We know from the data collection notebook, that there are no duplicates in the dataset.

---

## Invalid data

### Data types

Inspection of the data types for each variable, except for shows no discrepancies from the expectation for each variable's suitable data type.

### Value ranges

Checking the values for each variable are within the numeric valid range or equal to one of the categorical options, as indicated in the datasets metadata.

**First for numeric variables**.

In [None]:
numeric_house_prices_df = house_prices_df.select_dtypes(exclude=['object'])
numeric_house_prices_df.columns.tolist()

In [None]:

def check_value_ranges(variable, value_range):
    """
    Checks whether the non-missing values for a 'house_prices_df' numeric variable are in the valid variable range.

    Args:
        variable (str): name of variable.
        value_range (list): [minimum value, maximum value].
    
    Returns a boolean indicating whether all values of the variable are in the valid range.

    """
    variable_series = house_prices_df[variable]
    # drop missing data
    variable_series.dropna(inplace=True)
    result_series = variable_series[variable_series <= value_range[1]]
    result_series = result_series >= value_range[0]
    return result_series.size == variable_series.size


In [None]:
print('|Variable|Valid range|Data in valid range|')
variable_value_ranges = {'1stFlrSF': [334, 4692], '2ndFlrSF': [0, 2065], 'BedroomAbvGr': [0, 8], 'BsmtFinSF1': [0, 5644],
                         'BsmtUnfSF': [0, 2336], 'TotalBsmtSF': [0, 6110], 'GarageArea': [0, 1418], 'GarageYrBlt': [1900, 2010],
                         'GrLivArea': [334, 5642], 'LotArea': [1300, 215245], 'LotFrontage': [21, 313], 'MasVnrArea': [0, 1600],
                         'EnclosedPorchSF': [0, 286], 'OpenPorchSF': [0, 547], 'OverallCond': [1, 10], 'OverallQual': [1,10],
                         'WoodDeckSF': [0, 736], 'YearBuilt': [1872, 2010], 'YearRemodAdd': [1950, 2010], 'SalePrice': [34900, 755000]}

for variable in numeric_house_prices_df.columns:
    print(f'{variable}|', f'{variable_value_ranges[variable]}|', check_value_ranges(variable, variable_value_ranges[variable]))


All non-missing values are in the valid range for each numeric variable.

**Now for categorical variables**.

In [None]:
categorical_house_prices_df = house_prices_df.select_dtypes(include=['object'])
categorical_house_prices_df.columns.tolist()

In [None]:
import numpy as np

# include NaN as a valid value 
result_df = categorical_house_prices_df.isin({'BsmtExposure': ['Gd', 'Av', 'Mn', 'No', 'None', np.nan], 'BsmtFinType1': ['GLQ', 'ALQ', 'BLQ', 'Rec', 'LwQ', 'Unf', 'None', np.nan],
                                  'GarageFinish': ['Fin', 'RFn', 'Unf', 'None', np.nan], 'KitchenQual': ['Ex', 'Gd', 'TA', 'Fa', 'Po', np.nan]})
for col in result_df.columns:
    print(result_df[col].value_counts())

All values are valid for the categorical variables (allowing for missing data).

---

## Outliers

Determined from the significant features EDA notebook that for the most significant continuous numeric features in relation to sale price, there were several instances whose vector components were outliers in at least 50% of the continuous features, thus making them more likely multivariate outliers. What's more it was discovered that the components of these instances corresponded to the extremest outliers for multiple features, supporting the idea of a correlation between the number of features for which an instance's component is an outlier, and the extremity of the outliers.

Outliers for each feature (using the whole dataset) were determined using the IQR method, and the indices of the outliers tracked and counted to determine if the same instance gave rise to outliers for other features. It is common that the dataset is first split into train and test sets before handling outliers, perhaps using a transformer such as winsorize; the idea being to minimise the risk of data leakage. However arguably an outlier in the whole data set (at least the most extreme ones) will still be an outlier in a sample of the distribution (if it is present). Also it could be argued that such values if particularly extreme, and depending on the context of the dataset and business aims, offer no value, and potentially impact the ML algorithms. Therefore for this dataset the outliers will be trimmed from the whole dataset. 

**Adding the outlier related functions from the significant feature EDA notebook.**

In [None]:
# taken from significant_feature_EDA.ipynb
def locate_single_feature_outliers(feature, df):
    """
    Locates outliers for a feature in a dataframe (containing only numeric features) using the IQR method.

    Args:
        feature (str): the feature name.
        df: dataframe containing the feature.

    Returns a list of indices corresponding to the dataframe indices of the outliers.
    """
    sample = df[feature]
    mean = sample.mean()
    SD = sample.std()
    Q1 = sample.quantile(q=0.25)
    Q3 = sample.quantile(q=0.75)
    IQR = Q3 - Q1
    def return_outliers(instance):
        return instance > IQR*1.5 + Q3 or instance < Q1 - 1.5*IQR
    result = sample.apply(func=return_outliers)
    return result[result == True].index.tolist()

In [None]:
# taken from significant_feature_EDA.ipynb
def locate_all_feature_outliers(df):
    """
    Amalgamates into a single list, the dataframe (containing only numeric features) indices corresponding to all outliers of features in a dataframe.

    args:
        df: dataframe containing numeric features.

    Returns a list. It contains a series with index corresponding to the index of an outlier, and a column value
    corresponding to the number of times the instance is a common outlier across all features. Also contains
    a value_counts series for the series; finally contains a float for the number of features in the dataframe.
    """
    outlier_indices = []
    for col in df.columns:
        found_ouliers = locate_single_feature_outliers(col, df)
        outlier_indices.extend(found_ouliers)
    index_freq = np.array(outlier_indices)
    index_count = np.unique(index_freq, return_counts=True)
    index_count_series = pd.Series(data=index_count[1], index=index_count[0]).sort_values(ascending=False)
    return [index_count_series, index_count_series.value_counts().sort_values(), df.columns.size]

Rediscovering the outlier instances

In [None]:
continuous_numeric_features = ['1stFlrSF', '2ndFlrSF', 'BsmtFinSF1', 'GarageArea', 'GrLivArea',
                               'LotArea',
                               'LotFrontage',
                               'MasVnrArea',
                               'OpenPorchSF',
                               'TotalBsmtSF']
outlier_series, outlier_series_unique_count, total_feature_num = locate_all_feature_outliers(house_prices_df[continuous_numeric_features])
print('\n','Instances whose component values correspond to potential outliers in more than 50% of continuous numeric features:')
house_prices_df[continuous_numeric_features].loc[outlier_series[outlier_series > 5].index.tolist()]   

**Removing these instances from the whole dataset**

In [None]:
outlier_indices = house_prices_df.loc[outlier_series[outlier_series > 5].index.tolist()].index.tolist()
outlier_indices

Saving the outlier_indices list

In [None]:
try:
    path = os.path.join(os.getcwd(), 'outputs/ml/')
    os.makedirs(path)
except Exception as e:
  print(e)

In [None]:
try:
    import joblib
    path = os.path.join(os.getcwd(), 'outputs/ml/outlier_indices.pkl')
    joblib.dump(outlier_indices, path)
except Exception as e:
    print(e)

Dropping the instances corresponding to the outliers from the whole dataset

In [None]:
house_prices_df.drop(labels=outlier_indices, inplace=True)

---

## Split dataset

In [None]:
from sklearn.model_selection import train_test_split

(train_set_df, test_set_df) = train_test_split(house_prices_df, test_size=0.25, random_state=30)

---

## Handling missing data

During the sale price correlation study, missing values in the whole dataset were imputed, using a combination of the KNN (k-nearest neighbors) imputer for numeric features, and an equal value frequency imputer method for categorical features. The idea being to attempt to realistically replicate real data and to not distort the distributions. Comparing the distributions before and after revealed no significant distortions.

Therefore the same methods will be used again, but this time applied separately to the train and test subsets.

### Identifying missing values in the test and train subsets.

In [None]:
train_set_missing_data_df = train_set_df.loc[:, train_set_df.isna().any()]
print(train_set_missing_data_df)

In [None]:
test_set_missing_data_df = test_set_df.loc[:, test_set_df.isna().any()]
print(test_set_missing_data_df)

* So can see that the target in both test and train subsets has no missing values.
* Can see, as was already known, that there are missing values in both feature train and test subsets.

**Imputing missing values in train and test subsets**

In [None]:
train_set_numeric_df = train_set_df.select_dtypes(exclude='object')
train_set_non_numeric_df = train_set_df.select_dtypes(include='object')
test_set_numeric_df = test_set_df.select_dtypes(exclude='object')
test_set_non_numeric_df = test_set_df.select_dtypes(include='object')

First for numeric features:

Creating function to carry out imputation

In [None]:
from sklearn.impute import KNNImputer
import seaborn as sns
import matplotlib.pyplot as plt

def numeric_feature_missing_value_KNNImputer(numeric_df, transform_df, missing_data_df):
    """
    fit_transforms numeric features of a dataset using the KNNImputer. Produces before and after histograms.

    Args:
        numeric_df: dataframe containing all of the numeric features only.
        transform_df: dataframe which is updated with imputed values, and from which the numeric features originate.
        missing_data_df: dataframe containing columns (all types) with missing data only.
    """
    # plotting distributions for numeric variables with missing data
    counter = 0
    imputed_columns = []
    while counter < len(numeric_df.columns):
        if numeric_df.iloc[:, counter].name in missing_data_df.columns:
            fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(10,4))
            sns.histplot(x=numeric_df.iloc[:, counter], ax=ax[0])
            imputed_columns.append(numeric_df.iloc[:, counter].name)
        counter += 1
    
    # Imputing the missing values for required columns
    imputer = KNNImputer()
    imputer.set_output(transform='pandas')
    numeric_df = imputer.fit_transform(numeric_df)
    # check missing values have been replaced
    print('|feature|Number of missing values|')
    print(numeric_df.isna().sum())
    # converting back values to integers for integer features (consequence of taking the mean during KNN imputing)
    for col in ['BedroomAbvGr','GarageYrBlt','OverallCond','OverallQual','YearBuilt', 'YearRemodAdd']:
        numeric_df[col] = numeric_df[col].round()

    # transforming parent dataframe
    transform_df[numeric_df.columns.values] = numeric_df

    # plotting distributions after imputation on same figures, for visual comparison
    print('\n', 'Distributions before and after missing value imputation')
    for fig in plt.get_fignums():
        sns.histplot(x=numeric_df[imputed_columns[fig - 1]], ax=plt.figure(fig).get_axes()[1])


train set missing values imputation

In [None]:
numeric_feature_missing_value_KNNImputer(train_set_numeric_df.drop('SalePrice', axis=1), train_set_df, train_set_missing_data_df)

test set missing values imputation

In [None]:
numeric_feature_missing_value_KNNImputer(test_set_numeric_df.drop('SalePrice', axis=1), test_set_df, test_set_missing_data_df)

The choice was made to fit the imputer to the train and test sets separately, as opposed to fitting only to the train set and transforming both sets after. The rationale for the alternative is that you should only use information learned from the train set, to transform in the same way the train and test sets. Whilst in many cases it would be a clear example of data leakage if the test set was used to influence the train set, it is not clear whether modifying each subset independently impacts the model's performance, at least for transformers that are sample specific (e.g. if they use the location of nearest neighbors).

The KNN imputer uses the k-nearest neighbors to calculate the missing values. During fitting it identifies, as far as I can tell, the nearest neighbors by their locations, for all instances in the dataset (missing or not). So if a different dataset is used to fit the imputer, to that which is transformed, then the relative locations of the nearest neighbors of a missing value in the transformed dataset, are determined by the nearest neighbors of the identically located value (missing or not) in the fitted dataset. The missing value in the transformed dataset is then calculated using the values located at the positions of the nearest neighbors in the fitted dataset.

Consequently if the test set was transformed using the train set fitted imputer, then the values used to replace the missing values will have little or no relationship to the instance's other feature values. This then seems to defeat the purpose of using the KNN imputer to more realistically replace any missing values with something approximating the true value --- assuming a relationship exists between different features as well as the target. If the goal is to predict the sale price based on an assumed relationship to the features, which themselves may be related, then it makes sense to impute the test set in a way which is more likely to preserve any relationship, which has also been hopefully preserved in the train set using the KNN imputer independently.

For the categorical/non-numeric features:

Creating function to carry out imputation

In [None]:
def equal_frequency_imputer_categorical_features(categorical_df, transform_df, missing_data_df):
    """
    Imputes missing values for categorical features using an equal frequency value replacement method. Produces before and after count plots.

    The missing values are individually replaced in sequence by repeatedly cycling through one of the possible values.

    Args:
        categorical_df: dataframe containing all of the categorical features only.
        transform_df: dataframe which is updated with imputed values, and from which the categorical features originate.
        missing_data_df: dataframe containing columns (all types) with missing data only.
        
    """
    # plotting distributions for categorical variables with missing data
    counter = 0
    imputed_columns = []
    while counter < len(categorical_df.columns):
        if categorical_df.iloc[:, counter].name in missing_data_df.columns:
            fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(10,4))
            sns.countplot(x=categorical_df.iloc[:, counter], ax=ax[0])
            imputed_columns.append(categorical_df.iloc[:, counter].name)
        counter += 1

    # imputing the missing values
    for col in imputed_columns:
        number_of_nans = categorical_df[col].loc[categorical_df[col].isna() == True].size
        unique_values = categorical_df[col].unique()
        index_no = 0
        while number_of_nans > 0:
            if index_no + 1 >= unique_values.size:
                index_no = 0
            categorical_df[col].fillna(value=unique_values[index_no], limit=1, inplace=True)
            transform_df[col].fillna(value=unique_values[index_no], limit=1, inplace=True)
            index_no += 1
            number_of_nans = categorical_df[col].loc[categorical_df[col].isna() == True].size

    # checking the missing values have be replaced
    print('|feature|Number of missing values|')
    print(categorical_df.isna().sum())

    # plotting the distributions on the same figures after imputation for comparison
    print('\n', 'Distributions before and after missing value imputation')
    for fig in plt.get_fignums():
        sns.countplot(x=categorical_df[imputed_columns[fig - 1]], ax=plt.figure(fig).get_axes()[1])
            


Imputing the missing values for the categorical features

train set:

In [None]:
equal_frequency_imputer_categorical_features(train_set_non_numeric_df, train_set_df, train_set_missing_data_df)

test set:

In [None]:
equal_frequency_imputer_categorical_features(test_set_non_numeric_df, test_set_df, test_set_missing_data_df)

For the categorical features it is more difficult to replace the missing value with a true value, and so it was decided to preserve the distribution shape. By cycling through the missing values and replacing them in sequence with one of the possible values in an equally distributed fashion, the hope is to potentially only disrupt the relationships between these features and the target somewhat randomly, and thus hopefully only causing some noise, without altering the direction of any correlation. Fortunately for the categorical features, only a small amount of missing values were present.

**Final check that all missing values have been replaced**

In [None]:
print('train_set')
print('|feature|Number of missing values|')
print(train_set_df.isna().sum())

In [None]:
print('test_set')
print('|feature|Number of missing values|')
print(test_set_df.isna().sum())

---

## Saving cleaned datasets

In [None]:
try:
    path = os.path.join(os.getcwd(), 'outputs/datasets/ml/cleaned/')
    os.makedirs(path)
except Exception as e:
  print(e)

In [None]:
try:
    train_set_df.to_csv(os.path.join(path, 'train_set.csv'), index=False)
except Exception as e:
    print(e)

In [None]:
try:
    test_set_df.to_csv(os.path.join(path, 'test_set.csv'), index=False)
except Exception as e:
    print(e)

---

## Creating custom transformers for pipelines

Creating transformer for imputing the missing values of categorical features

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class EqualFrequencyImputer(BaseEstimator, TransformerMixin):
    """
    Imputes missing values for categorical features using an equal frequency replacement with possible feature values.
    """
    def fit(self, x, y):
        """
        No fitting is performed. The equal_frequency_imputer_categorical_features funnction is defined.
        """
        def equal_frequency_imputer_categorical_features(categorical_df, transform_df, missing_data_df):
            """
            Imputes missing values for categorical features using an equal frequency value replacement method. Produces before and after count plots.

            The missing values are individually replaced in sequence by repeatedly cycling through one of the possible values.

            Args:
                categorical_df: dataframe containing all of the categorical features only.
                transform_df: dataframe which is updated with imputed values, and from which the categorical features originate.
                missing_data_df: dataframe containing columns (all types) with missing data only.
                
            """
            counter = 0
            imputed_columns = []
            while counter < len(categorical_df.columns):
                if categorical_df.iloc[:, counter].name in missing_data_df.columns:
                    imputed_columns.append(categorical_df.iloc[:, counter].name)
                counter += 1

            for col in imputed_columns:
                number_of_nans = categorical_df[col].loc[categorical_df[col].isna() == True].size
                unique_values = categorical_df[col].unique()
                index_no = 0
                while number_of_nans > 0:
                    if index_no + 1 >= unique_values.size:
                        index_no = 0
                    categorical_df[col].fillna(value=unique_values[index_no], limit=1, inplace=True)
                    transform_df[col].fillna(value=unique_values[index_no], limit=1, inplace=True)
                    index_no += 1
                    number_of_nans = categorical_df[col].loc[categorical_df[col].isna() == True].size

        self.equal_frequency_imputer = equal_frequency_imputer_categorical_features
        return self

    def transform(self, x):
        """
        Transform the features by applying the equal frequency imputer function.
        """
        self.categorical_df = x.select_dtypes(include='object')
        self.missing_data_df = x.loc[:, x.isna().any()]

        self.equal_frequency_imputer(self.categorical_df, x, self.missing_data_df)

        return x
        

Creating custom transformer for imputing missing values for numeric features

The transformer is trivial, but needed because of the desire to fit and transform train and test sets independently with the KNNImputer. The motivation for this was discussed earlier.

In [None]:
class IndependentKNNImputer(BaseEstimator, TransformerMixin):
    """
    Imputes missing values for numerical features using a KNNImputer.
    
    Allows independent fitting and transforming for train and test sets.
    """
    def fit(self, x, y):
        return self

    def transform(self, x):
        """
        Transforms using a KNNImputer.
        """
        self.numeric_df = x.select_dtypes(exclude='object')
        imputer = KNNImputer()
        imputer.set_output(transform='pandas')
        imputer.fit(self.numeric_df)
        x[self.numeric_df.columns] = imputer.transform(self.numeric_df)
        return x

---

## Conclusions

**Data cleaning and feature engineering pipeline steps**

* Will impute missing values for numeric features using a custom transformer IndependentKNNImputer.
* Will impute missing values for the categorical features using a custom transformer: EqualFrequencyImputer
