# Data Handling

In the current state, the data we have will be a challenging element of this project. These very weakly correlated features and class imbalances will require careful thought into methodology. As concluded from our exploratory data analysis, there is significant manipulation to be done before we can start to apply models for prediction.

In [13]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

train = pd.read_csv("Dataset/new_train.csv")

# missing values

The greatest initial obstacle facing our models will be the large amount of missing data,
We can run an analysis to further look into this

In [14]:
vars_with_missing = []

for f in train.columns:
    missings = train[train[f] == -1][f].count()
    if missings > 0:
        vars_with_missing.append(f)
        missings_perc = missings / train.shape[0]

        print('Variable {} has {} records ({:.2%}) with missing values'.format(f, missings, missings_perc))

print('In total, there are {} variables with missing values'.format(len(vars_with_missing)))

Variable ps_ind_02_cat has 154 records (0.03%) with missing values
Variable ps_ind_04_cat has 62 records (0.01%) with missing values
Variable ps_ind_05_cat has 4344 records (0.97%) with missing values
Variable ps_reg_03 has 80734 records (18.09%) with missing values
Variable ps_car_01_cat has 78 records (0.02%) with missing values
Variable ps_car_02_cat has 2 records (0.00%) with missing values
Variable ps_car_03_cat has 308561 records (69.12%) with missing values
Variable ps_car_05_cat has 200191 records (44.84%) with missing values
Variable ps_car_07_cat has 8563 records (1.92%) with missing values
Variable ps_car_09_cat has 420 records (0.09%) with missing values
Variable ps_car_11 has 2 records (0.00%) with missing values
Variable ps_car_12 has 1 records (0.00%) with missing values
Variable ps_car_14 has 31843 records (7.13%) with missing values
In total, there are 13 variables with missing values


Variables ps_car_03_cat and ps_car_05_cat stand out to us as having a significant proportion of missing data.
It therefore feels justified to remove these columns, since it doesn't appear to contain meaningful information therefore removing will not bias our results. further analysis could be done into if a missing value in these columns might have any correlation to our target, but at this stage of time it seems best to remove the columns.

In [15]:
def dropmissingcol(pdData):
    vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
    pdData.drop(vars_to_drop, inplace=True, axis=1)
    return pdData

For the remaining missing values, it would be generally considered bad practice to remove the data rows - unless there is strong justification we aren't introducing bias. Instead, we can use our existing data to impute values for the missing data.
Imputing missing values involves replacing the missing values with estimates from the data. Popular methods include Mean imputation, Median imputation, Mode imputation, Regression imputation. Even further advanced methods can also be used such as ANN imputation, or KNN imputation.

We decided to test and compare the results for Mean/Mode imputation, against a slightly more advanced regression imputation algorithm.
It is not always true that regression imputation is better than mean imputation. Which one is most appropriate will depend on the specific dataset and the nature of the missing values. Mean imputation is a simple and commonly used method for imputing missing values but struggles when the data contains extreme values or outliers. Regression imputation can be more effective in situations where the missing values are not randomly distributed and there is a relationship between the missing values and other variables in the dataset.

Mean / Mode imputation

In [16]:
def missingvalues(pdData):
    mean_imp = SimpleImputer(missing_values=-1, strategy='mean')
    mode_imp = SimpleImputer(missing_values=-1, strategy='most_frequent')
    features = ['ps_reg_03', 'ps_car_12', 'ps_car_14', 'ps_car_11']
    for i in features:
        if i == 'ps_car_11':
            pdData[i] = mode_imp.fit_transform(pdData[[i]]).ravel()
        else:
            pdData[i] = mean_imp.fit_transform(pdData[[i]]).ravel()
    return pdData

Linear Regression Imputation

In [17]:
def missingvalues(pdData):
    features = ['ps_reg_03', 'ps_car_12', 'ps_car_14', 'ps_car_11']
    pdData00 = pdData.copy()
    pdData0 = pdData.copy()
    pdData1 = pdData.copy()

    for i in features:
        pdData1 = pdData1[pdData1[i] != -1]
    X_train = pdData1.drop(['target', 'id','ps_car_14','ps_reg_03','ps_car_11','ps_car_12'], axis=1)
    X_train = pd.DataFrame(X_train)
    pdData0 = pdData0.drop(['target', 'id','ps_car_14','ps_reg_03','ps_car_11','ps_car_12'], axis=1)
    for i in features:
            l_model = LinearRegression()
            y_train = pdData1[i].values
            l_model.fit(X_train,y_train)
            for j in range(pdData00.shape[0]):
                if pdData00.at[j,i] == -1:
                    X = pdData0.loc[j]
                    X = pd.DataFrame(X).transpose()
                    pdData00.at[j,i] = l_model.predict(X)
    return pdData00

# encoding data

For some models it may be necessary to encode our categorical variables. This is due to our categorical variables containing more than binary values, which may lead to the model making assumptions that category "3" is closer to category "2" than category "0", and this may not be the case. We then use encoding to expand our categorical features into new feature columns which will only take binary values - 1 for the data in this specific category and 0's for the others. This massively increases the dimensions of our dataframe.

In [18]:
def encodecat(train, test):
    cat_features = [col for col in train.columns if '_cat' in col]
    for column in cat_features:
        temp = pd.get_dummies(pd.Series(train[column]), prefix=column)
        train = pd.concat([train, temp], axis=1)
        train = train.drop([column], axis=1)

    for column in cat_features:
        temp = pd.get_dummies(pd.Series(test[column]), prefix=column)
        test = pd.concat([test, temp], axis=1)
        test = test.drop([column], axis=1)
    return train, test

# Rescaling data

Rescaling data is important because many algorithms use distance metrics to make predictions such as the euclidean distance or manhattan distance. For certain algorithms which use a metric like this, we must make sure all data has the same scale to avoid problems. This is a simple step which can be applied in almost all cases in ML unless it could specifically amplify the effect of outliers or break a relation we have with two variables by scaling them differently.

In [19]:
def RescaleData(train, test):
    scaler = StandardScaler()
    scaler.fit_transform(train)
    scaler.fit_transform(test)
    return train, test

# drop calc_ columns

From our EDA we see that the calc_ columns in our dataset contribute very little to predictions.
It could therefore be beneficial to remove these columns from our models.

In [20]:
def DropCalcCol(train, test):
    col_to_drop = train.columns[train.columns.str.startswith('ps_calc_')]
    train = train.drop(col_to_drop, axis=1)
    test = test.drop(col_to_drop, axis=1)
    return train, test

# Generating the linear regression imputed data

In [22]:
def missingvalues(pdData):
    pdData.drop(['ps_car_03_cat', 'ps_car_05_cat'], inplace=True, axis=1)
    col_to_drop = pdData.columns[pdData.columns.str.startswith('ps_calc_')]
    pdData = pdData.drop(col_to_drop, axis=1)
    features = ['ps_reg_03', 'ps_car_12', 'ps_car_14', 'ps_car_11']
    pdData00 = pdData.copy()
    pdData0 = pdData.copy()
    pdData1 = pdData.copy()

    for i in features:
        pdData1 = pdData1[pdData1[i] != -1]
    X_train = pdData1.drop(['target', 'id', 'ps_car_14', 'ps_reg_03', 'ps_car_11', 'ps_car_12'], axis=1)
    X_train = pd.DataFrame(X_train)
    pdData0 = pdData0.drop(['target', 'id', 'ps_car_14', 'ps_reg_03', 'ps_car_11', 'ps_car_12'], axis=1)
    for i in features:
        l_model = LinearRegression()
        y_train = pdData1[i].values
        l_model.fit(X_train, y_train)
        for j in range(pdData00.shape[0]):
            if pdData00.at[j, i] == -1:
                X = pdData0.loc[j]
                X = pd.DataFrame(X).transpose()
                pdData00.at[j, i] = l_model.predict(X)
    return pdData00

train = pd.read_csv("Dataset/new_train.csv")
test = pd.read_csv("Dataset/new_test.csv")
imputetrain = missingvalues(train)
imputetest = missingvalues(test)
imputetrain.to_csv("Dataset/ImputeTrain.csv", index=False)
imputetest.to_csv("Dataset/ImputeTest.csv", index=False)


This code is a lot more time-consuming than a simple mean impute ~5 min