# Attempts At Improvement

In this document we attempt a few methods at improving the performance of the Logistic Regression model.

### Table of Contents
1. [Dealing With Missing Data](#Dealing-With-Missing-Data)
2. [Tuning the Parameters of the Model](#Tuning-the-Parameters-of-the-Model)
3. [Conclusion](#Conclusion)
4. [References](#References)

We first get our data back into an analysable state.

In [11]:
# pip install -r requirements.txt # This can be used to install the necessary modules if needed.

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model
from sklearn.metrics import classification_report,confusion_matrix,roc_curve,auc
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

In [13]:
X_train = pd.read_csv("../data/X_train.csv", index_col=0)  # Use the first column as index
y_train = pd.read_csv("../data/y_train.csv", index_col=0)  # Use the first column as index
X_test = pd.read_csv("../data/X_test.csv", index_col=0)    # Use the first column as index
y_test = pd.read_csv("../data/y_test.csv", index_col=0)    # Use the first column as index

In [14]:
X_all = pd.concat({'X_train':X_train, 'X_test':X_test})
objects = ['workclass','education','marital-status','occupation','relationship','race','sex','native-country']
keys = [0]*len(objects)

for i in range(len(objects)):
    X_all[objects[i]], keys[i] = pd.factorize(X_all[objects[i]])

X_train = X_all.loc['X_train']
X_test = X_all.loc['X_test']

y_all = pd.concat({'y_train':y_train, 'y_test':y_test})
y_all['income'], income_key = pd.factorize(y_all['income'])

y_train = y_all.loc['y_train']
y_test = y_all.loc['y_test']

In [15]:
pipe = make_pipeline(StandardScaler(), linear_model.LogisticRegression())

pipe.fit(X_train, y_train.values.ravel())
y_pred2 = pipe.predict(X_test)

trainscore = pipe.score(X_train, y_train)
testscore = pipe.score(X_test, y_test)

# Dealing With Missing Data

Another thing that may be affecting our model is the missing data. Our data only contains missing entries in the training data and none in the testing data. This is good since we wont be surprised with missing values when testing, but also bad since these missing values could be affecting the efficiency of the model. We will test a couple different methods, namely removing the rows with missing data, or methods of imputing the missing data.

First we check to see where the missing data is:

In [16]:
X_train2 = pd.read_csv("../data/X_train.csv", index_col=0) # Reimport dataset so we have it in it's original state
X_train2.isnull().sum()

age                  0
workclass         2799
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        2809
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     857
dtype: int64

There are 3 columns with missing data, namely 'workclass', 'occupation', and 'native-country'. These are all originally object features, and so during our factorising, the NaN values would have been converted to '-1', which we can see below. This could cause problems for our data, especially when we scaled it as the mean and variance will be skewed by these values, and so this motivates us to investigate how fixing this problem will impact our accuracy.

In [17]:
print(X_train['workclass'].unique())

[ 0  1  2  3  4 -1  5  6  7]


One easy way to deal with the data is to simply **remove the rows with missing values**. This can be dangerous for smaller datasets but since for our group the number of rows with missing data is around 10%, this could be a valid way to deal with this.

In [18]:
missingcols = ['workclass','occupation','native-country'] # collect missing columns
X_train_nan = X_train.copy() # We no longer want to alter the original dataframe so we make a copy for our tests

# convert the missing values from -1 back to nan, so that we can then use the .dropna method
for col in missingcols:
    X_train_nan.loc[X_train[col] == -1,col] = np.nan

X_y_All = pd.concat({'X_train':X_train_nan,'y_train':y_train},axis=1)
X_y_All = X_y_All.dropna()

X_train2 = X_y_All.loc[:,'X_train']
y_train2 = X_y_All.loc[:,'y_train']

In [19]:
pipe_rem = make_pipeline(StandardScaler(), linear_model.LogisticRegression())
pipe_rem.fit(X_train2, y_train2.values.ravel())

y_pred_rem = pipe_rem.predict(X_test)

trainscore_rem = pipe_rem.score(X_train, y_train) # Using the model trained on the removed rows back on the dataset with missing values
testscore_rem = pipe_rem.score(X_test, y_test)
print(f'The training score with removed rows: {trainscore_rem}')
print(f'The training score without removed rows: {trainscore}\n')
print(f'The test score with removed rows: {testscore_rem}')
print(f'The test score without removed rows: {testscore}')

The training score with removed rows: 0.8268466837632624
The training score without removed rows: 0.8258912857645698

The test score with removed rows: 0.8198059108954565
The test score without removed rows: 0.8201367445963829


We see that removing the rows has infact increased our training score but decreased the testing score, we can't conclude much from this and will continue to investigate imputing before comparing all of the methods at the end.

We will now investigate a few different methods of **imputing**, namely mean, mode, and median imputation. Imputing is the act of filling in missing data by predicting what the data could be based on the other data. Mean, mode, and median imputing are 3 very simple methods of doing this, and whilst there are more complex methods available, such as Multiple Imputation, but that is too complex for this simple digression. We will investigate now:

In [54]:
X_train_mean = X_train_nan.copy()
X_train_mode = X_train_nan.copy()
X_train_medi = X_train_nan.copy()

for col in missingcols:
    col_mean = X_train_nan[col].mean()
    col_mode = (X_train_nan[col].mode())[0]
    col_medi = X_train_nan[col].median()
    X_train_mean.fillna({col: col_mean}, inplace=True)
    X_train_mode.fillna({col: col_mode}, inplace=True)
    X_train_medi.fillna({col: col_medi}, inplace=True)

pipe_mean = make_pipeline(StandardScaler(), linear_model.LogisticRegression())
pipe_mode = make_pipeline(StandardScaler(), linear_model.LogisticRegression())
pipe_medi = make_pipeline(StandardScaler(), linear_model.LogisticRegression())

pipe_mean.fit(X_train, y_train.values.ravel())
pipe_mode.fit(X_train, y_train.values.ravel())
pipe_medi.fit(X_train, y_train.values.ravel())

y_pred_mean = pipe_mean.predict(X_test)
y_pred_mode = pipe_mode.predict(X_test)
y_pred_medi = pipe_medi.predict(X_test)

trainscore_mean = pipe_mean.score(X_train_mean,y_train)
trainscore_mode = pipe_mode.score(X_train_mode,y_train)
trainscore_medi = pipe_medi.score(X_train_medi,y_train)

In [45]:
print(f'Normal Model Score          = {trainscore}')
print(f'Removed rows Model Score    = {trainscore_rem}')
print(f'Mean Imputing Model Score   = {trainscore_mean}')
print(f'Mode Imputing Model Score   = {trainscore_mode}')
print(f'Median Imputing Model Score = {trainscore_medi}')

Normal Model Score          = 0.8258912857645698
Removed rows Model Score    = 0.8268466837632624
Mean Imputing Model Score   = 0.8254638708704178
Mode Imputing Model Score   = 0.8254638708704178
Median Imputing Model Score = 0.8255141549756122


We see that the training scores for all of these imputing methods are worse than both the normal model, and the model where we removed the rows. This suggests to us that the missing data did not play a strong part in our model, which is reinforced by the fact that the mean and mode imputing end up giving us the same score. In our normal model, in a sense we were encoding the missing data as it's own category, this is because the method we used to encode the rows encodes `NaN` data as `-1`, and so it is treated as it's own category. From our EDA in document `02.1`, we saw that the majority of the missing data came from individuals earning less than 50k a year, and so treating this as it's own category may infact assist our model in identifying the cases for less than 50k.

# Tuning the Parameters of the Model

The LogisticRegression function in sklearn has a parameter known simply as `C`, set to `C = 1` by default. This parameter represents the inverse of the regularisation strength, where smaller values represent stronger regularisation. In our case, the training and testing scores are fairly similar and so we don't need to worry about potential overfitting, but we may have some underfitting. A good way to tune parameters in models is through a process known as cross-validation, which leaves out sections of the training data to be used to test the models and compare the values of the parameters. Scikit-Learn has a built in model to do this, known as LogisticRegressionCV, which we will implement here... TBC

# Conclusion

Thus, in conclusion we have that this model does not seem to be easily improvable, and our initial score was quite good for the limitations of the model. Reasons for this and more indepth consideration can be found in the next document.

# References