# Attempts At Improvement

In this document we attempt a few methods at improving the performance of the Logistic Regression model. These methods are as follows:

- Modifying the C Value of the model
- Dealing With Missing Data
- Cross Validation

We first get our data back into an analysable state.

In [1]:
# pip install -r requirements.txt # This can be used to install the necessary modules if needed.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model
from sklearn.metrics import classification_report,confusion_matrix,roc_curve,auc
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

In [3]:
X_train = pd.read_csv("../data/X_train.csv", index_col=0)  # Use the first column as index
y_train = pd.read_csv("../data/y_train.csv", index_col=0)  # Use the first column as index
X_test = pd.read_csv("../data/X_test.csv", index_col=0)    # Use the first column as index
y_test = pd.read_csv("../data/y_test.csv", index_col=0)    # Use the first column as index

In [4]:
X_all = pd.concat({'X_train':X_train, 'X_test':X_test})
objects = ['workclass','education','marital-status','occupation','relationship','race','sex','native-country']
keys = [0]*len(objects)

for i in range(len(objects)):
    X_all[objects[i]], keys[i] = pd.factorize(X_all[objects[i]])

X_train = X_all.loc['X_train']
X_test = X_all.loc['X_test']

y_all = pd.concat({'y_train':y_train, 'y_test':y_test})
y_all['income'], income_key = pd.factorize(y_all['income'])

y_train = y_all.loc['y_train']
y_test = y_all.loc['y_test']

## Modifying the C Value of the model

The LogisticRegression function in sklearn has a parameter known simply as C, set to C=1 by default. This parameter represents the inverse of the regularisation strength, where smaller values represent stronger regularisation. In our case, the training and testing scores are fairly similar and so we don't need to worry about potential overfitting, but we may have some underfitting. We try a few values of C, higher values produce a more flexible model, whilst lower values produce a more regularised model.

In [5]:
pipe001 = make_pipeline(StandardScaler(), linear_model.LogisticRegression(C=0.1))
pipe = make_pipeline(StandardScaler(), linear_model.LogisticRegression())
pipe100 = make_pipeline(StandardScaler(), linear_model.LogisticRegression(C=10))

pipe001.fit(X_train, y_train.values.ravel())
pipe.fit(X_train, y_train.values.ravel())
pipe100.fit(X_train, y_train.values.ravel())

#y_pred001 = pipe001.predict(X_test)
#y_pred100 = pipe100.predict(X_test)

trainscore001 = pipe001.score(X_train, y_train)
trainscore = pipe.score(X_train, y_train)
trainscore100 = pipe100.score(X_train, y_train)

testscore001 = pipe001.score(X_test, y_test)
testscore = pipe.score(X_test, y_test)
testscore100 = pipe100.score(X_test, y_test)

print('Our training score with C=0.01: {0:0.6f}'.format(trainscore001))
print('Our training score with C=1:    {0:0.6f}'.format(trainscore))
print('Our training score with C=100:  {0:0.6f}'.format(trainscore100))
print('Our testing score with C=0.01:  {0:0.6f}'.format(testscore001))
print('Our testing score with C=1:     {0:0.6f}'.format(testscore))
print('Our testing score with C=100:   {0:0.6f}'.format(testscore100))

Our training score with C=0.01: 0.825942
Our training score with C=1:    0.825891
Our training score with C=100:  0.825866
Our testing score with C=0.01:  0.820357
Our testing score with C=1:     0.820137
Our testing score with C=100:   0.820247


We see that for the training data, C=1 remain the best value, but both of the other values for the testing data provide a slight improvement over the default, with C=0.01 having a larger increase. However, this increase is not largely notable. It does tell us that the default value of C was not optimal though, and as such our model could be refined if we desired that small increase.

## Dealing With Missing Data

Another thing that may be affecting our model is the missing data. Our data only contains missing entries in the training data and none in the testing data. This is good since we wont be surprised with missing values when testing, but also bad since these missing values could be affecting the efficiency of the model. We will test a couple different methods, namely removing the rows with missing data, or methods of imputing the missing data.

First we check to see where the missing data is:

In [6]:
X_train2 = pd.read_csv("../data/X_train.csv", index_col=0) # Reimport dataset so we have it in it's original state
X_train2.isnull().sum()

age                  0
workclass         2799
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        2809
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     857
dtype: int64

There are 3 columns with missing data, namely 'workclass', 'occupation', and 'native-country'. These are all originally object features, and so during our factorising, the NaN values would have been converted to '-1', which we can see below. This could cause problems for our data, especially when we scaled it as the mean and variance will be skewed by these values, and so this motivates us to investigate how fixing this problem will impact our accuracy.

In [7]:
print(X_train['workclass'].unique())

[ 0  1  2  3  4 -1  5  6  7]


One easy way to deal with the data is to simply **remove the rows with missing values**. This can be dangerous for smaller datasets but since for our group the number of rows with missing data is around 10%, this could be a valid way to deal with this.

In [8]:
missingcols = ['workclass','occupation','native-country'] # collect missing columns
X_train_nan = X_train.copy() # We no longer want to alter the original dataframe so we make a copy for our tests

# convert the missing values from -1 back to nan, so that we can then use the .dropna method
for col in missingcols:
    X_train_nan.loc[X_train[col] == -1,col] = np.nan

X_y_All = pd.concat({'X_train':X_train_nan,'y_train':y_train},axis=1)
X_y_All = X_y_All.dropna()

X_train2 = X_y_All.loc[:,'X_train']
y_train2 = X_y_All.loc[:,'y_train']

In [9]:
pipe_rem = make_pipeline(StandardScaler(), linear_model.LogisticRegression())
pipe_rem.fit(X_train2, y_train2.values.ravel())

trainscore_rem = pipe_rem.score(X_train, y_train) # Using the model trained on the removed rows back on the dataset with missing values
testscore_rem = pipe_rem.score(X_test, y_test)
print(f'The training score with removed rows: {trainscore_rem}')
print(f'The training score without removed rows: {trainscore}\n')
print(f'The test score with removed rows: {testscore_rem}')
print(f'The test score without removed rows: {testscore}')

The training score with removed rows: 0.8268466837632624
The training score without removed rows: 0.8258912857645698

The test score with removed rows: 0.8198059108954565
The test score without removed rows: 0.8201367445963829


Unfortunately we see that removing the rows has infact decreased our score. This suggests that the missing data does not have much of an impact on our performance, and the reason that our score suffered after removing the rows was due to the decreased amount of data. This reassures us that the missing data is not having a negative affect on our models performance, and discourages us from attempting further methods of dealing with missing data such as imputing.

## Cross-Validation

The third and final method of improvement that we will try is using cross-validation, specifically K-fold Cross-Validation.

In [11]:
print('The score from our initial model: {:.4f}'.format(trainscore))

for x in [2,5,10,25,50]:
    scores = cross_val_score(pipe, X_train, y_train.values.ravel(), cv = x, scoring='accuracy')
    print('Average cross-validation score: {:.4f} for cv = {}'.format(scores.mean(),x))

The score from our initial model: 0.8259
Average cross-validation score: 0.8257 for cv = 2
Average cross-validation score: 0.8259 for cv = 5
Average cross-validation score: 0.8254 for cv = 10
Average cross-validation score: 0.8257 for cv = 25
Average cross-validation score: 0.8257 for cv = 50


We see that yet again, this method does not improve our score, and in fact all tests are the same or lower.

## Conclusion

Thus, in conclusion we have that this model does not seem to be easily improvable, and our initial score was quite good for the limitations of the model.