<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Preprocessing" data-toc-modified-id="1.-Preprocessing-1">1. Preprocessing</a></span></li><li><span><a href="#2.-Model-Generation" data-toc-modified-id="2.-Model-Generation-2">2. Model Generation</a></span></li></ul></div>

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import matplotlib.cm as cm
from matplotlib.colors import Normalize
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import sklearn.metrics as metrics

Read in and display the unemployed_swdf dataframe as this stores all of the records of people who found themselves unemployed in April 2020.

In [2]:
unemployed_swdf = pd.read_csv("unemployed_swdf.csv")
display(unemployed_swdf)

Unnamed: 0,job_in_apr,work_in_apr,src_income_before_apr,lost_income_in_apr,total_income_in_apr,uif_in_apr,grant_from_gov_ques,grant_from_gov
0,No,No,Income from a business,No,3532.0,No,Yes,Disability Grant
1,No,No,Government grants,Yes,1500.0,No,Yes,Child Support Grant (CSG)
2,No,No,Government grants,Yes,4200.0,Yes,Yes,Child Support Grant (CSG)
3,No,No,Income from employment,No,3532.0,No,Yes,Child Support Grant (CSG)
4,No,No,Government grants,No,4200.0,Yes,Yes,Old Age Pension Grant (OAP)
...,...,...,...,...,...,...,...,...
930,No,No,Government grants,Yes,2500.0,Yes,Yes,Old Age Pension Grant (OAP)
931,No,No,Income from employment,Yes,3532.0,No,Yes,Child Support Grant (CSG)
932,No,No,Government grants,Yes,1700.0,Yes,Yes,Child Support Grant (CSG)
933,No,No,Income from employment,No,3532.0,Yes,Yes,Other (specify)


## 1. Preprocessing 

Assign numerical values to all of the relevant categorical values.

In [3]:
# Employment Information
unemployed_swdf = unemployed_swdf.replace("No", 0)
unemployed_swdf = unemployed_swdf.replace("Yes", 1)

# Income Information
unemployed_swdf = unemployed_swdf.replace("Government grants", 1)
unemployed_swdf = unemployed_swdf.replace("Income from employment", 2)
unemployed_swdf = unemployed_swdf.replace("Household had no income in February", 3)
unemployed_swdf = unemployed_swdf.replace("Income from a business", 4)
unemployed_swdf = unemployed_swdf.replace("Money from friends or family", 5)
unemployed_swdf = unemployed_swdf.replace("Other (specify)", 6)

# Grant Information
unemployed_swdf = unemployed_swdf.replace("Child Support Grant (CSG)", 1)
unemployed_swdf = unemployed_swdf.replace("Old Age Pension Grant (OAP)", 2)
unemployed_swdf = unemployed_swdf.replace("Disability Grant", 3)
unemployed_swdf = unemployed_swdf.replace("R350 COVID-19 Social Relief of Distress Grant", 4)
unemployed_swdf = unemployed_swdf.replace("Foster Child Grant", 5)
unemployed_swdf = unemployed_swdf.replace("Social relief or distress grant", 6)
unemployed_swdf = unemployed_swdf.replace("Care Dependency Grant", 7)
unemployed_swdf = unemployed_swdf.replace("War veterans grant", 8)                                 
display(unemployed_swdf)

Unnamed: 0,job_in_apr,work_in_apr,src_income_before_apr,lost_income_in_apr,total_income_in_apr,uif_in_apr,grant_from_gov_ques,grant_from_gov
0,0,0,4,0,3532.0,0,1,3
1,0,0,1,1,1500.0,0,1,1
2,0,0,1,1,4200.0,1,1,1
3,0,0,2,0,3532.0,0,1,1
4,0,0,1,0,4200.0,1,1,2
...,...,...,...,...,...,...,...,...
930,0,0,1,1,2500.0,1,1,2
931,0,0,2,1,3532.0,0,1,1
932,0,0,1,1,1700.0,1,1,1
933,0,0,2,0,3532.0,1,1,6


Convert all columns to their appropriate data type. 

In [4]:
unemployed_swdf.job_in_apr = unemployed_swdf.job_in_apr.astype('category') 
unemployed_swdf.work_in_apr = unemployed_swdf.work_in_apr.astype('category')
unemployed_swdf.src_income_before_apr = unemployed_swdf.src_income_before_apr.astype('category')
unemployed_swdf.lost_income_in_apr = unemployed_swdf.lost_income_in_apr.astype('category')
unemployed_swdf.total_income_in_apr = unemployed_swdf.total_income_in_apr.astype('float64')
unemployed_swdf.uif_in_apr = unemployed_swdf.uif_in_apr.astype('category')
unemployed_swdf.grant_from_gov_ques = unemployed_swdf.grant_from_gov_ques.astype('category')
unemployed_swdf.grant_from_gov = unemployed_swdf.grant_from_gov.astype('category')
print(unemployed_swdf.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 935 entries, 0 to 934
Data columns (total 8 columns):
job_in_apr               935 non-null category
work_in_apr              935 non-null category
src_income_before_apr    935 non-null category
lost_income_in_apr       935 non-null category
total_income_in_apr      935 non-null float64
uif_in_apr               935 non-null category
grant_from_gov_ques      935 non-null category
grant_from_gov           935 non-null category
dtypes: category(7), float64(1)
memory usage: 14.8 KB
None


## 2. Model Generation

**Question: Given that you were unemployed, were you likely to receive the UIF reduced work-time benefit from the government during April 2020?**

Separate the dataset into features and target variables. For this model, the 'src_income_before_apr', 'lost_income_in_apr' and 'total_income_in_apr' columns are selected as features and the 'uif_in_apr' is selected as the target variable. The 'job_in_apr' and 'work_in_apr' columns were not selected because all of their values are 0 and they have no effect on the model. 

In [5]:
X = unemployed_swdf[['src_income_before_apr', 'lost_income_in_apr', 'total_income_in_apr']]
y = unemployed_swdf[['uif_in_apr']]
y = np.ravel(y)

Split the unemployed_swdf dataframe into training and testing sets. 80% of the unemployed_swdf dataframe makes up the training set and 20% of the unemployed_swdf dataframe makes up the testing set. In addition, both datasets are standardized using the StandardScaler object from the sklearn library.

In [6]:
train_data_X, test_data_X, train_data_y, test_data_y = train_test_split(X,y,test_size=0.2, random_state=101)
scaler = StandardScaler()
scaler.fit(train_data_X)
train_data_X = scaler.transform(train_data_X)
test_data_X = scaler.transform(test_data_X)

Train the model.

In [7]:
model = LogisticRegression(solver='liblinear')
model.fit(train_data_X, train_data_y)
parameters = np.append(model.intercept_, model.coef_)
display("Parameters after training the model: ", parameters)

'Parameters after training the model: '

array([ 0.78514827,  0.02839737, -0.29616191, -0.00260391])

Make predictions using the recently trained model and get its accuracy.

In [8]:
predictions = model.predict(test_data_X)
accuracy = model.score(test_data_X, test_data_y)
print("Accuracy of the model: ", accuracy)

Accuracy of the model:  0.7219251336898396


The model has a classification accuracy of 72%, which is very good. 