<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1">Imports</a></span></li><li><span><a href="#Read-data" data-toc-modified-id="Read-data-2">Read data</a></span></li><li><span><a href="#Demographics-and-IncomeModel" data-toc-modified-id="Demographics-and-IncomeModel-3">Demographics and IncomeModel</a></span><ul class="toc-item"><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-3.1">Preprocessing</a></span><ul class="toc-item"><li><span><a href="#Fix-some-variables-and-values-that-will-cause-problems-in-the-modelling-process" data-toc-modified-id="Fix-some-variables-and-values-that-will-cause-problems-in-the-modelling-process-3.1.1">Fix some variables and values that will cause problems in the modelling process</a></span></li><li><span><a href="#Select-only-necessary-variables" data-toc-modified-id="Select-only-necessary-variables-3.1.2">Select only necessary variables</a></span></li><li><span><a href="#Drop-all-nans" data-toc-modified-id="Drop-all-nans-3.1.3">Drop all nans</a></span></li><li><span><a href="#Remove-outliers" data-toc-modified-id="Remove-outliers-3.1.4">Remove outliers</a></span></li><li><span><a href="#Select-only-people-who-earned-a-salary,-and-drop-all-variables-a-tiny-number-of-occurances-(eg.-Province-=-Outside-of-South-Africa)" data-toc-modified-id="Select-only-people-who-earned-a-salary,-and-drop-all-variables-a-tiny-number-of-occurances-(eg.-Province-=-Outside-of-South-Africa)-3.1.5">Select only people who earned a salary, and drop all variables a tiny number of occurances (eg. Province = Outside of South Africa)</a></span></li><li><span><a href="#Get-Dummy-Variables" data-toc-modified-id="Get-Dummy-Variables-3.1.6">Get Dummy Variables</a></span></li></ul></li><li><span><a href="#Build-the-model" data-toc-modified-id="Build-the-model-3.2">Build the model</a></span><ul class="toc-item"><li><span><a href="#Split-into-training-and-testing-data" data-toc-modified-id="Split-into-training-and-testing-data-3.2.1">Split into training and testing data</a></span></li><li><span><a href="#Model-All-Variables-as-explanatory-variables" data-toc-modified-id="Model-All-Variables-as-explanatory-variables-3.2.2">Model All Variables as explanatory variables</a></span></li><li><span><a href="#Model--Provinces-as-explanatory-variables" data-toc-modified-id="Model--Provinces-as-explanatory-variables-3.2.3">Model  Provinces as explanatory variables</a></span></li><li><span><a href="#Model-Occupations-as-explanatory-variables" data-toc-modified-id="Model-Occupations-as-explanatory-variables-3.2.4">Model Occupations as explanatory variables</a></span></li><li><span><a href="#Final-and-best-model" data-toc-modified-id="Final-and-best-model-3.2.5">Final and best model</a></span></li><li><span><a href="#Analyse-on-Testing-Data" data-toc-modified-id="Analyse-on-Testing-Data-3.2.6">Analyse on Testing Data</a></span></li></ul></li></ul></li><li><span><a href="#Hypothesis-Testing" data-toc-modified-id="Hypothesis-Testing-4">Hypothesis Testing</a></span></li></ul></div>

## Imports

In [1]:
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt 

import matplotlib.cm as cm

from matplotlib.colors import Normalize

import statsmodels.formula.api as smf
import statsmodels.api as sm

from IPython.display import display_html,display, Markdown

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
from sklearn.preprocessing import StandardScaler, LabelEncoder

from scipy import stats


np.set_printoptions(suppress=True, precision=7)
pd.set_option("display.max_columns",55)


print("Successfully imported all")

Successfully imported all


## Read data

In [2]:
dem = pd.read_csv("Demographics.csv",)
lab = pd.read_csv("Labour.csv")

df = pd.concat([dem,lab],axis=1)

## Demographics and IncomeModel

**Question: Which factors contribute most significantly to a persons earnings in February?**

We can reasonably assume that women, and people without a matric and tertiary education will earn a lower income.

In [3]:
dftemp = pd.concat([dem,lab],axis=1)
display(dftemp)

Unnamed: 0,age,gender,race,highest_grade,tertiary_edu,province_current,province_moved,province_before,province_during,labour_in_feb,work_days_feb,work_hours_feb,take_home_pay_feb,labour_in_apr,work_days_apr,work_hours_apr,take_home_pay_apr,lost_labour,kept_labour,return_to_work,usual_work
0,39.0,Man,White,Grade 9,No,Gauteng,No,,,1,,,180.0,0,0.0,0.0,0.0,1,0,0,Unknown
1,72.0,Woman,Asian/Indian,Grade 0,,KwaZulu-Natal,No,,,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0,0,Not Employed
2,30.0,Woman,African/Black,Grade 10,Yes,Gauteng,No,,,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0,0,Not Employed
3,48.0,Woman,African/Black,Grade 10,No,Gauteng,No,,,1,3.0,5.0,900.0,0,0.0,0.0,0.0,1,0,1,Elementary occupations
4,49.0,Woman,African/Black,Grade 9,No,Gauteng,No,,,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0,0,Not Employed
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7068,22.0,Man,African/Black,Grade 11,No,Northern Cape,No,,,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0,0,Not Employed
7069,24.0,Woman,Coloured,Grade 10,No,Western Cape,No,,,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0,0,Not Employed
7070,36.0,Woman,African/Black,NTC 3,No,North West,No,,,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0,0,Not Employed
7071,20.0,Man,African/Black,Grade 12,No,North West,Yes,Gauteng,Gauteng,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0,0,Not Employed


In [4]:
df_main = dftemp[["age","gender","race","highest_grade", "tertiary_edu","province_current", "province_moved", "province_before","province_during","take_home_pay_feb","usual_work"]]
display(df_main)

Unnamed: 0,age,gender,race,highest_grade,tertiary_edu,province_current,province_moved,province_before,province_during,take_home_pay_feb,usual_work
0,39.0,Man,White,Grade 9,No,Gauteng,No,,,180.0,Unknown
1,72.0,Woman,Asian/Indian,Grade 0,,KwaZulu-Natal,No,,,0.0,Not Employed
2,30.0,Woman,African/Black,Grade 10,Yes,Gauteng,No,,,0.0,Not Employed
3,48.0,Woman,African/Black,Grade 10,No,Gauteng,No,,,900.0,Elementary occupations
4,49.0,Woman,African/Black,Grade 9,No,Gauteng,No,,,0.0,Not Employed
...,...,...,...,...,...,...,...,...,...,...,...
7068,22.0,Man,African/Black,Grade 11,No,Northern Cape,No,,,0.0,Not Employed
7069,24.0,Woman,Coloured,Grade 10,No,Western Cape,No,,,0.0,Not Employed
7070,36.0,Woman,African/Black,NTC 3,No,North West,No,,,0.0,Not Employed
7071,20.0,Man,African/Black,Grade 12,No,North West,Yes,Gauteng,Gauteng,0.0,Not Employed


### Preprocessing

#### Fix some variables and values that will cause problems in the modelling process

In [5]:
df_prepro = df_main.copy()

def fix_race(race):
    if race == "African/Black": return "Black"
    elif race == "Asian/Indian": return "AsInd"
    else: return race
    
    
def fix_province(provinceCurrent, provinceMoved, provinceBefore):
    if provinceMoved == "No":
        province = provinceCurrent
    else:
        province = provinceBefore
        
    return province

def fix_punc(province):
    prov = province.replace(" ", "_")
    prov = prov.replace("-", "_")

    return prov

def fix_work(work):
    new_work = work.replace(" ", "_")
    new_work = new_work.replace("-", "_")
    new_work = new_work.replace(",", "")
    return new_work


#Fix race
df_prepro["race"] = np.vectorize(fix_race)(df_main["race"])

#Fix province and puncuation
df_prepro["province_before"] = np.vectorize(fix_province)(df_main["province_current"],df_main["province_moved"],df_main["province_before"])

df_prepro["province_during"] = np.vectorize(fix_province)(df_main["province_current"],df_main["province_moved"],df_main["province_before"])

df_prepro["province_before"] = np.vectorize(fix_punc)(df_prepro["province_before"])

#Fix puncuation in usual work
df_prepro["usual_work"] = np.vectorize(fix_work)(df_main["usual_work"])

#### Select only necessary variables

In [6]:
df_selected = df_prepro[["age","gender","race","highest_grade", "tertiary_edu", "province_before", "take_home_pay_feb","usual_work"]]
# display(df_selected)

#### Drop all nans

In [7]:
df_dropped = df_selected.dropna()
df_dropped = df_dropped[(df_dropped["province_before"] != "nan") & (df_dropped["race"] != "nan")]
# display(df_dropped)

#### Remove outliers

In [8]:
print("Min pay before: ",min(df_dropped['take_home_pay_feb']))
print("Max pay before: ", max(df_dropped['take_home_pay_feb']))

# Remove all people with income outside of 95 percentiles
df_no_outliers = df_dropped[df_dropped["take_home_pay_feb"] < np.percentile(a = df_dropped["take_home_pay_feb"], q =95)]

print("\nMin pay after: ",min(df_no_outliers['take_home_pay_feb']))
print("Max pay after: ",max(df_no_outliers['take_home_pay_feb']))
# display(df_no_outliers)

Min pay before:  0.0
Max pay before:  200000.0

Min pay after:  0.0
Max pay after:  14700.0


In [9]:

def check_youth(age):
    if age<25: return 1
    else: return 0 

def check_male(gender): 
    if gender == "Man": return 1
    else: return 0

def check_matric(highest_grade):
    if highest_grade == "Grade 12": return 1
    else: return 0

def check_tertedu(tertiary_edu):
    if tertiary_edu == "Yes" : return 1 
    else: return 0 
    
def check_work(work):
    new_work = work.replace(" ", "_")
    new_work = new_work.replace("-", "_")
    new_work = new_work.replace(",", "")
    return new_work


df_pre_dummy = pd.DataFrame()
df_pre_dummy["Age"] = df_no_outliers["age"]
df_pre_dummy["Youth"] = np.vectorize(check_youth)(df_no_outliers["age"])

df_pre_dummy["Race"] = df_no_outliers["race"]

df_pre_dummy["Male"] = np.vectorize(check_male)(df_no_outliers["gender"])

df_pre_dummy["Matric"] = np.vectorize(check_matric)(df_no_outliers["highest_grade"])
df_pre_dummy["Tertiary"] = np.vectorize(check_tertedu)(df_no_outliers["tertiary_edu"])

df_pre_dummy["Province_Before"] = df_no_outliers["province_before"]

df_pre_dummy["Usual_Work"] = np.vectorize(check_work)(df_no_outliers['usual_work']) 

df_pre_dummy["Pay_Feb"] = df_no_outliers["take_home_pay_feb"]

display(df_pre_dummy)

Unnamed: 0,Age,Youth,Race,Male,Matric,Tertiary,Province_Before,Usual_Work,Pay_Feb
0,39.0,0,White,1,0,0,Gauteng,Unknown,180.0
2,30.0,0,Black,0,0,1,Gauteng,Not_Employed,0.0
3,48.0,0,Black,0,0,0,Gauteng,Elementary_occupations,900.0
4,49.0,0,Black,0,0,0,Gauteng,Not_Employed,0.0
5,34.0,0,Black,1,0,0,Gauteng,Not_Employed,0.0
...,...,...,...,...,...,...,...,...,...
7068,22.0,1,Black,1,0,0,Northern_Cape,Not_Employed,0.0
7069,24.0,1,Coloured,0,0,0,Western_Cape,Not_Employed,0.0
7070,36.0,0,Black,0,0,0,North_West,Not_Employed,0.0
7071,20.0,1,Black,1,1,0,Gauteng,Not_Employed,0.0


#### Select only people who earned a salary, and drop all variables a tiny number of occurances (eg. Province = Outside of South Africa)

In [10]:
df_earned = df_pre_dummy[df_pre_dummy["Pay_Feb"] > 0]
df_earned = df_earned[(df_earned["Usual_Work"] != "Not_Employed") & (df_earned["Usual_Work"] !=  "Not_applicable") & (df_earned["Usual_Work"] !=  "Armed_forces_occupations") & (df_earned["Usual_Work"] !=  "Unknown") & (df_earned["Province_Before"] != "Outside_of_South_Africa")]
display(df_earned)

Unnamed: 0,Age,Youth,Race,Male,Matric,Tertiary,Province_Before,Usual_Work,Pay_Feb
3,48.0,0,Black,0,0,0,Gauteng,Elementary_occupations,900.0
10,48.0,0,Black,0,1,1,KwaZulu_Natal,Self_Employed,4600.0
20,40.0,0,Black,0,0,0,Gauteng,Technicians_and_associate_professionals,3500.0
22,54.0,0,Black,0,1,0,Gauteng,Elementary_occupations,4500.0
28,43.0,0,Black,0,1,0,Mpumalanga,Self_Employed,450.0
...,...,...,...,...,...,...,...,...,...
7001,43.0,0,Black,0,0,0,Eastern_Cape,Elementary_occupations,3200.0
7016,29.0,0,Black,1,0,0,North_West,Service_and_sales_workers,4400.0
7038,25.0,0,Black,0,1,1,Gauteng,Clerical_support_workers,5430.0
7043,43.0,0,Black,1,0,0,Gauteng,Craft_and_related_trades_workers,6400.0


#### Get Dummy Variables

In [11]:
df_dummies = pd.get_dummies(df_earned, columns=["Race","Usual_Work", "Province_Before"], drop_first=True)
display(df_dummies)

Unnamed: 0,Age,Youth,Male,Matric,Tertiary,Pay_Feb,Race_Black,Race_Coloured,Race_White,Usual_Work_Craft_and_related_trades_workers,Usual_Work_Elementary_occupations,Usual_Work_Managers,Usual_Work_Plant_and_machine_operators_and_assemblers,Usual_Work_Professionals,Usual_Work_Self_Employed,Usual_Work_Service_and_sales_workers,Usual_Work_Skilled_agricultural_forestry_and_fishery_workers,Usual_Work_Technicians_and_associate_professionals,Province_Before_Free_State,Province_Before_Gauteng,Province_Before_KwaZulu_Natal,Province_Before_Limpopo,Province_Before_Mpumalanga,Province_Before_North_West,Province_Before_Northern_Cape,Province_Before_Western_Cape
3,48.0,0,0,0,0,900.0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
10,48.0,0,0,1,1,4600.0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
20,40.0,0,0,0,0,3500.0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0
22,54.0,0,0,1,0,4500.0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
28,43.0,0,0,1,0,450.0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7001,43.0,0,0,0,0,3200.0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7016,29.0,0,1,0,0,4400.0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0
7038,25.0,0,0,1,1,5430.0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
7043,43.0,0,1,0,0,6400.0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


### Build the model

#### Split into training and testing data

In [12]:
train_df, test_df = train_test_split(df_dummies,test_size=0.2, random_state=100)
print("Training Data Shape: ", train_df.shape)
print("Testing Data Shape: ", test_df.shape)

Training Data Shape:  (958, 26)
Testing Data Shape:  (240, 26)


#### Model All Variables as explanatory variables

In [13]:
All_Vars = smf.ols(formula = "Pay_Feb ~ Youth + Male + Matric + Tertiary + Race_Black + Usual_Work_Craft_and_related_trades_workers + Usual_Work_Elementary_occupations + Usual_Work_Managers + Usual_Work_Plant_and_machine_operators_and_assemblers + Usual_Work_Professionals + Usual_Work_Self_Employed + Usual_Work_Service_and_sales_workers + Usual_Work_Skilled_agricultural_forestry_and_fishery_workers + Usual_Work_Technicians_and_associate_professionals + Province_Before_Free_State + Province_Before_Gauteng + Province_Before_KwaZulu_Natal + Province_Before_Limpopo + Province_Before_Mpumalanga + Province_Before_North_West + Province_Before_Northern_Cape + Province_Before_Western_Cape", data=train_df).fit()
display(All_Vars.summary())
display(Markdown("## All explanatory variables with p-values less than 0.05"))
display(All_Vars.pvalues <  0.05)

0,1,2,3
Dep. Variable:,Pay_Feb,R-squared:,0.305
Model:,OLS,Adj. R-squared:,0.289
Method:,Least Squares,F-statistic:,18.69
Date:,"Mon, 21 Jun 2021",Prob (F-statistic):,1.14e-59
Time:,19:56:50,Log-Likelihood:,-8966.3
No. Observations:,958,AIC:,17980.0
Df Residuals:,935,BIC:,18090.0
Df Model:,22,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3910.4374,506.175,7.725,0.000,2917.066,4903.809
Youth,-1365.4925,325.479,-4.195,0.000,-2004.246,-726.739
Male,1259.9308,198.015,6.363,0.000,871.325,1648.537
Matric,1072.7549,201.036,5.336,0.000,678.220,1467.289
Tertiary,1357.9444,209.197,6.491,0.000,947.394,1768.495
Race_Black,-992.8102,312.951,-3.172,0.002,-1606.978,-378.643
Usual_Work_Craft_and_related_trades_workers,-630.1977,443.329,-1.422,0.156,-1500.233,239.838
Usual_Work_Elementary_occupations,-1183.9805,342.024,-3.462,0.001,-1855.204,-512.757
Usual_Work_Managers,896.0229,539.980,1.659,0.097,-163.691,1955.737

0,1,2,3
Omnibus:,83.063,Durbin-Watson:,1.989
Prob(Omnibus):,0.0,Jarque-Bera (JB):,103.748
Skew:,0.755,Prob(JB):,2.96e-23
Kurtosis:,3.564,Cond. No.,18.6


## All explanatory variables with p-values less than 0.05

Intercept                                                        True
Youth                                                            True
Male                                                             True
Matric                                                           True
Tertiary                                                         True
Race_Black                                                       True
Usual_Work_Craft_and_related_trades_workers                     False
Usual_Work_Elementary_occupations                                True
Usual_Work_Managers                                             False
Usual_Work_Plant_and_machine_operators_and_assemblers            True
Usual_Work_Professionals                                         True
Usual_Work_Self_Employed                                         True
Usual_Work_Service_and_sales_workers                            False
Usual_Work_Skilled_agricultural_forestry_and_fishery_workers    False
Usual_Work_Technicia

#### Model  Provinces as explanatory variables

In [14]:
Provinces = smf.ols(formula = "Pay_Feb ~ Province_Before_Free_State + Province_Before_Gauteng + Province_Before_KwaZulu_Natal + Province_Before_Limpopo + Province_Before_Mpumalanga + Province_Before_North_West + Province_Before_Northern_Cape + Province_Before_Western_Cape", data=train_df).fit()
display(Provinces.summary())
display(Markdown("## Province explanatory variables with p-values less than 0.05"))
display(Provinces.pvalues <  0.05)

0,1,2,3
Dep. Variable:,Pay_Feb,R-squared:,0.029
Model:,OLS,Adj. R-squared:,0.021
Method:,Least Squares,F-statistic:,3.511
Date:,"Mon, 21 Jun 2021",Prob (F-statistic):,0.00052
Time:,19:56:50,Log-Likelihood:,-9126.9
No. Observations:,958,AIC:,18270.0
Df Residuals:,949,BIC:,18320.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4333.5949,375.439,11.543,0.000,3596.809,5070.381
Province_Before_Free_State,-426.5235,522.990,-0.816,0.415,-1452.874,599.827
Province_Before_Gauteng,393.2365,453.536,0.867,0.386,-496.813,1283.286
Province_Before_KwaZulu_Natal,-922.1222,437.687,-2.107,0.035,-1781.068,-63.177
Province_Before_Limpopo,-978.2428,545.702,-1.793,0.073,-2049.165,92.679
Province_Before_Mpumalanga,378.6617,489.385,0.774,0.439,-581.740,1339.063
Province_Before_North_West,196.9606,671.022,0.294,0.769,-1119.898,1513.819
Province_Before_Northern_Cape,333.5459,545.702,0.611,0.541,-737.376,1404.468
Province_Before_Western_Cape,288.6372,490.283,0.589,0.556,-673.527,1250.801

0,1,2,3
Omnibus:,151.932,Durbin-Watson:,2.075
Prob(Omnibus):,0.0,Jarque-Bera (JB):,225.866
Skew:,1.137,Prob(JB):,8.99e-50
Kurtosis:,3.701,Cond. No.,11.7


## Province explanatory variables with p-values less than 0.05

Intercept                         True
Province_Before_Free_State       False
Province_Before_Gauteng          False
Province_Before_KwaZulu_Natal     True
Province_Before_Limpopo          False
Province_Before_Mpumalanga       False
Province_Before_North_West       False
Province_Before_Northern_Cape    False
Province_Before_Western_Cape     False
dtype: bool

**Interpretation:** We can see above that KwaZulu-Natal is the only variable that is statistically significant

####  Model Occupations as explanatory variables

In [15]:
Occupation = smf.ols(formula = "Pay_Feb ~ Usual_Work_Craft_and_related_trades_workers + Usual_Work_Elementary_occupations + Usual_Work_Managers + Usual_Work_Plant_and_machine_operators_and_assemblers + Usual_Work_Professionals + Usual_Work_Self_Employed + Usual_Work_Service_and_sales_workers + Usual_Work_Skilled_agricultural_forestry_and_fishery_workers + Usual_Work_Technicians_and_associate_professionals ", data=train_df).fit()
display(Occupation.summary())
display(Markdown("## Occuaption explanatory variables with p-values less than 0.05"))
display(Occupation.pvalues <  0.05)

0,1,2,3
Dep. Variable:,Pay_Feb,R-squared:,0.178
Model:,OLS,Adj. R-squared:,0.171
Method:,Least Squares,F-statistic:,22.88
Date:,"Mon, 21 Jun 2021",Prob (F-statistic):,1.74e-35
Time:,19:56:50,Log-Likelihood:,-9046.7
No. Observations:,958,AIC:,18110.0
Df Residuals:,948,BIC:,18160.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4782.3051,282.680,16.918,0.000,4227.554,5337.056
Usual_Work_Craft_and_related_trades_workers,-784.7337,463.260,-1.694,0.091,-1693.867,124.400
Usual_Work_Elementary_occupations,-2292.4838,354.202,-6.472,0.000,-2987.595,-1597.373
Usual_Work_Managers,1173.5057,578.575,2.028,0.043,38.069,2308.942
Usual_Work_Plant_and_machine_operators_and_assemblers,1337.3919,471.989,2.834,0.005,411.128,2263.656
Usual_Work_Professionals,1507.5810,446.390,3.377,0.001,631.554,2383.608
Usual_Work_Self_Employed,-2401.4251,394.133,-6.093,0.000,-3174.900,-1627.951
Usual_Work_Service_and_sales_workers,-30.6672,361.768,-0.085,0.932,-740.626,679.292
Usual_Work_Skilled_agricultural_forestry_and_fishery_workers,-1782.3051,897.344,-1.986,0.047,-3543.315,-21.295

0,1,2,3
Omnibus:,127.409,Durbin-Watson:,1.986
Prob(Omnibus):,0.0,Jarque-Bera (JB):,179.559
Skew:,0.983,Prob(JB):,1.02e-39
Kurtosis:,3.797,Cond. No.,11.0


## Occuaption explanatory variables with p-values less than 0.05

Intercept                                                        True
Usual_Work_Craft_and_related_trades_workers                     False
Usual_Work_Elementary_occupations                                True
Usual_Work_Managers                                              True
Usual_Work_Plant_and_machine_operators_and_assemblers            True
Usual_Work_Professionals                                         True
Usual_Work_Self_Employed                                         True
Usual_Work_Service_and_sales_workers                            False
Usual_Work_Skilled_agricultural_forestry_and_fishery_workers     True
Usual_Work_Technicians_and_associate_professionals              False
dtype: bool

**Interpretation:** We can see above that Elementary_occupations, Managers, Plant_and_machine_operators_and_assemblers, Professionals, Self_Employed and Skilled_agricultural_forestry_and_fishery_workers are statistically significant.

####  Final and best model

$$Pay\_Feb = \beta_0 +  \beta_1Youth + \beta_2Male  + \beta_3Matric + \beta_4Tertiary$$ 

In [16]:
Best_Model = smf.ols(formula = "Pay_Feb ~ Youth + Male + Matric + Tertiary + Race_Black", data=train_df).fit()
display(Best_Model.summary())
display(Markdown("## Reduced model explanatory variables with p-values less than 0.05"))
display(Best_Model.pvalues <  0.05)

0,1,2,3
Dep. Variable:,Pay_Feb,R-squared:,0.204
Model:,OLS,Adj. R-squared:,0.2
Method:,Least Squares,F-statistic:,48.78
Date:,"Mon, 21 Jun 2021",Prob (F-statistic):,5.21e-45
Time:,19:56:50,Log-Likelihood:,-9031.6
No. Observations:,958,AIC:,18080.0
Df Residuals:,952,BIC:,18100.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3279.3654,283.067,11.585,0.000,2723.858,3834.873
Youth,-1708.2855,338.974,-5.040,0.000,-2373.508,-1043.063
Male,1487.0163,196.627,7.563,0.000,1101.143,1872.889
Matric,1412.7818,207.475,6.809,0.000,1005.621,1819.942
Tertiary,1639.3535,214.662,7.637,0.000,1218.088,2060.619
Race_Black,-1235.0670,257.388,-4.798,0.000,-1740.181,-729.953

0,1,2,3
Omnibus:,82.603,Durbin-Watson:,2.082
Prob(Omnibus):,0.0,Jarque-Bera (JB):,102.248
Skew:,0.775,Prob(JB):,6.27e-23
Kurtosis:,3.397,Cond. No.,6.05


## Reduced model explanatory variables with p-values less than 0.05

Intercept     True
Youth         True
Male          True
Matric        True
Tertiary      True
Race_Black    True
dtype: bool

**Interpretation:** All the explanatory variables in our model are significant are statistically significant.

#### Analyse on Testing Data

In [17]:
Test_Model = smf.ols(formula = "Pay_Feb ~ Youth + Male + Matric + Tertiary + Race_Black", data=test_df).fit()
display(Test_Model.summary())
display(Markdown("## Testing model explanatory variables with p-values less than 0.05"))
display(Test_Model.pvalues <  0.05)

0,1,2,3
Dep. Variable:,Pay_Feb,R-squared:,0.19
Model:,OLS,Adj. R-squared:,0.173
Method:,Least Squares,F-statistic:,10.98
Date:,"Mon, 21 Jun 2021",Prob (F-statistic):,1.64e-09
Time:,19:56:50,Log-Likelihood:,-2279.9
No. Observations:,240,AIC:,4572.0
Df Residuals:,234,BIC:,4593.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2303.8925,614.011,3.752,0.000,1094.196,3513.589
Youth,-2921.1902,913.721,-3.197,0.002,-4721.361,-1121.019
Male,1302.4129,431.304,3.020,0.003,452.678,2152.148
Matric,1720.6010,460.504,3.736,0.000,813.337,2627.865
Tertiary,1671.5427,492.289,3.395,0.001,701.657,2641.428
Race_Black,-267.0754,549.084,-0.486,0.627,-1348.856,814.705

0,1,2,3
Omnibus:,30.862,Durbin-Watson:,1.689
Prob(Omnibus):,0.0,Jarque-Bera (JB):,38.674
Skew:,0.914,Prob(JB):,4e-09
Kurtosis:,3.727,Cond. No.,6.75


## Testing model explanatory variables with p-values less than 0.05

Intercept      True
Youth          True
Male           True
Matric         True
Tertiary       True
Race_Black    False
dtype: bool

In [18]:
display(Markdown("## Training model regression coefficients"))
print(Best_Model.params)

display(Markdown("## Testing model regression coefficients"))
print(Test_Model.params)

## Training model regression coefficients

Intercept     3279.365432
Youth        -1708.285500
Male          1487.016267
Matric        1412.781768
Tertiary      1639.353454
Race_Black   -1235.067023
dtype: float64


## Testing model regression coefficients

Intercept     2303.892486
Youth        -2921.190151
Male          1302.412928
Matric        1720.600988
Tertiary      1671.542746
Race_Black    -267.075393
dtype: float64


## Hypothesis Testing

**Null Hypothesis**: The average salary of men is equal to that of women.
<br><br>
**Alternative Hypothesis**: The average salary of men is lower than that of women.
<br><br>
$$H_0: \mu_1 = \mu_2 $$
$$H_1: \mu_1 > \mu_2 $$

The data is split into 2 groups, men and women.

In [19]:
sample1 = train_df[(train_df["Male"] == 1) ]["Pay_Feb"]
sample2 = train_df[(train_df["Male"] == 0) ]["Pay_Feb"]

Perform a permutation test on these two samples.

In [20]:
t_value, p_value = stats.ttest_ind(sample1, sample2)

print("t-value = ", t_value)
print("p-value = ", p_value/2)

t-value =  5.8574570821930285
p-value =  3.230102638963616e-09


The p-value is 3.230102638963616e-09 which is less than 0.05, and therefore we can reject the null hypothesis.
Thus, with a 95% confidence, we can say that the average salary for men is higher than the averae salary for women.