In [21]:
# Initialisation Cell
from __future__ import print_function, division
from IPython.display import display, HTML, Javascript
from matplotlib import pyplot as plt
import scipy.stats as stats
import seaborn as sns
import pandas as pd
import numpy as np

%matplotlib inline

sns.set_context("talk")
sns.set_style('darkgrid', {'figure.facecolor': '(0,0,0,0)'}) 
#'axes.facecolor': '(0,0,0,0)'


In [22]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Relationship between education and income level in South Africa
 
## Executive Summary
The purpose of the notebook is to analyse census and income data from [UCT NIDS databse](http://www.nids.uct.ac.za/nids-data/program-library/derived-files) to determine if there is a correlation between **how much people earn** and their **highest education level**(i.e primary,secondary or tertiary education) What is being modelled are the regression charts of earning amounts of the decade, graduation rates over the decade, as well as GDP and economic expenditure for citizens.Thus far,we have determined that:

  1)The largest group of participants were young (in their 20s).
  
  2)The majority was African/Black.
  
  3)Few participants completed tertiary education.
  
  4)Most people were in the R6 001 to R8 000 income bracket.
  
*The journey to coming to this stage had its perils. Some of those being:*

  1)Formulating a pertinent question.
  
  2)Finding an extensive dataset in the South African context.
  
  3)Dealing with and sorting out opaque variable names.

## Introduction
**Problem Context**: South Africa is known to be one of the most unequal societies in the world, with the World Bank reporting that 20% of people in South Africa control almost 70% of the resources.**Motivation**:The financial gap is widening and we aim to see if people of SA, specifically the previously disadavantaged, have amassed more financial power as a result of greater access to education and employment opportunities.**Questions include**: Is there a significant relationship between education and income level?How has this changed over time? What are the provincial differences?Are there any anomalies and if so, how do they affect the analysis.**Methodology**: We will take the data, Normalise the data to the 3rd normal form, we will then graph a set regression models that indicate the change over the years with regards to both eduation and income levels. We will then use the insights gained from this to formulate a response. **Section Contents Overview**: the following will be a description of the original datasets, the changes perfromed on these datasets, and initial visualisations on data correlations.

### Research question

Does education level correlate with the accumulation of wealth and employment opportunities in South Africa?

### Hypothesis
Education level does not correlate with the accumulation of wealth because it usually comes with a large amount of debt (student loans). Wealth is dependent on the type of degree that you do obtain. Wealth accumulation does not correlate with employment opporunities and education level.

In order to test this hypothesis, variables of interest have been extracted and have been used to fit regression models to establish whether the variable are significantly related to the monthly earnings. In the following section, the data used in this project is described.

### Methodology
A brief overview of the methodology is as follows:<br/>
1. Read in data <br/>
2. Encode variables <br/>
3. ...

## Data Description

The data was originaly sourced from the [UCT NIDS databse](https://www.datafirst.uct.ac.za/dataportal/index.php/catalog/712/datafile/F2/?offset=100&limit=100) 
The dataset was downloaded on the 22nd of March. The dataset was originaly created in August 2018 after 10 years of repeated panel surveys which were conducted in 7 300 homes across South Africa. *The Southern Africa Labour and Development Research Unit* of UCT lead the investigation and therefore were at the forefront of data collection. The data collected focuses on wealth creation in terms of income and expenditure patters and asset endowments, education and employment dynamics, impact of life events, social capital and intergenerational developments.The data collected follows:

### Reading in the data

In [24]:
#importing data infto dataframe 'df'
df = pd.read_stata('DAE-Data/adult_dataset.dta')

#selecting specific columns from "df"
df_filtered = df[
    ['w5_hhid','w5_a_sample','w5_a_dob_m','w5_a_dob_y','w5_a_gen','w5_a_popgrp','w5_a_mar','w5_a_mary_m','w5_a_mary_l','w5_a_curmarst',
     'w5_a_bhbrth','w5_a_bhcnt1con','w5_a_bhlive','w5_a_mthtertyn','w5_a_mthtert','w5_a_fthtert_o','w5_a_em1','w5_a_em1strty','w5_a_em1inc','w5_a_em1pay',
     'w5_a_em1inc_cat','w5_a_em1hrs','w5_a_em1prf','w5_a_em1prf_a','w5_a_em1prflm','w5_a_em1prflm_a','w5_a_em1bon','w5_a_em1bon_a','w5_a_em1bonlm','w5_a_em1bonlm_a',
     'w5_a_em1pcrt','w5_a_em1pcrt_a','w5_a_em1pcrtlm','w5_a_em1pcrtlm_a','w5_a_em2','w5_a_em2inc','w5_a_em2pay','w5_a_em2inc_cat','w5_a_ems','w5_a_emssll',
     'w5_a_emslft','w5_a_emsincfr_a','w5_a_incgovpen','w5_a_incgovpen_v','w5_a_incdis','w5_a_incdis_v','w5_a_incchld','w5_a_incchld_v','w5_a_incfos','w5_a_incfos_v',
     'w5_a_inccare','w5_a_inccare_v','w5_a_incaid','w5_a_incaid_v','w5_a_incwar','w5_a_incwar_v','w5_a_incuif','w5_a_incuif_v','w5_a_incwc','w5_a_incwc_v',
     'w5_a_incpfnd','w5_a_incpfnd_v','w5_a_incret','w5_a_incret_v','w5_a_incretp','w5_a_incretp_v','w5_a_incrnt','w5_a_incrnt_v','w5_a_incint','w5_a_incint_v',
     'w5_a_incretr','w5_a_incretr_v','w5_a_incinh','w5_a_incinh_v','w5_a_inclob','w5_a_inclob_v','w5_a_incgif','w5_a_incgif_v','w5_a_incloan',
     'w5_a_incloan_v','w5_a_incsale','w5_a_incsale_v','w5_a_inco','w5_a_inco_o','w5_a_inco_v','w5_a_cr',
     'w5_a_edschgrd','w5_a_edschyr','w5_a_edschage','w5_a_ednsc','w5_a_edexemp','w5_a_edschmth','w5_a_edschmth_o','w5_a_edter',
     'w5_a_edterlev','w5_a_edterlev_o','w5_a_edteryr','w5_a_edrep','w5_a_ed17cur',
     'w5_a_ed17curlev_o','w5_a_edlitcomp','w5_a_edlitrden','w5_a_edlitwrten','w5_a_fwbrelinc',
     'w5_a_fwbstp15','w5_a_fwbstp5yr','w5_a_fwbinc5yr','w5_a_recinh','w5_a_recjob','w5_a_recprof','w5_a_recfin',
     'w5_a_reclob','w5_a_recoth','w5_a_ownveh','w5_a_ownveh_v','w5_a_ownmot',
     'w5_a_ownmot_v','w5_a_dtbnd','w5_a_dtbnd_b','w5_a_dtbnd_joint','w5_a_ownoth_ind','w5_a_ownowdtot_indshare','w5_a_dtveh','w5_a_dtveh_b',
     'w5_a_dtveh_joint','w5_a_dtbnk','w5_a_dtbnk_b','w5_a_dtmic','w5_a_dtmic_b',
     'w5_a_dtstubnk','w5_a_dtstubnk_b','w5_a_dtstuo','w5_a_dtstuo_b','w5_a_dtcre','w5_a_dtcre_b','w5_a_dtstr','w5_a_dtstr_b','w5_a_dthp','w5_a_dthp_b',
     'w5_a_dtflloan','w5_a_dtflloan_b','w5_a_dtfrloan','w5_a_dtfrloanbal','w5_a_dtmsh','w5_a_dtmsh_b','w5_a_dtemploan','w5_a_dtemploan_b','w5_a_dtunpdtax','w5_a_dtunpdtax_b',
     'w5_a_dtserarr','w5_a_dtserarr_b','w5_a_dtoth1','w5_a_dtoth1_o','w5_a_dtoth1_b',
     'w5_a_aspen','w5_a_aspen_v','w5_a_aspen_cat','w5_a_asfin','w5_a_asfin_v','w5_a_asfin_cat','w5_a_asacc','w5_a_asacc_v',
     'w5_a_asacc_cat','w5_a_dtacc_cat','w5_a_assell','w5_a_assell_v']]

#df_filtered

FileNotFoundError: [Errno 2] No such file or directory: 'adult_dataset.dta'

In [None]:
#column renaming
pd.set_option('mode.chained_assignment', None)
#dictionary for the new column names: key = old name & value = new name
dict={'w5_hhid':'Household_Identifier','w5_a_sample':'Sample_Origin','w5_a_dob_m':'Month_DOB','w5_a_dob_y':'Year_DOB','w5_a_gen':'Gender',
      'w5_a_popgrp':'Population_Group','w5_a_mar':'Married_Cohabitation','w5_a_mary_m':'Years_Married','w5_a_mary_l':'Years_Cohabiting',
      'w5_a_curmarst':'Current_Relationship_Status','w5_a_bhbrth':'Given_Birth','w5_a_bhcnt1con':'Birth_Count','w5_a_bhlive':'Biological_Children_Living',
      'w5_a_mthtertyn':'Mother_Degrees','w5_a_mthtert':'Mother_Highest_Tertiary','w5_a_fthtert_o':'Father_Highest_Tertiary','w5_a_em1':'Employment_Payment',
      'w5_a_em1strty':'Primary_Occupation','w5_a_em1inc':'Primary_Gross_Income_Month','w5_a_em1pay':'Primary_Net_Income_Month','w5_a_em1inc_cat':'Main_Job_Income_Category',
      'w5_a_em1hrs':'Work_Week_Hours','w5_a_em1prf':'Rec_Share_Profit_Year','w5_a_em1prf_a':'Share_Profit_Year','w5_a_em1prflm':'Rec_Share_Profit_Month',
      'w5_a_em1prflm_a':'Share_Profit_Month','w5_a_em1bon':'Rec_Bonus_Year','w5_a_em1bon_a':'Other_Bonus_Year','w5_a_em1bonlm':'Rec_Bonus_Month',
      'w5_a_em1bonlm_a':'Other_Bonus_Month','w5_a_em1pcrt':'Rec_Extra_Income_Year','w5_a_em1pcrt_a':'Extra_Income_Year','w5_a_em1pcrtlm':'Rec_Extra_Income_Month',
      'w5_a_em1pcrtlm_a':'Extra_Income_Month','w5_a_em2':'Have_Secondary_Occupation','w5_a_em2inc':'Secondary_Gross_Income','w5_a_em2pay':'Secondary_Net_Income',
      'w5_a_em2inc_cat':'Secondary_Income_Category','w5_a_ems':'Is_Self_Employed','w5_a_emssll':'Net_After_Liabilities','w5_a_emslft':'Amount_Left_Over',
      'w5_a_emsincfr_a':'Month_Take_Home_Salary','w5_a_incgovpen':'Pension','w5_a_incgovpen_v':'Pension_Amount','w5_a_incdis':'Disability_Grant',
      'w5_a_incdis_v':'Disability_Grant_Amount','w5_a_incchld':'Child_Support','w5_a_incchld_v':'Child_Support_Amount','w5_a_incfos':'Foster_Care_Grant',
      'w5_a_incfos_v':'Foster_Care_Grant_Amount','w5_a_inccare':'Dependency_Grant','w5_a_inccare_v':'Dependency_Grant_Amount','w5_a_incaid':'Grant_In_Aid',
      'w5_a_incaid_v':'Grant_In_Aid_Amount','w5_a_incwar':'War_Veterans_Pension','w5_a_incwar_v':'War_Veterans_Pension_Amount','w5_a_incuif':'UIF',
      'w5_a_incuif_v':'UIF_Amount','w5_a_incwc':'Workers_Compensation','w5_a_incwc_v':'Workers_Compensation_Amount','w5_a_incpfnd':'Provident_Fund',
      'w5_a_incpfnd_v':'Provident_Fund_Amount','w5_a_incret':'Private_Retirement_Annuity','w5_a_incret_v':'Private_Retirement_Annuity_Amount',
      'w5_a_incretp':'Retirement_Package','w5_a_incretp_v':'Retirement_Package_Amount','w5_a_incrnt':'Rental_Income','w5_a_incrnt_v':'Rental_Income_Amount',
      'w5_a_incint':'Interest_Earnings','w5_a_incint_v':'Interest_Earnings_Amount','w5_a_incretr':'Retrenchment_Package','w5_a_incretr_v':'Retrenchment_Package_Amount',
      'w5_a_incinh':'Inheritances','w5_a_incinh_v':'Inheritances_Amount','w5_a_inclob':'Lobola','w5_a_inclob_v':'Lobola_Amount','w5_a_incgif':'Gifts',
      'w5_a_incgif_v':'Gifts_Amount','w5_a_incloan':'Loan_Repayments','w5_a_incloan_v':'Loan_Repayments_Amount','w5_a_incsale':'Sale_Household_Goods',
      'w5_a_incsale_v':'Sale_Household_Goods_Amount','w5_a_inco':'Other_Income','w5_a_inco_o':'Other_Income_Recipient','w5_a_inco_v':'Other_Income_Value',
      'w5_a_cr':'Non_Household_Residents_Contributions','w5_a_edschgrd':'Highest_Grade_Completed','w5_a_edschyr':'Year_Highest_Grade_Completed',
      'w5_a_edschage':'Age_Highest_Grade_Completed','w5_a_ednsc':'Highest_Grade_Completed_Pass_Type','w5_a_edexemp':'Matric_University_Exemption',
      'w5_a_edschmth':'Math_Highest_Grade_Completed','w5_a_edschmth_o':'Other_Math_Highest_Grade_Completed','w5_a_edter':'Tertiary_Completed',
      'w5_a_edterlev':'Highest_Tertiary_Completed','w5_a_edterlev_o':'Other_Highest_Tertiary_Completed','w5_a_edteryr':'Year_Tertiary_Completed',
      'w5_a_edrep':'Repeated_School_Grades','w5_a_ed17cur':'Currently_Enrolled_School','w5_a_ed17curlev_o':'Other_Currently_Enrolled_School',
      'w5_a_edlitcomp':'Computer_Literate','w5_a_edlitrden':'English_Reading_Level','w5_a_edlitwrten':'English_Writing_Level','w5_a_fwbrelinc':'Household_Income_Classification',
      'w5_a_fwbstp15':'Household_Income_Step_15_Years','w5_a_fwbstp5yr':'Household_Income_Step_In_5_Years','w5_a_fwbinc5yr':'Household_Expected_Income_In_5_Years',
      'w5_a_recinh':'Income_Inheritance','w5_a_recjob':'Income_Job_Payout','w5_a_recprof':'Income_Property_Sale','w5_a_recfin':'Income_Financial_Product',
      'w5_a_reclob':'Income_Lobola','w5_a_recoth':'Income_Other_Payout','w5_a_ownveh':'Vehicle_Owner','w5_a_ownveh_v':'Resale_Vehicle','w5_a_ownmot':'Motorcycle_Owner',
      'w5_a_ownmot_v':'Resale_Motorcycle','w5_a_dtbnd':'Has_Home_Loan','w5_a_dtbnd_b':'Home_Loan_Balance','w5_a_dtbnd_joint':'Home_Loan_Joint_Or_Sole',
      'w5_a_ownoth_ind':'Other_Property','w5_a_ownowdtot_indshare':'Other_Property_Balance','w5_a_dtveh':'Vehicle_Payment','w5_a_dtveh_b':'Vehicle_Payment_Balance',
      'w5_a_dtveh_joint':'Vehicle_Payment_Joint_Or_Sole','w5_a_dtbnk':'Bank_Personal_Loan','w5_a_dtbnk_b':'Bank_Personal_Loan_Balance',
      'w5_a_dtmic':'Micro_Lender_Loan','w5_a_dtmic_b':'Micro_Lender_Loan_Balance','w5_a_dtstubnk':'Bank_Study_Loan','w5_a_dtstubnk_b':'Bank_Study_Loan_Balance',
      'w5_a_dtstuo':'Other_Study_Loan','w5_a_dtstuo_b':'Other_Study_Loan_Balance','w5_a_dtcre':'Credit_Card','w5_a_dtcre_b':'Credit_Card_Balance',
      'w5_a_dtstr':'Store_Card','w5_a_dtstr_b':'Store_Card_Balance','w5_a_dthp':'Hire_Purchase_Agreement','w5_a_dthp_b':'Hire_Purchase_Agreement_Balance',
      'w5_a_dtflloan':'Family_Member_Loan','w5_a_dtflloan_b':'Family_Member_Loan_Balance','w5_a_dtfrloan':'Friend_Loan','w5_a_dtfrloanbal':'Friend_Loan_Balance',
      'w5_a_dtmsh':'Mashonisa_Loan','w5_a_dtmsh_b':'Mashonisa_Loan_Balance','w5_a_dtemploan':'Employer_Loan','w5_a_dtemploan_b':'Employer_Loan_Balance',
      'w5_a_dtunpdtax':'Unpaid_Tax','w5_a_dtunpdtax_b':'Unpaid_Tax_Balance','w5_a_dtserarr':'Monthly_Arrears','w5_a_dtserarr_b':'Monthly_Arrears_Balance',
      'w5_a_dtoth1':'Other_Debts','w5_a_dtoth1_o':'Other_Other_Debts','w5_a_dtoth1_b':'Other_Debts_Balance','w5_a_aspen':'Pension_Annuity',
      'w5_a_aspen_v':'Pension_Annuity_Amount','w5_a_aspen_cat':'Pension_Annuity_Category','w5_a_asfin':'Shares','w5_a_asfin_v':'Shares_Amount',
      'w5_a_asfin_cat':'Shares_Category','w5_a_asacc':'Bank_Account','w5_a_asacc_v':'Bank_Account_Balance','w5_a_asacc_cat':'Bank_Account_Category',
      'w5_a_dtacc_cat':'Bank_Account_Overdraft_Category','w5_a_assell':'Possessions_Net_Value','w5_a_assell_v':'Possessions_Net_Value_Balance'}
df_filtered.rename(columns = dict,inplace = True)
display(df_filtered)

As can be seen above the data set has 30 110 entries each with at most 180 filled columns. The pertinent information captured in the dataframe is as follows:

#### Population Group
Categories of race groups in South Africa including:
- Missing
- African(Black)
- Coloured
- Asian/Indian
- White
- Other

#### Gender
Categories include:
- Male
- Female

#### Married Cohabitation
participants fall into the following Categories:
- Don't Know
- Refused (to give answer)
- Missing
- Formally Married
- Living Together
- Not Living Together

#### Number of years Married
a value for how long participants have been married.

# (Will add others as is necessary)

After having analysed the whole dataset it can be stated that the data is of good quality. Is it **Valid**,is it **Accurate**, is it **Complete**, is it **Consistent**, is it **Uniform**
