# Data Insights

This notebook contains some statistical data analysis of the data.

It should help understand what's in the data and some basic interaction between the variables.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [57]:
## Import the training and test data
col_names = ['age','workclass','fnlwgt','education','education_num','marital_status','occupation','relationship','race',
             'sex','capital_gain','capital_loss','hours_per_week','native_country', 'class']

train = pd.read_csv('../data/adult.data', sep=', ', names=col_names, engine='python')
test = pd.read_csv('../data/adult.test', sep=', ', names=col_names, engine='python'); test = test.iloc[1:, :] # There's a line with no information

# Replace '?' by missing values
train = train.replace('?', np.nan)
test = test.replace('?', np.nan)

# Print information
train.info(); print(""); test.info()
train.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         30725 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education_num     32561 non-null int64
marital_status    32561 non-null object
occupation        30718 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital_gain      32561 non-null int64
capital_loss      32561 non-null int64
hours_per_week    32561 non-null int64
native_country    31978 non-null object
class             32561 non-null object
dtypes: int64(6), object(9)
memory usage: 2.6+ MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16281 entries, 1 to 16281
Data columns (total 15 columns):
age               16281 non-null object
workclass         15318 non-null object
fnlwgt            16281 non-null float64
education       

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Both datasets have some missing values in the variables `workclass`, `occupation`, and `native-country`.  
Only 7.4% of the training observations have missing values. That is low so I might just use the complete subset of the data to train my model.

However, I need to decide on a method to impute missing data for new input. Here are three methods I am considering:
- Impute by the mode/average,
- Impute by random assignment following the variable distribution,
- Impute by regression of the other variables (meaning looking at the non-missing values of the observation, what value is the most likely?).

The last one can only give better results, but it's more complicated to implement. The other two should be fairly equivalent.  
Since only three variables are concerned with the missing values, I will decide what amount of effort to put in NA imputation according to those variables importance for the prediction. If the variables have a low predictive effect, I won't use a fancy way of imputing.

---

*Note to myself*: I should put a warning when the input contains missing values, or values in a different datatype. Then, give a choice: go back and change the value, or continue knowing the value will be imputed.