# US Census Task

The following link lets you download an archive containing an “exercise” US Census dataset: http://thomasdata.s3.amazonaws.com/ds/us_census_full.zip
This US Census dataset contains detailed but anonymized information for approximately 300,000 people.

The archive contains 3 files: 
1. A large training file (csv)
2. Another test file (csv)
3. A metadata file (txt) describing the columns of the two csv files (identical for both)

The goal of this exercise is to model the information contained in the last column (42nd), i.e., whether a person makes more or less than $50,000 per year, from the information contained in the other columns. The exercise here consists of modeling a binary variable.

Work with Python (or R) to carry out the following steps:
1. Load the train and test files.
2. Perform an exploratory analysis on the data and create some relevant visualisations.
3. Clean, preprocess, and engineer features in the training data, with the aim of building a data set that a model will perform well on.
4. Create a model using these features to predict whether a person earns more or less than $50,000 per year. Here, the idea is for you to test a few different models, and see whether there are any techniques you can apply to improve performance over your first results.
5. Choose the model that appears to have the highest performance based on a comparison between reality (the 42nd variable) and the model’s prediction.
6. Apply your model to the test file and measure its real performance on it (same method as above).

The goal of this exercise is not to create the best or the purest model, but rather to describe the steps you took to accomplish it.
Explain areas that may have been the most challenging for you.
Find clear insights on the profiles of the people that make more than $50,000 / year. For example, which variables seem to be the most correlated with this phenomenon?
Finally, you push your code on GitHub to share it with me, or send it via email.

Once again, the goal of this exercise is not to solve this problem, but rather to spend a few hours on it and to thoroughly explain your approach.

# Imports

In [24]:
import requests, zipfile, os
from collections import deque
import pandas as pd

# Download the Data

In [15]:
zipurl = 'https://t.lever-analytics.com/email-link?dest=http%3A%2F%2Fthomasdata.s3.amazonaws.com%2Fds%2Fus_census_full.zip&eid=b5c393b0-02b0-409e-b899-52d66d90cf44&idx=1&token=jcOR6AbBgoc9M-r1CGMmuQ0lnLk'

# get request
response = requests.get(zipurl)

zname = "us_census_full.zip"
zfile = open(zname, 'wb')
zfile.write(response.content)
zfile.close()

# unzip the file
with zipfile.ZipFile('us_census.zip', 'r') as zip_ref:
    zip_ref.extractall('.')

['.ipynb_checkpoints',
 'Untitled.ipynb',
 'us_census.zip',
 'us_census_full',
 'us_census_full.zip',
 '__MACOSX']

# 1. Load the train and test files.

In [102]:
train_df = pd.read_csv('us_census_full/census_income_learn.csv', header=None)
test_df = pd.read_csv('us_census_full/census_income_test.csv', header=None)

print(train_df.shape, test_df.shape)

(199523, 42) (99762, 42)


In [103]:
train_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,41
0,73,Not in universe,0,0,High school graduate,0,Not in universe,Widowed,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,95,- 50000.
1,58,Self-employed-not incorporated,4,34,Some college but no degree,0,Not in universe,Divorced,Construction,Precision production craft & repair,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.
2,18,Not in universe,0,0,10th grade,0,High school,Never married,Not in universe or children,Not in universe,...,Vietnam,Vietnam,Vietnam,Foreign born- Not a citizen of U S,0,Not in universe,2,0,95,- 50000.
3,9,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.
4,10,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.


In [99]:
with open('us_census_full/census_income_metadata.txt') as meta_file:
    last42 = deque(meta_file, 42)

col_names = []
for line in last42:
    col_names.append(line.split(':')[0])

In [101]:
train_df.shape

(199522, 42)

In [None]:
    
train_df.columns = col_names
test_df.columns = col_names

In [87]:
print('Number of duplicated training samples: ', train_df.duplicated().sum())
print('Number of duplicated test samples: ', test_df.duplicated().sum())

print(train_df.shape, test_df.shape)

train_df.drop_duplicates(inplace=True)
test_df.drop_duplicates(inplace=True)

print(train_df.shape, test_df.shape)

Number of duplicated training samples:  3229
Number of duplicated test samples:  883
(199522, 42) (99761, 42)
(196293, 42) (98878, 42)


In [88]:
print('Number of nan training samples: ', train_df.duplicated().sum())
print('Number of nan test samples: ', test_df.duplicated().sum())

Number of nan training samples:  0
Number of nan test samples:  0


In [67]:
train_df.drop(columns='| instance weight', inplace=True)
test_df.drop(columns='| instance weight', inplace=True)

# 2. Perform an exploratory analysis on the data and create some relevant visualisations.

In [89]:
tr = train_df.copy()
te = test_df.copy()

In [90]:
train_df.dtypes

age                                             int64
class of worker                                object
detailed industry recode                        int64
detailed occupation recode                      int64
education                                      object
wage per hour                                   int64
enroll in edu inst last wk                     object
marital stat                                   object
major industry code                            object
major occupation code                          object
race                                           object
hispanic origin                                object
sex                                            object
member of a labor union                        object
reason for unemployment                        object
full or part time employment stat              object
capital gains                                   int64
capital losses                                  int64
dividends from stocks       

## Continious Data

In [94]:
train_df.describe()

Unnamed: 0,age,detailed industry recode,detailed occupation recode,wage per hour,capital gains,capital losses,dividends from stocks,| instance weight,migration prev res in sunbelt,citizenship,fill inc questionnaire for veteran's admin,veterans benefits,weeks worked in year
count,196293.0,196293.0,196293.0,196293.0,196293.0,196293.0,196293.0,196293.0,196293.0,196293.0,196293.0,196293.0,196293.0
mean,34.929274,15.603267,11.490527,56.336792,441.872288,37.927787,200.723408,1743.267804,1.988115,0.178305,1.53818,23.554009,94.499325
std,22.209891,18.106413,14.498142,277.055009,4735.688985,274.081859,2000.13566,996.948519,2.371019,0.55774,0.836814,24.428593,0.500001
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.87,0.0,0.0,0.0,0.0,94.0
25%,16.0,0.0,0.0,0.0,0.0,0.0,0.0,1061.53,0.0,0.0,2.0,0.0,94.0
50%,34.0,1.0,2.0,0.0,0.0,0.0,0.0,1620.17,1.0,0.0,2.0,12.0,94.0
75%,50.0,33.0,26.0,0.0,0.0,0.0,0.0,2194.06,4.0,0.0,2.0,52.0,95.0
max,90.0,51.0,46.0,9999.0,99999.0,4608.0,99999.0,18656.3,6.0,2.0,2.0,52.0,95.0


In [91]:
def objs_to_cats(df):
    objs = df.select_dtypes('object')
    for col in objs.columns:
        df[col] = df[col].astype('category')
    return df

train_df = objs_to_cats(train_df)
test_df = objs_to_cats(test_df)

In [95]:
train_df.head()

Unnamed: 0,age,class of worker,detailed industry recode,detailed occupation recode,education,wage per hour,enroll in edu inst last wk,marital stat,major industry code,major occupation code,...,family members under 18,country of birth father,country of birth mother,country of birth self,citizenship,own business or self employed,fill inc questionnaire for veteran's admin,veterans benefits,weeks worked in year,year
0,58,Self-employed-not incorporated,4,34,Some college but no degree,0,Not in universe,Divorced,Construction,Precision production craft & repair,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.
1,18,Not in universe,0,0,10th grade,0,High school,Never married,Not in universe or children,Not in universe,...,Vietnam,Vietnam,Vietnam,Foreign born- Not a citizen of U S,0,Not in universe,2,0,95,- 50000.
2,9,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.
3,10,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.
4,48,Private,40,10,Some college but no degree,1200,Not in universe,Married-civilian spouse present,Entertainment,Professional specialty,...,Philippines,United-States,United-States,Native- Born in the United States,2,Not in universe,2,52,95,- 50000.
