# Census Income
This is a Sci-Kit Learn + Pandas example of classification problem. The dataset comes from http://archive.ics.uci.edu/. 

Data extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)). The data was also preprocessed for the purpose of this example.

Prediction task is to determine whether a person makes over 50K a year.


### List of attributes:

##### Features
- age: continuous. 
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, - 10th, Doctorate, 5th-6th, Preschool. 
- education-num: continuous. 
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 
- sex: Female, Male. 
- hours-per-week: continuous. 
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.



##### Labels
- income - >50K, <=50K. 

In [103]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

### Load dataset

In [104]:
df = pd.read_csv("./census.csv")

print (df.shape)
print (df.columns)
df.head()

(32561, 12)
Index(['age', 'workclass', 'education', 'education-num', 'marital-status',
       'occupation', 'relationship', 'race', 'sex', 'hours-per-week',
       'native-country', 'income'],
      dtype='object')


Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40,Cuba,<=50K


## Task 1 - Initial analysis
Perform initial analysis to understand the data.

`df.describe()` is a quick way to display the continuous values

In [133]:
df.describe()

Unnamed: 0,age,education-num,hours-per-week
count,32561.0,32561.0,32561.0
mean,38.581647,10.080679,40.437456
std,13.640433,2.57272,12.347429
min,17.0,1.0,1.0
25%,28.0,9.0,40.0
50%,37.0,10.0,40.0
75%,48.0,12.0,45.0
max,90.0,16.0,99.0


Let's check if there are any empty values

In [106]:
df.isnull().any()

age               False
workclass         False
education         False
education-num     False
marital-status    False
occupation        False
relationship      False
race              False
sex               False
hours-per-week    False
native-country    False
income            False
dtype: bool

## Task 2 - Preparing data
- Select features `X` and labels `y`. Make sure that your selection makes sense.
- Change the data into a numerical form to let your algorithm (logistic regression) deal with them
- Perform One-hot encoding if necessary
- Split your data into train and test subsets. Make sure that your split is reasonable. Use `stratify` if you consider it helpful.

In [107]:
Xy = df[['age', 'workclass', 'education-num', 'occupation', 
         'sex', 'hours-per-week', 'income']]

Xy.head()

Unnamed: 0,age,workclass,education-num,occupation,sex,hours-per-week,income
0,39,State-gov,13,Adm-clerical,Male,40,<=50K
1,50,Self-emp-not-inc,13,Exec-managerial,Male,13,<=50K
2,38,Private,9,Handlers-cleaners,Male,40,<=50K
3,53,Private,7,Handlers-cleaners,Male,40,<=50K
4,28,Private,13,Prof-specialty,Female,40,<=50K


### Preprocessing data

Let's check the labels first. We expect only 2 classes of the labels (<=50K and >50K).

In [108]:
Xy.income.unique()

array(['<=50K', '>50K'], dtype=object)

We have 2 classes indeed, so we may change it into 0/1. Let's assume that income > 50k corresponds to 1

In [109]:
Xy.income = (df.income == '>50K').astype(int)
print (Xy.income.value_counts())
Xy.head()

0    24720
1     7841
Name: income, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,age,workclass,education-num,occupation,sex,hours-per-week,income
0,39,State-gov,13,Adm-clerical,Male,40,0
1,50,Self-emp-not-inc,13,Exec-managerial,Male,13,0
2,38,Private,9,Handlers-cleaners,Male,40,0
3,53,Private,7,Handlers-cleaners,Male,40,0
4,28,Private,13,Prof-specialty,Female,40,0


In [110]:
Xy.sex = (df.sex == 'Male').astype(int)
print (Xy.sex.value_counts())
Xy.head()

1    21790
0    10771
Name: sex, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,age,workclass,education-num,occupation,sex,hours-per-week,income
0,39,State-gov,13,Adm-clerical,1,40,0
1,50,Self-emp-not-inc,13,Exec-managerial,1,13,0
2,38,Private,9,Handlers-cleaners,1,40,0
3,53,Private,7,Handlers-cleaners,1,40,0
4,28,Private,13,Prof-specialty,0,40,0


#### Workclass analysis

In [111]:
Xy.workclass.value_counts()

Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

In [112]:
Xy.workclass.unique()

array(['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov',
       'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'],
      dtype=object)

In [113]:
Xy.loc[df.workclass.isin(['Without-pay', 'Never-worked']), 'workclass'] = '?'
Xy.workclass.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1857
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Name: workclass, dtype: int64

Perform the same analysis for occupation. Please replace everything that has less than 700 instances with the `?`

In [114]:
Xy.occupation.value_counts()

Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: occupation, dtype: int64

In [115]:
Xy.occupation.unique()

array(['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners',
       'Prof-specialty', 'Other-service', 'Sales', 'Craft-repair',
       'Transport-moving', 'Farming-fishing', 'Machine-op-inspct',
       'Tech-support', '?', 'Protective-serv', 'Armed-Forces',
       'Priv-house-serv'], dtype=object)

In [116]:
Xy.loc[df.occupation.isin(['Protective-serv', 'Priv-house-serv', 'Armed-Forces']), 'occupation'] = '?'
Xy.occupation.value_counts()

Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
?                    2650
Machine-op-inspct    2002
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Name: occupation, dtype: int64

### Let's do one-hot encoding finally

In [117]:
Xy = pd.get_dummies(Xy, columns=['workclass', 'occupation'])
print (Xy.shape)
print (Xy.columns)
Xy.head()

(32561, 24)
Index(['age', 'education-num', 'sex', 'hours-per-week', 'income',
       'workclass_?', 'workclass_Federal-gov', 'workclass_Local-gov',
       'workclass_Private', 'workclass_Self-emp-inc',
       'workclass_Self-emp-not-inc', 'workclass_State-gov', 'occupation_?',
       'occupation_Adm-clerical', 'occupation_Craft-repair',
       'occupation_Exec-managerial', 'occupation_Farming-fishing',
       'occupation_Handlers-cleaners', 'occupation_Machine-op-inspct',
       'occupation_Other-service', 'occupation_Prof-specialty',
       'occupation_Sales', 'occupation_Tech-support',
       'occupation_Transport-moving'],
      dtype='object')


Unnamed: 0,age,education-num,sex,hours-per-week,income,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,...,occupation_Craft-repair,occupation_Exec-managerial,occupation_Farming-fishing,occupation_Handlers-cleaners,occupation_Machine-op-inspct,occupation_Other-service,occupation_Prof-specialty,occupation_Sales,occupation_Tech-support,occupation_Transport-moving
0,39,13,1,40,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,50,13,1,13,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,38,9,1,40,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
3,53,7,1,40,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
4,28,13,0,40,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0


## Splitting dataset

Let's split the dataset into features and labels first.
- `income` is the label (`y`)
- all other columns are features (`X`)

In [118]:
y = Xy.income
X = Xy.drop('income', axis=1)

In [119]:
X.head()

Unnamed: 0,age,education-num,sex,hours-per-week,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,...,occupation_Craft-repair,occupation_Exec-managerial,occupation_Farming-fishing,occupation_Handlers-cleaners,occupation_Machine-op-inspct,occupation_Other-service,occupation_Prof-specialty,occupation_Sales,occupation_Tech-support,occupation_Transport-moving
0,39,13,1,40,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,50,13,1,13,0,0,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
2,38,9,1,40,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
3,53,7,1,40,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
4,28,13,0,40,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0


### Train test split

In [120]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify = y, random_state=1)

print ('X train shape:', X_train.shape)
print ('X test shape:', X_test.shape)
print ('y train shape:', y_train.shape)
print ('y test shape:', y_test.shape)

X train shape: (24420, 23)
X test shape: (8141, 23)
y train shape: (24420,)
y test shape: (8141,)


## Task 4 - Logistic Regression
Train and test a logistic regression model. If you want to get a maximum score you must be sure that your model:
- Do not overfit
- Do not underfit
- Achieves at least 80% accuracy on the test subset.

In [121]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

print ('Model trained!')

Model trained!


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### Let's check the accuracy
Train data first...

In [122]:
model.score(X_train, y_train)

0.8110974610974611

And test data...

In [123]:
model.score(X_test, y_test)

0.814641935880113

## Task 5 - Precision and recall
- Compute precision and recall for your model, for both, train and test subsets.
- Make sure that you understand these metrics, you may be asked to explain the meaning of it.

In [132]:
from sklearn.metrics import accuracy_score, precision_score, recall_score
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)

print ('Precision (train set): {:.2f}%'.format(100*precision_score(y_train, pred_train)))
print ('Precision (test set): {:.2f}%'.format(100*precision_score(y_test, pred_test)))

print ('Recall (train set): {:.2f}%'.format(100*recall_score(y_train, pred_train)))
print ('Recall (test set): {:.2f}%'.format(100*recall_score(y_test, pred_test)))

print ('Accuracy (train set): {:.2f}%'.format(100*accuracy_score(y_train, pred_train)))
print ('Accuracy (test set): {:.2f}%'.format(100*accuracy_score(y_test, pred_test)))




Precision (train set): 66.47%
Precision (test set): 67.88%
Recall (train set): 43.51%
Recall (test set): 43.67%
Accuracy (train set): 81.11%
Accuracy (test set): 81.46%


## Task 6: Applying the model
Use your model to check if you will earn above 50,000$ per year. Check both the response from the model (true/false) and the probability that the response will be true. Check using the data about yourself:
- right now
- two years from now
- ten years from now

In [125]:
print (X_train.columns)
new_individual = np.array([26, 15,  1, 40,  
                  0,  1,  0,  0,  0,  0,  0,  
                  0,  1,  0,  0,  0,  0, 0,  0,  0,  0,  0,  0]).reshape(1, -1)


Index(['age', 'education-num', 'sex', 'hours-per-week', 'workclass_?',
       'workclass_Federal-gov', 'workclass_Local-gov', 'workclass_Private',
       'workclass_Self-emp-inc', 'workclass_Self-emp-not-inc',
       'workclass_State-gov', 'occupation_?', 'occupation_Adm-clerical',
       'occupation_Craft-repair', 'occupation_Exec-managerial',
       'occupation_Farming-fishing', 'occupation_Handlers-cleaners',
       'occupation_Machine-op-inspct', 'occupation_Other-service',
       'occupation_Prof-specialty', 'occupation_Sales',
       'occupation_Tech-support', 'occupation_Transport-moving'],
      dtype='object')


In [126]:
print (model.predict(new_individual))

[0]


In [127]:
print (model.predict_proba(new_individual))

[[0.52003355 0.47996645]]
