# Predicting income from census data

In this challenge you will be working with the popular machine learning dataset from the 1994 Adult Census database. Given a set of socio-economic attributes you will try to predict if the annual income of an individual is less than or greater than $50,000. 

The following attributes are present in the dataset: 
<ol> 
<li><b>age</b>: continuous.</li>
<li><b>workclass</b>: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.</li>
<li><b>fnlwgt</b>: continuous.</li>
<li><b>education</b>: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.</li>
<li><b>education-num</b>: numerical version of education.</li>
<li><b>marital-status</b>: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.</li>
<li><b>occupation</b>: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.</li>
<li><b>relationship</b>: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.</li>
<li><b>race</b>: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.</li>
<li><b>sex</b>: Female, Male.</li>
<li><b>capital-gain</b>: continuous.</li>
<li><b>capital-loss</b>: continuous.</li>
<li><b>hours-per-week</b>: continuous.</li>
<li><b>native-country</b>: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.</li>
<li><b> income </b>: >50K, <=50K.</li>
</ol>

Citation:
<ul>
<li>This dataset has been taken from repository Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.</li>
<li>Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996.</li>



In [None]:
# Feel free to import more packages (i.e., numpy, sklearn packages) as required.
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

## Data

In [None]:
path_ = 'https://s3-eu-west-1.amazonaws.com/fellowship-teaching-materials/data-practical/adult.csv' 

In [None]:
# Read the csv file into a DataFrame
df = pd.read_csv(path_, header='infer', index_col=None)
df.head(10)

## Exploratory data analysis (EDA)

It is good practice to perform some quality checks on the data, e.g., missing values, duplications etc. Try to find out some basic insights about the data so that you can make more informed decisions about the machine learning task.
</n>

The Pandas data analysis library has many built-in functions that facilitate faster and easier data manipulation and exploration. 
</n>

For creating plots you can use any python plotting library that you are familiar with, e.g., matplotlib, seaborn, Pandas also has its own built in plotting functions. 

In [None]:
# Check for missing values or duplicate rows using pandas and remove accordingly if any are found
print(f'Null values: {df.isnull().values.sum()}')
print(f'Number of duplicated rows: {sum(df.duplicated(keep="first"))}')
df.drop_duplicates(keep="first",inplace=True)
print(len(df))

In [None]:
# get some basic statistics on your continuous variables using pandas
df.describe()

In [None]:
# plot histograms for your continuous variables
# You can use any python plot library for this
# Below we make use of the built-in Pandas plotting functions
cont_cols = ['age', 'fnlwgt', 'educationNum', 'capitalGain', 'capitalLoss','hoursPerWeek']
fig, axes = plt.subplots(ncols=2,nrows=3, figsize=[20,20])
axes = np.ravel(axes)
for idx, col in enumerate(cont_cols):
    df[col].plot(kind='hist', density=True, ax=axes[idx], title=col)

In [None]:
# plot bar charts of you categorical variables
# It might be useful to compare the values of each category given the target variable (income)

# set as categories
cat_cols = ['workclass', 'education', 'maritalStatus', 'occupation', 'relationship', 'race', 'sex', 'nativeCountry']
df[cat_cols] = df[cat_cols].astype("category")

# this time we plot with seaborn to try an alternative approach
fig, axes = plt.subplots(ncols=2,nrows=4, figsize=[20,20])
axes = np.ravel(axes)

for idx, ax in enumerate(axes):
    sns.countplot(x=cat_cols[idx], hue='income', data=df, ax=ax)
    if cat_cols[idx] != 'sex':
        ax.xaxis.set_tick_params(rotation=70)
plt.tight_layout()

What have you learned from these plots? Do you already have some insights about which demographics are more likely to earn over $50k? Are there any features that seem redundant, uninformative or unuseable for any other reason? What about the target variable, income?

In [None]:
# check the class balance of your data. 
df.income.value_counts().plot(kind='bar')
plt.xticks(rotation=360);

## Preprocessing

Now that we know a little more about our data it is time to preprocess it for our classification task. Consider which feature engineering steps you will need to take to ensure that the data is in the right format, for example, how should categorical variables be treated?

In [None]:
# create new dataframe that will be used for training ML model
df_data = pd.DataFrame()

In [None]:
# copy the continuous variables that you wish to keep as features into the new dataframe
# consider if you would like to threshold any of these into binary variables
df_data[cont_cols] = df[cont_cols].copy()

In [None]:
# copy the categorical variables you want to the new dataframe
# they need to be converted into numerical values and one-hot-encoded (again, pandas has built in functions for this)
df_data['sex'] = df['sex'].copy()
df_data = pd.get_dummies(df_data, columns=['sex'], drop_first=True)
cat_cols.remove('sex')
df_data.head(5)

In [None]:
# educationNum seems to be a numeric encoding of the education category
# we can therefore drop the latter.
print(df[['educationNum','education']].value_counts())
cat_cols.remove('education')

In [None]:
# copy and encode the rest of the categorical columns
df_data[cat_cols] = df[cat_cols].copy()
df_data = pd.get_dummies(df_data, columns=cat_cols)

In [None]:
# finally, add the output variable with one-hot-encoding. 
df_data['income'] = df['income'].copy()
# make sure to set drop_first=True to avoid creating unnecessary columns.
df_data = pd.get_dummies(df_data, columns=['income'], drop_first=True)
df_data.head()

### Train, test, validation split

Before you begin selecting and optimising a machine learning model, you should split your data into train, test (and maybe validation) sets. 

In some cases, you may only need a training and a validation set. For example, perhaps the test data has been held out from the beginning, as in some competitions. You may also choose to just use a train/test split and utilise cross validation methods on your training data. 

The exact ratios for each dataset will depend on the amount of available data and specifics of the problem but an 80/20 train/test split is a good rule of thumb. 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# split the data into train/test sets and separate the features from the target. 
X = df_data.drop('income_ >50K', axis=1)
y = df_data['income_ >50K']
# fixing your random state ensures you will get the same split every time you run this line.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
y_train.value_counts()

In [None]:
X_train.head()

## Model selection and tuning

There are many classification algorithms that could be used for this problem. It is up to you to decide which methods are most suitable for this binary classification task given what you have learned about the data so far.

In general [sklearn](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) can be used to quickly test different types of model. We suggest using cross validation to compare the performance of a few classifiers on the training data, without worrying too much about hyperparameter tuning at this stage. 

Try to pick at least 3 models that are different in some significant way. Depending on which models you choose, you may need some extra preprocessing steps, e.g., normalising the data.

You will need to consider what the important performance metrics are for a classification problem, and use these to decide which model is best for the task. 

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
# train different models using cross validation
# model 1
y_svc = cross_val_predict(SVC(), X_train, y_train, cv=5, n_jobs=4)

In [None]:
# model 2
y_dtc = cross_val_predict(DecisionTreeClassifier(), X_train, y_train, cv=5, n_jobs=4)

In [None]:
#model 3
y_rfc = cross_val_predict(RandomForestClassifier(), X_train, y_train, cv=5, n_jobs=4)

In [None]:
def get_performance(y_pred, y_true):
    """Returns a dict containing the accuracy, precision, recall, and f1 score of 
    a model prediction."""
    # compute performance metrics of model
    performance = {}

    performance['accuracy'] = round(metrics.accuracy_score(y_true, y_pred) * 100, 2)
    performance['precision'] = round(metrics.precision_score(y_true, y_pred), 2)
    performance['recall'] = round(metrics.recall_score(y_true, y_pred), 2)
    performance['f1score'] = round(metrics.f1_score(y_true, y_pred), 2)
    
    return performance

In [None]:
# compare the performance of the models
performance_dict = {}
performance_dict['SVC'] = get_performance(y_svc, y_train)
performance_dict['DecisionTree'] = get_performance(y_dtc, y_train)
performance_dict['RandomForest'] = get_performance(y_rfc, y_train)
df_performance = pd.DataFrame(performance_dict)
df_performance

Looking at these initial results, which model do you think is best to proceed with? 

Do you have any thoughts about why a certain model might be performing better at this problem than another. 

What are the limitations of each model?

### Hyperparameter tuning

Select your best model from the above and see if you can increase its performance using hyper parameter tuning. You may find this [link](https://scikit-learn.org/stable/modules/grid_search.html) helpful. Depending on your model, doing an exhaustive grid search might take a very long time. Consider limiting your grid size by either selecting one or two of the hyperparameters that you think are most important or searching over small value range for each hyper parameter. Alternatively, you could try a randomised grid search to speed things up. 

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# Do a grid search on your hyperparameter space.
param_grid = {'n_estimators': [100, 500], 
              'max_features': ['auto', 'sqrt'], 
              'min_samples_split': [5, 10]} # probably too small a grid for any significant improvements

clf = RandomForestClassifier()
tuned_model = GridSearchCV(clf, param_grid, cv=5, n_jobs=4)
tuned_model.fit(X_train, y_train)
print(f'Best Parameter values: {tuned_model.best_params_}')

## Model Evaluation

Now compare the performance of your baseline model and the tuned model on the test set. Why is it imporant to compare performance on held out data? 

In [None]:
clf_baseline = RandomForestClassifier()
clf_baseline.fit(X_train, y_train)

In [None]:
# compare performance metrics
y_test_baseline = clf_baseline.predict(X_test)
y_test_tuned = tuned_model.predict(X_test)
performance_dict['RFC_test_baseline'] = get_performance(y_test_baseline, y_test)
performance_dict['RFC_test_tuned'] = get_performance(y_test_tuned, y_test)
df_performance = pd.DataFrame(performance_dict)
df_performance

### ROC vs precision-recall

Draw the precision-recall curve and ROC curve for the classifiers and calculate the area under the curve in both cases. Which curve do you think is more appropriate for this problem and how might the choice effect your evaluation of the model? (<b>Hint</b>: consider your class balance).

In [None]:
y_proba_baseline = clf_baseline.predict_proba(X_test)[:,1]
y_proba_tuned = tuned_model.predict_proba(X_test)[:,1]

In [None]:
precision_b, recall_b, thresholds_b = metrics.precision_recall_curve(y_test, y_proba_baseline)
precision_t, recall_t, thresholds_t = metrics.precision_recall_curve(y_test, y_proba_tuned)
fpr_b, tpr_b, thresholds = metrics.roc_curve(y_test, y_proba_baseline)
fpr_t, tpr_t, thresholds = metrics.roc_curve(y_test, y_proba_tuned)

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=[20,10])
axes[0].plot(recall_t, precision_t, "-r", label='tuned')
axes[0].plot(recall_b, precision_b, "-g", label='baseline')
axes[0].plot([0,1],[1,0], '--')
axes[0].set_xlabel('Recall')
axes[0].set_ylabel('Precision')
axes[0].set_title('Precision recall curve on test data')
axes[0].legend()

axes[1].plot(fpr_t, tpr_t, "-r", label='tuned')
axes[1].plot(fpr_b, tpr_b, "-g", label='baseline')
axes[1].plot([0, 1], [0, 1],'--')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC curve on test data');
axes[1].legend()

In [None]:
print(f'Area under precision-recall curve: {metrics.auc(recall_t, precision_t)}')
print(f'Area under ROC curve: {metrics.auc(fpr_t, tpr_t)}')