# Predicting income from census data

In this challenge you will be working with the popular machine learning dataset from the 1994 Adult Census database. Given a set of socio-economic attributes you will try to predict if the annual income of an individual is less than or greater than $50,000. 

The following attributes are present in the dataset: 
<ol> 
<li><b>age</b>: continuous.</li>
<li><b>workclass</b>: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.</li>
<li><b>fnlwgt</b>: continuous.</li>
<li><b>education</b>: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.</li>
<li><b>education-num</b>: continuous.</li>
<li><b>marital-status</b>: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.</li>
<li><b>occupation</b>: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.</li>
<li><b>relationship</b>: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.</li>
<li><b>race</b>: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.</li>
<li><b>sex</b>: Female, Male.</li>
<li><b>capital-gain</b>: continuous.</li>
<li><b>capital-loss</b>: continuous.</li>
<li><b>hours-per-week</b>: continuous.</li>
<li><b>native-country</b>: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.</li>
<li><b> income </b>: >50K, <=50K.</li>
</ol>

Citation:
<ul>
<li>This dataset has been taken from repository Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.</li>
<li>Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996.</li>



In [None]:
# Feel free to import more packages (i.e., sklearn packages) as required.
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

## Data 

In [None]:
path_ = 'https://s3-eu-west-1.amazonaws.com/fellowship-teaching-materials/data-practical/adult.csv'

In [None]:
# Read the csv file into a DataFrame
df = pd.read_csv(path_, header='infer', index_col=None)
df.head(10)

## Exploratory data analysis (EDA)

It is good practice to perform some quality checks on the data, e.g., missing values, duplications etc. Try to find out some basic insights about the data so that you can make more informed decisions about the machine learning task. 
The Pandas data analysis library has many built-in functions that facilitate faster and easier data manipulation and exploration. 
For creating plots you can use any python plotting library that you are familiar with, e.g., matplotlib, seaborn, Pandas also has its own built in plotting functions. 

In [None]:
# Check for missing values or duplicate rows using pandas and remove accordingly if any are found
print(f'Null values: {df.isnull().values.sum()}')
print(f'Number of duplicated rows: {sum(df.duplicated(keep="first"))}')
df.drop_duplicates(keep="first",inplace=True)
print(len(df))

In [None]:
# get some basic statistics on your continuous variables using pandas
df.describe()

In [None]:
# plot histograms for your continuous variables
# You can use any python plot library for this
# Below we make use of the built-in Pandas plotting functions
cont_cols = ['age', 'fnlwgt', 'educationNum', 'capitalGain', 'capitalLoss','hoursPerWeek']
fig, axes = plt.subplots(ncols=2,nrows=3, figsize=[20,20])
axes = np.ravel(axes)
for idx, col in enumerate(cont_cols):
    df[col].plot(kind='hist', density=True, ax=axes[idx], title=col)

In [None]:
# plot bar charts of you categorical variables
# It might be useful to compare the values of each category given the target variable (income)


What have you learned from the above sets of plots? Do you already have some insights about which demographics are more likely to earn over $50k? Are there any features that seem redundant, uninformative or unuseable for any other reason? What about the target variable, income?

In [None]:
# check the class balance of your data. 


## Preprocessing

Now that we know a little more about our data it is time to preprocess it for our classification task. Consider which feature engineering steps you will need to take to ensure that the data are in the right format, for example, how should categorical variables be treated?

In [None]:
# create new dataframe that will be used for training ML model
df_data = pd.DataFrame()

In [None]:
# copy the continuous variables that you wish to keep as features into the new dataframe
# consider if you would like to threshold any of these into binary variables
df_data[cont_cols] = df[cont_cols].copy()

In [None]:
# copy the categorical variables you want to the new dataframe
# they need to be converted into numerical values and one-hot-encoded (again, pandas has built in functions for this)
df_data['sex'] = df['sex'].copy()
df_data = pd.get_dummies(df_data, columns=['sex'], drop_first=True)
cat_cols.remove('sex')
df_data.head(5)

In [None]:
# repeat this process for the rest of the variables you want to keep. 
# for each one consider any what type of encoding you think is most appropriate.

In [None]:
# finally, add the output variable with one-hot-encoding. 


### Train, test, validation split

Before you begin selecting and optimising a machine learning model, you should split your data into train, test (and maybe validation) sets. 

In some cases, you may only need a training and a validation set. For example, perhaps the test data has been held out from the beginning. You may also choose to just use a train/test split and utilise cross validation methods on your training data. 

The exact ratios for each dataset will depend on the amount of available data and specifics of the problem but an 80/20 train/test split is a good rule of thumb. 

In [None]:
# split the data into train/test sets and separate the features from the target. 
from sklearn.model_selection import train_test_split


## Model selection and tuning

There are many classification algorithms that could be used for this problem. It is up to you to decide which methods are most suitable for this binary classification task given what you have learned about the data so far.

In general [sklearn](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) can be used to quickly test different types of model. We suggest using cross validation to compare the performance of a few classifiers on the training data, without worrying too much about hyperparameter tuning at this stage. 

Try to pick at least 3 models that are different in some significant way. Depending on which models you choose, you may need some extra preprocessing steps, e.g., normalising the data.

You will need to consider what the important performance metrics are for a classification problem, and use these to decide which model is best for the task. 

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
# import the sklearn models that you want to try (recommend 3 models max for time)

In [None]:
# train different models using cross validation
# model 1


# model 2


#model 3

In [None]:
# compare the performance of the models
# you will want to consider metrics like accuracy, precision, recall.


Looking at these initial results, which model do you think is best to proceed with? 

Do you have any thoughts about why a certain model might be performing better at this problem than another. 

What are the limitations of each model?

### Hyperparameter tuning

Select your best model from the above and see if you can increase its performance using hyper parameter tuning. You may find this [link](https://scikit-learn.org/stable/modules/grid_search.html) helpful. Depending on your model, doing an exhaustive grid search might take a very long time. Consider limiting your grid size by either selecting one or two of the hyperparameters that you think are most important or searching over small value range for each hyper parameter. Alternatively, you could try a randomised grid search to speed things up. 

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# Do a grid search on your hyperparameter space.


## Model Evaluation

Now compare the performance of your baseline model and the tuned model on the test set. Why is it imporant to compare performance on held out data? 

In [None]:
# compare performance metrics


### ROC vs Precision-Recall

Draw the precision-recall curve and ROC curve for the classifiers and calculate the area under the curve in both cases. Which curve do you think is more appropriate for this problem and how might the choice effect your evaluation of the model? (<b>Hint</b>: consider your class balance).

In [None]:
# get roc values and precision recall values using sklearn


In [None]:
# plot the curves


In [None]:
# calculate area under the curve. 
