# Title

## Introduction

### Authored by:
Team Name:
Team Members: Tim Smith, John Jones, etc.

### Description of the analysis

Describe the analysis you will conduct. Describe the data you will use for this analysis. 

In this project, we will be using a dataset containing census information from [UCI's Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/census+income).

Our prediction task is to determine whether a person makes 50K or more a year. We are given the input variables that include measures of age, working class, education, edcuational number, marital status, accupation, relationship, race, sex, capital-gain, capital loss, hours worked per week. We have been directed to use a subset of these input variables: age, sex, capital-gain, capital-loss, hours worked per week.

To conduct our analysis, we will utilize both a k-NN model and a Decition Tree model.

## Step 1: Install and/or import necessary packages

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import OrdinalEncoder, LabelEncoder, StandardScaler

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

random_seed = 1
np.random_seed = random_seed

## Step 2: Preliminary (Business) Problem Scoping

We are developing a binary classifier to identify if a given person in the dataset has equal to or above 50k salary, or not. Our positive case will therefore be >=50k. The cost/benefit between a FP and a FN is relatively equal. It is not yet known of the classes are imbalanced. If these classes are imbalanced, we will look to rebalance them using a oversampling techniqe. 

## Step 3: Load, clean and prepare data for analysis

### Load data from source

In [2]:
income_df = pd.read_csv("../../Data/income.csv")
income_df.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### Explore the data

In [3]:
income_df.columns 

Index(['age', ' workclass', ' fnlwgt', ' education', ' education-num',
       ' marital-status', ' occupation', ' relationship', ' race', ' sex',
       ' capital-gain', ' capital-loss', ' hours-per-week', ' native-country',
       ' income'],
      dtype='object')

In [4]:
income_df.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [5]:
income_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              32561 non-null  int64 
 1    workclass       32561 non-null  object
 2    fnlwgt          32561 non-null  int64 
 3    education       32561 non-null  object
 4    education-num   32561 non-null  int64 
 5    marital-status  32561 non-null  object
 6    occupation      32561 non-null  object
 7    relationship    32561 non-null  object
 8    race            32561 non-null  object
 9    sex             32561 non-null  object
 10   capital-gain    32561 non-null  int64 
 11   capital-loss    32561 non-null  int64 
 12   hours-per-week  32561 non-null  int64 
 13   native-country  32561 non-null  object
 14   income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


Our findings from the data exploration indicate the the data requires a renaming/cleanup of column names. We also note that sex and income and income are categorical data. We will encode this data using OrdinalEncoder for the input variable sex, and LabelEncoder for y (target) variable income. (Remember out inclass discussion on these: they do the same thing, but LabelEncoder accepts on variable (since it's used for encoding target variables)

## Step 4: Clean and transform data

Clean up column names

In [6]:
income_df.columns = [s.strip().replace('-', '_') for s in income_df.columns] 
income_df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
       'income'],
      dtype='object')

Drop columns that will not be used in our analysis

In [7]:
income_df = income_df.drop(columns=['workclass', 'fnlwgt', 'education', 'education_num', 'occupation', 'relationship', 'race', 'marital_status', 'native_country'])
income_df

Unnamed: 0,age,sex,capital_gain,capital_loss,hours_per_week,income
0,39,Male,2174,0,40,<=50K
1,50,Male,0,0,13,<=50K
2,38,Male,0,0,40,<=50K
3,53,Male,0,0,40,<=50K
4,28,Female,0,0,40,<=50K
...,...,...,...,...,...,...
32556,27,Female,0,0,38,<=50K
32557,40,Male,0,0,40,>50K
32558,58,Female,0,0,40,<=50K
32559,22,Male,0,0,20,<=50K


Check for any need to address missing values

In [8]:
income_df.isnull().sum()

age               0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
income            0
dtype: int64

Encode sex and income (first check to ensure there aren't any typos that could trigger more than two classes being recognized)

In [9]:
print(income_df.sex.unique())
print(income_df.income.unique())

[' Male' ' Female']
[' <=50K' ' >50K']


Since it looks like there are no typos in these two columns, we will not encode these columns.

In [10]:
income_df.sex

0           Male
1           Male
2           Male
3           Male
4         Female
          ...   
32556     Female
32557       Male
32558     Female
32559       Male
32560     Female
Name: sex, Length: 32561, dtype: object

In [11]:
enc = OrdinalEncoder(categories=[[' Male', ' Female']]) 
income_df.sex = enc.fit_transform(income_df[['sex']])
income_df.head(5)

Unnamed: 0,age,sex,capital_gain,capital_loss,hours_per_week,income
0,39,0.0,2174,0,40,<=50K
1,50,0.0,0,0,13,<=50K
2,38,0.0,0,0,40,<=50K
3,53,0.0,0,0,40,<=50K
4,28,1.0,0,0,40,<=50K


> NOTE: OrdinalEncoder will map the first value found to 0, then the next unique value found, 1, etc. If you wish to change, or control this order, you can speciy this in the categories parameter. For an OrdinalEncoder, since we can have multiple columns, we specify the categories for each column by creating a list of lists. In this specific case above, we only have one column we're encoding, therefore it's a list with one list inside. 

In [12]:
enc = LabelEncoder()
income_df.income = enc.fit_transform(income_df['income'])
income_df.head(5)

Unnamed: 0,age,sex,capital_gain,capital_loss,hours_per_week,income
0,39,0.0,2174,0,40,0
1,50,0.0,0,0,13,0
2,38,0.0,0,0,40,0
3,53,0.0,0,0,40,0
4,28,1.0,0,0,40,0


## Step 5: Partition data into training and test sets

We've decided to utilize a training/test split of the data at 70% training and 30% testing. This percentage split ratio is inline with common practice for small to medium sized datasets, which this data represents. Moreover, we have decided not to do a three way data split, as we are only testing two models and we wish to allocated as much data as possible to training and validation steps.

In [13]:
train_df, test_df = train_test_split(income_df, train_size=0.3, random_state = random_seed)

In [14]:
X_train = train_df.drop(columns=['income'])
y_train = train_df.income
X_test = test_df.drop(columns=['income'])
y_test = test_df.income

## Step 6: Address any data imbalances

We will utilize an oversamplying technique to address any necessary date balancing.

Let' check the count of each class

In [15]:
gt_fiftyK_count = (train_df.income==1).sum()
gt_fiftyK_count

2455

In [16]:
lte_fiftyK_count = (train_df.income==0).sum()
lte_fiftyK_count

7313

We find that the <=50K class outnumbers the >50K class by approximately 3 to 1. This is significant enought to warrent a rebalancing.

In [17]:
lte_fiftyK_count-gt_fiftyK_count

4858

In [18]:
fiftyK_plus_df  = train_df.loc[train_df.income==1]
df_oversampled = fiftyK_plus_df.sample(n=lte_fiftyK_count-gt_fiftyK_count,replace=True)

train_df = pd.concat([train_df, df_oversampled], ignore_index=True)
train_df.income.value_counts()

0    7313
1    7313
Name: income, dtype: int64

## Step 7: Train our models

### Train a default decision tree

Since a decision tree is not sensitive to differences in scale, we do not need to rescale our variables.

In [19]:
dtree=DecisionTreeClassifier(random_state=1)
dtree.fit(X_train, y_train)
y_pred = dtree.predict(X_test)

print('Accuracy', accuracy_score(y_test, y_pred))
print('Precision', precision_score(y_test, y_pred))
print('Recall', recall_score(y_test, y_pred))
print('F1 Score', f1_score(y_test, y_pred))


Accuracy 0.806870530426008
Precision 0.6513846153846153
Recall 0.3930560712959525
F1 Score 0.49027327466419635


### Train a k-NN model 

Since we know the k-NN models are very sensitive to differences in scale, we will rescale our variables before fitting the model.

In [20]:
# create a standard scaler and fit it to the training set of predictors
scaler = StandardScaler()
scaler.fit(X_train)

# Transform the predictors of training and validation sets
X_train = scaler.transform(X_train) # train_predictors is not a numpy array

X_test = scaler.transform(X_test)


In [21]:
results = []
for k in range(1,int(len(y_train)**0.5),2):
    knn = KNeighborsClassifier(n_neighbors=k,  metric='euclidean')
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    results.append ({
        'k': k,
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred)
    })

results_df = pd.DataFrame(results)

Best accuracy score...

In [22]:
results_df.loc[[results_df.accuracy.idxmax()]]

Unnamed: 0,k,accuracy,precision,recall,f1
24,49,0.812179,0.673688,0.397883,0.500292


Best precision score...

In [23]:
results_df.loc[[results_df.precision.idxmax()]]

Unnamed: 0,k,accuracy,precision,recall,f1
29,59,0.811916,0.677201,0.3899,0.494875


Best recall score...

In [24]:
results_df.loc[[results_df.recall.idxmax()]]

Unnamed: 0,k,accuracy,precision,recall,f1
0,1,0.735138,0.445669,0.49573,0.469368


Best f1 score...

In [25]:
results_df.loc[[results_df.f1.idxmax()]]

Unnamed: 0,k,accuracy,precision,recall,f1
9,19,0.80845,0.637763,0.438359,0.519586


## Step 8: Discussion of Results and Conclusion

Our objective was to create a binary classifier that predicts if a persons income is >50K or not. We applied two modeling techniques to accomplish this. First we created a k=NN for each odd value of k from 1 through to the root value of the number of observations. Secondly, we fit a decision tree to the data. 

As discussed in the introduction, we do not find reason to identify significant differences in the cost/benefit of FN over TP (and vise versa). Therefore, our best metric choices for determining the best performing model based on this criteria would be accuracy or f1 score. Since the data has a significant imbalance (with one class outnumbering the other by approximately 3x), this indicates that our best metric to use is f1 score.

Of out two models, the decision tree produced an f1 score of 0.49 and the k-nn model that produced the best f1 score was at k=19 with a value of 0.52.

Considering this, the best model we have produced is a k-nn model @ k=19 with an f1 score of 0.52 (and at this value, accuracy was 0.8, precision was 0.64, and recall was 0.44.

Since these models scored rather poorly, we recommend further exploration of different models to develop a better predictive model. 