- University: University of São Paulo (USP) 

- Class: PMR3508 (2021) - Fundamentals of Machine Learning

- Kaggle Competition: Adult

# Table of contents
1. [Setup and imports](#Setup-and-imports)
    1. [Libraries](##Libraries)
    2. [Setup](##Setup)
2. [EDA (Exploratory Data Analysis)](#EDA(Exploratory-Data-Analysis))
    1. [Glance at data](##Glance-at-data)
        1. [Train dataset](###Train-dataset)
        2. [Test dataset](###Test-dataset)
    2. [Summary statistics](##Summary-statistics)
    3. [Target histogram](##Target-histogram)
    4. [Non zero counts](##Non-zero-counts)
    5. [Empirical distribution of features](##Empirical-distribution-of-features)
        1. [Train dataset](###Train-dataset)
            1. [Histograms of numerical features](####Histograms-of-numerical-features)
            2. [Bar plots for categorical features](####Barplots-for-categorical-features)
        2. [Test dataset](###Test-dataset)
            1. [Histograms of numerical features](####Histograms-of-numerical-features)
            2. [Bar plots for categorical features](####Barplots-for-categorical-features)
        3. [Plots of target vs features](###Plots-of-target-vs-features)
            1. [Numerical features](####Numerical-features)
            2. [Categorical features](####Categorical-features)
        4. [Pairwise plots](###Pairwise-plots)
            1. [Scatter plot](####Numerical-vs-numerical)
            2. [Correlation heatmap](####Correlation-heatmap)
            3. [Categorical heatmap](####Categorical-heatmap)
3. [Data engineering](#Data-engineering)
    1. [Divide dataset into numerical and categorical subdatasets](##Divide-dataset-into-numerical-and-categorical-subdatasets)
    1. [Normalize features](##Normalize-features)
    2. [Treat categorical features](##Treat-categorical-features)
    3. [Joining numerical and categorical dfs back](##Joining-numerical-and-categorical-dfs-back)
    4. [Treat missing values](##Treat-missing-values)
    5. [Treat outliers](##Treat-outliers)
    6. [Feature tranformations](##Feature-tranformations)
    7. [Mirror on test dataset](##Mirror-on-testdataset)
4. [Feature Engineering](#Featur-engineering)
    1. [Importance sampling](##Importance-sampling)
    2. [Select features](##Select-features)
    3. [Create new features](##Create-new-features)
5. [Experiments](#Experiments)
    1. [Base dataset](##Base-dataset)
    2. [Baseline (KNN)](##Baseline-(KNN))
    3. [4 Classifiers](##4-Classifiers)
        1. [RF](###RF)
        2. [XGBoost](###XGBoost)
        3. [SVM](###SVM)
        4. [NN](###NN)
    4. [Engineered datasets](##Engineered-datasets)
6. [Final model](#Final-model)
7. [Submission](#Submission)

# Setup and imports
### Setup environment and import libraries.

## Libraries

In [None]:
import sys
import copy

# Just because I have a conflicting Python 3.6 installation at root
# Comment this when uploading to kaggle
# try:
#     sys.path.remove('C:/Python36/Lib/site-packages')
#     sys.path.remove('C:/Python36/Lib')
# except:
#     print("py36 not influencing")

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn import preprocessing

from sklearn.model_selection import RandomizedSearchCV

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier

## Setup

In [None]:
%matplotlib inline
sns.set()

# Loading data
# adultTrain = pd.read_csv(
#     "C:/Users/bruno/Desktop/kaggle-adult-comp-knn/data/train_data.csv",
#     sep=r'\s*,\s*',
#     engine='python',
#     na_values="?",
# )

# # For uploading to kaggle
adultTrain = pd.read_csv(
    "/kaggle/input/adult-pmr3508/train_data.csv",
    sep=r'\s*,\s*',
    engine='python',
    na_values="?",
)

# adultTest = pd.read_csv(
#     "C:/Users/bruno/Desktop/kaggle-adult-comp-knn/data/test_data.csv",
#     sep=r'\s*,\s*',
#     engine='python',
#     na_values="?",
# )

# # For uploading to kaggle
adultTest = pd.read_csv(
    "/kaggle/input/adult-pmr3508/test_data.csv",
    sep=r'\s*,\s*',
    engine='python',
    na_values="?",
)

modifyNames = {
    "fnlwgt": "weight",
    "education.num": "educationNum", 
    "marital.status": "maritalStatus",
    "capital.gain": "capitalGain", 
    "capital.loss": "capitalLoss",
    "hours.per.week": "hoursPerWeek", 
    "native.country": "country",
    "income": "target"
}

# Changing columns names
adultTrain.rename(columns=modifyNames, inplace=True)
adultTest.rename(columns=modifyNames, inplace=True)

# Casting appropriate datatypes
dtypes = {
    "age": int,
    "workclass": str,
    "weight": int,             
    "education": str,
    "educationNum": int,
    "maritalStatus": str,
    "occupation": str,
    "relationship": str,
    "race": str,
    "sex": str,
    "capitalGain": int,
    "capitalLoss": int,
    "hoursPerWeek": int,
    "country": str,
    "target": str
}

adultTrain.astype(dtypes, copy=False)
adultTest.astype(dtypes.pop("target"), copy=False)

# Id is not relevant, so it is dropped
adultTrain.pop("Id")
idTest = adultTest.pop("Id")

# weight is not important for testing
weightTrain = adultTrain["weight"]
adultTest.pop("weight")

print("\n\n#### TRAIN DATASET ####")
# (32560, 16)
print('\nshape: ', adultTrain.shape)
# all as objects, need to change some datatypes
print('\ndata types:\n', adultTrain.dtypes)
# max of 4000 datapoints with some nan entry -> treat them
print('\nNumber of null entries:\n', adultTrain.isnull().sum())
# No duplicated data points
print('\nDuplicated data points:\n', adultTrain.duplicated().sum()) 

print("\n\n#### TEST DATASET ####")
# (16280, 15)
print('\nshape: ', adultTest.shape)
# all as objects, need to change some datatypes
print('\ndata types:\n', adultTest.dtypes)
# max of aprox 2000 datapoints with some nan entry -> treat them
print('\nNumber of null entries:\n', adultTest.isnull().sum())
# No duplicated data points
print('\nDuplicated data points:\n', adultTest.duplicated().sum()) 

# EDA (Exploratory Data Analysis)
### Get to know data and draw insights on the problem of classifying income as > 50K.

## Glance at data

### Train dataset

In [None]:
# education can be dropped, since educationNum is givving all the information we want
# there is notinh specific about a certain degree that will affect the target
adultTrain.head(20)

### Test dataset

In [None]:
adultTest.head(10)

## Summary statistics

In [None]:
adultTrain.describe()

##  Target histogram

In [None]:
# aprox 25 000 datapoints <= 50K and 7 500 < 50K -> relatively imbalanced dataset
# most simple baseline is prediciting always <= 50K -> gives 0.76% accuracy
counts = adultTrain["target"].value_counts().values
imbalanceRatio = counts[0]/counts[1]
print(imbalanceRatio)
adultTrain["target"].value_counts().plot(kind="bar")


## Non zero counts

In [None]:
# capitalGain and capitalLoss have very few examples
# ideas
    # 1. exclude these festures
    # 2. cluster them in two bins -> will become boolean variables
print(adultTrain.astype(bool).sum(axis=0))

## Plot empirical distribution of each feature

### Train dataset

#### Histograms of numerical features

In [None]:
# hoursPerWeek could be dividid in three bins:  <30, 30-50, >50
# educationNUm could be dividid in four bins: <8, 8-10, 10-12, >13
# capitalGains and capitalLoss needs to actuallt only form one feature 
# that is capitalLiquid = capitalGains - capitalLoss. 
# The effect of this feature will be almost as of a imbalanced binary variable since almost all values are zero
# and the other are in a small range
adultTrain.hist(bins=30, figsize=(15, 10))

#### Bar plots for categorical features

In [None]:
# Private is way bigger than the rest (therefore the rest of the classes have little data)
# Without pay and never work have very few examples (14) but these examples guarantee we know the target
# Ideas:
    # 1. Cluster into 3 bins: private, {without pay + ever worked},  and rest -> 
    # but need to see if private and rest have distinct relatinships with target
print('"Without-pay" or "Never-worked" datapoints: ', adultTrain[adultTrain["workclass"] == ("Without-pay" or "Never-worked")].shape[0])
adultTrain["workclass"].value_counts().plot(kind="bar")

In [None]:
# This feature will be excluded, educationNum already gives us the info we need. There is nothing specific to a 
# certain category that would be relevant for predicting the target
adultTrain["education"].value_counts().plot(kind="bar")

In [None]:
# A priori a would think only having a present spouse or not is important
# So this could be cluster into two groups: present spouse and not present spouse
adultTrain["maritalStatus"].value_counts().plot(kind="bar")

In [None]:
# Each of the categories seem to be very important
adultTrain["occupation"].value_counts().plot(kind="bar")

In [None]:
# This feature seem a little weird, it doest provide mmuch new info, 
# and the categories dont seem to be mutually exclusive
# Idea: exclude this feature
adultTrain["relationship"].value_counts().plot(kind="bar")

In [None]:
# this could be divided into two bins: white and black 
# because the rest doesnt have data and my guess they would be very similar to white
adultTrain["race"].value_counts().plot(kind="bar")

In [None]:
# a priori seems to be important
adultTrain["sex"].value_counts().plot(kind="bar")

In [None]:
# Surely maintaning all these low data categories will fit statistical noise and ruin the accuracy
# Ideas: 
    # 1. divide in two bins: developed and not developed ccontries
    # 2. divide in two bins: USA and rest
adultTrain["country"].value_counts().plot(kind="bar")

### Test dataset

#### Histograms of numerical features

In [None]:
# no surprises here
adultTest.hist(bins=30, figsize=(15, 10))

#### Bar plots for categorical features

In [None]:
# no surprises here
adultTest["workclass"].value_counts().plot(kind="bar")

In [None]:
# no surprises here
adultTest["maritalStatus"].value_counts().plot(kind="bar")

In [None]:
# no surprises here
adultTest["occupation"].value_counts().plot(kind="bar")

In [None]:
# no surprises here
adultTest["race"].value_counts().plot(kind="bar")

In [None]:
# no surprises here
adultTest["country"].value_counts().plot(kind="bar")

## Plots of target vs features

### Numerical features

In [None]:
## OBS: the plots below dont consider the dataset imbalaca, therefore, all ratios are essentially multiplied
# by a factor of 2.3 in favour of <50K.

In [None]:
# for <30 it is almost certain that wage <50K; 30-40 roughly the same; 40-50 >50K has good advantage
# <50K decays linearly with age, while 50K is like a normal function centered in 43
sns.catplot(x="target", y="age", kind="violin", inner=None, data=adultTrain)

In [None]:
# for <10 <50K has a good advantage; 10-12.5 same; >12.5 >50K has very good advantage
sns.catplot(x="target", y="educationNum", kind="violin", inner=None, data=adultTrain)

In [None]:
# for <40 <50K has a good advantage; for >40 >50K has a good advantage
sns.catplot(x="target", y="hoursPerWeek", kind="violin", inner=None, data=adultTrain)

In [None]:
sns.catplot(x="target", y="capitalGain", kind="violin", inner=None, data=adultTrain)

In [None]:
sns.catplot(x="target", y="capitalLoss", kind="violin", inner=None, data=adultTrain)

### Categorical features

In [None]:
# OBS: I am multiplying the counts of >50K by the imbalaceRatio to decouple 
# the fact that the dataset is imbalaced from differences in distribution of the feature

In [None]:
# Private & Self-empinc differ a little, the rest is roughly the same, so can be grouoed 
# into a single category called other
countsDf = adultTrain[["target","workclass"]].value_counts().unstack()
countsDf.loc[">50K", :] = countsDf.loc[">50K", :]*imbalanceRatio
countsDf.plot(kind="bar", stacked=True,  figsize=(10, 7))

In [None]:
# all ctegories are different, thus maintaining all of them seems the way to go
countsDf = adultTrain[["target","maritalStatus"]].value_counts().unstack()
countsDf.loc[">50K", :] = countsDf.loc[">50K", :]*imbalanceRatio
countsDf.plot(kind="bar", stacked=True,  figsize=(10, 7))

In [None]:
# tranposrt moving, tech support, sales, creaf repair dont seem to help distringuish, so could be grouped
# into a single category named rest
countsDf = adultTrain[["target","occupation"]].value_counts().unstack()
countsDf.loc[">50K", :] = countsDf.loc[">50K", :]*imbalanceRatio
countsDf.plot(kind="bar", stacked=True,  figsize=(10, 7))

In [None]:
# black and other dimishes for over >50K but white dominates in both
# I think grouping into white and non-white is a valid approach here
countsDf = adultTrain[["target","race"]].value_counts().unstack()
countsDf.loc[">50K", :] = countsDf.loc[">50K", :]*imbalanceRatio
countsDf.plot(kind="bar", stacked=True,  figsize=(10, 7))

In [None]:
# i think this can be mexico and non-mexico because the rest of the categories have so little data
# that it is likely that we are fittng statistical noise
countsDf = adultTrain[["target","country"]].value_counts().unstack()
countsDf.loc[">50K", :] = countsDf.loc[">50K", :]*imbalanceRatio
countsDf.plot(kind="bar", stacked=True,  figsize=(15, 10))

## Pairwise plots

### Numerical vs numerical

In [None]:
# age limmits >50K even with high education and hours per week
# age < 35 seems to be good indicator -> could maybe be binary variable

# capitalGain > aprox 5 000 seems to be a great separator 
# capitalGain > 50 000 guarantees >50K 
# could be categorical variable

# educationNum > 10 seems to be good indicator also
 
# 1 000 < capital loss < 3 000 can be good 

# hours per week < 50 good
sns.pairplot(adultTrain, hue="target")

## Correlation plot

In [None]:
# all numerical features with very low correlation
sns.heatmap(adultTrain.corr())

## Categorical heatmap

In [None]:
adultTrainDummies = pd.get_dummies(adultTrain[["workclass", "maritalStatus", "occupation", "race", "country"]])
dummy_features = adultTrainDummies.columns.values
pivots = []
for feature in dummy_features:
    rest_of_features = dummy_features[dummy_features != feature]
    new_pivot = adultTrainDummies.groupby(feature)[rest_of_features].sum().fillna(0)
    pivots.append(new_pivot)

fullPivot = pd.concat(pivots)[dummy_features]
fullPivotOnes = fullPivot.iloc[lambda x: x.index > 0]
fullPivotOnes.set_index(adultTrainDummies.columns, inplace=True)

def normalize_pivot_tables(fullPivot):
    vec = np.array(fullPivot.sum(axis=1).values)
    sizeDummies = vec.size
    normMatrix = np.zeros((sizeDummies, sizeDummies))
    for i, element in enumerate(vec):
        for j, element2 in enumerate(vec):
            normMatrix[i][j] = element + element2
                        
    normDf = pd.DataFrame(normMatrix, columns=fullPivot.columns)
    normDf.set_index(fullPivot.columns, inplace=True)
    fullPivotNorm = fullPivot.div(normDf)
    return fullPivotNorm

fullPivotNorm = normalize_pivot_tables(fullPivotOnes) # P(X1 = 1, X2 = 1)

# dataset is too big, so will divide in two for plotting heatmaps
#fullPivot2 = fullPivot.iloc[37:, :37] # down left -> not useful
#fullPivot4 = fullPivot.iloc[37:, 37:] # down right -> country vs country -> not useful
fullPivotNorm1 = fullPivotNorm.iloc[:37, :37] # top left
fullPivotNorm3 = fullPivotNorm.iloc[:37:, 37:] # top right

#OBS: dummy features with same prefix are mutually exclsusive, 
# therefore they will have joint prob equal to zero 

# max joint probability is aprox 0.1 in entire categorical combinations dataset, 
# therefore all categorical features are relatively independent from each other


In [None]:
fullPivotNorm.describe()

In [None]:
_, ax = plt.subplots(figsize=(10,7))
sns.heatmap(fullPivotNorm1,ax=ax)

In [None]:
_, ax = plt.subplots(figsize=(10,7))
sns.heatmap(fullPivotNorm3, ax=ax)

# Data engineering
### Prepare data for algorithm.

## Divide dataset into numerical and categorical subdatasets

In [None]:
numColumns = ["age", "capitalGain", "capitalLoss", "educationNum", "hoursPerWeek"] # obs: left weight out
catColumns = ["country", "education", "maritalStatus", "occupation", "race", "relationship", "sex", "workclass"] # obs: left target out
targetTrain = adultTrain["target"]
adultTrainNum = adultTrain[numColumns]
adultTrainCat = adultTrain[catColumns]

In [None]:
adultTrainNum.head()

In [None]:
adultTrainCat.head()

## Normalize features

In [None]:
adultTrainNum = (adultTrainNum-adultTrainNum.mean())/adultTrainNum.std()
adultTrainNum.head()

## Treat categorical features

In [None]:
# Target-encoding 
# encoder = TargetEncoder()
# encoder.fit_transform(adultTrainCat, adultTrain["target"])
# Simple one-hot encoding (this will be chosen one for now)
adultTrainCat = pd.get_dummies(adultTrainCat)
adultTrainCat.head()

## Joining numerical and categorical dfs back

In [None]:
adultTrain = pd.concat([adultTrainNum, adultTrainCat], axis=1)
adultTrain.head()

## Treat missing values

In [None]:
# Two main options
# 1. Just thorw away rows with missing values
# 2. Replace with mean of colummn (this will be chosen one for now)
adultTrain.fillna(adultTrain.mean(), inplace=True)

## Treat outliers

In [None]:
# todo later

## Feature tranformations

In [None]:
# todo later

## Mirror on test dataset

In [None]:
adultTestNum = adultTest[numColumns]
adultTestCat = adultTest[catColumns]

adultTestNum = (adultTestNum-adultTestNum.mean())/adultTestNum.std() # broadcasts to columns by default

adultTestCat = pd.get_dummies(adultTestCat)
adultTestCat = adultTestCat.reindex(columns = adultTrainCat.columns, fill_value=0) # equivalent to fit transform

adultTest = pd.concat([adultTestNum, adultTestCat], axis=1)

adultTest.fillna(adultTest.mean(), inplace=True)

adultTest.head()

# Feature Engineering
### Select and/or create new features. Non-linear transformations affect more KNN performance

In [None]:
# Primising Engineered Datasets (v0)
# Disct which will hold different feature engineered candidate datsets
promisingDatasets = {}

## Importance sampling

In [None]:
############### NOTEBOOK WAS TO SLOW WITH THIS, DECIDED TO COMMENT OUT ###################

# # with the given weights, the rows can be resampled according to their weight
# # number of rows = weight_factor*minMaxNormalized_weight where the weight factor is large enough so that the sampling
# # can give different integer values for most of the rows, 
# # but if it is too large becomes computationally heavy
# weightTrainNorm = ((weightTrain) - weightTrain.min())/(weightTrain.max() - weightTrain.min())
# weightFactor = 50 # (can be altered later)
# knnNeighboursFactor = weightTrainNorm.mean()*50 # expected value of number of columns added
# print('knnNeighboursFactor:', knnNeighboursFactor)

# # adultTrain70Importance will contain replicas only of row in it, will used to train KNN, 
# # that will be tested in adultTrain30ImportanceCV, which wasnt replicated
# adultTrainShuffled = adultTrain.sample(frac=1)
# adultTrain70Importance, adultTrain30Importance = \
#     np.split(adultTrainShuffled, [int(.7*len(adultTrain))])
# adultTrain30ImportanceCopy = adultTrain30Importance.copy()
# # putting back target I removed earlier
# adultTrain70Importance["target"] = targetTrain[:len(adultTrain70Importance)]

# for idx, row in adultTrain70Importance.iterrows():
#     numReplicatedRows = int(weightTrainNorm[idx]*weightFactor)
#     df = row.to_frame().T
#     adultTrain70Importance = adultTrain70Importance.append([df]*numReplicatedRows, ignore_index=True)
    
# for idx, row in adultTrain30ImportanceCopy.iterrows():
#     numReplicatedRows = int(weightTrainNorm[idx]*weightFactor)
#     df = row.to_frame().T
#     adultTrain30ImportanceCopy = adultTrain30ImportanceCopy.append([df]*numReplicatedRows, ignore_index=True)

# #### IMPORTANT: this is 100% of the actual training dataset -> only used if survived CV
# adultTrainImportance = pd.concat([adultTrain70Importance, adultTrain30ImportanceCopy])
# promisingDatasets["importanceSampling"] = adultTrainImportance

## Select features

In [None]:
# todo later

## Create new features

In [None]:
# todo later

# Experiments
### Tune and compare 4 different classifiers. These are: Decision Trees (RF and XGBoost), SVM and NN.

In [None]:
# Label encoder
le = preprocessing.LabelEncoder()
# Test data
Xtest = adultTest.values

SEED = 10

## Base dataset

In [None]:
#### Baseline dataset
Xtrain = adultTrain.values
Ytrain = le.fit_transform(targetTrain)

# shape check
print(Xtrain.shape)
print(Xtest.shape)
print(Ytrain.shape)

## Baseline (KNN)

In [None]:
baselineKnn = KNeighborsClassifier()
baselineKnnAcc = cross_val_score(baselineKnn, Xtrain, Ytrain, cv=5, scoring='accuracy')
baselineKnnAccMean = baselineKnnAcc.mean()
print('mean accuracy for baseline knn: ', baselineKnnAccMean)

baselineKnn.fit(Xtrain, Ytrain)

currentBestModel = {
    'model': baselineKnn,
    'modelFamily': 'KNN',
    'cv': baselineKnnAccMean, 
    'X': Xtrain,
    'Y': Ytrain
}

## 4 Classifiers

### RF

In [None]:
rf = RandomForestClassifier(random_state=SEED)

rfConfig = {
    'n_estimators': np.arange(10, 50),
    'criterion': ['gini', 'entropy'],
    'max_depth': np.arange(5, 50)
}

rfRandomSearch = (
    RandomizedSearchCV(
        rf, 
        rfConfig, 
        verbose=False, 
        scoring='accuracy', 
        cv=5, 
        n_iter=3, 
        n_jobs=-1, # all cores
        random_state=SEED
    )
)

rfRandomSearch.fit(Xtrain, Ytrain)

rfMean = rfRandomSearch.best_score_
print('mean accuracy for rf: ', rfMean)

rfTuned = rfRandomSearch.best_estimator_
print('tuned RF: ', rfTuned)

if rfMean > currentBestModel['cv']:
    currentBestModel = {
        'model': rfTuned,
        'modelFamily': 'RF',
        'cv': rfMean
    }

### XGBoost

In [None]:
xgb = XGBClassifier(random_state=SEED, use_label_encoder=False)

xgbConfig = {
    'n_estimators': np.arange(10, 50),
    'learning_rate': np.arange(1e-3, 1),
    'max_depth': np.arange(5, 50),
    'reg_alpha': [1e-5, 1e-2, 0.1, 1, 100],
    'reg_lambda': [1e-5, 1e-2, 0.1, 1, 100]
}

xgbRandomSearch = (
    RandomizedSearchCV(
        xgb, 
        xgbConfig, 
        verbose=False, 
        cv=5, 
        n_iter=3, 
        n_jobs=-1, # all cores
        random_state=SEED
    )
)

xgbRandomSearch.fit(Xtrain, Ytrain)

xgbMean = xgbRandomSearch.best_score_
print('mean accuracy for xgb: ', xgbMean)

xgbTuned = xgbRandomSearch.best_estimator_
print('tuned XGB: ', xgbTuned)

if xgbMean > currentBestModel['cv']:
    currentBestModel = {
        'model': xgbTuned,
        'modelFamily': 'XGB',
        'cv': xgbMean
    }

### SVM

In [None]:
svm = SVC(random_state=SEED, probability=True)

svmConfig = {
    'C': np.arange(1e-3, 10),
    'gamma': ['scale', 'auto']
}

svmRandomSearch = (
    RandomizedSearchCV(
        svm, 
        svmConfig, 
        verbose=False, 
        scoring='accuracy', 
        cv=5, 
        n_iter=3, 
        n_jobs=-1, # all cores
        random_state=SEED
    )
)

svmRandomSearch.fit(Xtrain, Ytrain)

svmMean = svmRandomSearch.best_score_
print('mean accuracy for svm: ', svmMean)

svmTuned = svmRandomSearch.best_estimator_
print('tuned SVM: ', svmTuned)

if svmMean > currentBestModel['cv']:
    currentBestModel = {
        'model': svmTuned,
        'modelFamily': 'SVM',
        'cv': svmMean
    }

### NN

In [None]:
# model definition
nn = MLPClassifier(random_state=SEED, early_stopping=True)

# RandomizedSearchCV
nnConfig = {
    'hidden_layer_sizes': [(2 ** i,) for i in np.arange(2, 7)], # just one hidden layer
    'alpha': [1e-10, 1e-8, 1e-6, 1e-4, 1e-2, 1e-0, 1e2],
    'learning_rate': ['constant', 'adaptive']
}

nnRandomSearch = (
    RandomizedSearchCV(
        nn, 
        nnConfig, 
        verbose=False, 
        scoring='accuracy', 
        cv=5, 
        n_iter=3, 
        n_jobs=-1, 
        random_state=SEED
    )
)

nnRandomSearch.fit(Xtrain, Ytrain)

nnMean = nnRandomSearch.best_score_
print('mean accuracy for svm: ', nnMean)

nnTuned = nnRandomSearch.best_estimator_
print('tuned NN: ', nnTuned)

if nnMean > currentBestModel['cv']:
    currentBestModel = {
        'model': nnTuned,
        'modelFamily': 'NN',
        'cv': nnMean
    }

## Engineered datasets

In [None]:
############### NOTEBOOK WAS TO SLOW WITH THIS, DECIDED TO COMMENT OUT ###################

# # 1. Importance Sampling dataset
# Ytrain70ImportanceNotEncoded = adultTrain70Importance.pop("target").values
# print('Ytrain70ImportanceNotEncoded:', Ytrain70ImportanceNotEncoded) 
# Ytrain70Importance = le.fit_transform(Ytrain70ImportanceNotEncoded)
# print('Ytrain70Importance:', Ytrain70Importance) ## remove afterwards
# Xtrain70Importance = adultTrain70Importance.values

# # for cross validation 
# # k-fold cross validation is not done here because, the duplicated rows would leak to the cv sets
# #Ytrain30ImportanceNotEncoded = adultTrain30Importance.pop("target")
# Ytrain30Importance = le.fit_transform(Ytrain30ImportanceNotEncoded)
# Xtrain30Importance = adultTrain30Importance.values

# # shape check
# print(Xtrain70Importance.shape)
# print(Xtest.shape) 
# print(Ytrain70Importance.shape)

# # 5 is the deafult n_neighbors
# importanceSamplingKnn = KNeighborsClassifier(n_neighbors=int(5*knnNeighboursFactor))
# importanceSamplingKnn.fit(Xtrain70Importance, Ytrain70Importance)
# Ytrain30Prediction = importanceSamplingKnn.predict(Xtrain30Importance)
# print('Ytrain30Prediction', Ytrain30Prediction) ### remove afterwards
# importanceSamplingKnnAccMean = accuracy_score(Ytrain30Importance, Ytrain30Prediction)

# print('mean accuracy for importanceSamplingKnnAccMean: ', importanceSamplingKnnAccMean)
# if importanceSamplingKnnAccMean > currentBestModel['cv']:  
#     # get whole dataset fro promisingDatasets
#     adultTrainImportance = promisingDatasets["importanceSampling"] 
    
#     YtrainImportance = le.fit_transform(adultTrainImportance.pop("target").values)
#     XtrainImportance = adultTrainImportance.values
        
#     currentBestModel = {
#         'model': importanceSamplingKnn,
#         'cv': importanceSamplingKnnAccMean,
#         'X': XtrainImportance,
#         'Y': YtrainImportance
#     }

# Final model
### Trained on entire train dataset.

In [None]:
bestFamily = currentBestModel['modelFamily']
trainedBestModel = currentBestModel['model']
meanAcc = currentBestModel['cv']

print('best family: ', bestFamily)
print('best model: ', trainedBestModel)
print('best cv mean accuracy: ', meanAcc)

predictions = trainedBestModel.predict(Xtest) # numpy array

# Submission
### Save to csv in the required format.

In [None]:
# going back to array of strings <=50 K and >50K
predictions = le.inverse_transform(predictions)
submissionDf = pd.DataFrame({'Id': idTest.values, 'income': predictions})

submissionDf.to_csv("submission.csv", index=False)