# Data Preprocessing
First the training data is preprocessed. The rows which contain missing values are filled with the most frequent value for that particular feature (column).
Then all the category features are labeled using a LabelEncoder.
Finally the data is splited in X (the features) and Y (the target label).

In [1]:
# DATA
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import cross_validate, GridSearchCV

existing_customers = pd.read_excel("data/existing-customers.xlsx", engine="openpyxl")
existing_customers = existing_customers.fillna(existing_customers.mode().iloc[0])
to_label = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country", "class"]
label_encoder = LabelEncoder()
for label in to_label:
    existing_customers[label] = label_encoder.fit_transform(existing_customers[label])

X = existing_customers.drop(["class", "RowID"], axis=1)
Y = existing_customers["class"]

  warn("Workbook contains no default style, apply openpyxl's default")


# Scoring

Here a profit function is defined which makes use of the True and False positives to calculate an estimate of the profit that a particular classifier makes. A dataframe is created to save all the metrics for each classifier. Classifiers will be ranked based on highest profit this is because accuracy is not a good measure here since the data is highly imbalanced (75% of the data is <=50k) and all we care about is to make the highest profit.

In [2]:
from sklearn.metrics import confusion_matrix, make_scorer
def profit(y_true, y_pred):
    cfm = confusion_matrix(y_true, y_pred)
    TN, FP, FN, TP = cfm.ravel()
    profit = 0.1*TP*(980) + 0.05*FP*(-310) + (TP+FP)*(-10)
    return profit
profit_scorer = make_scorer(profit)

scoring = {
    'acc': 'accuracy',
    'prec': 'precision',
    'rec': 'recall',
    'prof': profit_scorer
}
results = pd.DataFrame(columns=['Algorithm', 'Accuracy', 'Precision', 'Recall', 'Profit'])

# KNeighbors Classifiers
First the KNeighbors classifier is used and a grid search (with 5-fold cross-validation) is used to find the best number of neighbors to use. 5 Neighbors gets the best result.

In [3]:
from sklearn.neighbors import KNeighborsClassifier

parameters = {
    "n_neighbors": [1, 5, 11, 25, 51, 101]
}
GCV = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=parameters, scoring=scoring, refit='prof', cv=5)
GCV.fit(X, Y)
# Mean scores for the best parameter
mean_acc = GCV.cv_results_['mean_test_acc'][GCV.best_index_]
mean_prec = GCV.cv_results_['mean_test_prec'][GCV.best_index_]
mean_rec = GCV.cv_results_['mean_test_rec'][GCV.best_index_]
mean_prof = GCV.cv_results_['mean_test_prof'][GCV.best_index_]
results.loc[len(results.index)] = ['KNeighborsClassifier', mean_acc, mean_prec, mean_rec, mean_prof]
print(GCV.best_params_)

{'n_neighbors': 5}


# Decision Tree Classifier

For the Decision Tree classifier I wanted to find which splitting criterion is the best. After using a GridSearch (with 5-fold cross-validation) the entropy criterion gets the best results.

In [4]:
from sklearn.tree import DecisionTreeClassifier
# Decision Tree Classifier
parameters = {
    "criterion": ['gini', 'entropy', 'log_loss']
}
GCV = GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=parameters, scoring=scoring, refit='prof', cv=5)
GCV.fit(X, Y)
# Mean scores for the best parameter
mean_acc = GCV.cv_results_['mean_test_acc'][GCV.best_index_]
mean_prec = GCV.cv_results_['mean_test_prec'][GCV.best_index_]
mean_rec = GCV.cv_results_['mean_test_rec'][GCV.best_index_]
mean_prof = GCV.cv_results_['mean_test_prof'][GCV.best_index_]
results.loc[len(results.index)] = ['DecisionTreeClassifier', mean_acc, mean_prec, mean_rec, mean_prof]
print(GCV.best_params_)

{'criterion': 'entropy'}


# Categorical Naive Bayes Classifier
Since most of the features are categories I wanted to try this classifier and see how it performs. For this I did a 5-fold cross-validation with the default parameters and saved the average metrics.

In [5]:
from sklearn.naive_bayes import CategoricalNB
# Categorical Naive Bayes
scores = cross_validate(CategoricalNB(), X, Y, scoring=scoring, cv=5)
# Mean scores
mean_acc = np.average(scores['test_acc'])
mean_prec = np.average(scores['test_prec'])
mean_rec = np.average(scores['test_rec'])
mean_prof = np.average(scores['test_prof'])
results.loc[len(results.index)] = ['CategoricalNB', mean_acc, mean_prec, mean_rec, mean_prof]

# AdaBoost Classifier
Here I wanted to boost the Decision Tree classifier with the AdaBoost Classifier. For this 50, 100 and 1000 estimators are tested. And as I was expecting a higher number of estimators gave a better result. I kept it to a max 1000 estimators because the computing times increase a lot if you use more estimators.

In [6]:
# AdaBoost Classifier
from sklearn.ensemble import AdaBoostClassifier
parameters = {
    'n_estimators': [50, 100, 1000]
}
GCV = GridSearchCV(estimator=AdaBoostClassifier(), param_grid=parameters, scoring=scoring, refit='prof', cv=5)
GCV.fit(X, Y)
# Mean scores for the best parameter
mean_acc = GCV.cv_results_['mean_test_acc'][GCV.best_index_]
mean_prec = GCV.cv_results_['mean_test_prec'][GCV.best_index_]
mean_rec = GCV.cv_results_['mean_test_rec'][GCV.best_index_]
mean_prof = GCV.cv_results_['mean_test_prof'][GCV.best_index_]
results.loc[len(results.index)] = ['AdaBoostClassifier', mean_acc, mean_prec, mean_rec, mean_prof]
print(GCV.best_params_)

{'n_estimators': 1000}


# RandomForest Classifier
For this last classifier again I wanted to see if we could get better results using multiple Decision Trees. Random Forest is the perfect algorithm for this, I tested three different number of trees 100, 200 and 1000 again as I was expecting higher number of estimators gave better results but with the cost of higher computation times. 

In [7]:
# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
parameters = {
    'n_estimators': [100, 200, 1000]
}
GCV = GridSearchCV(estimator=RandomForestClassifier(), param_grid=parameters, scoring=scoring, refit='prof', cv=5)
GCV.fit(X, Y)
# Mean scores for the best parameter
mean_acc = GCV.cv_results_['mean_test_acc'][GCV.best_index_]
mean_prec = GCV.cv_results_['mean_test_prec'][GCV.best_index_]
mean_rec = GCV.cv_results_['mean_test_rec'][GCV.best_index_]
mean_prof = GCV.cv_results_['mean_test_prof'][GCV.best_index_]
results.loc[len(results.index)] = ['RandomForestClassifier', mean_acc, mean_prec, mean_rec, mean_prof]
print(GCV.best_params_)

{'n_estimators': 1000}


Displaying the average metrics of each classifier (with the best parameters) during the cross-validation.

In [8]:
display(results)

Unnamed: 0,Algorithm,Accuracy,Precision,Recall,Profit
0,KNeighborsClassifier,0.838887,0.680704,0.623775,74372.0
1,DecisionTreeClassifier,0.818372,0.624951,0.614593,70065.2
2,CategoricalNB,0.858727,0.736408,0.643924,79641.6
3,AdaBoostClassifier,0.870582,0.783908,0.638696,81097.7
4,RandomForestClassifier,0.850158,0.716068,0.626197,76481.2


Preprocessing potential customers data

In [9]:
# Preprocess potential customers
potential_customers = pd.read_excel("data/potential-customers.xlsx", engine="openpyxl")
potential_customers = potential_customers.fillna(potential_customers.mode().iloc[0])
to_label = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"]
for label in to_label:
    potential_customers[label] = LabelEncoder().fit_transform(potential_customers[label])

X_ = potential_customers.drop(["RowID"], axis=1)

  warn("Workbook contains no default style, apply openpyxl's default")


Predicting labels for the potential customers using each classifier (with the best parameters) and also storing the estimate profit for each.

In [11]:
# Defining all models with best parameters
models = [KNeighborsClassifier(), DecisionTreeClassifier(criterion='entropy'), CategoricalNB(), AdaBoostClassifier(n_estimators=1000), RandomForestClassifier(n_estimators=1000)]
profits_df = pd.DataFrame({
    'Algorithm': ['KNeighborsClassifier', 'DecisionTreeClassifier', 'CategoricalNB', 'AdaBoostClassifier', 'RandomForestClassifier'],
})
profits = []
# Training each one on the full dataset and predicting the labels for potential customers.
# Using average precision from the evaluation to calculate an estimation of the profit.
for i in range(len(models)):
    model = models[i]
    precision = results.loc[i]['Precision']
    model.fit(X, Y)
    potential_customers["class"] = model.predict(X_)
    high_income = potential_customers[potential_customers["class"] == 1]
    low_income = potential_customers[potential_customers["class"] == 0]
    TP = precision*len(high_income)     # Estimated TP
    FP = (1-precision)*len(high_income) # Estimated FP
    estimated_profit = 0.1*TP*(980) + 0.05*FP*(-310) + (TP+FP)*(-10)
    profits.append(estimated_profit)

profits_df['Estimated Profit'] = profits

display(profits_df)

Unnamed: 0,Algorithm,Estimated Profit
0,KNeighborsClassifier,183695.943313
1,DecisionTreeClassifier,171551.137685
2,CategoricalNB,169716.462628
3,AdaBoostClassifier,201972.915798
4,RandomForestClassifier,181710.83676


Finally we save the row ids of the people that we are going to send the promotion to using the best performing classier (AdaBoost with 1000 estimators)

In [None]:
model = AdaBoostClassifier(n_estimators=1000)
model.fit(X, Y)
high_income = potential_customers[potential_customers["class"] == 1]
np.savetxt('rows.txt', high_income['RowID'].values, fmt='%s')