# "Data Science (Marketing)" Simple Challenge

We found a company's publicly-hosted GitHub Account with simple pre-interview challenges for several Data-Science-related roles. We removed the company information and created this notebook to practice the challenges.

The data is theirs, but all code is our creation. The repository was provided under the MIT License, so while technically we've broken the license by not including it, we'd rather keep the anonymity and disassociation since the MIT License is free and permissive.

All dependencies (`import` calls) for this notebook are in the first code-block.

Alongside this notebook is a lock-file generated using the `poetry` package, which has all the exact dependencies used to generate and run this Notebook. The `pyproject.toml` file is the project configuration file generated by `poetry` -- which is installable with `pip`.

## Simple Challenge

<style type="text/css">
    ol ol { list-style-type: lower-alpha; }
</style>

_(The following has been extracted from a PDF at the originating company's GitHub repository of challenges. The associated data with the challenge are in the `data-science-marketing_data` directory.)_

Success in the simple challenge leads to the final two steps of the interview process:

1. Informal chat with [Company] founders
1. Full technical challenge

For the simple challenge, use `train.csv` to predict the likelihood of the `outcome` variable being equal to `1`.

__Requirements:__

1. All code must be written in Python and must be in a Jupyter notebook
1. The first cell in the notebook must include:
  1. Your last name (please don’t include any other identifying information)
  1. The date
  1. A one sentence description of your approach
  1. The estimated AUC you would expect to get on the `test.csv` data. If your estimated AUC is less than `.825`, your submission will not be reviewed.
1. Your code _must_ be able to predict _all_ observations in the test dataset. The last cell in the notebook must output the first five predicted values of the `outcome` variable for `test.csv`.

If you are spending more than an hour on this simple challenge because there are so many things you want to demonstrate, you are spending too much time on it. If you are spending more than an hour on it because you don’t know where to start, please be warned that the full technical challenge will be considerably more difficult.

## Estimated AUC

A __Receiver Operating Characteristic__ (ROC) curve is the __True Positive Rate__ (TPR) as a function of the __False Positive Rate__ (FPR) for a given "threshold" in a binary-classification.

In a binary-classification, a __Confusion Matrix__ is a `2x2` array of the __True Positive__ (TP), __True Negative__ (TN), __False Positive__ (FP), and __False Negative__ (FN) counts.

The __True Positive Rate__ (TPR) is formulated as: `TP / (TP + FN)`.

The __False Positive Rate__ (FPR) is formulated as: `FP / (FP + TN)`.

Per the publication "An introduction to ROC analysis" by Tom Fawcett (Pattern Recognition Letters, 2006), all ROC curves have a start-point at `(FPR, TPR) = (0, 0)` and `(1.0, 1.0)`. For any binary-classification "threshold" (decision point), we will end up with a new `(FPR(t), TPR(t))` point. With only 3 points, we can calculate a rough area for an __Area Under the Curve__ (AUC) estimate as:

`AUC = ((TPR(t) * FPR(t)) / 2) + (((1 - TPR(t)) * (1 - FPR(t))) / 2) + (TPR(t) * (1 - FPR(t)))`

Which can be rewritten as:

`AUC = (TPR(t) * FPR(t) * (1/2 + 1/2 - 1)) + 1/2 - (TPR(t)/2) - (FPR(t)/2) + TPR(t)`

And that simplifies to:

`AUC = (TPR(t) - FPR(t) + 1) / 2`

### Estimate

> _Filled-in retroactively after building notebook. We don't have the "actual" classifications for the `test.csv`, so this AUC estimate is based-on how the `train.csv` performed in a self-analysis after classification, with more discussion below._

For our trained classifier, our estimate of AUC is at least `0.88`, and up to `0.99`, most likely around `0.95`.

In [3]:
# Python Standard Library
import pathlib;
import sys;

# Third-Party Packages
import pandas;

import plotly;
import plotly.express as plotly_express;

import sklearn;

print(">> Python v{0:s}".format(sys.version));
print("");
print(">> Loading: pandas v{0:s}".format(pandas.__version__));
print(">> Loading: plotly v{0:s}".format(plotly.__version__));
print(">> Loading: sklearn v{0:s}".format(sklearn.__version__));
print("");
print(">> Dependencies Loaded.");

>> Python v3.8.2 (default, Apr 13 2020, 19:02:26) 
[Clang 11.0.3 (clang-1103.0.32.29)]

>> Loading: pandas v1.1.0
>> Loading: plotly v4.9.0
>> Loading: sklearn v0.23.2

>> Dependencies Loaded.


In [4]:
# Load and Parse Training Data

train_file = pathlib.Path("data-science-marketing_data/train.csv");

train_df = pandas.read_csv(train_file);

print("Training Data:");
print(train_df.head());
print("");

Training Data:
   age  cost_of_ad device_type gender  in_initial_launch_location  income  \
0   56    0.005737      iPhone      M                           0   62717   
1   50    0.004733     desktop      F                           0   64328   
2   54    0.004129      laptop      M                           0   83439   
3   16    0.005117     Android      F                           0   30110   
4   37    0.003635     desktop      M                           0   76565   

   n_drivers  n_vehicles  prior_ins_tenure  outcome  
0          2           1                 4        0  
1          2           3                 2        0  
2          1           3                 7        0  
3          2           3                 0        0  
4          2           1                 5        0  



In [130]:
# Create Gradient-Boosted Classifier

import xgboost;

device_sr = train_df["device_type"];
device_enc = sklearn.preprocessing.LabelEncoder();
device_lbl = device_enc.fit(device_sr.unique());
device_lbl = device_enc.transform(device_sr);

gender_sr = train_df["gender"].apply(str);
gender_enc = sklearn.preprocessing.LabelEncoder();
gender_lbl = gender_enc.fit(gender_sr.unique());
gender_lbl = gender_enc.transform(gender_sr);

train_enc_df = train_df.copy(deep= True,);
train_enc_df["device_type"] = device_lbl;
train_enc_df["gender"] = gender_lbl;

outcome_sr = train_enc_df["outcome"];
train_enc_df.drop(
    columns= ["outcome",],
    axis= 1,
    inplace= True,
);

train_dm = xgboost.DMatrix(
    data= train_enc_df,
    label= outcome_sr,
);

crossval_params = {
    "booster": "gbtree",
    "max_depth": 25,
    "eta": 0.5,
    "objective": "binary:hinge",
};

classifier_params = {
    "booster": "gbtree",
    "max_depth": 25,
    "eta": 0.5,
    "objective": "binary:hinge",
};
num_round = 20;

cross_val = xgboost.cv(
    crossval_params,
    train_dm,
    20, # nrounds
    20, # nfold
);

classifier_obj = xgboost.train(
    params= classifier_params,
    dtrain= train_dm,
    num_boost_round= num_round,
);

print("Trained Classifier... done!");

Trained Classifier... done!


In [131]:
test_file = pathlib.Path("data-science_data/test.csv");

test_df = pandas.read_csv(test_file);

test_device_lbl = device_enc.transform(test_df["device_type"]);
test_gender_lbl = gender_enc.transform(test_df["gender"].apply(str));

test_enc_df = test_df.copy(deep= True,);
test_enc_df["device_type"] = test_device_lbl;
test_enc_df["gender"] = test_gender_lbl;

test_dm = xgboost.DMatrix(test_enc_df);

print(classifier_obj)

train_classified = classifier_obj.predict(train_dm);
test_classified = classifier_obj.predict(test_dm);

classes = outcome_sr.unique();

print("'False' Outcome:", classes[0]);
print(" 'True' Outcome:", classes[1]);
print("");

train_conmat = sklearn.metrics.confusion_matrix(outcome_sr, train_classified);
train_fpr, train_tpr, train_thresholds = sklearn.metrics.roc_curve(
    outcome_sr,
    train_classified,
    pos_label= classes[1],
);
train_auc = sklearn.metrics.auc(train_fpr, train_tpr);

print("Classified Training Data AUC Thresholds:");
print(train_thresholds);
print("");
print("Classified Training Data Confusion-Matrix:");
print(train_conmat);
print("");
print("Classified Training Data AUC:");
print(train_auc);
print("");
print("Classified Training Data Cross-Validation:");
print(cross_val);
print("");

<xgboost.core.Booster object at 0x11ba9b790>
'False' Outcome: 0
 'True' Outcome: 1

Classified Training Data AUC Thresholds:
[2. 1. 0.]

Classified Training Data Confusion-Matrix:
[[9018    0]
 [  17  965]]

Classified Training Data AUC:
0.9913441955193483

Classified Training Data Cross-Validation:
    train-error-mean  train-error-std  test-error-mean  test-error-std
0           0.901800         0.000863           0.9018        0.016394
1           0.014947         0.001180           0.1230        0.011841
2           0.014942         0.001179           0.1230        0.011841
3           0.014942         0.001179           0.1230        0.011841
4           0.014942         0.001179           0.1230        0.011841
5           0.014832         0.001174           0.1227        0.011336
6           0.014516         0.001088           0.1223        0.012054
7           0.011884         0.001266           0.1211        0.011322
8           0.010863         0.001383           0.1214      

## Classifier

For classification we used a `binary:hinge` classifier through a Gradient-Boosted-Trees algorithm, provided by `xgboost`'s Python API.

After setting-up the classifier and doing a naïve evaluation, we found that our training AUC was only `0.50`, which meant out classifier was meaningless.

We made a few changes to the `classifier_params` dictionary and the `num_rounds` value, but settled on the current values.

This was an `ad hoc` approach, though not without reasonably well-educated guesses.

We increased `num_rounds` and the `max_depth` to give the algorithm more iterations to train itself. For the sake of brevity we removed the plots, but we did a preliminary evaluation of the histograms of each column in the training dataset, and most of the "distributions" were relatively flat. This provided a visual means of estimating Entropy, and relatively flat distributions have very low Entropy, meaning that those columns would provide minimal (Shannon) Information and would be unlikely to be useful in training a classifier. Of course, the combination of "features" could prove to have sufficient Entropy to make a discriminations in a classifier, which is why we chose the `xgboost` library.

Using the `binary:hinge` "objective" was a choice of convenience, to simply get `0` or `1` labels that easily correspond to the `outcome` labels.

We also found that `eta` seemed to have a detrimental effect at values below and above `0.5`, though lower was worse.

As you can see, above, we achieved an AUC of `0.99`, which is shown in the Confusion Matrix as well.

We produced `test_classified` (the classifier run on the `test` data), but we obviously could not evaluate its AUC or any Confusion Matrix because it did not come with any ground-truth `outcomes`.

Other classification approaches could've worked, or even the same approach with a `binary:logistic` "objective" could've been likely more informative if we were to go more in-depth about the potential accuracy of our predictions.

We also used the XGBoost __Cross Validation__ algorithm to do a `K = 20`, K-Fold __Cross Validation__ of the classifier (same configuration, at least, if not technically the same exact classifier). This showed after 20 rounds that the expected Test Data __Accuracy__ would be around `0.88`, which should be roughly the same as the AUC. Since the K-Fold __Cross Validation__ is randomly paritioning the original data and training on the smaller subsets, we see also that the trained accuracy improves to `0.997` when we got an AUC of at least `0.99`. So there was potentially an outlier-rejection benefit to the partitioning, which may be misleading. Again, the __AUC__ and the complement of the average error in __Cross Validation__ aren't the same values, but they're certainly correlated.

As such, and as stated above, we expect that our predictor AUC for the Test Data would bottom-out around `0.88`, provided that the training data is very well indicative of the test data. But realistically, under that same assumption, we'd expect an AUC of closer to `0.95` given the very high accuracy of our trained classifier, here.