# "Data Science (Insurance)" Simple Challenge

We found a company's publicly-hosted GitHub Account with simple pre-interview challenges for several Data-Science-related roles. We removed the company information and created this notebook to practice the challenges.

The data is theirs, but all code is our creation. The repository was provided under the MIT License, so while technically we've broken the license by not including it, we'd rather keep the anonymity and disassociation since the MIT License is free and permissive.

All dependencies (`import` calls) for this notebook are in the first code-block.

Alongside this notebook is a lock-file generated using the `poetry` package, which has all the exact dependencies used to generate and run this Notebook. The `pyproject.toml` file is the project configuration file generated by `poetry` -- which is installable with `pip`.

## Simple Challenge

<style type="text/css">
    ol ol { list-style-type: lower-alpha; }
</style>

_(The following has been extracted from a PDF at the originating company's GitHub repository of challenges. The associated data with the challenge are in the `data-science-marketing_data` directory.)_

Success in the simple challenge leads to the final two steps of the interview process:

1. Informal chat with [Company] founders
1. Full technical challenge

For the simple challenge, use "train.csv" to predict the `outcome` variable using a __Generalised Linear Model__ (GLM) with a _log_ Link Function and Poisson Distribution. The category named `categorical` is a categorical column and the column named `numeric` is a numeric column. Both should be used as independent variables.

__Requirements:__

1. All code must be written in Python and must be in a Jupyter notebook.
1. The first cell in the notebook must include:
  1. Your last name (please don’t include any other identifying information)
  1. The date
1. You must output the GLM's parameter estimates.
1. Your code must be able to predict all five observations in the "test.csv" dataset. The last cell in the notebook must output the five predicted values of the `outcome` variable for "test.csv".
1. A key point of evaluation is how well written the code is. Please write the code as if you are writing it for a production setting. No need to wrap the code in a service or write Dockerfiles. Just ensure the code you write is not hacked together.

If you are spending more than an hour on this simple challenge because there are so many things you want to demonstrate, you are spending too much time on it. If you are spending more than an hour on it because you don’t know where to start, please be warned that the full technical challenge will be considerably more difficult.

In [2]:
# Python Standard Library
import pathlib;
import sys;

# Third-Party Packages
import pandas;

import plotly;
import plotly.express as plotly_express;

import sklearn;

print(">> Python v{0:s}".format(sys.version));
print("");
print(">> Loading: pandas v{0:s}".format(pandas.__version__));
print(">> Loading: plotly v{0:s}".format(plotly.__version__));
print(">> Loading: sklearn v{0:s}".format(sklearn.__version__));
print("");
print(">> Dependencies Loaded.");

>> Python v3.8.2 (default, Apr 13 2020, 19:02:26) 
[Clang 11.0.3 (clang-1103.0.32.29)]

>> Loading: pandas v1.1.0
>> Loading: plotly v4.9.0
>> Loading: sklearn v0.23.2

>> Dependencies Loaded.


In [11]:
# Load and Parse Training Data

train_file = pathlib.Path("data-science-insurance_data/train.csv");

train_df = pandas.read_csv(train_file);

print("Training Data:");
print(">  Size:", train_df.shape);
print(train_df.head(20));
print("");

Training Data:
>  Size: (195, 3)
    outcome  categorical  numeric
0         0          3.0     41.0
1         0          1.0     41.0
2         0          3.0     44.0
3         0          3.0      NaN
4         0          NaN     40.0
5         0          1.0     42.0
6         0          3.0     46.0
7         0          NaN     40.0
8         0          3.0     33.0
9         0          3.0     46.0
10        0          3.0     40.0
11        0          2.0     38.0
12        0          3.0     44.0
13        0          NaN     37.0
14        0          3.0     40.0
15        0          1.0     39.0
16        0          1.0     43.0
17        0          3.0     38.0
18        0          2.0      NaN
19        0          3.0     39.0



In [10]:
# Load and Parse Test Data

test_file = pathlib.Path("data-science-insurance_data/test.csv");

test_df = pandas.read_csv(test_file);

print("Testing Data:");
print(">  Size:", test_df.shape);
print(test_df.head(20));
print("");

Testing Data:
>  Size: (5, 2)
   categorical  numeric
0          NaN     71.0
1          3.0     75.0
2          NaN     71.0
3          1.0      NaN
4          2.0     73.0



## Generalised Linear Model (GLM)

A __Generalised Linear Model__ (GLM) is a linear-system-of-equations, regression approach to estimating the functional relationship between input (independent) variables and an output (dependent) variable. This is also referred to as [Generalised Linear Regression](https://scikit-learn.org/stable/modules/linear_model.html#generalized-linear-regression) in some libraries, like `sklearn` (`scikit-learn`).

To discuss this in context, let's first establish the algebraic symbols for the data that we have (from the previous cell).

Independent Variables:

- $x_{0}$: `categorical`
- $x_{1}$: `numeric`

Dependent Variable:

- $y$: `outcome`

$y = f\left(x_{0}, x_{1}\right)$

So, from the above equation, we have our task at hand. We need to figure out what the relationship is between the $x_{i}$ variables and the `outcome` ($y$) variable. We don't have any other information, but we were directed to use a __GLM__ approach. In general, we would plot data in 2D or 3D, if possible, and try to recognise some shape to the spread of the data. If there seemed to be something line-like or plane-like, we could then justify the use of a linear model. But, since we were told what approach to take, we'll leave the justification as being outside the scope of this work.

A __Generalised Linear Model__ (GLM) is typically written through a Classical Linear Algebra relationship between all the samples, but we can be a bit more clear and write out the full equation for one sample (row $i$ of the `train_df` DataFrame):

$\hat{y_{i}} = g^{-1}\left(\beta_{0} + \beta_{1}{\cdot}x_{i,0} + \beta_{2}{\cdot}x_{i,1} + \epsilon_{i}\right)$

In [22]:
# In sklearn a Tweedie Regressor with `link="log"`
#  and `power=1` is equivalent to a log-Link-Function
#  Poission-distribution Generalised Linear Model.
# https://scikit-learn.org/stable/modules/linear_model.html#usage

from sklearn.linear_model import TweedieRegressor;

regressor_obj = TweedieRegressor(
    power= 1,              # 1: Poisson Distribution
    link= "log",           # Link Function
    alpha= 1.0,            # Default: 1.0
    tol= 1e-4,             # Default: 1e-4
    fit_intercept= True,   # Default: True
    max_iter= 100,         # Default: 100
    verbose= 0,            # Default: 0
    warm_start= False,     # Default: False
);

# Regressor doesn't support `NaN`, so we need to
#  either drop or replace the values. As a first
#  pass, let's just do something "naïve" like
#  replace the `NaN` values with the Arithmetic
#  Mean value from their respective column.

cleaned_train_df = train_df.fillna(
    train_df.mean(),
    inplace= False,
);

# Since we don't have anything to say that we
#  should weight any of the samples differently
#  from one another, then we can just set the
#  `sample_weight` parameter ot the default
#  value of `None`, so that they're all weighed
#  equally.

regressor_obj.fit(
    cleaned_train_df[["categorical", "numeric",]],
    cleaned_train_df["outcome"],
    sample_weight= None,
);

print("> GLM Parameters:");

regressor_params_dct = regressor_obj.get_params();

for key in regressor_params_dct:
    print("  -", key, ":", regressor_params_dct[key]);
# rof

print("");
print("> Done!")

> GLM Parameters:
  - alpha : 1.0
  - fit_intercept : True
  - link : log
  - max_iter : 100
  - power : 1
  - tol : 0.0001
  - verbose : 0
  - warm_start : False

> Done!


In [26]:
# Range of Outcome Values

print("Unique `outcome` Values:");
print(train_df["outcome"].unique());

Unique `outcome` Values:
[0 1 3 2 5 4 6]


In [29]:
# Outcome Prediction using the "fit"-ed GLM

predicted_df = test_df.copy(deep= True,);
predicted_df["outcome_raw"] = -1;
predicted_df["outcome_rounded"] = -1;

# Again, sklearn cannot handle `NaN` values, so
#  we can again use the "naïve" approach, but
#  instead of using the Arithmetic Means of the
#  `test_df` columns, we still want to use the
#  Means from the `train_df` columns, so that
#  we don't add extra bias on top of whatever
#  bias we're already injecting. Again, this is
#  a non-ideal approach, but it gives us a
#  way to get a quick output to review the
#  predictions and see if we need to tweak the
#  parameters of our GLM.

cleaned_test_df = test_df.fillna(
    train_df.mean(),
    inplace= False,
);

predicted_df["outcome_raw"] = regressor_obj.predict(
    cleaned_test_df[["categorical", "numeric",]],
);

predicted_df["outcome_rounded"] = predicted_df["outcome_raw"].round(
    decimals= 0,
).astype(int);

print("Predicitons for `outcome`:");
print(">  Size:", predicted_df.shape);
print(predicted_df.head(20));
print("");

Predicitons for `outcome`:
>  Size: (5, 4)
   categorical  numeric  outcome_raw  outcome_rounded
0          NaN     71.0     2.431978                2
1          3.0     75.0     3.592392                4
2          NaN     71.0     2.431978                2
3          1.0      NaN     0.424782                0
4          2.0     73.0     2.919786                3



In [33]:
# Great! We got an output, but let's try again
#  with some tweaks to the GLM parameters and
#  see how accurate/stable it is...

regressor_obj = TweedieRegressor(
    power= 1,              # 1: Poisson Distribution
    link= "log",           # Link Function
    alpha= 0.15,           # Default: 1.0
    tol= 1e-6,             # Default: 1e-4
    fit_intercept= True,   # Default: True
    max_iter= 50000,       # Default: 100
    verbose= 0,            # Default: 0
    warm_start= False,     # Default: False
);

regressor_obj.fit(
    cleaned_train_df[["categorical", "numeric",]],
    cleaned_train_df["outcome"],
    sample_weight= None,
);

print("> GLM Parameters:");

regressor_params_dct = regressor_obj.get_params();

for key in regressor_params_dct:
    print("  -", key, ":", regressor_params_dct[key]);
# rof

print("");

predicted_df["outcome_raw"] = regressor_obj.predict(
    cleaned_test_df[["categorical", "numeric",]],
);

predicted_df["outcome_rounded"] = predicted_df["outcome_raw"].round(
    decimals= 0,
).astype(int);

print("Predicitons for `outcome`:");
print(">  Size:", predicted_df.shape);
print(predicted_df.head(20));
print("");

> GLM Parameters:
  - alpha : 0.15
  - fit_intercept : True
  - link : log
  - max_iter : 50000
  - power : 1
  - tol : 1e-06
  - verbose : 0
  - warm_start : False

Predicitons for `outcome`:
>  Size: (5, 4)
   categorical  numeric  outcome_raw  outcome_rounded
0          NaN     71.0     2.492728                2
1          3.0     75.0     3.962121                4
2          NaN     71.0     2.492728                2
3          1.0      NaN     0.393089                0
4          2.0     73.0     3.004424                3



### Results Discussion

So, we did a default __GLM__ with a __Tweedie Regressor__ configured with the _log_-Link-Function and Poisson Distribution as directed. We got the results of `[2, 4, 2, 0, 3]` for the predicted `outcome`, and then we set the `alpha` (step-size) really small to `0.15` and upped the iterations to `50,000` while lowering the tolerence to `1e-6`.

When we comapre the `outcome_raw` values we see some asymptotic behaviours of how the values are tendending essentially towards: `[2.5, 4.0, 2.5, 0.4, 3.0]`.

This may mean that our answer of `[2, 4, 2, 0, 3]` is only slightly more likely than `[3, 4, 3, 0, 3]`, `[2, 4, 3, 0, 3]`, or `[3, 4, 2, 0, 3]`, as the raw values of `2.5` could round to either `2` or `3`.

Since we don't know the true `outcome` values for the `test.csv` data, we can try computing the $D^{2}$ scores with the `train.csv` data, to get an estimate of what is sorta kinda basically the "residual (error) variance". Since we're using known related values, there's a semantic argument of whether these are residuals or errors, but that's not really necessary to clarify here.

Note that if $d(\dots)$ is the ["Deviance" Function](https://en.wikipedia.org/wiki/Deviance_(statistics)), $D^{2}$ is defined as:

$$
D^{2} = 1 - \frac{d\left(y, \hat{y}\right)}{d_{null}}
$$

Which basically gives an optimal value of $1$ if the deviance between the actual and predicated values is zero, and otherwise is less than $1$ (including negative values). The difference from $1$ is the ratio between the deviance of the predicted ($\hat{y}$) values from the real values ($y$) and the "null deviance", which is the deviance when assuming that the Linear Model (the line/hyperplane) fits to the Average line (hyperplane) going through the data "cloud".

So, let's see what the `train.csv` $D^{2}$ scores are using the second set of parameters for the regression model.

In [37]:
d2_score = regressor_obj.score(
    cleaned_train_df[["categorical", "numeric",]],
    cleaned_train_df["outcome"],
    sample_weight= None,
);

print(
    "D^2 Score:",
    "{0:0.4f}".format(d2_score),
);

D^2 Score: 0.2905


### The $D^{2}$ Score

A score of `0.29` indicates that the ratio $\frac{d\left(y, \hat{y}\right)}{d_{null}}$ is equal to `0.71`, indicating that our regression "line" is (on average) about 30% off of the simple (y-intercept-only Model) Average line through the data.

Ideally, we'd want a value much closer to `1`, so this isn't a great score, indicating that probably a __Generalised Linear Model__ (GLM) isn't very accurate for this data.

Obviously, we've also skewed the data through our "imputations" by filling in all the `NaN` values with training data averages.

A follow-up to this could be to see how much the predictions and $D^{2}$ score change if we dropped all the samples with `NaN` values, and then only used the averages to fill in the `test.csv` data.

We should also point out that our predicted values that were close to `0.4` or `0.5` were actually the rows of `test.csv` that had `NaN` values. This likely means that our Arithmetic Mean values poorly skewed the data, since the other two samples were basically right on with `4.0` and `3.0` values. Through better inspection of the data (plotting, simple statistics, Kernel Density Estimation, _etc._), we could probably come up with a better "imputing" (missing-data filling) approach that would hopefully more strongly bias our predictions towards more "realistic" outcomes.

As we mentioned above, values like `2.5` could either be `2` or `3`, with no real indication towards one or the other. In this case, we would __want__ to have biasing towards `2` or `3`, to make the prediction more "self-assured". The trouble with injecting/forcing the bias is that we'd need to have a good rationale behind it. That's partly why we start with the "naïve" approach, because we want to just see what happens when we do something simple. The Arithmetic Mean is known statistically as (very likely) being "unbiased", such that its distribution is equivalent to the true (population) Mean's distribution. As such, we're not likely to be skewing one way or the other, away from truth, when filling in data with the Arithmetic Mean.

However, the `numeric` and `categorical` means and distributions are likely very different, along with their ranges. So depending on which column we're filling, we're skewing our prediction better or worse, depending on how well our predictor performs for that indpendent variable. And that's where the deeper introspection would come into play. We'd want to figure out how good our linear fit is (our regression model) and adjust it as necessary, to move (bias) our predictor towards "better" predictions. And, as such, we'd want to make sure that values we fill in don't overwhelm the pre-existing data. So, if one independent variable is more variant than the other, we'd want to bias our "imputations" (filled-in values) to fight against that variance so that we get something more realistic.