# "Data Science (Insurance)" Simple Challenge

We found a company's publicly-hosted GitHub Account with simple pre-interview challenges for several Data-Science-related roles. We removed the company information and created this notebook to practice the challenges.

The data is theirs, but all code is our creation. The repository was provided under the MIT License, so while technically we've broken the license by not including it, we'd rather keep the anonymity and disassociation since the MIT License is free and permissive.

All dependencies (`import` calls) for this notebook are in the first code-block.

Alongside this notebook is a lock-file generated using the `poetry` package, which has all the exact dependencies used to generate and run this Notebook. The `pyproject.toml` file is the project configuration file generated by `poetry` -- which is installable with `pip`.

## Simple Challenge

<style type="text/css">
    ol ol { list-style-type: lower-alpha; }
</style>

_(The following has been extracted from a PDF at the originating company's GitHub repository of challenges. The associated data with the challenge are in the `data-science-marketing_data` directory.)_

Success in the simple challenge leads to the final two steps of the interview process:

1. Informal chat with [Company] founders
1. Full technical challenge

For the simple challenge, use "train.csv" to predict the `outcome` variable using a __Generalised Linear Model__ (GLM) with a _log_ Link Function and Poisson Distribution. The category named `categorical` is a categorical column and the column named `numeric` is a numeric column. Both should be used as independent variables.

__Requirements:__

1. All code must be written in Python and must be in a Jupyter notebook.
1. The first cell in the notebook must include:
  1. Your last name (please don’t include any other identifying information)
  1. The date
1. You must output the GLM's parameter estimates.
1. Your code must be able to predict all five observations in the "test.csv" dataset. The last cell in the notebook must output the five predicted values of the `outcome` variable for "test.csv".
1. A key point of evaluation is how well written the code is. Please write the code as if you are writing it for a production setting. No need to wrap the code in a service or write Dockerfiles. Just ensure the code you write is not hacked together.

If you are spending more than an hour on this simple challenge because there are so many things you want to demonstrate, you are spending too much time on it. If you are spending more than an hour on it because you don’t know where to start, please be warned that the full technical challenge will be considerably more difficult.

In [2]:
# Python Standard Library
import pathlib;
import sys;

# Third-Party Packages
import pandas;

import plotly;
import plotly.express as plotly_express;

import sklearn;

print(">> Python v{0:s}".format(sys.version));
print("");
print(">> Loading: pandas v{0:s}".format(pandas.__version__));
print(">> Loading: plotly v{0:s}".format(plotly.__version__));
print(">> Loading: sklearn v{0:s}".format(sklearn.__version__));
print("");
print(">> Dependencies Loaded.");

>> Python v3.8.2 (default, Apr 13 2020, 19:02:26) 
[Clang 11.0.3 (clang-1103.0.32.29)]

>> Loading: pandas v1.1.0
>> Loading: plotly v4.9.0
>> Loading: sklearn v0.23.2

>> Dependencies Loaded.


In [4]:
# Load and Parse Training Data

train_file = pathlib.Path("data-science-insurance_data/train.csv");

train_df = pandas.read_csv(train_file);

print("Training Data:");
print(train_df.head(20));
print("");

Training Data:
    outcome  categorical  numeric
0         0          3.0     41.0
1         0          1.0     41.0
2         0          3.0     44.0
3         0          3.0      NaN
4         0          NaN     40.0
5         0          1.0     42.0
6         0          3.0     46.0
7         0          NaN     40.0
8         0          3.0     33.0
9         0          3.0     46.0
10        0          3.0     40.0
11        0          2.0     38.0
12        0          3.0     44.0
13        0          NaN     37.0
14        0          3.0     40.0
15        0          1.0     39.0
16        0          1.0     43.0
17        0          3.0     38.0
18        0          2.0      NaN
19        0          3.0     39.0



## Generalised Linear Model (GLM)

...