This notebook is designed to assess some of your data science skills. Explain your answers as clearly as you can. _Every_ step is important (programing, data handling, machine learning, critical thinking and written explanations). 

You will work on the following project:

# Project Description

You will analyze simulated claims and property data and build a model that predicts claim frequency from property attributes.

# Setup

## 1. Create a git repo

You should submit your work in the form of a git repository that contains:
1. your source code
2. a `.yml` file specifying your conda environment
3. One or more Jupyter notebooks
4. a README file

Requirements:
- **NO** data should be committed to the repo, not even pickled files. Only text files and Jupyter notebooks.
- The README file must contain a summary of your methodology and results. 
- Your code and notebook(s) must be commented so that someone else can understand what you did
- All code should be written in Python 3.7 or greater

Advice:
- If you know how to set annotated git tags, showcase it with one or more tags
- Divide your work into different notebooks that represent your main activities. For example, data exploration can happen in one notebook, data preprocessing in a second one, and model training and evaluation in a third one. This is just a suggestion. What matters is that your work can be quickly grasped by another data scientist.
- Commit frequently. In particular, avoid changing two different notebooks in a single commit
- The person who will review your work will: 1. copy your repo on their computer, 2. reproduce your conda environment, 3. run the notebook in which you train your final model, 4. evaluate your model on held-out data. So make sure all of this works on your end. 

## 2. Create a conda environment

- All the notebooks in your repo must use the same conda environment.
- Your conda environment must be specified with a platform-agnostic `.yml` file. ([This resource](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#exporting-an-environment-file-across-platforms) can help)
- when you have the option to install a library with `conda` and `pip`, use `conda`.
- make sure that you add the `ipykernel` library as one of your conda dependencies. This will allow us to register your environment as a Jupyter kernel when evaluating your work.

## 3. Inspect the data

You are free to inspect the data in the way you wish. Plots and comments are welcome. However, to avoid over fitting, we recommend you perform your exploratory data analysis on a random subset of the data only. The performance of your model will be measured on our end with held out data that you don't have access to, so overfitting is a risk.

### Description of the data

The data for this project is contained in three distinct files: `policies.xlsx`, `claims.parquet`, `properties.csv`.

Each row in the `policies.xlsx` file corresponds to a single policy term (or contract). A policy term is uniquely identified by its policy number `pol` **and** its start date `start` (a term is a time period that falls between a start date and an end date). 

Each policy covers a set of properties (or buildings). Each property is listed as a row of the `properties.csv` dataset (`prop_id` is a primary key). To know to which policy a property belongs, use the `pol` column. The three property attributes from this project are `age` (the age of the building in year), `state` (a US state), `sqft` (the square footage of the building).

Finally, during each policy term, and for each property, claims can occur. Each claim is recorded in the `claims.parquet` dataset. Use `pol` and `start_date` to match back a claim to its policy term, and the `property` column is a foreign key into `prop_id` from the `properties.csv` dataset.

# Tasks

**Deliverable:** Your README file should contain detailed explanations of how to do the following:
1. retrain your final model from scratch 
2. run your model on held out data (our held out data will be in exactly the same format as the one you are given here) 

## Modeling

Your task is to build a claim frequency model that takes as input the attributes of a single property (state, square footage and building age) and outputs its expected claim frequency. _Claim frequency_ is defined as the number of claims _per unit of exposure_. We would like you to use "year times 1,000 square feet" as your unit of exposure. Here are two examples.

_Example 1_: A building is 2,000 square feet and is covered for 1 year; this corresponds to 2 units of exposure. 
If your model outputs 3.23 for this building, it will mean that you predict that, on average, this building will generate 6.46 claims per year.

_Example 2_: A building is 1,000 square feet and is covered for 2 years; this also corresponds to 2 units of exposure. 
 If your model outputs 3.23 for this building, it will mean that you predict that, on average, this building will generate 3.23 claims per year.
 
In other words, your model always outputs the predicted number of claims per thousand square feet for 1 year of coverage of that building. 

**Tips**: You are free to pick your modeling approach. We suggest you take inspiration from [here](https://en.wikipedia.org/wiki/Poisson_regression#%22Exposure%22_and_offset) and [here](https://scikit-learn.org/stable/auto_examples/linear_model/plot_poisson_regression_non_normal_loss.html#sphx-glr-auto-examples-linear-model-plot-poisson-regression-non-normal-loss-py). We also recommend you use the [pandas](https://pandas.pydata.org/docs/index.html) and [scikit-learn](https://scikit-learn.org/) libraries.

### Performance evaluation

On held out data, to assess the performance of your model, we will plot a "lift chart", which is a graph that contains your predicted claim frequency on the x-axis and the ground-truth (empirical) claim frequency on the y-axis. More precisely:
1. on the x-axis, your model output (over the test set) will be binned into deciles (so there will be 10 points).
2. on the y-axis, the true average claim count (per unit of exposure) is displayed, where the average is taken over the bins from the previous bullet point.

The more the curve aligns with the identity line, the better your model performs.

### Additional Analysis

Any additional analysis and conclusion or questions that you wish to expose in your Notebook or README file are welcome. In particular, you could mention (with quantitative arguments):
1. the different impact that each feature may have on the claim frequency
2. the caveats with which one should trust your model's output (if it were to be used for Business purposes for instance) 

## Processing

As mentioned earlier, we would like you to write Python code that will read the raw data and transform it in an appropriate way to train and test your model.

To this end, we need from you:
- A function (`function_1` below) that takes as input the three file paths and returns the X dataframe that gets fed to the final model. We will run this function on our held-out data.
- Another function (`function_2` below) that returns the fitted model, ready to be tested. So model training and model selection all happen inside `function_2`. If running `function_2` takes a long time, please mention it somewhere. We will run `function_2` on the data we sent you.

This way, when assessing your work, we can do:
```
>>> X, y = function_1(<heldout path1>, <heldout path2>, <hedlout path3>)
>>> model = function_2(<original path1>, <original path2>, <original path3>)
>>> y_hat = model.predict(X)
>>> lift_chart(y, y_hat)
```