# Machine Learning Master Class

## Classification example

In this notebook we address the problem of predicting the probability of claims, a critical task for risk assessment and policy pricing. Thus, based on certain features such as the type of insurance policy, the duration for which the insurance policy is active or the age of the policyholder and the vehicle, we try to predict whether each case is susceptible (1) or not (0) of a claim.

To tackle this problem we will follow the typical Machine Learning pipeline, in which the following steps are performed:
1. Data loading
2. Exploratory analysis of the data
3. Preprocessing of the data
4. Model training
5. Model evaluation and benchmarking
6. Final model selection.

In [1]:
SEED = 1

# 1. Loading the data

Read the data stored in ``data/Insurance claims data.csv``. Inspect and determine how many samples does the database consist of

## 2. Exploratory Data Analysis
Analyze the main statistics of the dataset and represent the estimated distribution of the following variables: ``subcription_length``, ``vehicle_age``, ``customer_age``, ``region_code``, ``segment`` and ``fuel_type``. What are the main differences between these variables?

### Statistics

### Distributions

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns

**Distribution of numerical variables**

**Distribution of categorical variables**

**Distribution of target variable**

Represent the distribution of the target variable. What can we conclude from the distribution of the target variable? Is that a problem for a classification model? If yes how can we solve it?

## 3. Data transformations / pre-processing
In this section we transform the data in order to accomodate it to the ML models. This section usually involves processes such as balancing the database, encoding, feature selection or normalization.

### Balancing the database
Balance the database in order to have the same values of ``claim_status=1`` than ``claim_status=0``. Use upsampling to increment the number of samples of the minor class. Represent the previous distributions with the balanced database.

### Data Transformation

In order to accommodate data to the ML models that are going to be trained and evaluated, transform the categorical variables into numerical variables using LabelEncoder.

### Feature Selection
To avoid the *curse of dimensionaliry*, Feature selection allows us to select the most relevant features for the prediction of the target variable. Select the top 5 most relevant features 

## 4. Model Training

Split the dataset into a train and test with a 70-30 proportion respectively. Train a Random Forest classifier with the train set.

## 5. Model Evaluation
Evaluate the model previously trained determining:
1. Precision
2. Recall
3. F1-score
4. Confusion Matrix: TP, FP, TN, FN
5. ROC Curve

### Obtaining the predictions

### Classification report (Prec, Recall, F1-score, Accuracy)

### Confusion Matrix

### ROC Curve

In [3]:
from sklearn.metrics import roc_curve, auc

Do you consider that the evaluation methodology followed is rigorous and reliable? Why? What aspects can be improved?