# Datacamp: Claims scoring for automatic acceptation for BNP Paribas Cardif

## 1. Introduction

The dataset is stored in `DataFrame_Claims_Spain.csv` file.

This datacamp is dedicated to the development of a machine learning pipeline to automatize the claim process. In short, automatic acceptation will speed-up the procedure, improve customer experience, and reduce management costs. However, it will be counterbalanced by an increase in the number of payments due to false positive acceptations leading to some new costs. Therefore, the pipeline should outcome probabilities to find the appropriate trade-off between accepting claims and creating new costs.

## 2. Available features

The handed data set contains the following features which can be used for designing the machine learning pipeline.

Features linked to the customer:

* Insured NIF: Customer ID 
* Age at signature: Customers age at insurance contract signature
* Sex
* Policy Holder date of birth: customer date of birth

Features linked to insurance product:

* Risk code: coverage to be activated for specific claim
* Initial coverage date: insurance coverage starting date 

Features linked to claims:

* Claim number: id of the claim
* Claim Incident date: occurrence date of the incident
* Declaration date: date when a customer declare its claim

Features linked to the amount to be paid in case of claims:
* Insured Amount variable for classical credit modality
* Initial_Instalment_Amount: monthly amount paid by customer to reimburse its credit
    
Target to be predicted:

* Refusal_Flag: Yes / No

Additional information once a decision was made and you should not include in your pipeline:

* Claims cause 1
* Refused decision reason code
* Claim_Status_Level_0
* Refusal_Category

## 3. Datacamp organization

* Morning: data exploration using pandas
* Afternoon: development of a predictive model

## 4. Guideline to develop some predictive model

* Use a `DummyClassifier` to get a dummy baseline.
* Evaluate this model using cross-validation. You can use a `StratifiedKFold` strategy.
* Check the accuracy score on both the training and testing sets. Is it an appropriate metric? Compare it with the ROC AUC.
* Develop a predictive model using a linear model. Use the appropriate preprocessing for this model.
* Evaluate this model with the dummy baseline.
* Optimized the hyperparameter of your model using a `RandomizedSearchCV` or a `GridSearchCV`. Do you improve the results?
* Can you think of additional preprocessing methods to be added to your linear model: `SplineTransformer`, `PolynomialFeatures`, etc.
* Evaluate such a model.
* Develop a predictive model based on `HistGradientBoostingClassifier`. Choose an appropriate preprocessing.
* Evaluate this model.
* Fine tune and evaluate the model again.
* Split the dataset into a training and testing set. Compute the feature importance of this model using the function `permutation_importance`. Check the importance on both the training and testing set.
* Instead of computing the ROC AUC, let's derive a business metric by replacing with cost and gain the entries of the confusion matrix (i.e. TP, TN, FP, FN). By default, the scikit-learn classifier will use a cut-off point at a 0.5 probability. Vary this cut-off point and compute the business metric for the different thresholds. Which cut-off point do you find?