### This notebook is providing a general overview on the activities performed

### Data and main goal of the analysis

Following <b>dataset</b> was used for the analyis:
https://www.kaggle.com/datasets/floser/french-motor-claims-datasets-fremtpl2freq

This dataset is containing the losses from a French motor-third-party-liability portfolio.
    
The <b>main goal of the analysis </b> was to:
- find the best model to describe the <b>frequency of claims occurance</b> i.e. the probability for an individual policy whether it occured a claim in the observed time frame or not.
- i.e. the problem was framed as a <b>binary classification</b>

### 01) 1st data analysis

<b>Main Findings and Observations:</b>
- Data quality is good:
    - policy IDs are just appearing ones in the data set
    - there are no missing values
    - there are no duplicate records<br><br>
    
- Findings and Decisions:
    - ClaimNb: there were some records for which the number of Claims observed is >>5,
    - according to my experience that seems a bit unrealistic, but as there is not other possibility to assess the quality now and the nr. of such records is small, the data are kept
    - Also for exposure there were some records where expsoure is > 1, i.e. more than 1 year
    - again as the nr of such records is small, data are kept

### 02) Visualiation and describtive Analysis

<b>Main Findings and Observations:</b>
- Univariate analysis of numerical data:
    - most of the Exposure is 1 -> full 1y history of the policy
    - VehAge, there quite a few outliers wiht old ages -> oldtimers?
    - Bonus Malus: most of the portfolio is in highest Bonus Class<br><br>
    
- Bivariate analysis of numerical data:
    - no clear dependencies are visible<br><br>
    
- Correlation analyiss of numerical data:
    - corelation analysis is basically confirming, that there are only weak linear dependencies in the data<br><br>
    

- Univariate analysis of claims occurance frequency:
    - <b>overall the frequency of a claim occuring is 9,5%</b>
    - area: worsening frequency from A-> F -> this has an ordinal meaning, so probably area defines some risk scoring 
    - VehPower: moderate volatility, 4 classes leading from volume point of view
    - VehAge: high frequency at age 0 -> volume is lower here, but this seems to be an issue here
    - Driver Age: young drivers causing more claims, biggest part of portfolio ist between 30 and 60ys
    - bonus/Malus: positive correlation visible -> higher malus class -> higher frequency
    - Brand: B12: outstandingly worse segment with high volume also
    - Density: some classes with outstandingly high frequency but partially driven by low exposure
    - Regions: quite volatile with Region 24 showing by far the highest volume in the portfolio

### 03) Feature Preparation

- Main Activities:
    - an 80/20 train/ test split is applied
    - as "Area" is expected to present some king of risk scoring it is encoded ordinally
    - numerical features are scaled 2 ways: using MinMax and StandardScaler()
    - categorical features are one-hot encoded

### 04) Modelling and final Conclusion

- Main Activities:
 
    - Besides a random baseline model, the following models are trained and optimised:
        - LogLogistic Regression
        - Randrom Forest Classifier
        - AdaBoost Classifier
        - Desision Tree Model using cost complexity pruning<br><br>
    - Data used: the StadardScaled version of the prepared numerical features
    - Target Metric: AUC<br><br>
- Final Conclusions:
    - The random forest with hyperparameters optimised by grid search seems to be the best model
    - it shows some slight overfit which is acceptable
    - Total <b>AUC on test data is almost 70%</b>
    - This score shows that claims occurance is actually quite random (also proven by the results of the classification report)
    - Given that the main target of the exercise was to calculate the claim frequency, e.g. for pricing, it is a reasonable basis
    - applying the final model to the total data set AUC was 80%<br><br>
    - most important features of the model are Exposure, followed by BonusMalus, VehAge and DriverAge

### Outlook and possible next steps

- Potential next steps:
    - try the MinMax scaled data and see the differences
    - go deeper into optimisation of hyperparameters for the selected model and see whether further improvements are possible
    - try models from other packages, e.g. LGBM or deep learning and see whehter the AUC can be even further improved
    - compare with a classical GLM that can directly model the claim frequency
    - as the event of a claim is not super frequent, some undersampling techniques can be applied also to get more weight on the positive labels
    - use better hardware to optimise run-time and test more parameters

### Appendix: main libraries used

pandas==2.0.0<br>
numpy==1.24.2<br>
matplotlib==3.7.1<br>
seaborn==0.12.2<br>
scikit-learn==1.2.2<br>