# Coding Challenge Huk-Coburg

## Business Understanding

* Given is a dataset of french car insurances with raised damage claims
* Time budget 5h
* predict the yearly damage claims for an insurance holder. 

Questions:
What is BonusMalus? - Schadenfreiheitsrabatt: discount for driving without any claims for a certain period 

## Data Understanding

In [None]:
import pandas as pd 
import arff

In [None]:
# Load features and frequency
data_freq = arff.load('data/freMTPL2freq.arff')
df_freq = pd.DataFrame(data_freq, columns=["IDpol", "ClaimNb", "Exposure", "Area", "VehPower", 
                                           "VehAge","DrivAge", "BonusMalus", "VehBrand", "VehGas", 
                                           "Density", "Region"])

In [None]:
# Load Claim Amount
data_sev = arff.load('data/freMTPL2sev.arff')
df_sev = pd.DataFrame(data_sev, columns=["IDpol", "ClaimAmount"])

#### Features

In [None]:
df_freq.describe()

In [None]:
# Test if IDpol is unique
assert len(df_freq["IDpol"]) == len(df_freq["IDpol"].unique())

In [None]:
df_freq.head(5)

In [None]:
# Show categorical variables
print("Area", list(df_freq["Area"].unique()))
print("VehBrand", list(df_freq["VehBrand"].unique()))
print("VehGas",list(df_freq["VehGas"].unique()))
print("Region", list(df_freq["Region"].unique()))

#### Claims

In [None]:
df_sev.describe()

In [None]:
# Sum all claims
df_total_claim = df_sev.groupby("IDpol").sum()

In [None]:
df_total_claim.plot(kind='hist', bins=100, logy=True)

#### Merge DataFrames

In [None]:
df = df_freq.set_index("IDpol")
df["ClaimAmount"] = df_total_claim["ClaimAmount"]
df = df.fillna(0)

## Data Quality

#### How many insurance holders have claims but no recorded amount?