# Analyze the report of Swedish Motor Insurance

### Question

The data gives the details of third party motor insurance claims in Sweden for the year 1977. In Sweden, all motor insurance companies apply identical risk arguments to classify customers, and thus their portfolios and their claims statistics can be combined. The data were compiled by a Swedish Committee on the Analysis of Risk Premium in Motor Insurance. The Committee was asked to look into the problem of analyzing the real influence on the claims of the risk arguments and to compare this structure with the actual tariff.


The insurance dataset holds 7 variables and the description of these variables are given below:<br>
**Variable Description**<br>
Kilometers travelled per year<br>
            1: < 1000<br>
            2: 1000-15000<br>
            3: 15000-20000<br> 
            4: 20000-25000<br> 
            5: > 25000<br>
Geographical zone<br> 
            1: Stockholm, Göteborg, and Malmö with surroundings<br> 
            2: Other large cities with surroundings<br> 
            3: Smaller cities with surroundings in southern Sweden<br> 
            4: Rural areas in southern Sweden<br> 
            5: Smaller cities with surroundings in northern Sweden<br> 
            6: Rural areas in northern Sweden<br> 
            7: Gotland<br>
Bonus :     No claims bonus; equal to the number of years, plus one, since the last claim<br>
Make :      1-8 represents eight different common car models. All other models are combined in class 9.<br>
Insured:    Number of insured in policy-years<br>
Claims:     Number of claims<br>
Payment :   Total value of payments in Skr (Swedish Krona)<br>

### Import Libraries

In [1]:
import numpy as np
import pandas as pd

#Plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

#Supress warnings
#Suppress cell warnings
import warnings
warnings.filterwarnings('ignore')

### Data Importation

In [2]:
insurance_df = pd.read_csv('Downloads/Projects/Insurance/Insurance/SwedishMotorInsurance.csv', low_memory=False)

In [3]:
#first five rows
insurance_df.head()

Unnamed: 0,Kilometres,Zone,Bonus,Make,Insured,Claims,Payment
0,1,1,1,1,455.13,108,392491
1,1,1,1,2,69.17,19,46221
2,1,1,1,3,72.88,13,15694
3,1,1,1,4,1292.39,124,422201
4,1,1,1,5,191.01,40,119373


In [4]:
insurance_df.shape

(2182, 7)

The DataFrame consists of 2182 rows and 7 columns

In [5]:
insurance_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2182 entries, 0 to 2181
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Kilometres  2182 non-null   int64  
 1   Zone        2182 non-null   int64  
 2   Bonus       2182 non-null   int64  
 3   Make        2182 non-null   int64  
 4   Insured     2182 non-null   float64
 5   Claims      2182 non-null   int64  
 6   Payment     2182 non-null   int64  
dtypes: float64(1), int64(6)
memory usage: 119.5 KB


In [6]:
insurance_df.isnull().sum()

Kilometres    0
Zone          0
Bonus         0
Make          0
Insured       0
Claims        0
Payment       0
dtype: int64

This indicates that there are no missing values in the dataset

In [7]:
insurance_df.corr()

Unnamed: 0,Kilometres,Zone,Bonus,Make,Insured,Claims,Payment
Kilometres,1.0,-0.014045,0.007226,-0.002671,-0.11299,-0.128452,-0.120886
Zone,-0.014045,1.0,0.011752,-0.005216,-0.058232,-0.114687,-0.102695
Bonus,0.007226,0.011752,1.0,0.00215,0.165425,0.105102,0.118033
Make,-0.002671,-0.005216,0.00215,1.0,0.185642,0.253212,0.243539
Insured,-0.11299,-0.058232,0.165425,0.185642,1.0,0.910348,0.933217
Claims,-0.128452,-0.114687,0.105102,0.253212,0.910348,1.0,0.9954
Payment,-0.120886,-0.102695,0.118033,0.243539,0.933217,0.9954,1.0


**1. Summary statistics for each variable**

In [8]:
#Perform descriptive statistics
insurance_df.describe()

Unnamed: 0,Kilometres,Zone,Bonus,Make,Insured,Claims,Payment
count,2182.0,2182.0,2182.0,2182.0,2182.0,2182.0,2182.0
mean,2.985793,3.970211,4.015124,4.991751,1092.19527,51.86572,257007.6
std,1.410409,1.988858,2.000516,2.586943,5661.156245,201.710694,1017283.0
min,1.0,1.0,1.0,1.0,0.01,0.0,0.0
25%,2.0,2.0,2.0,3.0,21.61,1.0,2988.75
50%,3.0,4.0,4.0,5.0,81.525,5.0,27403.5
75%,4.0,6.0,6.0,7.0,389.7825,21.0,111953.8
max,5.0,7.0,7.0,9.0,127687.27,3338.0,18245030.0


From here we can see that there are people who insured their cars but have not made a claim

**2. Find whether payment is related to number of claims and the number of insured policy years**

In [9]:
import statsmodels.formula.api as sm

In [10]:
formula_str = insurance_df.columns[6]+' ~ '+'+'.join(insurance_df.columns[4:6])
formula_str

'Payment ~ Insured+Claims'

In [11]:
model=sm.ols(formula=formula_str, data=insurance_df)

In [12]:
fitted = model.fit()

In [13]:
print(fitted.summary())

                            OLS Regression Results                            
Dep. Variable:                Payment   R-squared:                       0.995
Model:                            OLS   Adj. R-squared:                  0.995
Method:                 Least Squares   F-statistic:                 2.211e+05
Date:                Sat, 12 Feb 2022   Prob (F-statistic):               0.00
Time:                        08:12:13   Log-Likelihood:                -27477.
No. Observations:                2182   AIC:                         5.496e+04
Df Residuals:                    2179   BIC:                         5.498e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   3250.7447   1582.708      2.054      0.0

H0: There is no relationship between payment and number of claims and number of insured policy years<br>
H1: There is a relationship between payment and number of claims and number of insured policy years

From the above results, we reject null hypothesis(H0) since p-value <  p-alpha and conclude that there is a relationship between payment and number of claims and number of insured policy years

**3. Find out whether distance, location, bonus, make, insured amount or claims affect payment**

In [14]:
formula_str = insurance_df.columns[6]+' ~ '+'+'.join(insurance_df.columns[0:6])
formula_str

'Payment ~ Kilometres+Zone+Bonus+Make+Insured+Claims'

In [15]:
model=sm.ols(formula=formula_str, data=insurance_df)

In [16]:
fitted = model.fit()

In [17]:
print(fitted.summary())

                            OLS Regression Results                            
Dep. Variable:                Payment   R-squared:                       0.995
Model:                            OLS   Adj. R-squared:                  0.995
Method:                 Least Squares   F-statistic:                 7.462e+04
Date:                Sat, 12 Feb 2022   Prob (F-statistic):               0.00
Time:                        08:12:13   Log-Likelihood:                -27461.
No. Observations:                2182   AIC:                         5.494e+04
Df Residuals:                    2175   BIC:                         5.498e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -2.173e+04   6338.112     -3.429      0.0

Bonus and Make do not affect the payment

**4. Find out at what location, kilometer and bonus level their insured amount, claims and payment get increased**

In [23]:
payment_increase = insurance_df.groupby(['Zone','Kilometres','Bonus'])['Insured','Claims','Payment'].mean().sort_values(by='Payment',ascending=False)
payment_increase.head(506+)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Insured,Claims,Payment
Zone,Kilometres,Bonus,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4,2,7,19543.612222,530.333333,2917260.0
4,1,7,19143.095556,433.777778,2357821.0
4,3,7,12956.287778,404.111111,2169422.0
2,2,7,8518.242222,322.555556,1637884.0
1,2,7,6918.635556,327.333333,1612848.0
3,2,7,9576.966667,314.111111,1581017.0
3,1,7,9719.787778,271.222222,1407028.0
1,1,7,7007.348889,263.444444,1262024.0
2,1,7,8056.955556,253.777778,1225560.0
2,3,7,5407.417778,241.666667,1222893.0


Rural areas in southern Sweden, 1000-15000 kilometers and 7 years no claim bonus

**5. Find whether the insured amount, zone, kilometer, bonus or make affect the claim rates and to what extent**

In [19]:
formula_str = insurance_df.columns[5]+' ~ '+'+'.join(insurance_df.columns[:5])
formula_str

'Claims ~ Kilometres+Zone+Bonus+Make+Insured'

In [20]:
model=sm.ols(formula=formula_str, data=insurance_df)

In [21]:
fitted = model.fit()

In [22]:
print(fitted.summary())

                            OLS Regression Results                            
Dep. Variable:                 Claims   R-squared:                       0.842
Model:                            OLS   Adj. R-squared:                  0.842
Method:                 Least Squares   F-statistic:                     2328.
Date:                Sat, 12 Feb 2022   Prob (F-statistic):               0.00
Time:                        08:12:13   Log-Likelihood:                -12659.
No. Observations:                2182   AIC:                         2.533e+04
Df Residuals:                    2176   BIC:                         2.536e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     37.1230      7.127      5.209      0.0

From the p-values we can see that Kilometres, Zone, Bonus, Make and Insured all affect the claim rates