## Hypothesis Testing
- In this notebook we will try to perform a hypothesis testing by rejecting or accepting the null hypothesis that is presented to us
- These null hypotheses are based on the insurance dataset and knowing the significance of certain parameter on our risk assessment would be beneficial.

## Importing necessary libraries and modules

In [2]:
import pandas as pd
import os
import warnings
os.chdir('..')
from scripts.hypo import *
warnings.filterwarnings('ignore',category=pd.errors.DtypeWarning)

## Loading the dataset

In [3]:
df=pd.read_csv('data/MachineLearningRating_v3.txt',sep='|')
df.head()

Unnamed: 0,UnderwrittenCoverID,PolicyID,TransactionMonth,IsVATRegistered,Citizenship,LegalType,Title,Language,Bank,AccountType,...,ExcessSelected,CoverCategory,CoverType,CoverGroup,Section,Product,StatutoryClass,StatutoryRiskType,TotalPremium,TotalClaims
0,145249,12827,2015-03-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,21.929825,0.0
1,145249,12827,2015-05-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,21.929825,0.0
2,145249,12827,2015-07-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,0.0,0.0
3,145255,12827,2015-05-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Metered Taxis - R2000,Own damage,Own Damage,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,512.84807,0.0
4,145255,12827,2015-07-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Metered Taxis - R2000,Own damage,Own Damage,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,0.0,0.0


## Generating some important columns

In [4]:
# Adding a new catagorical column from the TotalClaims column which answers if a user has taken a claim or not 
df['Claimed']=df['TotalClaims'].apply(lambda x: 0 if x==0 else 1)

In [11]:
# Adding a new gender column based on the Title column. By doing this we significantly decrease the Not Specified rows
df['Gender_new'] = df['Title'].apply(
    lambda x: 'Male' if x == 'Mr' 
    else 'Female' if x in ['Miss', 'Mrs', 'Ms'] 
    else 'Not Specified')
# let's load the important columns for our testing
df['Profit']=df['TotalPremium']-df['TotalClaims']
df['Profit_cat']=df['Profit'].apply(lambda x: 1 if x>0 else 0)
df[['Province','Gender_new','PostalCode','Claimed','Profit','Profit_cat']]

Unnamed: 0,Province,Gender_new,PostalCode,Claimed,Profit,Profit_cat
0,Gauteng,Male,1459,0,21.929825,1
1,Gauteng,Male,1459,0,21.929825,1
2,Gauteng,Male,1459,0,0.000000,0
3,Gauteng,Male,1459,0,512.848070,1
4,Gauteng,Male,1459,0,0.000000,0
...,...,...,...,...,...,...
1000093,Western Cape,Male,7493,0,347.235175,1
1000094,Western Cape,Male,7493,0,347.235175,1
1000095,Western Cape,Male,7493,0,347.235175,1
1000096,Western Cape,Male,7493,0,2.315000,1


## Hypothesis Testing

- The 4 null hypotheses that we have to analyze are:
    1. There are no risk differences across provinces 
    2. There are no risk differences between zip codes 
    3. There are no significant margin (profit) difference between zip codes 
    4. There are not significant risk difference between Women and Men
- The general assumption is: 
    - If the p-value is less than **0.05**, we will **reject the null hypothesis**.
    - If the p-value is greater than or equal to **0.05**, we will **accept the null hypothesis**.
Let's start with the first one:

1. There are no risk differences across provinces.
    - For this null hypothesis we have to analyze the relationship between someone claiming a loss and their respective province. Since both the parameter are catagorical, we'll use the chi-square test to calculate the p-value

In [17]:
chi_stat,p_value,contingency_table=catagorical_claims(df,'Province')
contingency_table.T

Claimed,0,1
Province,Unnamed: 1_level_1,Unnamed: 2_level_1
Eastern Cape,30286,50
Free State,8088,11
Gauteng,392541,1324
KwaZulu-Natal,169298,483
Limpopo,24769,67
Mpumalanga,52588,130
North West,142938,349
Northern Cape,6372,8
Western Cape,170425,371


In [11]:
print(f"The chi-squared statistic is {round(chi_stat,3)}")
print(f"The p-value for the province parameter is {p_value}")

The chi-squared statistic is 103.883
The p-value for the province parameter is 6.852117855844585e-19


---
* I did not round the p-value just to show how small of a number it is. This number is extremely close to 0. But more importantly, this number is less than 0.05(significance level). So, we **reject the hypothesis**
* This means that **there are risk differences across province**. This is also shown in the previous EDA notebook. Provinces like `Gauteng`, `KwaZulu-Natal` and `Western Cape` show a high claim rate compared to smaller provinces. This shows you the variance and disparity between provinces considering `TotalClaims`.
---



2. There are no risk differences between zip codes. 
    - For this null hypothesis we have to analyze the relationship between someone claiming a loss and their respective postal codes. Since both the parameter are catagorical, we'll use the chi-square test to calculate the p-value. 

In [23]:
# Since there is no column that shows zip codes, we'll use Postal codes
chi_stat,p_value,contingency_table=catagorical_claims(df,'PostalCode')
contingency_table.T

Claimed,0,1
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
1,5329,12
2,1482,6
4,77,0
5,396,4
6,438,2
...,...,...
9781,640,3
9830,56,0
9868,100,0
9869,1414,1


In [24]:
print(f"The chi-squared statistic is {round(chi_stat,3)}")
print(f"The p-value for the postal code parameter is {p_value}")

The chi-squared statistic is 1450.142
The p-value for the postal code parameter is 7.370378491766577e-30


---
* The p-value we got for this is even smaller than the one for province. In this case we have to **reject the hypothesis**. this shows you **the risk is extremely variant when related to postal code**. Certain Postal Codes like `2000` file a large amount of claim. There are also other Postal Codes that have never filed a claim. So, the p-value we got for this is understandable.
---

3. There are no significant margin (profit) difference between zip codes.
    - I assumed that the profit would be the premium customers pay minus the claim they ask. The KPI used for this hypothesis is `Profit_cat`, returns a categorical column for positive and negative columns. Since both the parameters are categorical, we'll use the chi-square test

In [18]:
con=pd.crosstab(df['PostalCode'],df['Profit_cat'])
chi,p_value,_,_=chi2_contingency(con)
con

Profit_cat,0,1
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2503,2838
2,841,647
4,0,77
5,44,356
6,222,218
...,...,...
9781,214,429
9830,0,56
9868,0,100
9869,571,844


In [20]:
print(f"The chi-squared statistic is {round(chi,3)}")
print(f"The p-value for the postal code parameter using the profit difference is {p_value}")

The chi-squared statistic is 112594.706
The p-value for the postal code parameter using the profit difference is 0.0


---
- For this test, we have got a p-value of 0. Since this is a number below 0.05, we have to **reject the null hypothesis**. This indicates that **there is a significant difference in profit when we analyze the postal codes**. When we see the total profit by the company, we can see that the insurance company is losing money since the total sum of the `Profit` column is negative. Measures should be taken to fix this issue either by increasing the premium our customers pay or by reducing the claims we allow to damages.
---

4. There are not significant risk difference between Women and Men.
    - This is the last null hypothesis we have to reject or accept. For this I managed to significantly reduce the null values by using the title value for gender and used it to compare it with the claimed value. The chi-square test is used for this one as well.

In [21]:
chi_stat,p_value,contingency_table=catagorical_claims(df,'Gender_new')
contingency_table

Gender_new,Female,Male,Not Specified
Claimed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,65602,930895,808
1,131,2660,2


In [22]:
print(f"The chi-squared statistic is {round(chi_stat,3)}")
print(f"The p-value for the gender parameter is {p_value}")

The chi-squared statistic is 16.203
The p-value for the gender parameter is 0.00030304328290309377


---
* We can see that the p-value is 0.0003, which is still less than the significance level(0.05). So in this case as well, we have to **reject the hypothesis**.By rejecting the null hypothesis, we can say that **there is a statistically significant risk difference between males and females**. From the contingency table we can see that males file much more claims than females. This may be the reason why the p-value is small.
---

In [30]:
df.to_csv('C:/Users/abenet/Desktop/data/Week 3/dataset.csv')

888