# Hypothesis Testing for AlphaCare Insurance Solutions

This notebook performs hypothesis testing on the insurance claim data to help optimize the marketing strategy and discover 'low-risk' targets for premium reduction.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os

# Add the src directory to the Python path
src_dir = os.path.abspath(os.path.join(os.getcwd(), '..', 'src'))
sys.path.insert(0, src_dir)

from src.data_loader import DataLoader
from src.hypothesis_testing import HypothesisTesting

## Load and Prepare Data

In [None]:
data_loader = DataLoader('C:/Users/Lidya/Downloads/Week-3/alpha-insurance-analysis/data\MachineLearningRating_v3.txt')
data = data_loader.load_data()
print(data.head())
print(data.info())

## Perform Hypothesis Testing

In [None]:
hypothesis_tester = HypothesisTesting(data)
results = hypothesis_tester.run_all_tests()

for result in results:
    print(f"Test: {result['test']}")
    print(f"Statistic: {result['statistic']:.4f}")
    print(f"P-value: {result['p_value']:.4f}")
    print(f"Reject null hypothesis: {result['reject_null']}\n")

## Visualizations

In [None]:
# Risk distribution across provinces
province_risk = data.groupby('Province').apply(hypothesis_tester.calculate_risk_ratio).sort_values(ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(x=province_risk.index, y=province_risk.values)
plt.title('Risk Ratio by Province')
plt.xlabel('Province')
plt.ylabel('Risk Ratio')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Margin distribution across postal codes (top 20)
postal_margin = data.groupby('PostalCode').apply(lambda x: (x['TotalPremium'] - x['TotalClaims']).mean()).sort_values(ascending=False).head(20)

plt.figure(figsize=(12, 6))
sns.barplot(x=postal_margin.index, y=postal_margin.values)
plt.title('Average Margin by Postal Code (Top 20)')
plt.xlabel('Postal Code')
plt.ylabel('Average Margin')
plt.xticks(rotation=90)
plt.show()

In [None]:
# Risk comparison between genders
gender_risk = data.groupby('Gender').apply(hypothesis_tester.calculate_risk_ratio)

plt.figure(figsize=(8, 6))
sns.barplot(x=gender_risk.index, y=gender_risk.values)
plt.title('Risk Ratio by Gender')
plt.xlabel('Gender')
plt.ylabel('Risk Ratio')
plt.show()

# Observations and Conclusions

## 1. Risk differences across provinces

**Test results:**
```python
print("Test: Risk differences across provinces")
print("Statistic: 104.1909")
print("P-value: 0.0000")
print("Reject null hypothesis: True")
```

- The chi-square test results provide strong evidence to reject the null hypothesis.
- There are significant statistical differences in risk across the provinces.
- From the "Risk Ratio by Province" graph, we note the following:
  - Gauteng shows the highest risk ratio, with KwaZulu-Natal and Western Cape following closely behind.
  - Northern Cape has the lowest risk ratio, which is notably lower than the other provinces.

**Implications for marketing strategy:**
- Consider revising premiums or underwriting policies based on varying provincial risk levels.
- Implement focused risk management initiatives in high-risk regions such as Gauteng.
- Look into opportunities for growth in lower-risk provinces like Northern Cape.

## 2. Risk differences between postal codes

**Test results:**
```python
print("Test: Risk differences between postal codes")
print("Statistic: 1454.4676")
print("P-value: 0.0000")
print("Reject null hypothesis: True")
```

Here are the rephrased versions:

- The chi-square test results strongly indicate the rejection of the null hypothesis.
- There are statistically significant differences in risk between postal codes.
- This suggests that location, at a finer granularity than province, plays a crucial role in risk evaluation.

**Potential targeted marketing opportunities:**
- Design postal code-specific marketing strategies, targeting areas with lower risk ratios.
- Offer more competitive rates in low-risk postal codes to attract potential customers.
- Apply stricter underwriting criteria or higher premiums in high-risk postal codes to mitigate risk exposure.

## 3. Margin differences between postal codes

**Test results:**
```python
print("Test: Margin differences between postal codes")
print("Statistic: 0.8707")
print("P-value: 0.9977")
print("Reject null hypothesis: False")
```

- The ANOVA test results suggest that we cannot reject the null hypothesis.
- There are no statistically significant differences in margins across postal codes.
- The "Average Margin by Postal Code (Top 20)" graph shows some variation in margins, but these differences are not statistically significant across all postal codes.

**Implications for premium adjustments:**
- Although certain postal codes may have different average margins, the overall pricing strategy appears consistent across regions.
- Focus on factors other than postal code when considering potential premium adjustments.
- Investigate why specific postal codes (such as 3887 and 4016) have higher average margins, even though the differences are not statistically significant overall.

## 4. Risk differences between Women and Men

**Test results:**
```python
print("Test: Risk differences between Women and Men")
print("Statistic: nan")
print("P-value: nan")
print("Reject null hypothesis: False")
```

Here is the rephrased version of your content:

- The t-test result suggests a potential issue with either the data or the implementation of the test.
- We cannot confidently draw conclusions about risk differences between genders from this result.
- However, the "Risk Ratio by Gender" graph reveals some key observations:
  - Males appear to have a slightly higher risk ratio than females.
  - The "Not specified" category shows the highest risk ratio, indicating the need for further investigation.

**Potential gender-based marketing strategies:**
- Additional analysis is required to determine whether the observed gender differences are statistically significant.
- Consider creating marketing campaigns tailored to different genders, while ensuring that discriminatory practices are avoided.
- Investigate the elevated risk ratio in the "Not specified" category to determine whether it reflects a specific group or a data quality issue.

## Overall conclusions and recommendations for ACIS marketing strategy

- Geographic factors, especially at the province and postal code levels, are important in assessing risk.
- Implement location-specific marketing and underwriting strategies to manage risks effectively and attract lower-risk customers.
- While margins show no significant differences across postal codes, there may still be opportunities to optimize premiums in specific regions.
- Gender-based risk differences require further investigation due to inconclusive statistical findings.

## Suggested areas for further investigation or data collection

1. Explore other demographic factors (e.g., age, occupation) to uncover additional risk patterns.
2. Investigate the causes of higher risk ratios in Gauteng, KwaZulu-Natal, and Western Cape.
3. Conduct a detailed analysis of the "Not specified" gender category to understand its elevated risk ratio.
4. Perform time-series analysis to detect any seasonal or long-term trends in risk ratios and margins.
5. Collect more detailed data on vehicle types and usage patterns to enhance risk assessment accuracy.
6. Investigate the relationship between customer loyalty (policy duration) and risk ratios to inform retention strategies.

By leveraging these insights and conducting further analyses, ACIS can fine-tune its marketing strategy, optimize pricing, and improve risk management across various customer segments and geographical regions.