# Task 1: EDA with a Modular Approach

**Objective:** Perform exploratory data analysis using the custom-built modules from the `/src` directory.

## 1. Import Modules and Load Data

In [None]:
import sys
import os
# Add the parent directory to the path to import our modules
sys.path.append(os.path.abspath(os.path.join('..')))

from src.data_loader import load_data
from src.eda import EDAAnalyzer
from src.visualizer import Visualizer

%matplotlib inline

In [None]:
# --- Load the Data ---
# Replace 'your_data_file.xlsx' with the actual filename.
file_path = '../data/your_data_file.xlsx'
df = load_data(file_path)

if df is not None:
    display(df.head())

## 2. Data Quality and Summarization

Run initial checks for data types, missing values, and descriptive statistics.

In [None]:
if df is not None:
    analyzer = EDAAnalyzer(df)
    
    print('--- Data Info ---')
    analyzer.get_data_info()
    
    print('
--- Summary Statistics ---')
    display(analyzer.get_summary_statistics())
    
    print('
--- Missing Values ---')
    display(analyzer.check_missing_values())

## 3. Answering Guiding Questions

Now, let's use our enhanced modules to address the specific questions from the project brief.

### Guiding Question 1: What is the overall Loss Ratio and how does it vary?

In [None]:
if df is not None:
    # Calculate overall loss ratio
    overall_loss_ratio = analyzer.calculate_loss_ratio()
    print(f'Overall Loss Ratio: {overall_loss_ratio:.2%}\n')

    # Calculate loss ratio by Province
    print('--- Loss Ratio by Province ---')
    display(analyzer.calculate_loss_ratio(group_by='Province'))

    # Calculate loss ratio by Gender
    print('\n--- Loss Ratio by Gender ---')
    display(analyzer.calculate_loss_ratio(group_by='Gender'))
    
    # Calculate loss ratio by VehicleType
    print('\n--- Loss Ratio by VehicleType ---')
    display(analyzer.calculate_loss_ratio(group_by='VehicleType'))

### Guiding Question 2: What are the distributions and are there outliers?

In [None]:
if df is not None:
    visualizer = Visualizer(df)
    numerical_cols = ['TotalPremium', 'TotalClaims', 'CustomValueEstimate', 'CalculatedPremiumPerTerm']
    
    # Plot distributions
    visualizer.plot_distributions(numerical_cols)
    
    # Check for outliers
    visualizer.plot_boxplots(['TotalClaims', 'CustomValueEstimate'])

### Guiding Question 3: Which vehicle makes are associated with the highest claims?

In [None]:
if df is not None:
    # Analyze claims by vehicle make
    print('--- Average Claim Amount by Top 10 Vehicle Makes ---')
    display(analyzer.analyze_claims_by_vehicle(top_n=10))

## 4. Creative Visualizations

Produce plots that capture key insights.

In [None]:
if df is not None:
    # Plot 1: Loss Ratio by Province
    visualizer.plot_loss_ratio_by_category('Province')
    
    # Plot 2: Loss Ratio by Vehicle Type
    visualizer.plot_loss_ratio_by_category('VehicleType')
    
    # Plot 3: Claim amounts by top vehicle makes
    visualizer.plot_claims_by_make()