Perform Exploratory Data Analysis (EDA) analysis on the following:
    Data Summarization:
        Descriptive Statistics: Calculate the variability for numerical features such as TotalPremium, TotalClaim, etc.
        Data Structure: Review the dtype of each column to confirm if categorical variables, dates, etc. are properly formatted.
    Data Quality Assessment:
        Check for missing values.
    Univariate Analysis:
        Distribution of Variables: Plot histograms for numerical columns and bar charts for categorical columns to understand distributions..
    Bivariate or Multivariate Analysis:
        Correlations and Associations: Explore relationships between the monthly changes TotalPremium and TotalClaims as a function of ZipCode, using scatter plots and correlation matrices.
    Data Comparison
        Trends Over Geography: Compare the change in insurance cover type, premium, auto make, etc. 
    Outlier Detection:
        Use box plots to detect outliers in numerical data
    Visualization
        Produce 3 creative and beautiful plots that capture the key insight you gained from your EDA


In [1]:
import os
os.chdir("../")


The data was changed to data/data.csv

In [2]:
import pandas as pd
df = pd.read_csv('data/data.csv')

In [None]:
df['Date_Column'] = pd.to_datetime(df['Date_Column'])

Descriptive Statistics: Calculate the variability for numerical features such as TotalPremium, TotalClaim, etc.


In [None]:
from scripts.eda import descriptive_statistics
#coumns to perform descriptive statistics 
columns_num = ['TotalPremium', 'TotalClaims','SumInsured', 'CalculatedPremiumPerTerm', 'ExcessSelected', 'NumberOfVehiclesInFleet','Age', 'YearsInsured', 'DrivingExperience', 'ClaimFrequency', 'ClaimSeverity']
columns_num = ['TotalPremium', 'TotalClaims','SumInsured', 'CalculatedPremiumPerTerm', 'ExcessSelected', 'NumberOfVehiclesInFleet']


descriptive_statistics(df, columns_num)

Data Structure: Review the dtype of each column to confirm if categorical variables, dates, etc. are properly formatted.


In [None]:
df.info()

we have object, float64, and int64 data types 

In [None]:
object_cols = df.select_dtypes(include='object')
print(object_cols.columns)

In [None]:
df.describe()

In [None]:
for col in object_cols:
    print("######### ", col)
    print(df[col].value_counts())

In [None]:
num_data_points = len(df)
print("Number of data points:", num_data_points)

Check for missing values.


In [None]:
from scripts.preprocessing import missing_values, count_missing_values
# mis, misp = count_missing_values(df)
# print(mis, misp)
missing = missing_values(df)
print(missing)

most of the columns are compleat but some have missing values :
    NumberOfVehiclesInFleet     100.000000
    CrossBorder                  99.930207
    CustomValueEstimate          77.956560
    Converted                    64.183810
    Rebuilt                      64.183810
    WrittenOff                   64.183810
    NewVehicle                   15.327998
    Bank                         14.594670
    AccountType                   4.022806

Distribution of Variables: Plot histograms for numerical columns and bar charts for categorical columns to understand distributions..

In [None]:
numerical_cols = df.select_dtypes(include=['int64', 'float64'])
categorical_cols = df.select_dtypes(include='object')

print("Numerical columns:", numerical_cols.columns)
print("Categorical columns:", categorical_cols.columns)

In [None]:
from scripts.eda import show_histograms
show_histograms(df, numerical_cols)

In [None]:
from scripts.eda import show_bar_chart
show_bar_chart(df,categorical_cols)

Correlations and Associations: Explore relationships between the monthly changes TotalPremium and TotalClaims as a function of ZipCode, using scatter plots and correlation matrices.

In [3]:
df['TransactionMonth'] = pd.to_datetime(df['TransactionMonth'], format='ISO8601')


# Sort by TransactionMonth
df = df.sort_values('TransactionMonth')

# Calculate monthly changes
df['TotalPremium_Change'] = df['TotalPremium'].diff()
df['TotalClaims_Change'] = df['TotalClaims'].diff()

In [None]:
from scripts.eda import corr_trotalpremium_totalclaim_postalcode

corr_trotalpremium_totalclaim_postalcode(df)

Correlations and Associations: Explore relationships between the monthly changes TotalPremium and TotalClaims as a function of ZipCode, using scatter plots and correlation matrices.

In [7]:
df = df.head(100)

Data Comparison
    Trends Over Geography: Compare the change in insurance cover type, premium, auto make, etc. 


In [None]:
from scripts.eda import analyze_geographic_trends
analyze_geographic_trends(df)

Outlier Detection:
    Use box plots to detect outliers in numerical data

In [None]:
from scripts.eda import outlier_detection

outlier_detection(df, numerical_cols)

Visualization
    Produce 3 creative and beautiful plots that capture the key insight you gained from your EDA


In [None]:

from scripts.eda import visualize_plots


visualize_plots(df)
