# Car Insurance Data Analysis

## Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an essential phase in any data engineering and analytics project. It helps in gaining an understanding of the dataset before diving into more detailed analysis. EDA involves summarizing the dataset's key features, identifying patterns, detecting outliers, and observing trends.

**Step 1: Examine the Data Structure**

- **Load the Data:** Begin by loading the dataset and analyzing its structure.
- **Check Dimensions:** Determine the number of rows and columns present in the dataset.
- **Preview the Data:** Review a sample of the initial records in the dataset.

In [10]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import logging
import os
import sys
from importlib import reload

sys.path.append(os.path.abspath(os.path.join('..', 'scripts')))

In [14]:
df = pd.read_csv('../data/data.csv', low_memory=False)

In [17]:
df.sample(5)

Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy,FraudResult
52283,TransactionId_13109,BatchId_103464,AccountId_4407,SubscriptionId_3655,CustomerId_4864,UGX,256,ProviderId_6,ProductId_10,airtime,ChannelId_3,3000.0,3000,2019-01-07T16:39:30Z,2,0
73066,TransactionId_49408,BatchId_5088,AccountId_4841,SubscriptionId_3829,CustomerId_1656,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-5000.0,5000,2019-01-25T18:23:34Z,2,0
1469,TransactionId_89325,BatchId_109466,AccountId_4841,SubscriptionId_3829,CustomerId_4920,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-200.0,200,2018-11-16T12:32:29Z,2,0
70552,TransactionId_47489,BatchId_56223,AccountId_2470,SubscriptionId_364,CustomerId_2886,UGX,256,ProviderId_5,ProductId_15,financial_services,ChannelId_3,10000.0,10000,2019-01-24T22:11:07Z,2,0
3405,TransactionId_133180,BatchId_94736,AccountId_4841,SubscriptionId_3829,CustomerId_2845,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-180.0,180,2018-11-19T11:46:39Z,2,0


In [15]:
print(df.dtypes)

TransactionId            object
BatchId                  object
AccountId                object
SubscriptionId           object
CustomerId               object
CurrencyCode             object
CountryCode               int64
ProviderId               object
ProductId                object
ProductCategory          object
ChannelId                object
Amount                  float64
Value                     int64
TransactionStartTime     object
PricingStrategy           int64
FraudResult               int64
dtype: object


In [20]:
# Import the class CreditRiskEDA
from credit_risk_EDA import CreditRiskEDA
# Initialize the class
cr_eda = CreditRiskEDA(df)

# Overview of the dataset
if not df.empty:
    cr_eda.data_overview()


Dataset contains 95662 rows and 16 columns.

Data Types:
TransactionId            object
BatchId                  object
AccountId                object
SubscriptionId           object
CustomerId               object
CurrencyCode             object
CountryCode               int64
ProviderId               object
ProductId                object
ProductCategory          object
ChannelId                object
Amount                  float64
Value                     int64
TransactionStartTime     object
PricingStrategy           int64
FraudResult               int64
dtype: object

Sample Data:
         TransactionId         BatchId       AccountId       SubscriptionId  \
0  TransactionId_76871   BatchId_36123  AccountId_3957   SubscriptionId_887   
1  TransactionId_73770   BatchId_15642  AccountId_4841  SubscriptionId_3829   
2  TransactionId_26203   BatchId_53941  AccountId_4229   SubscriptionId_222   
3    TransactionId_380  BatchId_102363   AccountId_648  SubscriptionId_2185   
4  Trans

In [21]:
# Convert the TransactionStartTime to appropriate datetime format
df['TransactionStartTime'] = pd.to_datetime(df['TransactionStartTime'])

In [22]:
# Identify duplicates based on specified columns
duplicate_rows = df[df.duplicated(subset=['BatchId', 'AccountId', 'SubscriptionId', 'CustomerId', 'CurrencyCode', 
                                           'CountryCode', 'ProviderId', 'ProductId', 'ProductCategory', 
                                           'ChannelId', 'Amount', 'Value', 'TransactionStartTime', 
                                           'PricingStrategy', 'FraudResult'], keep=False)]
duplicate_rows.shape

(317, 16)

In [23]:
# Keep the first occurrence of duplicates
df_cleaned = df.drop_duplicates(subset=['BatchId', 'AccountId', 'SubscriptionId', 'CustomerId', 
                                          'CurrencyCode', 'CountryCode', 'ProviderId', 'ProductId', 
                                          'ProductCategory', 'ChannelId', 'Amount', 'Value', 
                                          'TransactionStartTime', 'PricingStrategy', 'FraudResult'], 
                                 keep='first')

In [24]:
df_cleaned.shape

(95475, 16)

In [25]:
# Comfirm that there is no duplicates 
df_cleaned[df_cleaned.duplicated()]

Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy,FraudResult


In [26]:
df = df_cleaned.copy()
df.shape

(95475, 16)

In [27]:
# Statistical summary
summary_stats = cr_eda.summary_statistics()
display(summary_stats.T)

Summary Statistics:



Unnamed: 0,CountryCode,Amount,Value,PricingStrategy,FraudResult
count,95662.0,95662.0,95662.0,95662.0,95662.0
mean,256.0,6717.846,9900.584,2.255974,0.002018
std,0.0,123306.8,123122.1,0.732924,0.044872
min,256.0,-1000000.0,2.0,0.0,0.0
25%,256.0,-50.0,275.0,2.0,0.0
50%,256.0,1000.0,1000.0,2.0,0.0
75%,256.0,2800.0,5000.0,2.0,0.0
max,256.0,9880000.0,9880000.0,4.0,1.0
median,256.0,1000.0,1000.0,2.0,0.0
mode,256.0,1000.0,1000.0,2.0,0.0


In [28]:
# List of numeric columns
numeric_cols = df.select_dtypes(include='number').columns

# Plot distibution
cr_eda.plot_numerical_distribution(numeric_cols)

ValueError: The truth value of a Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().