## CAR INSUARNCE ANALYSIS

# Introduction 
This report presents an analysis of the factors influencing insurance premiums, specifically focusing on driver age and car age. The analysis employs Ordinary Least Squares (OLS) regression to quantify the relationships between these variables and the insurance premium.



Table of content
* Data Wrangling
* Descriptive Statistics
* Correlation analysis
* Regression modeling 

In [2]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
%matplotlib inline 

# Data Wrangling

# In this process we are going to do the following steps:
* Load the dataset
* Check for datatypes and the summary statistics
* Check for outliers
* Check for missing values and Duplicated values
* Data Trasformation

In [3]:
#load the dataset

car_data = pd.read_csv(r'c:\Users\User\Downloads\car_insurance_premium_dataset.csv')

car_data.head()

Unnamed: 0,Driver Age,Driver Experience,Previous Accidents,Annual Mileage (x1000 km),Car Manufacturing Year,Car Age,Insurance Premium ($)
0,56,32,4,17,2002,23,488.35
1,46,19,0,21,2025,0,486.15
2,32,11,4,15,2020,5,497.55
3,60,0,4,19,1991,34,498.35
4,25,7,0,13,2005,20,495.55


In [4]:
#check the datatypes of the columns
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Driver Age                 1000 non-null   int64  
 1   Driver Experience          1000 non-null   int64  
 2   Previous Accidents         1000 non-null   int64  
 3   Annual Mileage (x1000 km)  1000 non-null   int64  
 4   Car Manufacturing Year     1000 non-null   int64  
 5   Car Age                    1000 non-null   int64  
 6   Insurance Premium ($)      1000 non-null   float64
dtypes: float64(1), int64(6)
memory usage: 54.8 KB


In [5]:
#display the summary statistics of the dataset
car_data.describe()

Unnamed: 0,Driver Age,Driver Experience,Previous Accidents,Annual Mileage (x1000 km),Car Manufacturing Year,Car Age,Insurance Premium ($)
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,41.575,14.759,2.568,17.933,2007.637,17.363,493.74225
std,13.765677,10.544292,1.6989,4.410665,10.363331,10.363331,5.909689
min,18.0,0.0,0.0,11.0,1990.0,0.0,477.05
25%,30.0,6.0,1.0,14.0,1999.0,8.0,489.4875
50%,42.0,13.0,3.0,18.0,2008.0,17.0,493.95
75%,53.0,23.0,4.0,22.0,2017.0,26.0,498.3125
max,65.0,40.0,5.0,25.0,2025.0,35.0,508.15


* Before continuing ,we going to check if we have outliers in our data by doing visualizations

In [40]:
import plotly.express as px

# Create a box plot for the 'Insurance Premium ($)' column
fig = px.box(car_data, y=["Driver Age", "Driver Experience", "Previous Accidents", "Annual Mileage (x1000 km)", "Car Age", "Insurance Premium"], title="Boxplot for Numerical Columns")
fig.show()

In [30]:
import plotly.express as px
fig = px.histogram(car_data, x="Insurance Premium", marginal="violin", nbins=30, title="Distribution of Insurance Premium")
fig.show()

In [28]:
fig = px.histogram(car_data, x="Driver Age", marginal="violin", nbins=30, title="Distribution of Driver Age")
fig.show()

In [33]:
fig = px.histogram(car_data, x="Car Age", marginal="violin", nbins=30, title="Distribution of Car Age")
fig.show()

In [6]:
#check for missing values
car_data.isnull().sum()

Driver Age                   0
Driver Experience            0
Previous Accidents           0
Annual Mileage (x1000 km)    0
Car Manufacturing Year       0
Car Age                      0
Insurance Premium ($)        0
dtype: int64

In [7]:
#check for duplicates
car_data.duplicated().sum()

0

In [25]:
#data transformation by renaming a column
car_data.rename(columns={'Insurance Premium ($)': 'Insurance Premium'}, inplace=True)
car_data.head()

Unnamed: 0,Driver Age,Driver Experience,Previous Accidents,Annual Mileage (x1000 km),Car Manufacturing Year,Car Age,Insurance Premium
0,56,32,4,17,2002,23,488.35
1,46,19,0,21,2025,0,486.15
2,32,11,4,15,2020,5,497.55
3,60,0,4,19,1991,34,498.35
4,25,7,0,13,2005,20,495.55


In [None]:
#check the distrubution of the variables

#histogram driver age
import plotly.express as px

# Create a histogram with KDE for 'Driver Age'
fig = px.histogram(car_data, x="Driver Age", marginal="violin", nbins=30, title="Histogram with KDE for Driver Age")
fig.show()


In [36]:
#histogram car age
import plotly.express as px
# Create a histogram with KDE for 'Driver Age'
fig = px.histogram(car_data, x="Car Age", marginal="violin", nbins=30, title="Histogram with KDE for Car Age")
fig.show()

In [39]:
# Create a histogram with KDE for 'Insurance Premium'
fig = px.histogram(car_data, x="Insurance Premium", marginal="violin", nbins=30, title="Histogram with KDE for Insurance Premium")
fig.show()

In [41]:
#create a scatterplot for driver age vs insurance premium
import plotly.express as px
# Scatter plot: Driver Age vs. Insurance Premium
fig = px.scatter(car_data, x="Driver Age", y="Insurance Premium", title="Driver Age vs. Insurance Premium", labels={"Driver Age": "Driver Age", "Insurance Premium": "Insurance Premium"})
fig.show()


In [None]:
# Scatter plot: Car Age vs. Insurance Premium using Plotly
fig = px.scatter(car_data, x="Car Age", y="Insurance Premium", 
                 title="Car Age vs. Insurance Premium", 
                 labels={"Car Age": "Car Age", "Insurance Premium": "Insurance Premium"})
fig.show()


In [46]:
#scatter plot for 'Car Age' vs. 'Insurance Premium'
fig = px.scatter(car_data, x='Car Age', y='Insurance Premium', 
                 title='Car Age vs. Insurance Premium', 
                 labels={'Car Age': 'Car Age', 'Insurance Premium': 'Insurance Premium'})
fig.show()


In [47]:
import plotly.express as px

# Create bins for Driver Age
car_data['Driver Age Group'] = pd.cut(car_data['Driver Age'], bins=[0, 20, 40, 60, 80], labels=['0-20', '21-40', '41-60', '61-80'])

# Box plot: Insurance Premium by Driver Age Group
fig = px.box(car_data, x='Driver Age Group', y='Insurance Premium', title='Insurance Premium by Driver Age Group', labels={'Driver Age Group': 'Driver Age Group', 'Insurance Premium': 'Insurance Premium'})
fig.show()

# Create bins for Car Age
car_data['Car Age Group'] = pd.cut(car_data['Car Age'], bins=[0, 5, 10, 15, 20], labels=['0-5', '6-10', '11-15', '16-20'])

# Box plot: Insurance Premium by Car Age Group
fig = px.box(car_data, x='Car Age Group', y='Insurance Premium', title='Insurance Premium by Car Age Group', labels={'Car Age Group': 'Car Age Group', 'Insurance Premium': 'Insurance Premium'})
fig.show()


# Descriptive statistics

In [48]:
# Descriptive statistics for Driver Age and Insurance Premium
driver_age_stats = car_data['Driver Age'].describe()
insurance_premium_stats = car_data['Insurance Premium'].describe()

print("Descriptive Statistics for Driver Age:")
print(driver_age_stats)

print("\nDescriptive Statistics for Insurance Premium:")
print(insurance_premium_stats)

Descriptive Statistics for Driver Age:
count    1000.000000
mean       41.575000
std        13.765677
min        18.000000
25%        30.000000
50%        42.000000
75%        53.000000
max        65.000000
Name: Driver Age, dtype: float64

Descriptive Statistics for Insurance Premium:
count    1000.000000
mean      493.742250
std         5.909689
min       477.050000
25%       489.487500
50%       493.950000
75%       498.312500
max       508.150000
Name: Insurance Premium, dtype: float64


In [49]:
average = car_data['Insurance Premium'].mean()
average = car_data['Driver Age'].mean()

print("Average Insurance Premium: ", average)
print("Average Driver Age: ", average)

Average Insurance Premium:  41.575
Average Driver Age:  41.575


In [50]:
median = car_data['Insurance Premium'].median()
median = car_data['Driver Age'].median()

print("Median Insurance Premium: ", median)
print("Median Driver Age: ", median)

Median Insurance Premium:  42.0
Median Driver Age:  42.0


In [51]:
mode = car_data['Insurance Premium'].mode()
mode = car_data['Driver Age'].mode()

print("Mode Insurance Premium: ", mode)
print("Mode Driver Age: ", mode)

Mode Insurance Premium:  0    43
Name: Driver Age, dtype: int64
Mode Driver Age:  0    43
Name: Driver Age, dtype: int64


In [57]:
# Create bins for Driver Age and Car Age
driver_age_bins = [0, 20, 40, 60, 80]
car_age_bins = [0, 5, 10, 15, 20]

# Add grouped columns for Driver Age and Car Age
car_data['Driver Age Group'] = pd.cut(car_data['Driver Age'], bins=driver_age_bins, labels=['0-20', '21-40', '41-60', '61-80'])
car_data['Car Age Group'] = pd.cut(car_data['Car Age'], bins=car_age_bins, labels=['0-5', '6-10', '11-15', '16-20'])

# Create the pivot table
pivot_table = car_data.pivot_table(values='Insurance Premium', index='Driver Age Group', columns='Car Age Group', aggfunc='mean')

print(pivot_table)

Car Age Group            0-5        6-10       11-15       16-20
Driver Age Group                                                
0-20              498.645000  502.137500  501.380000  503.025000
21-40             496.202830  496.600000  497.277451  498.120833
41-60             489.587000  489.955556  490.888060  490.599180
61-80             485.492857  486.970000  484.587500  486.938462






# Correlation Analysis

* We are going to check the correlation between variables

In [58]:
# Calculate the correlation matrix
correlation_matrix = car_data[['Driver Age', 'Car Age', 'Insurance Premium']].corr()

print(correlation_matrix)

                   Driver Age   Car Age  Insurance Premium
Driver Age           1.000000 -0.008187          -0.776848
Car Age             -0.008187  1.000000           0.171829
Insurance Premium   -0.776848  0.171829           1.000000


* Insight : There is a Positive correlation between Driver Age and Insurance Premium

In [63]:
# Creating a heatmap
import plotly.graph_objects as go

heatmap = go.Figure(data=go.Heatmap(
    z=correlation_matrix.values,
    x=correlation_matrix.columns,
    y=correlation_matrix.index,
    colorscale=[[0.0, '#440154'], [0.1111111111111111, '#482878'],
                [0.2222222222222222, '#3e4989'], [0.3333333333333333, '#31688e'],
                [0.4444444444444444, '#26828e'], [0.5555555555555556, '#1f9e89'],
                [0.6666666666666666, '#35b779'], [0.7777777777777778, '#6ece58'],
                [0.8888888888888888, '#b5de2b'], [1.0, '#fde725']],
    zmin=-1,
    zmax=1,
    colorbar=dict(title="Correlation")
))

heatmap.update_layout(
    title="Correlation Heatmap",
    xaxis_title="Variables",
    yaxis_title="Variables"
)

heatmap.show()

In [66]:
from scipy import stats
pearson_coef, p_value = stats.pearsonr(car_data['Insurance Premium'], car_data['Driver Age'])
print("Pearson Correlation Coefficient: ", pearson_coef)

Pearson Correlation Coefficient:  -0.7768483315311161


* Insight : The value of -0.777 suggests a strong negative correlation. This means that as one variable increases, the other variable tends to decrease significantly.


In [67]:
pearson_coef, p_value = stats.pearsonr(car_data['Insurance Premium'], car_data['Car Age'])
print("Pearson Correlation Coefficient: ", pearson_coef)

Pearson Correlation Coefficient:  0.17182867618496808


* Insight : The value of 0.172 suggests a weak positive correlation. This means that as one variable increases, the other variable tends to increase slightly, but the relationship is not strong.

In [69]:
import plotly.express as px

# Scatter plot: Driver Age vs. Insurance Premium
fig = px.scatter(car_data, x='Driver Age', y='Insurance Premium', 
                 title='Scatter Plot: Driver Age vs. Insurance Premium', 
                 labels={'Driver Age': 'Driver Age', 'Insurance Premium': 'Insurance Premium'})
fig.update_layout(xaxis_title='Driver Age', yaxis_title='Insurance Premium')
fig.show()

In [71]:
#Scatter plot: Driver Age vs. Insurance Premium
fig = px.scatter(car_data, x='Car Age', y='Insurance Premium', 
                 title='Scatter Plot: Car Age vs. Insurance Premium', 
                 labels={'Car Age': 'Car Age', 'Insurance Premium': 'Insurance Premium'})
fig.update_layout(xaxis_title='Car Age', yaxis_title='Insurance Premium')
fig.show()

# Regression Modeling

* We are going too Conduct a regression analysis with OLS model to quantify the relationship and understand how much of the variation in insurance premium can be explained by driver age and car age.

In [74]:
import statsmodels.api as sm

#independent variables (X) and dependent variable (y)
X = car_data[['Driver Age', 'Car Age']]
y = car_data['Insurance Premium']

#constant to the model (intercept)
X = sm.add_constant(X)

# Fit the regression model
model = sm.OLS(y, X).fit()

# Print the summary of the regression
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:      Insurance Premium   R-squared:                       0.631
Model:                            OLS   Adj. R-squared:                  0.630
Method:                 Least Squares   F-statistic:                     852.0
Date:                Thu, 20 Mar 2025   Prob (F-statistic):          1.72e-216
Time:                        16:23:27   Log-Likelihood:                -2696.7
No. Observations:                1000   AIC:                             5399.
Df Residuals:                     997   BIC:                             5414.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        505.9451      0.410   1233.502      0.0