# Car Insurance Claim Data Analysis

**Aviva** offers multiple products and services, spanning over pensions, insurnace and investments. One of the insurnace products they offer is Car Insurance. To help model insurnace premiums and offers, data is collated to help shape potential "risk" of customers.

This notebook will:

1. Upload the relevant dataset
2. Clean the data
3. Profile
4. Create multiple visualistions

Posible questions for the datase:


*   Will a more powerful car mean more likely to claim
*   Will age impact the risk of claim



# Setup
Import relevant libraries
Explain libraries?

In [None]:
# Only run this once
!pip -q install skimpy -U

In [None]:
# Run this everytime
import pandas as pd
import numpy as np
from google.colab import files
from skimpy import skim

## Upload Dataset

Why manual upload like this?

Once the dataset is uploaded , the head (top 5 rows) are displayed to show a successfull upload.

In [None]:
print('Upload the relevant dataset')
uploaded = files.upload()

if len(uploaded) == 0:
    raise SystemExit('File not uploaded.')
filename = next(iter(uploaded))

data = pd.read_csv(filename)
print(f'Loaded `{filename}`')


Upload the relevant dataset


Saving data.csv to data (2).csv
Loaded `data (2).csv`


# Preprocessing

Purpose of preprocessing ?

## Sample Data

In [None]:
print("First five rows:")
data.head()

First five rows:


Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


In [None]:
print("Last five rows:")
data.tail()

Last five rows:


Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
11909,Acura,ZDX,2012,premium unleaded (required),300.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,23,16,204,46120
11910,Acura,ZDX,2012,premium unleaded (required),300.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,23,16,204,56670
11911,Acura,ZDX,2012,premium unleaded (required),300.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,23,16,204,50620
11912,Acura,ZDX,2013,premium unleaded (recommended),300.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,23,16,204,50920
11913,Lincoln,Zephyr,2006,regular unleaded,221.0,6.0,AUTOMATIC,front wheel drive,4.0,Luxury,Midsize,Sedan,26,17,61,28995


In [None]:
print('Random data sample')
data.sample(5)

Random data sample


Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
6837,Mitsubishi,Mirage,2017,regular unleaded,78.0,3.0,MANUAL,front wheel drive,4.0,Hatchback,Compact,4dr Hatchback,41,33,436,14795
4391,Ford,Explorer,2015,flex-fuel (unleaded/E85),290.0,6.0,AUTOMATIC,front wheel drive,4.0,"Crossover,Flex Fuel,Performance",Midsize,4dr SUV,24,17,5657,33000
10520,Pontiac,Torrent,2008,regular unleaded,185.0,6.0,AUTOMATIC,front wheel drive,4.0,Crossover,Compact,4dr SUV,24,17,210,23520
1189,Plymouth,Acclaim,1993,regular unleaded,100.0,4.0,MANUAL,front wheel drive,4.0,,Compact,Sedan,29,22,535,2000
2412,Volkswagen,CC,2016,premium unleaded (recommended),200.0,4.0,AUTOMATED_MANUAL,front wheel drive,4.0,Performance,Midsize,Sedan,31,22,873,34475


In [None]:
print("Data dimensions: ", data.shape)

Data dimensions:  (11914, 16)


In [None]:
print("Dataset Information:")
data.info()

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11914 entries, 0 to 11913
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Make               11914 non-null  object 
 1   Model              11914 non-null  object 
 2   Year               11914 non-null  int64  
 3   Engine Fuel Type   11911 non-null  object 
 4   Engine HP          11845 non-null  float64
 5   Engine Cylinders   11884 non-null  float64
 6   Transmission Type  11914 non-null  object 
 7   Driven_Wheels      11914 non-null  object 
 8   Number of Doors    11908 non-null  float64
 9   Market Category    8172 non-null   object 
 10  Vehicle Size       11914 non-null  object 
 11  Vehicle Style      11914 non-null  object 
 12  highway MPG        11914 non-null  int64  
 13  city mpg           11914 non-null  int64  
 14  Popularity         11914 non-null  int64  
 15  MSRP               11914 non-null  int64  
dtypes

## Data Profiling using skimpy

why using this?

In [None]:
if 'local_df' in globals():
    skim_local = skim(data)
    skim_local
else:
    print('No local_df found. Please run the local CSV import cell first.')


## Data cleansing

why?

Remove unnecssary columns first and why



In [None]:
# code for removing columns

Talk about why I am checking null

In [None]:
data.isnull().sum()

Unnamed: 0,0
Make,0
Model,0
Year,0
Engine Fuel Type,3
Engine HP,69
Engine Cylinders,30
Transmission Type,0
Driven_Wheels,0
Number of Doors,6
Market Category,3742


Now see and handle duplicates

Then incorrect values

Fix datatypes?

Handle outliers- use box plots

# EDA

Section to include ML

# Visualisations

At least 3 vsiaulisations

# Refelection

* Discuss how the findings from the data can inform business intelligence.
* Reflect on the ethical considerations related to the data handling, analysis, and visualisation process.
