# Exploratory Analysis of datasets
In this Jupyter Notebook you will find exploratory analysis of dataset of an Insurance company based in the USA.

First, we import the libraries and the dataset that we will need.

In [190]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 10
fig_size[1] = 10
plt.rcParams["figure.figsize"] = fig_size

## Insurance file
We import the file.

In [191]:
ins = pd.read_csv('../Data/Clean_data/Clean_Insurance_USA.csv', index_col = 0)

In [192]:
ins['Number_Open_Complaints'] = ins['Number_Open_Complaints'].apply(lambda x: 0 if x == 0 else 1)

Let's see how many customers from each State of the USA there are.

In [193]:
pd.DataFrame(ins['State'].value_counts(ascending = False, normalize = True)*100)

Unnamed: 0,State
California,34.486534
Oregon,28.476024
Arizona,18.644624
Nevada,9.656229
Washington,8.736589


Aproximately 80% of the clients come from California, Oregon and Arizona.

In [25]:
pd.DataFrame(ins['Location'].value_counts(ascending = False, normalize = True)*100)

Unnamed: 0,Location
Suburban,63.269104
Rural,19.410992
Urban,17.319904


63% of clients live in Suburban areas.

In [26]:
pd.DataFrame(ins['Gender'].value_counts(ascending = False, normalize = True)*100)

Unnamed: 0,Gender
F,50.996278
M,49.003722


51% of our data is Female.

In [54]:
ins[['Gender', 'Number_Open_Complaints']].groupby('Gender').sum()*100/len(ins)

Unnamed: 0_level_0,Number_Open_Complaints
Gender,Unnamed: 1_level_1
F,10.652507
M,9.951828


In [91]:
gender = ins[['Gender', 'Number_Open_Complaints']].groupby('Gender').sum()

There is not so much difference between female and male in terms of number of accidents.

In [32]:
pd.DataFrame(ins['Coverage'].value_counts(ascending = False, normalize = True)*100)

Unnamed: 0,Coverage
Basic,60.959054
Extended,30.019707
Premium,9.021239


60% of clients use the basic coverage products whereas 30% and 9% use Extended and Premium respectively.

In [49]:
#Is people that use basic more prone to have accidents than those who use a Premium?
ins.groupby('Coverage')['Number_Open_Complaints'].sum()*100/ins.groupby('Coverage')['Number_Open_Complaints'].count()

Coverage
Basic       20.761494
Extended    20.860686
Premium     18.689320
Name: Number_Open_Complaints, dtype: float64

Is income related with type of coverage?

In [189]:
#Proportion of clients
bins = [0,15000,30000,45000,60000,75000,90000,105000] #Bins that I will use.
pd.DataFrame(pd.cut(ins['Income'], bins = bins).value_counts(ascending = False, normalize = True)*100)

Unnamed: 0,Income
"(15000, 30000]",24.292211
"(30000, 45000]",19.29001
"(60000, 75000]",16.634883
"(45000, 60000]",16.400176
"(75000, 90000]",12.732874
"(90000, 105000]",7.026551
"(0, 15000]",3.623295


In [188]:
#Number of accidents by Type of car
pd.DataFrame(ins.groupby('Car_Type')['Number_Open_Complaints'].sum()*100/ins.groupby('Car_Type')['Number_Open_Complaints'].count())

Unnamed: 0_level_0,Number_Open_Complaints
Car_Type,Unnamed: 1_level_1
Four-Door Car,21.380654
Luxury Car,22.08589
Luxury SUV,23.369565
SUV,19.54343
Sports Car,19.008264
Two-Door Car,19.724284


In [107]:
pd.DataFrame(ins.groupby('Policy_Type')['Number_Open_Complaints'].sum()*100/ins.groupby('Policy_Type')['Number_Open_Complaints'].count())

Unnamed: 0_level_0,Number_Open_Complaints
Policy_Type,Unnamed: 1_level_1
Corporate Auto,20.426829
Personal Auto,20.713023
Special Auto,19.57672
