<a href="https://colab.research.google.com/github/27136thapelo/Insurance-Claim-Analysis/blob/main/Insurance_claim_analysis_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Insurance Claim Analaysis

___________________________________________________________________________

Contributors:

1. Nokukhanya Magagaula
2. Thapelo Nkhumishe

____________________________________________________________________________

# Table of Contents

<a href=#one>1. Problem Statement</a>

<a href=#two>2. Importing Packages</a>

<a href=#three>3. Loading the Data</a>

<a href=#four>4. Data Description</a>

<a href=#five>5. Data Pre-processing</a>

<a href=#six>6. Exploratory Data Analysis</a>



 <a id="one"></a>
## 1. Problem Statement
<a href=#cont>Back to Table of Contents</a>

---

 <a id="two"></a>
## 2. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---

In [2]:
import pandas as pd
import numpy as np
import re
import math 


 <a id="three"></a>
## 3. Loading the Data
<a href=#cont>Back to Table of Contents</a>

---

In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/27136thapelo/Insurance-Claim-Analysis/main/insurance_data.csv")

#brief look at the data 
df.head(10)

Unnamed: 0,index,PatientID,age,gender,bmi,bloodpressure,diabetic,children,smoker,region,claim
0,0,1,39.0,male,23.2,91,Yes,0,No,southeast,1121.87
1,1,2,24.0,male,30.1,87,No,0,No,southeast,1131.51
2,2,3,,male,33.3,82,Yes,0,No,southeast,1135.94
3,3,4,,male,33.7,80,No,0,No,northwest,1136.4
4,4,5,,male,34.1,100,No,0,No,northwest,1137.01
5,5,6,,male,34.4,96,Yes,0,No,northwest,1137.47
6,6,7,,male,37.3,86,Yes,0,No,northwest,1141.45
7,7,8,19.0,male,41.1,100,No,0,No,northwest,1146.8
8,8,9,20.0,male,43.0,86,No,0,No,northwest,1149.4
9,9,10,30.0,male,53.1,97,No,0,No,northwest,1163.46


 <a id="four"></a>
## 4. Data Description
<a href=#cont>Back to Table of Contents</a>

---

This dataset contains insightful information related to insurance claims, giving us an in-depth look into the demographic patterns of those receiving them. The dataset contains information on patient age, gender, BMI (Body Mass Index), blood pressure levels, diabetic status, number of children, smoking status and region. 

By analyzing these key factors across geographical areas and across different demographics such as age or gender we can gain a greater understanding of who is most likely to receive an insurance claim. This understanding gives us valuable insight that can be used to inform our decision making when considering potential customers for our services. 

On a broader scale it can inform public policy by allowing for more targeted support for those who are most in need and vulnerable. These kinds of insights are extremely valuable and this dataset provides us with the tools we need to uncover them!

(Data source: https://www.kaggle.com/datasets/thedevastator/insurance-claim-analysis-demographic-and-health)

The dataset consists of 11 columns, namely:

- index
- PatientID - unique identifier of the patient
- Age - age of patient
- gender -  gender of patient
- bmi - patient's body mass index
- bloodpressure - patient's blood pressure
- diabetic - Whether the insured person is diabetic or not. (Boolean)
- children - Number of children of the insured person. (Integer)
- smoker - Whether the insured person is a smoker or not. (Boolean)
- region
- claim - Amount of the insurance claim. (Float)

 <a id="five"></a>
## 5. Data Pre-processing
<a href=#cont>Back to Table of Contents</a>

---

In [None]:
df.shape
#Checking the number of columns and rows 

(1340, 11)

In [None]:
df.isna().sum()
#Checking data for null values 

index            0
PatientID        0
age              5
gender           0
bmi              0
bloodpressure    0
diabetic         0
children         0
smoker           0
region           3
claim            0
dtype: int64

As observed from the data above it appears that Age has 5 empty cells and Region has 3

In [4]:
cat_cols = ['region']
for region in cat_cols:
    df[region].fillna(df[region].mode()[0], inplace=True)
df.isna().sum()

index            0
PatientID        0
age              5
gender           0
bmi              0
bloodpressure    0
diabetic         0
children         0
smoker           0
region           0
claim            0
dtype: int64

In [14]:
num_cols = ['age']
for age in num_cols:
    df[age].fillna(df[age].mean(), inplace=True)
    df[age] = round(df[age], 0)
df.isna().sum()

index            0
PatientID        0
age              0
gender           0
bmi              0
bloodpressure    0
diabetic         0
children         0
smoker           0
region           0
claim            0
dtype: int64

In [15]:
df.head(10)

Unnamed: 0,index,PatientID,age,gender,bmi,bloodpressure,diabetic,children,smoker,region,claim
0,0,1,39.0,male,23.2,91,Yes,0,No,southeast,1121.87
1,1,2,24.0,male,30.1,87,No,0,No,southeast,1131.51
2,2,3,38.0,male,33.3,82,Yes,0,No,southeast,1135.94
3,3,4,38.0,male,33.7,80,No,0,No,northwest,1136.4
4,4,5,38.0,male,34.1,100,No,0,No,northwest,1137.01
5,5,6,38.0,male,34.4,96,Yes,0,No,northwest,1137.47
6,6,7,38.0,male,37.3,86,Yes,0,No,northwest,1141.45
7,7,8,19.0,male,41.1,100,No,0,No,northwest,1146.8
8,8,9,20.0,male,43.0,86,No,0,No,northwest,1149.4
9,9,10,30.0,male,53.1,97,No,0,No,northwest,1163.46


In [None]:
df.describe(include='all').round(0)
#Getting a deeper look into the features of the dataset


Unnamed: 0,index,PatientID,age,gender,bmi,bloodpressure,diabetic,children,smoker,region,claim
count,1340.0,1340.0,1340.0,1340,1340.0,1340.0,1340,1340.0,1340,1340,1340.0
unique,,,,2,,,2,,2,4,
top,,,,male,,,No,,No,southeast,
freq,,,,678,,,698,,1066,446,
mean,670.0,670.0,38.0,,31.0,94.0,,1.0,,,13253.0
std,387.0,387.0,11.0,,6.0,11.0,,1.0,,,12110.0
min,0.0,1.0,18.0,,16.0,80.0,,0.0,,,1122.0
25%,335.0,336.0,29.0,,26.0,86.0,,0.0,,,4720.0
50%,670.0,670.0,38.0,,30.0,92.0,,1.0,,,9370.0
75%,1004.0,1005.0,47.0,,35.0,99.0,,2.0,,,16604.0


In [None]:
df['diabetic'] = df['diabetic'].replace({'Yes': 'diabetic', 'No': 'non-diabetic'})
df['smoker'] = df['smoker'].replace({'Yes': 'smoker', 'No': 'non-smoker'})
df.head()
#Here the replace function was to to replace categorical features such as smoker and diabetic columns which initially  had 'yes' and 'no' which was going to be a problem when doing THE EDA plots.


Now the smoker and diabitic features include diabetic/non-diabetic and smoker/non- smoker instead of yes and no.



In [None]:
df.info()
#Looking at the dataframe before exploring the data for analysis.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1340 entries, 0 to 1339
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   index          1340 non-null   int64  
 1   PatientID      1340 non-null   int64  
 2   age            1340 non-null   float64
 3   gender         1340 non-null   object 
 4   bmi            1340 non-null   float64
 5   bloodpressure  1340 non-null   int64  
 6   diabetic       1340 non-null   object 
 7   children       1340 non-null   int64  
 8   smoker         1340 non-null   object 
 9   region         1340 non-null   object 
 10  claim          1340 non-null   float64
dtypes: float64(3), int64(4), object(4)
memory usage: 115.3+ KB


In [16]:
# Dropping index column to remove noise

df = df.drop("index", axis=1)

In [17]:
df.head()

Unnamed: 0,PatientID,age,gender,bmi,bloodpressure,diabetic,children,smoker,region,claim
0,1,39.0,male,23.2,91,Yes,0,No,southeast,1121.87
1,2,24.0,male,30.1,87,No,0,No,southeast,1131.51
2,3,38.0,male,33.3,82,Yes,0,No,southeast,1135.94
3,4,38.0,male,33.7,80,No,0,No,northwest,1136.4
4,5,38.0,male,34.1,100,No,0,No,northwest,1137.01


 <a id="six"></a>
## 6.Exploratory Data Analysis_EDA
<a href=#cont>Back to Table of Contents</a>