# Mediclaim predection project

## PACKAGES USED

In [2]:
import pandas as pd    #importing the pandas library and aliasing as pd

## DATASET USED

In [7]:
df=pd.read_csv("insurance.csv")  #insurance.csv name of our dataset which is in the form of a csv file
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Importing csv file using **read_csv()**

## CHECKING NULL VALUE

In [45]:
df.info()  #NO NULL VALUES FOUND

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [98]:
df.isnull().sum()[0]

0

Clearly there is no null value present in the dataset

## DATASET DESCRIBTION

In [51]:
df.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

- Age: Age of primary beneficiary
- Sex: Primary beneficiary’s gender
- BMI: Body mass index (providing an understanding of the body, weights that are relatively high or low relative to height)
- Children: Number of children covered by health insurance / Number of dependents
- Smoker: Smoking (yes, no)
- Region: Beneficiary’s residential area in the US (northeast, southeast, southwest, northwest)
- Charges: Individual medical costs billed by health insurance

In [53]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


The describe() method is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame. It analyzes both numeric and object series and also the DataFrame column sets of mixed data types.

## UNDERSTANDING THE DATA

In [162]:
location=df.groupby(by='region').size()  
location                              

region
northeast    324
northwest    325
southeast    364
southwest    325
dtype: int64

A **groupby** operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Northeast region has the **minimum** patients whereas Southwest region has the **maximum** patients

In [55]:
s = df.groupby(by='sex').size()
print(s)

sex
female    662
male      676
dtype: int64


There are **662** Female patients whereas **676** Male patients

In [56]:
smoker=df.groupby(by='smoker').size()
smoker

smoker
no     1064
yes     274
dtype: int64

Patients who smoke **274**

Patients who do not smoke **1064**

In [75]:
female_smoker=df.loc[(df['sex']=='female') & (df['smoker']=='yes' )].count()[0]
male_smoker=df.loc[(df['sex']=='male') & (df['smoker']=='yes' )].count()[0]
print(female_smoker)
print(male_smoker)

115
159


Female patients who smoke are:**115**

Male patients who smoke are:**159**

In [149]:
cost_smoking=df.groupby(by='smoker').mean()
cost_smoking.charges

smoker
no      8434.268298
yes    32050.231832
Name: charges, dtype: float64

The people who do not smoke have mean charge of just around **8400** while people who smoke have high mean charge of **32050**


In [86]:
df.groupby(by='sex').mean()

Unnamed: 0_level_0,age,bmi,children,charges
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,39.503021,30.377749,1.074018,12569.578844
male,38.91716,30.943129,1.115385,13956.751178


Male patients have more mean charges than Female patients though there is not much difference.

In [87]:
df.groupby(by='region').mean()

Unnamed: 0_level_0,age,bmi,children,charges
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
northeast,39.268519,29.173503,1.046296,13406.384516
northwest,39.196923,29.199785,1.147692,12417.575374
southeast,38.93956,33.355989,1.049451,14735.411438
southwest,39.455385,30.596615,1.141538,12346.937377


In [95]:
children = df.groupby(by='children').size()
print (children)




children
0    574
1    324
2    240
3    157
4     25
5     18
dtype: int64


An analysis of the count of children linked with number of patients

## ADDING ANOTHER COLUMN TO THE DATA SET

In [137]:
 age_dict = {
        0: '0-9',
        1: '10-19',
        2: '20-29',
        3: '30-39',
        4: '40-49',
        5: '50-59',
        6: '60-69',
        7: '70-79',
        8: '80-89',
        9: '90-99',
        10: '100-200'
    }
for i, row in df.iterrows():
    df.at[i, 'age_category'] = age_category(row['age'])
df
#By adding age_category it would help us to analyse the dataset better.

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,cage,age_category
0,19,female,27.900,0,yes,southwest,16884.92400,10-19,10-19
1,18,male,33.770,1,no,southeast,1725.55230,10-19,10-19
2,28,male,33.000,3,no,southeast,4449.46200,20-29,20-29
3,33,male,22.705,0,no,northwest,21984.47061,30-39,30-39
4,32,male,28.880,0,no,northwest,3866.85520,30-39,30-39
...,...,...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830,50-59,50-59
1334,18,female,31.920,0,no,northeast,2205.98080,10-19,10-19
1335,18,female,36.850,0,no,southeast,1629.83350,10-19,10-19
1336,21,female,25.800,0,no,southwest,2007.94500,20-29,20-29


In [147]:
age_charge=df.groupby(by='age_category').mean()
age_charge.charges

age_category
10-19     8407.349242
20-29     9561.751018
30-39    11738.784117
40-49    14399.203564
50-59    16495.232665
60-69    21248.021885
Name: charges, dtype: float64

Clearly,the age category of **60-69** has the **highest expenses** while the age category of **10-19** has the **lowest expenses**.This indicates that  medical expenses rise with age.Hence age is an important factor in healthcare insurance

**SUMMARY**

1.Northeast region has the **minimum** patients whereas Southwest region has the **maximum** patients.

2.Patients who smoke **274**.

3.Patients who do not smoke **1064**.

4.Female patients who smoke are:**115**

5.Male patients who smoke are:**159**

6.The people who do not smoke have mean charge of just around **8400** while people who smoke have high mean charge of **32050**.

7.Male patients have more mean charges than Female patients though there is not much difference.

8.The age category of 60-69 has the highest expenses while the age category of 10-19 has the lowest expenses.This indicates       that **medical expenses rises with age**.Hence age is an important factor in healthcare insurance