#### Performing Regression on Health Insurance Data
This notebook seeks to implement regression and improve ML model accuracy through feature transformation, feature engineering, clustering, algorithm boosting, etc.

Our main objective is to predict insurance charges. We will use regression because our target variable is numeric.

In [1]:
# Importing libraries
import pandas as pd

In [91]:
# Loading the dataset
df_insure = pd.read_csv("C:/Users/Nkululeko Cyril Cele/source/Machine-Learning/insurance.csv")

In [93]:
# Checking the shape of the df
print(f"Rows, columns: {df_insure.shape}")

# Showing the first five rows
print(df_insure.head())

Rows, columns: (1338, 7)
   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520


The dataset has 1338 rows/records, 6 features, and 1 target variable. 3 of the features (sex, smoker, region) are categorical variables and 3 others (age, bmi, children) are numerical.
#### Handling Duplicated Values

In [94]:
df_insure.duplicated().sum()

1

There is one row that is duplicated. We will delete it and check the shape again.

In [95]:
df_insure = df_insure.drop_duplicates()
print(f"Rows, columns: {df_insure.shape}")

Rows, columns: (1337, 7)


#### Handling Missing Values

In [96]:
df_insure.isnull().values.sum()

0

In [97]:
df_insure.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

There are no missing values.

#### Detecting the Outliers

In [98]:
df_insure.describe()

Unnamed: 0,age,bmi,children,charges
count,1337.0,1337.0,1337.0,1337.0
mean,39.222139,30.663452,1.095737,13279.121487
std,14.044333,6.100468,1.205571,12110.359656
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29,0.0,4746.344
50%,39.0,30.4,1.0,9386.1613
75%,51.0,34.7,2.0,16657.71745
max,64.0,53.13,5.0,63770.42801


In [99]:
class OutlierBoundary:
    def __init__(self, dataset):
        self.dataset = dataset

    def outlier(self):
        column_list = ["age", "bmi", "children", "charges"]
        for column in self.dataset:
            if column in column_list:
                des = self.dataset[column].describe()
                desPairs = {"count":0,"mean":1,"std":2,"min":3,"25":4,"50":5,"75":6,"max":7}
                Q1 = des[desPairs['25']]
                Q3 = des[desPairs['75']]
                IQR = Q3-Q1
                lower = Q1-1.5*IQR
                upper = Q3+1.5*IQR
                print(f"The upper boundary for the {column} column is {upper} and the lower boundary is {lower}.")

In [100]:
p = OutlierBoundary(df_insure)

In [101]:
print(p.outlier())

The upper boundary for the age column is 87.0 and the lower boundary is -9.0.
The upper boundary for the bmi column is 47.31500000000001 and the lower boundary is 13.674999999999994.
The upper boundary for the children column is 5.0 and the lower boundary is -3.0.
The upper boundary for the charges column is 34524.777625 and the lower boundary is -13120.716174999998.
None


In [103]:
print("The total number of outliers that can be removed is:",
      len(df_insure[df_insure.age > 87.0]) +
      len(df_insure[df_insure.age < -9.0]) +
      len(df_insure[df_insure.bmi > 47.315]) +
      len(df_insure[df_insure.bmi < 13.675]) +
      len(df_insure[df_insure.children > 5.0]) +
      len(df_insure[df_insure.children < -3.0]) +
      len(df_insure[df_insure.charges > 34524.78]) +
      len(df_insure[df_insure.charges < -13120.72]))
print("This is about", ((len(df_insure[df_insure.age > 87.0]) +
      len(df_insure[df_insure.age < -9.0]) +
      len(df_insure[df_insure.bmi > 47.315]) +
      len(df_insure[df_insure.bmi < 13.675]) +
      len(df_insure[df_insure.children > 5.0]) +
      len(df_insure[df_insure.children < -3.0]) +
      len(df_insure[df_insure.charges > 34524.78]) +
      len(df_insure[df_insure.charges < -13120.72])) / 1337) * 100, "% of the dataset.")

The total number of outliers that can be removed is: 148
This is about 11.06955871353777 % of the dataset.


The dataset contains about 11.07 percent of outliers. I will perform the analyses on the data with and without the outliers and conclude accordingly.
#### Data Visualisation

In [17]:
import matplotlib.pyplot as plt, seaborn as sns

%matplotlib inline