## Notebook Overview

For Use-Case 4 :
How AI can help in analysing claims and utilization data to identify anomalies, trends, FW&A. Use some publicly available data and come up with the solutions.

* Data pre-processing and preparation
* Exploratory Data analysis
* Preparing data for Data modeling
* Data Modelling
* Model Evaluation
* Predictions


In [None]:
## Import relevant libraries for data processing & visualisation 

import numpy as np              # linear algebra
import pandas as pd             # data processing, dataset file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization & graphical plotting
import seaborn as sns           # to visualize random distributions
%matplotlib inline

## Add additional libraries to prepare and run the model
import sklearn
from sklearn.preprocessing import StandardScaler 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split 
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.linear_model import LinearRegression 
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.enesmble import IsolationForest
import xgboost as xgb
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import ExtraTreesRegressor

import warnings                 # to deal with warning messages
warnings.filterwarnings('ignore')

In [None]:
## Import the dataset to read and analyse
df_ins = pd.read_csv("insurance_data.csv")

## Data Processing & Data Preparation for EDA 

In [None]:
# checking the datasct contents, with head() function
df_ins.head()

#### Checking the null values, and filling them appropriately

In [None]:
## Checking the null values with isna() function
df_ins.isna().sum()

It is observed that age feature has 5 records with null value, and region feature has 3 records with null values.

In [None]:
## interpolating the null values
df = df_ins.interpolate()                ## numerical features
df = df_ins.fillna(df.mode().iloc[0])    ## categorical features
df.isna().sum()                          ## check for any null values, after modifying

In [None]:
## Having a more deeper look into the data, gauging descriptive data for each feature
df.describe(include='all').round(0)

In [None]:
## Checking the shape of the dataset
print("The number of rows and number of columns are ", df.shape)

In [None]:
## Checking the labels in categorical features
for col in df.columns:
    if df[col].dtype=='object':
        print()
        print(col)
        print(df[col].unique())

In [None]:
## Relabeling the categories in 'diabetic', 'smoker' variables appropriatly with .replace() function
## This helps in having a greater understanding of contents in charts & plots
df['diabetic'] = df['diabetic'].replace({'Yes': 'diabetic', 'No': 'non-diabetic'})
df['smoker'] = df['smoker'].replace({'Yes': 'smoker', 'No': 'non-smoker'})

In [None]:
# Before proceeding to EDA, see the information about the DataFrame with .info() function
df.info()

## Exploratory Data Analysis_EDA

#### Countplot By Region, Gender

In [None]:
## First we will use pd.crosstab() to check the data in tabular format
pd.crosstab(df['region'], df['gender'], margins = True, margins_name = "Total").sort_values(by="Total", ascending=True)

- Since we have only 4 categories, we can quickly makeout some info from the table
- However, when categories number is high, itis difficult to gain insights from the table
- Thats' where visualising would be the better option

In [None]:
## Now we use countplot() to visualise the data
sns.countplot(x='region', hue='gender', palette="Set2", data=df).set(title='Number of Insurance Claimants by Gender, by Region')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0)

- The above plot revealts that southeast has higher claims overall
- southeast, southwest have higher female claims; northwest, northeast, have higher male claims

#### Boxplot by Gender vs Age

In [None]:
## Boxplot gender vs age of insurance claimants
sns.boxenplot(x='gender', y='age', palette="Set2", data=df).set(title='Number of Insurance Claimants by Gender, by Age')

- The plot shows age of female insurance claimants is higher, has a higher median than males

#### Boxplot By Region, Claim Value, Gender

In [None]:
sns.boxplot(x="region", y="claim",hue="gender", palette="Set2",data=df).set(title='Claim Value by Region, by Gender')
sns.despine(offset=10, trim=True)
plt.legend(bbox_to_anchor=(1.02, 1), loc='best', borderaxespad=0)

- The plot revealts that claim median value lies in the rang of around 
- 10,000-15,000 for all the regions, for both the genders
- Claim value outliers are rampant for all the regions, for both the genders

#### Histograms for numerical variables

In [None]:
## Generating histograms for numerical variables –– age, bmi, bloodpressure, claim
fig, axes = plt.subplots(1, 4, figsize=(14,3))
age = df.age.hist(ax=axes[0], color="#32B5C9", ec="white", grid=False).set_title('age')
bmi = df.bmi.hist(ax=axes[1], color="#32B5C9", ec="white", grid=False).set_title('bmi')
bloodpressure = df.bloodpressure.hist(ax=axes[2], color="#32B5C9", ec="white", grid=False).set_title('bloodpressure')
claim = df.claim.hist(ax=axes[3], color="#32B5C9", ec="white", grid=False).set_title('claim')

Histograms generated show
   - age of individuals is more or less equally distributed
   - bmi displays a typical normal distribution
   - bloodpressure & claims have higher positive skewness

#### Scatterplots

In [None]:
## Scatterplots help in understanding the impact of habits & health conditions on insurance claim value
## Let us try analyse the impact of smoking habit and age on claim value
sns.scatterplot(x='age', y='claim', hue='smoker', palette="Set2", data=df).set(title='Impact of Age & Smoking Habit on Claim Value')
plt.legend(bbox_to_anchor=(1.02, 1), loc='best', borderaxespad=0)
plt.show()

 - The plot reveals that claim value is typiclly high for people with smoking habit 

In [None]:
## Impact of diabetes disease and age on claim value
sns.scatterplot(x='age', y='claim', hue='diabetic', palette="Set2", data=df).set(title='Impact of Age & Diabetes Disease on Claim Value')
plt.legend(bbox_to_anchor=(1.02, 1), loc='best', borderaxespad=0)
plt.show()

 - The plot reveals that there is no significant correlation between claim value and prevalence of diabetes 

In [None]:
## Impact of no. of children and age on claim value 
sns.scatterplot(x='age', y='claim', hue='children', palette="Set2", data=df).set(title='Impact of Age & Children on Claim Value')
plt.legend(bbox_to_anchor=(1.02, 1), loc='best', borderaxespad=0)
plt.show()

 - The plot reveals that there is no significant correlation between claim value and number of children the claim holder has

In [None]:
## Impact of bmi on claim value, by gender
sns.scatterplot(x='bmi', y='claim', hue='gender', palette="Set2", data=df).set(title='Impact of BMI & Gender on Claim Value')
plt.legend(bbox_to_anchor=(1.02, 1), loc='best', borderaxespad=0)
plt.show()

 - The plot reveals that there is certain degree of correlation between claim value and bmi, in both male & female groups

In [None]:
## Impact of bloodpressure on claim value, by gender
sns.scatterplot(x='bloodpressure', y='claim', hue='gender', palette="Set2", data=df).set(title='Impact of BP & Gender on Claim Value')
plt.legend(bbox_to_anchor=(1.02, 1), loc='best', borderaxespad=0)
plt.show()

 - The plot reveals that there is some correlation between claim value and bloodpressure, in both male & female groups

#### Pie Charts

In [None]:
## Pie charts help in determining the % share of each category in a feature variable
## First we will define colors for Pie chart (about 6 colors are sufficient here)
colors = ({'custom': 'turquoise', 'silver': 'silver', 'grey': 'grey', 'blue': 'blue', 'lightskyblue': 'lightskyblue', 'white': 'antiquewhite'})

In [None]:
## Total claims by region
regions = df[['region', 'claim']].groupby('region').sum().sort_values(by="claim", ascending=True)
regions

In [None]:
regions.plot(kind='pie', subplots=True, figsize=(10,6), fontsize = 14, colors = colors.values(), title='Total claims by region in % value', autopct='%1.0f%%')

In [None]:
## Total claims by gender
gender = df[['gender', 'claim']].groupby('gender').sum().sort_values(by="claim", ascending=True)
gender

In [None]:
gender.plot(kind='pie', subplots=True, figsize=(10,6), fontsize = 14, colors = colors.values(), title='Total claims by gender in % value (male & female)', autopct='%1.0f%%')

In [None]:
## Total claims by smoking habit
smokers = df[['smoker', 'claim']].groupby('smoker').sum().sort_values(by="claim", ascending=True)
smokers

In [None]:
smokers.plot(kind='pie', subplots=True, figsize=(10,6), fontsize = 14, colors = colors.values(), title='Total claims by smokers & non-smokers in % value', autopct='%1.0f%%')

In [None]:
## Total claims by diabetes prevalence
diabetic = df[['diabetic', 'claim']].groupby('diabetic').sum().sort_values(by="claim", ascending=True)
diabetic

In [None]:
diabetic.plot(kind='pie', subplots=True, figsize=(10,6),fontsize = 14, colors = colors.values(), title='Total claims by diabetics & non-diabetics in % value', autopct='%1.0f%%')

In [None]:
## Total claims by number of children
children = df[['children', 'claim']].groupby('children').sum().sort_values(by="claim", ascending=True)
children

In [None]:
children.plot(kind='pie', subplots=True, figsize=(10,6), fontsize = 14, colors = colors.values(), title='Total claims by No. of children in % value', autopct='%1.0f%%')

#### Building a Pie Chart with Age Groups

 - We have age of individuals in our dataset, but we do not have age group.
 - We create one, by binning using pd.cut() function
 - Before proceeding we build a distribution plot to see age distribution in the dataset
 - later we check minimum, maximum, average ages of the inviduals, for better understanding of ages

In [None]:
## age distribution plot
sns.displot(df.age, color="r", kde=True).set(title='Age Distribution Chart')

In [None]:
## min, max, mean ages
df['age'].agg(['min', 'max', 'mean']).round(0)

In [None]:
## Build a new ageGroup feature, with 6 age bands, of 7 years each
age_band = [18,25,32,39,46,53,60]
df['age_group'] = pd.cut(df['age'], bins=age_band)
ageGroup = df[['age_group', 'claim']].groupby('age_group').sum().sort_values(by="claim", ascending=False)
ageGroup

In [None]:
## Total claims by age group 
ageGroup.plot(kind='pie', subplots=True, figsize=(10,6), fontsize = 14, colors = colors.values(), title='Total claims by age group in % value', autopct='%1.0f%%')

## Preparing the Data for Data Modeling

In [None]:
## Now we do some data modeling, model evaluation, and if possible some predictions.
## First we prepare the data to make predictions, and do some feature engineering as per the need
## Later load required additional libraries, and proceed with machine learning

## splitting Categorical and Numerical data
cat_df = df[['gender', 'diabetic', 'children', 'smoker', 'region']]
num_df = df[['age', 'bmi', 'bloodpressure', 'claim']]

In [None]:
## label encoding 
le = LabelEncoder ()

#select ctegorical columns 
cat_df = df.select_dtypes(exclude=["int", "float"])

for i in cat_df:
    cat_df[i] = le.fit_transform(df[i])

#joining the data to the numeric data
num_df = df.select_dtypes(include=['int', 'float'])
main_df = pd.concat([num_df, cat_df], axis=1)

In [None]:
## EDA-Univariate analysis to check "claim" feature, before proceeding with machine learning
sns.distplot(main_df.claim, color="r", kde=True).set(title='Univariate Analysis : Claim Feature')

 - The plot reveals that some outliers exist on higher as well as lower sides, we try to remove them in next step

In [None]:
## Removing the outliers from claim feature
Q1 = np.percentile(main_df['claim'], 25,
                   interpolation = 'midpoint')

Q3 = np.percentile(main_df['claim'], 75,
                   interpolation = 'midpoint')
IQR = Q3 - Q1

print("Old Shape: ", main_df.shape)

# Upper bound
upper = np.where(main_df['claim'] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(main_df['claim'] <= (Q1-1.5*IQR))

''' Removing the Outliers '''
main_df.drop(upper[0], inplace = True)
main_df.drop(lower[0], inplace = True)

print("New Shape: ", main_df.shape)

In [None]:
## Re-running the Univariate analysis on revised "claim" feature
sns.distplot(main_df.claim, color="r", kde=True).set(title='Univariate Analysis : Revised Claim  Feature')

In [None]:
## EDA-Bivariate Analysis (Insranc Claim vs Age of Claimant)
sns.jointplot(data=main_df, x="age", y="claim", hue="gender", palette="Set2")

In [None]:
## Correlation map
corr = main_df.corr(method='pearson').round(3)
plt.figure(figsize=(11,5))
sns.heatmap(corr, annot=True, cmap="YlOrRd_r")

## Data Modeling

In [None]:
# Segregating the Dependent Variables in X and Independent Variable in Y
X = main_df.drop(columns=["claim"])
y = main_df["claim"]

## standardize the price data values to avoid biased outcome of predictions
scaler = StandardScaler()
x_scaled=scaler.fit_transform(X)

In [None]:
## split the data

X_train, X_test, y_train, y_test = train_test_split(x_scaled ,y, test_size=0.2, random_state=0)

## create function to fit models

model_preds = []

def fit_model(model, model_name):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = round(r2_score(y_test, y_pred),4)
    adj_r2 = round(1 - (1-r2)*(len(y)-1)/(len(y)-X.shape[1]-1),4)
    mse = round(mean_squared_error(y_test, y_pred),4)
    mae = round(mean_absolute_error(y_test, y_pred),4)
    rmse = round(np.sqrt(mean_squared_error(y_test, y_pred)),4)
    model_preds.append([model_name, r2, adj_r2, mse, mae, rmse])
    print ("The R-Squared Value is: ", r2)
    print ("Adjusted R-Squared Value is: ", adj_r2)
    print("The Mean Squared error (MSE) is: ", mse)
    print("Root Mean Squared Error (RMSE): ", rmse)
    print("Mean Absolute Error (MAE) is: ", mae)

## model evaluation function
def model_eval():
    preds = pd.DataFrame(model_preds)
    preds.columns = ["Mod_Name", "R2 Value", "adj_R2", "MSE", "RMSE", "MAE"]
    return preds.sort_values(by="R2 Value", ascending=False)

In [None]:
## Linear Regression

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

fit_model(lr_model, "Linear Regression")

In [None]:
## Decision Trees

dectree_model = DecisionTreeRegressor()
fit_model(dectree_model, "Decision Tree Regressor")

In [None]:
## Isolation Forest for Anomaly detection on training data
iso_clf = IsolationForest(max_samples=100, random_state=0)
iso_clf.fit(X_train)

In [None]:
## Plot for Isolation points
from sklearn.inspection import DecisionBoundaryDisplay

disp = DecisionBoundaryDisplay.from_estimator(
    iso_clf,
    X,
    response_method="predict",
    alpha=0.5,
)
disp.ax_.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
disp.ax_.set_title("Binary decision boundary \nof IsolationForest")
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.show()

In [None]:
## Random Forest

randfor_model = RandomForestRegressor()
fit_model(randfor_model, "Random Forest Regressor")

In [None]:
## XG Boost

XGB_model = xgb.XGBRFRegressor()
fit_model(XGB_model, "XG Boost")

In [None]:
## KNN

knn_model = KNeighborsRegressor(n_neighbors=6)
fit_model(knn_model, "K-Neigbors Regressor")

## Model Evaluation

In [None]:
model_eval()

## Predictions

In [None]:
# Training the Model

modelETR = ExtraTreesRegressor()
modelETR.fit(X_train, y_train)
    
# Predict the model with test data

y_pred = XGB_model.predict(X_test)

In [None]:
out=pd.DataFrame({'Price_actual':y_test,'Price_pred':y_pred})
result=main_df.merge(out,left_index=True,right_index=True)

In [None]:
result[['PatientID', 'age','gender','Price_actual','Price_pred']].sample(25)