# This is a classification project on Medical Insurance Claim Dataset.

In [21]:
# Loading the claim_data dataframe

import pandas as pd, numpy as np
df = pd.read_csv(r"C:\Users\hemed\Downloads\ALX - DATA SCIENCE\ALX Explore AI\Exam 1\Exam_Part_I_data 5\claims_data.csv")
df.head()

Unnamed: 0,age,sex,bmi,steps,children,smoker,region,insurance_claim,claim_amount
0,19,female,27.9,3009,0,yes,southwest,yes,16884.924
1,18,male,33.77,3008,1,no,southeast,yes,1725.5523
2,28,male,33.0,3009,3,no,southeast,no,0.0
3,33,male,22.705,10009,0,no,northwest,no,0.0
4,32,male,28.88,8010,0,no,northwest,yes,3866.8552


#### This dataset contains information about patients with or without insurance claim. 
#### We will train the data to predict whether or not a patient will be eligible for insurance_claim or not. This will be done without taking claim_amount ino consideration since the claim status will be a determinant of whether the person will recieve an amount or not, not the other way round.

## 1. Data Cleaning

In [22]:
# Checking dataset fro possible missng values or inconsistencies for preprocessing

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              1338 non-null   int64  
 1   sex              1338 non-null   object 
 2   bmi              1338 non-null   float64
 3   steps            1338 non-null   int64  
 4   children         1338 non-null   int64  
 5   smoker           1338 non-null   object 
 6   region           1338 non-null   object 
 7   insurance_claim  1338 non-null   object 
 8   claim_amount     1338 non-null   float64
dtypes: float64(2), int64(3), object(4)
memory usage: 94.2+ KB


#### From the above output, we can see that there are no missing value since all columns have the required number of entries with correct data types.

## 2. Splitting and training data

In [23]:
# splitting data into features and target

x = df.drop(columns = df[['insurance_claim', 'claim_amount']], axis = 1)
y = df['insurance_claim']

In [24]:
# Converting the categorical variables thus variables with datatype object into numerical 

X = pd.get_dummies(x, drop_first = True).astype(int)
y = y.apply(lambda x: 1 if x == 'yes' else 0)

In [25]:
# Splitting into train and test set with 30% for test size and random state of 42

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [26]:
# training, predicting and evaluation of a logistic regression using Accuracy score

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Instantiate the model and train the training data
lr = LogisticRegression()
model = lr.fit(X_train, y_train)

# Predict insurance claim for the X_test
y_pred = lr.predict(X_test)

# Evaluating the model using
accuracy = accuracy_score(y_test, y_pred)

print(f'The accuracy of this model is {accuracy}')
print (classification_report(y_test, y_pred))

The accuracy of this model is 0.8606965174129353
              precision    recall  f1-score   support

           0       0.86      0.78      0.82       161
           1       0.86      0.91      0.89       241

    accuracy                           0.86       402
   macro avg       0.86      0.85      0.85       402
weighted avg       0.86      0.86      0.86       402



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


#### The accuracy score of 86% implies that 86% of the predicted insurance claim were correctly predicted

## 3. Feature selection

#### using the statsmodels library to do so, with default parameters, we will select the significant features

In [27]:
import statsmodels.api as sm

# Add a constant to the training data
X_train_const = sm.add_constant(X_train)

# Fit the logistic regression model
model = sm.Logit(y_train, X_train_const)
result = model.fit()

# Output the summary of the model
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.367735
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:        insurance_claim   No. Observations:                  936
Model:                          Logit   Df Residuals:                      926
Method:                           MLE   Df Model:                            9
Date:                Fri, 09 Aug 2024   Pseudo R-squ.:                  0.4597
Time:                        14:53:07   Log-Likelihood:                -344.20
converged:                       True   LL-Null:                       -637.04
Covariance Type:            nonrobust   LLR p-value:                2.489e-120
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
const               -9.3946      1.344     -6.991      0.000     -12.029      -6.761
age        

#### The stats model shows that the following features (age, bmi, children, smoker, region) significantly affect the likelihood of an insurance claim. Number of steps and sex are not significant. There appear to be some regional effects, but these are not strongly significant.

#### Training a new logistic model with the most significant features

In [28]:
x_new = df[['age', 'bmi', 'children', 'smoker', 'region']]

X_new = pd.get_dummies(x_new, drop_first = True).astype(int)
X_new_train, X_new_test, y_train, y_test = train_test_split(X_new, y, test_size = 0.3, random_state = 42)

lm = LogisticRegression()
lm.fit(X_new_train, y_train)

y_new_pred = lm.predict(X_new_test)

acc = accuracy_score(y_new_pred, y_test)
acc

0.8805970149253731

#### The accuracy score of 88% implies that 88% of the predicted insurance claim were correctly predicted which is higher than the accuracy of the full model 

### Fit Support Vector Machine models to the training data, using respectively the radial, sigmoid and linear kernels with default parameters to find which kernel will be the best predictora?

In [29]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Step 1: Initialize the SVM models with different kernels
svm_radial = SVC(kernel='rbf', random_state=101)
svm_sigmoid = SVC(kernel='sigmoid', random_state=101)
svm_linear = SVC(kernel='linear', random_state=101)

# Step 2: Fit the models on the training data
svm_radial.fit(X_new_train, y_train)
svm_sigmoid.fit(X_new_train, y_train)
svm_linear.fit(X_new_train, y_train)

# Step 3: Make predictions on the test data
y_pred_radial = svm_radial.predict(X_new_test)
y_pred_sigmoid = svm_sigmoid.predict(X_new_test)
y_pred_linear = svm_linear.predict(X_new_test)

# Step 4: Calculate the accuracy for each model
accuracy_radial = accuracy_score(y_test, y_pred_radial)
accuracy_sigmoid = accuracy_score(y_test, y_pred_sigmoid)
accuracy_linear = accuracy_score(y_test, y_pred_linear)

# Output the accuracies
print(f'Accuracy for radial kernel model = {accuracy_radial}')
print(f'Accuracy for sigmoid kernel model = {accuracy_sigmoid}')
print(f'Accuracy for linear kernel model = {accuracy_linear}')

Accuracy for radial kernel model = 0.7437810945273632
Accuracy for sigmoid kernel model = 0.4253731343283582
Accuracy for linear kernel model = 0.8905472636815921


#### The Support Vector Machine model with linear kernel performed better than all the other making it the best predictor