# Project Overview

The project objective is to develop a predictive model that accurately classifies the risk level of life insurance applicants based on a set of attributes provided in the dataset. The "Response" variable, which represents the risk level, has 8 ordinal levels. By leveraging machine learning techniques, such as predictive modeling, the goal is to expedite and streamline the insurance application process, thereby increasing the efficiency and accessibility of life insurance services while maintaining privacy boundaries. Ultimately, the successful implementation of the predictive model will not only enhance the customer experience but also improve the public perception of the insurance industry.

## Data Preprocessing

Data preprocessing steps include concatenation of data, scaling, encoding, and imputation, critical to handling the complexity of the dataset.



In [175]:
# Importing necessary libraries and modules for data preprocessing and modeling

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

In [176]:
# access the datasets and loading them into pandas DataFrames

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

In [178]:
combined_data = pd.concat([train_data, test_data], axis=0)

In [179]:
X = combined_data.drop(columns=['Id', 'Response'])
y = combined_data['Response']

In [180]:


# Assume X contains both continuous and categorical columns
# Define your continuous columns
continuous_cols = ['Product_Info_4', 'Ins_Age', 'Ht', 'Wt', 'BMI', 'Employment_Info_1', 
                    'Employment_Info_4', 'Employment_Info_6', 'Insurance_History_5', 
                    'Family_Hist_2', 'Family_Hist_3', 'Family_Hist_4', 'Family_Hist_5']

# Define your categorical columns
categorical_cols = ['Product_Info_1', 'Product_Info_2', 'Product_Info_3', 'Product_Info_5', 
                    'Product_Info_6', 'Product_Info_7', 'Employment_Info_2', 'Employment_Info_3', 
                    'Employment_Info_5', 'InsuredInfo_1', 'InsuredInfo_2', 'InsuredInfo_3', 
                    'InsuredInfo_4', 'InsuredInfo_5', 'InsuredInfo_6', 'InsuredInfo_7', 
                    'Insurance_History_1', 'Insurance_History_2', 'Insurance_History_3', 
                    'Insurance_History_4', 'Insurance_History_7', 'Insurance_History_8', 
                    'Insurance_History_9', 'Family_Hist_1', 'Medical_History_2', 'Medical_History_3', 
                    'Medical_History_4', 'Medical_History_5', 'Medical_History_6', 'Medical_History_7', 
                    'Medical_History_8', 'Medical_History_9', 'Medical_History_11', 'Medical_History_12', 
                    'Medical_History_13', 'Medical_History_14', 'Medical_History_16', 'Medical_History_17', 
                    'Medical_History_18', 'Medical_History_19', 'Medical_History_20', 'Medical_History_21', 
                    'Medical_History_22', 'Medical_History_23', 'Medical_History_25', 'Medical_History_26', 
                    'Medical_History_27', 'Medical_History_28', 'Medical_History_29', 'Medical_History_30', 
                    'Medical_History_31', 'Medical_History_33', 'Medical_History_34', 'Medical_History_35', 
                    'Medical_History_36', 'Medical_History_37', 'Medical_History_38', 'Medical_History_39', 
                    'Medical_History_40', 'Medical_History_41']

# Assume X contains both continuous and categorical columns
# Scale the continuous columns
scaler = StandardScaler()
X[continuous_cols] = scaler.fit_transform(X[continuous_cols])

# One-hot encode the categorical columns
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
X_encoded = encoder.fit_transform(X[categorical_cols])
encoded_feature_names = encoder.get_feature_names_out(categorical_cols)
X_encoded_df = pd.DataFrame(X_encoded, columns=encoded_feature_names)

# Reset the index of X_encoded_df
X_encoded_df.reset_index(drop=True, inplace=True)

# Drop the categorical columns from X
X.drop(columns=categorical_cols, inplace=True)

# Reset the index of X
X.reset_index(drop=True, inplace=True)

# Concatenate X with X_encoded_df
X = pd.concat([X, X_encoded_df], axis=1)

In [181]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [182]:
train_data = train_data.fillna(X_train.median())
test_data=test_data.fillna(X_test.median())

In [183]:
X.head()

Unnamed: 0,Product_Info_4,Ins_Age,Ht,Wt,BMI,Employment_Info_1,Employment_Info_4,Employment_Info_6,Insurance_History_5,Family_Hist_2,...,Medical_History_38_2,Medical_History_39_1,Medical_History_39_2,Medical_History_39_3,Medical_History_40_1,Medical_History_40_2,Medical_History_40_3,Medical_History_41_1,Medical_History_41_2,Medical_History_41_3
0,-0.891949,1.803488,0.514174,0.003371,-0.277392,-0.393833,2.559719,-0.405924,1.156639,-0.858012,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
1,-0.891949,-0.467237,-0.955296,-1.288935,-1.111675,-0.454442,-0.410119,-0.62616,1.156639,1.268046,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
2,0.559979,-0.997073,0.514174,3.010921,3.273777,-0.212005,-0.410119,-1.21405,1.156639,0.938644,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
3,-0.347476,0.289671,0.024351,-1.288935,-1.592976,-0.515051,-0.410119,-1.21405,1.156639,1.048444,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
4,-0.347476,-0.921382,-0.710384,0.426308,1.10847,-0.648392,-0.410119,-1.21405,1.156639,-0.858012,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0


In [184]:
# Setting up the parameter grid for hyperparameter tuning using GridSearchCV

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}


In [185]:
# Initializing the RandomForestClassifier with a random state for reproducibility

rf_classifier = RandomForestClassifier(random_state=42)

In [167]:
# Executing GridSearchCV to find the best hyperparameters based on the defined parameter grid and using cross-validation

grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=2, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Note: Fork warnings are due to incompatibility between JAX's multithreading and joblib's multiprocessing. 
# This is just a warning and typically does not impact the execution of the code.

In [168]:
# Extracting the best hyperparameters from the grid search

best_params = grid_search.best_params_

# Initializing RandomForestClassifier with the best parameters and fitting it to the training data

best_rf_classifier = RandomForestClassifier(**best_params, random_state=42)
best_rf_classifier.fit(X_train, y_train)

# Making predictions on the test set with the trained model

test_predictions = best_rf_classifier.predict(X_test)

In [169]:
# Retrieving the best score from the grid search and printing the best accuracy and hyperparameters

best_accuracy = grid_search.best_score_
print("Best Accuracy:", best_accuracy)
print("Best Hyperparameters:", best_params)

Best Accuracy: 0.5438278881778377
Best Hyperparameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}


In [170]:
from sklearn.ensemble import AdaBoostClassifier

In [171]:
# Defining the parameter grid for AdaBoostClassifier to be used in grid search
# The grid specifies a range for 'n_estimators' and 'learning_rate' to find the best combination

ada_param_grid = {
    'n_estimators': [50, 100],
    'learning_rate': [0.1, 0.01]
}

In [172]:
# Initializing AdaBoostClassifier with a random state for reproducibility
# The random state ensures the results are the same each time the classifier is run

ada_classifier = AdaBoostClassifier(random_state=42)

In [173]:
ada_grid_search = GridSearchCV(estimator=ada_classifier, param_grid=ada_param_grid, cv=2, scoring='accuracy', n_jobs=-1)


In [174]:
# Performing GridSearchCV to tune AdaBoostClassifier with the defined parameter grid
# This process will find the best hyperparameters based on cross-validation for the highest accuracyada_grid_search = GridSearchCV(estimator=ada_classifier, param_grid=ada_param_grid, cv=2, scoring='accuracy', n_jobs=-1)

ada_grid_search.fit(X_train, y_train)

In [135]:
# Retrieving the best hyperparameters for AdaBoost after GridSearchCV

best_ada_params = ada_grid_search.best_params_

In [136]:
# Training the AdaBoost classifier with the best parameters obtained from the grid search

best_ada_classifier = AdaBoostClassifier(**best_ada_params, random_state=42)
best_ada_classifier.fit(X_train, y_train)


In [137]:
# Using the optimized AdaBoost classifier to make predictions on the test set

ada_test_predictions = best_ada_classifier.predict(X_test)

In [25]:
# Displaying the best accuracy and hyperparameters for the AdaBoost model after grid search

best_ada_accuracy = ada_grid_search.best_score_
print("Best AdaBoost Accuracy:", best_ada_accuracy)
print("Best AdaBoost Hyperparameters:", best_ada_params)

Best AdaBoost Accuracy: 0.4873063320983496
Best AdaBoost Hyperparameters: {'learning_rate': 0.1, 'n_estimators': 100}


In [186]:
# Loading the dataset from Google Drive and importing necessary functions for data preprocessing

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

from sklearn.impute import SimpleImputer
# Separating features and target variable from the training data
X = train_data.drop(columns=['Id','Response'])  # Separate features and target variable
y = train_data['Response']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identifying numerical and categorical columns for imputation
numeric_cols = X.select_dtypes(include=['float64', 'int64']).columns  # Separating numerical and categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns

# Imputing missing values in numerical columns with median and in categorical columns with the most frequent value
imputer_numeric = SimpleImputer(strategy='median')  # Filling missing values with median for numerical columns and mode for categorical columns
imputer_categorical = SimpleImputer(strategy='most_frequent')

# Applying the imputation to the train and test sets
X_train[numeric_cols] = imputer_numeric.fit_transform(X_train[numeric_cols])
X_train[categorical_cols] = imputer_categorical.fit_transform(X_train[categorical_cols])

X_test[numeric_cols] = imputer_numeric.transform(X_test[numeric_cols])
X_test[categorical_cols] = imputer_categorical.transform(X_test[categorical_cols])

# Encoding categorical variables using label encoders
label_encoders = {} 
for column in categorical_cols:
    label_encoders[column] = LabelEncoder()
    X_train[column] = label_encoders[column].fit_transform(X_train[column])
    X_test[column] = label_encoders[column].transform(X_test[column])

# Normalizing the features with StandardScaler to ensure that each feature contributes equally to the distance computation in KNN    
scaler = StandardScaler() # Normalizing the features
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# K-Nearest Neighbors (KNN) Classifier
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train_scaled, y_train)
knn_predictions = knn_classifier.predict(X_test_scaled)
knn_accuracy = accuracy_score(y_test, knn_predictions)
print("KNN Accuracy:", knn_accuracy)


KNN Accuracy: 0.3922744974213241


In [187]:
from sklearn.linear_model import LogisticRegression

logistic_classifier = LogisticRegression()
logistic_classifier.fit(X_train_scaled, y_train)
logistic_predictions = logistic_classifier.predict(X_test_scaled)
logistic_accuracy = accuracy_score(y_test, logistic_predictions)
print("Logistic Regression Accuracy:", logistic_accuracy)

Logistic Regression Accuracy: 0.5044732133459636


In [28]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam


# Identify categorical and numerical columns
categorical_cols = train_data.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_cols = train_data.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_cols.remove('Id')  # Removing 'Id' if it's part of numerical columns
numerical_cols.remove('Response')  # Removing 'Response' as it's the target

# Ensuring all categorical columns are type 'category' for correct processing
train_data[categorical_cols] = train_data[categorical_cols].apply(lambda x: x.astype('category'))
test_data[categorical_cols] = test_data[categorical_cols].apply(lambda x: x.astype('category'))

# Setup the OneHotEncoder and StandardScaler
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Create the preprocessor with ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_cols),
    ('cat', categorical_transformer, categorical_cols)
])

# Preprocess data
X_train = preprocessor.fit_transform(train_data.drop(['Id', 'Response'], axis=1))
y_train = train_data['Response']
X_test = preprocessor.transform(test_data.drop(['Id', 'Response'], axis=1))

# Convert labels to categorical
y_train_encoded = to_categorical(y_train)
y_test_encoded = to_categorical(test_data['Response'])

# Define the neural network model
model = Sequential([
    Dense(64, activation='relu', input_dim=X_train.shape[1]),
    Dense(64, activation='relu'),
    Dense(y_train_encoded.shape[1], activation='softmax')
])

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

# Fit the model
history = model.fit(X_train, y_train_encoded, epochs=50, batch_size=32, validation_split=0.2)

# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(X_test, y_test_encoded, verbose=0)

print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_accuracy}")

# Check for unique predictions (debugging step)
predictions = model.predict(X_test)
predicted_classes = np.argmax(predictions, axis=1)
unique_classes = np.unique(predicted_classes)
print(f"Unique classes predicted: {unique_classes}")

# Print early stopping if NaN loss is detected
if np.isnan(history.history['loss']).any():
    print("Training stopped due to NaN loss.")


Epoch 1/50


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m1188/1188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 470us/step - accuracy: 0.4149 - loss: 1.5830 - val_accuracy: 0.5082 - val_loss: 1.3199
Epoch 2/50
[1m1188/1188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 403us/step - accuracy: 0.5351 - loss: 1.2612 - val_accuracy: 0.5302 - val_loss: 1.2798
Epoch 3/50
[1m1188/1188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 403us/step - accuracy: 0.5576 - loss: 1.2137 - val_accuracy: 0.5368 - val_loss: 1.2679
Epoch 4/50
[1m1188/1188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 401us/step - accuracy: 0.5731 - loss: 1.1752 - val_accuracy: 0.5324 - val_loss: 1.2662
Epoch 5/50
[1m1188/1188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 404us/step - accuracy: 0.5817 - loss: 1.1473 - val_accuracy: 0.5352 - val_loss: 1.2656
Epoch 6/50
[1m1188/1188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 416us/step - accuracy: 0.5837 - loss: 1.1456 - val_accuracy: 0.5363 - val_loss: 1.2717
Epoch 7/50
[1m

In [30]:
# Prediction Set for Neural Network
predictions = model.predict(X_test)
predicted_classes = np.argmax(predictions, axis=1)
actual_classes = np.argmax(y_test_encoded, axis=1)

[1m372/372[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 243us/step


In [128]:
import numpy as np
from sklearn.metrics import confusion_matrix

def quadratic_weighted_kappa(y_true, y_pred):
    """
    Calculates the quadratic weighted kappa
    kappa = 1 - (sum(w * O) / sum(w * E))
    where O is the confusion matrix and E is the matrix of expected counts under independence.
    """
    O = confusion_matrix(y_true, y_pred)
    N = O.shape[0]
    w = np.zeros((N, N))
    for i in range(N):
        for j in range(N):
            w[i][j] = ((i - j) ** 2) / (N - 1) ** 2

    # Compute the matrix of expected counts
    E = np.outer(np.sum(O, axis=1), np.sum(O, axis=0)) / np.sum(O)

    # Calculate kappa
    kappa = 1 - (np.sum(w * O) / np.sum(w * E))
    return kappa

# Predict classes on the test set using the trained model
#predictions = model.predict(X_test)
#predicted_classes = np.argmax(predictions, axis=1)
#actual_classes = np.argmax(y_test_encoded, axis=1)  # Make sure this matches the format of your actual test labels

# Calculate the kappa score using the defined function for agreement between actual and predicted classes
#kappa_score = quadratic_weighted_kappa(actual_classes, predicted_classes)
#print("Quadratic Weighted Kappa:", kappa_score)


In [34]:
# Calculating the Kappa score for the K-Nearest Neighbors (KNN) model predictions
knn_kappa = quadratic_weighted_kappa(y_test, nn_predictions)
print("KNN Kappa:", knn_kappa)


KNN Kappa: 0.38561988466173935


In [129]:
# Generating predictions with the Random Forest model and calculating its Kappa score
#rf_predictions = best_rf_classifier.predict(X_test)
rf_kappa = quadratic_weighted_kappa(y_test, test_predictions)
print("Random Forest Kappa:", rf_kappa)


Random Forest Kappa: 0.5290627898581586


In [27]:
# Calculating the Kappa score for Logistic Regression model predictions
logistic_kappa = quadratic_weighted_kappa(y_test, logistic_predictions)
print("Logistic Regression Kappa:", logistic_kappa)

Logistic Regression Kappa: 0.5085070500173755


In [138]:
#from sklearn.ensemble import AdaBoostClassifier

# Ensuring we are using the correct features in training

ada_kappa=quadratic_weighted_kappa(y_test,ada_test_predictions)

print("AdaBoost Kappa:", ada_kappa)


AdaBoost Kappa: 0.4309616538345524


In [33]:
# Calculate the kappa score using the defined function for agreement between actual and predicted classes
kappa_score = quadratic_weighted_kappa(actual_classes, predicted_classes)
print("NeuralNetworkKappa:", kappa_score)

NeuralNetworkKappa: 0.5105409774280318


## Detailed Analysis

#### Weak Models:

##### KNN Model:

###### Kappa Value: 0.3856

###### Accuracy: 0.3923

###### Reasons for Weakness:

>KNN relies heavily on the local structure of the data and the choice of distance metric, which might not effectively capture complex relationships in the data.

>It can be computationally expensive during inference, especially with large datasets.

###### Impact on Performance:

>With a Kappa value of 0.3856, the KNN model might struggle to generalize well to unseen data, especially in scenarios where the relationships between attributes and the target variable are complex or non-linear.

##### AdaBoost Model:

###### Kappa Value: 0.4309

###### Accuracy: 0.4873

###### Reasons for Weakness:

>AdaBoost is sensitive to noisy data and outliers, which can lead to overfitting and suboptimal generalization.
It may struggle to capture complex relationships in the data that cannot be represented by simple weak learners.

###### Impact on Performance:

>With a Kappa value of 0.4309, the AdaBoost model's performance might be limited by its sensitivity to noise and its inability to capture complex patterns effectively.

#### Strong Models:

##### Random Forest Model:

###### Kappa Value: 0.5291

###### Accuracy:  0.5464

###### Reasons for Strength:

>Random Forests are robust to overfitting and can handle complex relationships in the data effectively due to their ensemble nature.

>They can capture non-linear interactions between features, making them suitable for a wide range of datasets.

###### Impact on Performance:

>With a Kappa value of 0.5291, the Random Forest model demonstrates strong performance in capturing complex patterns and relationships in the data.

##### Neural Network Classifier:

###### Kappa Value: 0.5105

###### Accuracy Value: 00.5063


###### Reasons for Strength:

>Neural networks have the capacity to learn complex non-linear relationships in the data through multiple layers of neurons and activation functions.

>They can automatically extract features from raw data, making them suitable for tasks with high-dimensional inputs or unstructured data.

###### Impact on Performance:

>With a Kappa value of 0.5105, the Neural Network Classifier excels in capturing intricate patterns and interactions among features, contributing to its strong performance.

#### Medium Model:

##### Logistic Regression Model:

###### Kappa Value: 0.5085

###### Accuracy: 0.5045

###### Reasons for Medium Performance:

>Logistic Regression assumes a linear relationship between the independent variables and the log-odds of the target variable, which might not capture the full complexity of the data.

>It is less flexible compared to Random Forests and Neural Networks, especially in scenarios with non-linear relationships or interactions.

###### Impact on Performance:

>With a Kappa value of 0.5085, the Logistic Regression model demonstrates moderate performance, performing well when the relationship between predictors and the target is approximately linear. However, it may struggle in capturing non-linear relationships present in the data.