# Introduction
The goal of this project is to classify whether an insurance claim will be made based on various features of the insured individual. We will explore and implement three machine learning models: Logistic Regression, Random Forest, and Support Vector Machine (SVM). The dataset used for this project contains information such as age, BMI, steps, number of children, sex, smoking status, and region.

## Data Preprocessing
We start by loading and preprocessing the dataset. This involves converting categorical variables to dummy variables, handling missing values, and splitting the data into training and test sets.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the dataset
df = pd.read_csv('claims_data.csv')

# Convert target variable to binary (1 if there was a claim, 0 otherwise)
df['insurance_claim'] = df['insurance_claim'].apply(lambda x: 1 if x == 'yes' else 0)

# Create dummy variables for categorical features and drop the first in each instance
df_dummies = pd.get_dummies(df.drop(columns=['claim_amount']), drop_first=True)

# Define features and target variable
X = df_dummies.drop(columns=['insurance_claim'])
y = df_dummies['insurance_claim']

# Convert boolean columns to integers
bool_columns = X.select_dtypes(include=['bool']).columns
X[bool_columns] = X[bool_columns].astype(int)

# Ensure all data is numeric
X = X.apply(pd.to_numeric, errors='coerce')
y = pd.to_numeric(y, errors='coerce')

# Drop any rows with NaN values that could not be converted
X = X.dropna()
y = y.loc[X.index]  # Ensure y has the same indices as X after dropping NaNs

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


## Logistic Regression
Next, we fit a logistic regression model to the data and evaluate its performance.

In [None]:
# Add a constant to the X matrices
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

# Fit the logistic regression model
logit_model = sm.Logit(y_train, X_train)
result = logit_model.fit()

# Print the summary of the model
print(result.summary())

# Make predictions on the test set
y_pred = result.predict(X_test)
y_pred = (y_pred > 0.5).astype(int)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Proportion of correctly predicted claim indicators: {accuracy:.2f}')


## Random Forest
We then fit a Random Forest model to the data and evaluate its performance.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create and fit the random forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=101)
rf_model.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Calculate confusion matrix
cm = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")
print(cm)

# Calculate false positives and false negatives
tn, fp, fn, tp = cm.ravel()

print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")


## Support Vector Machine (SVM)
Finally, we fit SVM models with different kernels and evaluate their performance.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create and fit the random forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=101)
rf_model.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Calculate confusion matrix
cm = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")
print(cm)

# Calculate false positives and false negatives
tn, fp, fn, tp = cm.ravel()

print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")


## Conclusion
In this project, we implemented and compared three classification models to predict insurance claims. The Random Forest model performed the best, achieving the highest accuracy and the lowest number of false positives and false negatives. Further tuning and model optimization can be explored to improve performance.