## Model Training and Evaluation

After preprocessing the data and saving it to a CSV file, the next step is to build and evaluate various machine learning models. This section outlines the approach for model training, evaluation, and feature importance analysis, ensuring we achieve the best model for predicting insurance claims.

### Objectives
The main objectives of this report are to:
- Document the complete process of training, evaluating, and interpreting machine learning models for predicting car insurance claims.
- Utilize the preprocessed dataset to train various models.
- Evaluate each model based on relevant metrics to identify the best-performing one.
- Conduct SHAP (SHapley Additive exPlanations) analysis to explain feature importance and model predictions.

### Model Training Approach
1. **Data Preparation**: Load the preprocessed dataset and split it into training and testing sets.
2. **Model Selection**: Choose various machine learning algorithms (e.g., Linear Regression, Random Forest, Gradient Boosting).
3. **Training**: Fit the models on the training set.
4. **Evaluation**: Assess model performance using metrics such as MAE, RMSE, R², and accuracy.
5. **Feature Importance**: Use SHAP analysis to interpret model outputs and understand the impact of different features on predictions.

### Conclusion
This comprehensive approach ensures that we not only find the most effective model for predicting insurance claims but also understand the underlying factors that influence these predictions.

In [17]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb  # Ensure XGBoost is installed for advanced gradient boosting
import os
import sys

In [18]:
# Add the 'scripts' directory to the Python path for module imports
sys.path.append(os.path.abspath(os.path.join('..', 'scripts')))

In [19]:
# Set options for pandas to display more columns and rows
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

In [20]:
# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings("ignore", message="")

In [21]:
# Load the preprocessed data from CSV file
df = pd.read_csv('../data/preprocessed_data.csv')
print("Data loaded successfully.")

Data loaded successfully.
