# Model Improvement - Part 2

## Introduction
In this section, we will improve the baseline model developed in Part 1. We'll analyze the model's errors, implement various improvements, and compare the performance of the improved model to the baseline model.


# Data Loading and Preprocessing



In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

# Load data
data = pd.read_csv("data/car_insurance.csv")

# Fill missing values
numeric_cols = data.select_dtypes(include=['number'])
data[numeric_cols.columns] = data[numeric_cols.columns].fillna(data[numeric_cols.columns].mean())

# Encode categorical features
le = LabelEncoder()
for col in ['Gender', 'Location.Code', 'Marital.Status']:
    data[col] = le.fit_transform(data[col])

# Split data into features and target
X = data.drop(columns=['Total.Claim.Amount'])
y = data['Total.Claim.Amount']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


 # Baseline Model for Comparison



In [2]:
import xgboost as xgb

# Baseline Model
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42, n_estimators=100, max_depth=3)
xgb_model.fit(X_train, y_train)

# Baseline Predictions
y_pred_baseline = xgb_model.predict(X_test)


ModuleNotFoundError: No module named 'xgboost'

## Error Analysis and Conclusions

Based on the error analysis conducted in Part 1, we identified the following issues:
- **Overestimation and Underestimation**: The model tends to overestimate or underestimate in specific cases.
- **Feature Importance**: Features like `Monthly.Premium.Auto` and `Location.Code` are significant contributors to prediction errors.
- **Outliers**: Many errors are caused by outliers in the data.

### Work Plan:
1. Perform hyperparameter tuning to optimize the model.
2. Feature engineering:
   - Create new features or transformations.
   - Remove or handle outliers.
3. Handle missing data:
   - Improve the imputation of missing values.
4. Balance data:
   - Check for data imbalances and address them if needed.


## Improving Model Performance

We will implement various techniques to enhance model performance.


In [None]:
from sklearn.model_selection import GridSearchCV
# Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
}

# Initialize model
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

# Grid search
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, scoring='neg_mean_absolute_error', cv=5, verbose=2)
grid_search.fit(X_train, y_train)

# Best parameters
print("Best parameters:", grid_search.best_params_)

# Train model with best parameters
best_model = grid_search.best_estimator_
