In [2]:
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, f1_score

# Load the data 

- The original data have 800 cases and 40 variables of dtypes float64(2), int64(17) and object(21).

- Deciding features and target:
  - features: variables on *the insurance policy*, *demographics of the insured*, *incident details*, *claims*, *automobile* are potentially predictors 
  - target: `fraud_reported` (N/Y).

- Necessary preprocessing:
  - `_c39` has only null, drop it, then the only 1 float variable is `policy_annual_premium`. 

  - Only `capital-gains` and `capital-loss` have hyphen as word link-for consistency, *standardize* column names to "snake_case". 

  - Calculate `policy_age_days` as *the time* between `policy_bind_date` and `incident_date`, because claims filed on very new policies are often suspicious. 

  - *Drop* some variables that are either unique identifiers, have too many unique values (high cardinality), are irrelevant, or have already had their useful information extracted. 

  - The *processed data* have 28 variables of dtypes float64(1), int64(15) and object(12).

In [4]:
# Load the training data
df = pd.read_csv('data/train_claims.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 40 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   months_as_customer           800 non-null    int64  
 1   age                          800 non-null    int64  
 2   policy_number                800 non-null    int64  
 3   policy_bind_date             800 non-null    object 
 4   policy_state                 800 non-null    object 
 5   policy_csl                   800 non-null    object 
 6   policy_deductable            800 non-null    int64  
 7   policy_annual_premium        800 non-null    float64
 8   umbrella_limit               800 non-null    int64  
 9   insured_zip                  800 non-null    int64  
 10  insured_sex                  800 non-null    object 
 11  insured_education_level      800 non-null    object 
 12  insured_occupation           800 non-null    object 
 13  insured_hobbies     

In [11]:
df.head()

Unnamed: 0,months_as_customer,age,policy_number,policy_bind_date,policy_state,policy_csl,policy_deductable,policy_annual_premium,umbrella_limit,insured_zip,...,police_report_available,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,auto_model,auto_year,fraud_reported,_c39
0,241,45,596785,2014-03-04,IL,500/1000,2000,1104.5,0,432211,...,NO,91650,14100,14100,63450,Accura,TL,2011,N,
1,65,23,876699,1999-12-12,OH,250/500,1000,1099.95,0,473109,...,YES,52400,6550,6550,39300,Accura,MDX,2005,Y,
2,289,45,943425,1999-10-28,OH,250/500,2000,1221.41,0,466289,...,NO,2700,300,300,2100,Honda,Accord,2006,N,
3,63,26,550930,1995-10-12,IL,500/1000,500,1500.04,6000000,613826,...,YES,5160,860,860,3440,Accura,TL,2004,N,
4,257,43,797636,1992-05-19,IN,100/300,1000,974.84,0,468984,...,YES,85320,21330,7110,56880,Nissan,Pathfinder,2006,N,


# 1. Splitting Data to Train/Test
We do splitting *prior to* any preprocessing and analysis, in order to prevent **data leakage** i.e. using info. in the testing dataset to guide our model training.

- training and testing 80/20 split
- proportion of fraud cases is roughly the same in both datasets 

In [16]:
# Define features (X) and target (y)
TARGET = 'fraud_reported'
X = df.drop(columns=[TARGET])
y = df[TARGET]

# Split the data into training and testing sets (80/20 split)
# We use stratify=y to ensure the proportion of fraud cases is the same in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Data successfully split into training and testing sets.")
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")
print("\nFraud distribution in training set:")
print(y_train.value_counts(normalize=True))
print("\nFraud distribution in testing set:")
print(y_test.value_counts(normalize=True))

Data successfully split into training and testing sets.
Training set shape: (640, 39)
Testing set shape: (160, 39)

Fraud distribution in training set:
fraud_reported
N    0.759375
Y    0.240625
Name: proportion, dtype: float64

Fraud distribution in testing set:
fraud_reported
N    0.7625
Y    0.2375
Name: proportion, dtype: float64


# 2. Build the Preprocessing and Modeling Pipeline

Now, we'll create a pipeline to handle all preprocessing steps. This is the best practice as it prevents data leakage and bundles our entire workflow into a single, reusable object.

The pipeline will perform the following steps in order:
1.  **Clean and Engineer Features**: A custom transformer will handle replacing `'?'`, creating `policy_age_days`, and dropping unnecessary columns.
2.  **Impute Missing Values**: It will fill missing numerical values with the median and categorical values with the mode.
3.  **Scale and Encode**: It will apply `StandardScaler` to numerical features and `OneHotEncoder` to categorical features.
4.  **Train the Model**: Finally, it will train our `AdaBoostClassifier`.

First, **Create a custom preprocessing step** that can be used inside a `scikit-learn` `Pipeline`. 

This step handles all the initial data cleaning and feature creation/deletion steps, such as:
- Standardizing column names.
- Replacing '?' with NaN.
- Creating the policy_age_days feature.
- Dropping the unnecessary columns.

Specifically, create a new class named `FeatureEngineer`, and to make it compatible with `scikit-learn` inherit from two of its built-in classes: 
   - `BaseEstimator`: the "base plate", gives our class methods like `get_params()` and `set_params()`, which work correctly with `GridSearchCV`, so we can have a fine-tuning step
   - `TransformerMixin`: a "helper" class, we define our own `.fit()` and `.transform()` methods, it gives us a `.fit_transform()` automatically. A convention for any step in a pipeline that transfroms data.

In [17]:
from sklearn.base import BaseEstimator, TransformerMixin

# Define the list of columns to drop
# We define this here so it can be used in our custom transformer
columns_to_drop = [
    'policy_bind_date', 'incident_date', 'policy_number', 'policy_state',
    'insured_zip', 'insured_hobbies', 'incident_location', 'incident_city',
    'incident_state', 'auto_model', 'auto_year', 'auto_make', '_c39'
]

class FeatureEngineer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_transformed = X.copy()
        
        # Standardize column names
        X_transformed.columns = [col.lower().replace('-', '_') for col in X_transformed.columns]
        
        # Replace '?' with NaN
        X_transformed.replace('?', np.nan, inplace=True)
        
        # Feature Engineering: policy_age_days
        X_transformed['policy_bind_date'] = pd.to_datetime(X_transformed['policy_bind_date'])
        X_transformed['incident_date'] = pd.to_datetime(X_transformed['incident_date'])
        X_transformed['policy_age_days'] = (X_transformed['incident_date'] - X_transformed['policy_bind_date']).dt.days
        
        # Drop specified columns
        # We use .get(columns_to_drop, []) to avoid errors if columns are already gone
        cols_to_drop_in_df = [col for col in columns_to_drop if col in X_transformed.columns]
        X_transformed.drop(columns=cols_to_drop_in_df, inplace=True)
        
        return X_transformed

**Complete Pipeline Cell**: builds and runs the entire `Pipeline`. This cell:

- Defines the preprocessing steps for numerical features (imputation with median, then scaling).
- Defines the preprocessing steps for categorical features (imputation with mode, then one-hot encoding).
- Combines these steps using `ColumnTransformer`.
- Integrates the custom `FeatureEngineer` and the preprocessor into a final `Pipeline` with the `AdaBoostClassifier`.
- Trains the entire pipeline on `X_train` and `y_train`.
- Evaluates the model on `X_test` and prints the classification report.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Define the initial feature engineering step
feature_engineering_step = ('feature_engineering', FeatureEngineer())

# Define preprocessing for numerical and categorical features
# Note: We are defining this based on expected dtypes AFTER the FeatureEngineer runs.
# Let's get the dtypes first by fitting and transforming a sample.
temp_df = FeatureEngineer().fit_transform(X_train)
numerical_features = temp_df.select_dtypes(include=np.number).columns.tolist()
categorical_features = temp_df.select_dtypes(include='object').columns.tolist()

# Create the preprocessing pipelines
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Create the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough'
)

# Create the final model pipeline
model_pipeline = Pipeline(steps=[
    feature_engineering_step,
    ('preprocessor', preprocessor),
    ('classifier', AdaBoostClassifier(random_state=42))
])

# Train the model
print("Training the complete model pipeline...")
model_pipeline.fit(X_train, y_train)
print("Model training complete.")

# Evaluate the model
print("\nEvaluating model performance on the test set...")
y_pred = model_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

Training the complete model pipeline...
Model training complete.

Evaluating model performance on the test set...
              precision    recall  f1-score   support

           N       0.88      0.87      0.88       122
           Y       0.60      0.63      0.62        38

    accuracy                           0.81       160
   macro avg       0.74      0.75      0.75       160
weighted avg       0.82      0.81      0.81       160



# 3. Hyperparameter Tuning with GridSearchCV

The initial model provides a good baseline. Now, we'll use `GridSearchCV` to systematically search for the best hyperparameters for our `AdaBoostClassifier`. This will help us fine-tune the model and potentially improve its performance, especially the recall for fraud detection.

We will search over:
- `n_estimators`: The number of boosting stages to perform.
- `learning_rate`: The rate at which the model learns from its mistakes.

`GridSearchCV` is very clever. After it finishes cross-validation and finds the best parameters, it *automatically performs one final step*: it retrains the entire pipeline on all of the training data (X_train and y_train) using those best parameters. 

`grid_search.best_estimator_` is the resulting entire and fully-trained pipeline-we only need to save this.

In [19]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid to search
# These are parameters for the 'classifier' step of our pipeline
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__learning_rate': [0.01, 0.1, 1.0]
}

# Create the GridSearchCV object
# We use the entire model_pipeline so that the search is done on the preprocessed data
# We focus on 'f1_macro' as it's a good overall metric for imbalanced classes
# cv=3 means 3-fold cross-validation
grid_search = GridSearchCV(
    model_pipeline, 
    param_grid, 
    cv=3, 
    scoring='f1_macro', 
    verbose=1
)

# Fit the grid search to the training data
print("Starting GridSearchCV...")
grid_search.fit(X_train, y_train)

# Print the best parameters and the best score
print("\nGridSearchCV complete.")
print(f"Best parameters found: {grid_search.best_params_}")
print(f"Best F1 score on cross-validation: {grid_search.best_score_:.4f}")

# Evaluate the best model on the test set
print("\nEvaluating the best model on the test set...")
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)
print(classification_report(y_test, y_pred_best))

Starting GridSearchCV...
Fitting 3 folds for each of 9 candidates, totalling 27 fits

GridSearchCV complete.
Best parameters found: {'classifier__learning_rate': 0.01, 'classifier__n_estimators': 50}
Best F1 score on cross-validation: 0.7584

Evaluating the best model on the test set...
              precision    recall  f1-score   support

           N       0.91      0.84      0.88       122
           Y       0.60      0.74      0.66        38

    accuracy                           0.82       160
   macro avg       0.75      0.79      0.77       160
weighted avg       0.84      0.82      0.82       160



# 4. Save the Best Model

Now that we have a well-performing, tuned model, the final step is to save it to a file. This process, called **serialization**, allows us to load the entire pipeline—including the feature engineering, preprocessing, and the trained classifier—into another environment (like a web application or an API) to make predictions on new, unseen data without having to retrain it.

We will use the `joblib` library, which is efficient for saving scikit-learn models. Here we save `best_model`, the entire, fitted `Pipeline` object.

In [20]:
# Define the file path for the saved model
model_path = 'models/fraud_detection_pipeline.pkl'

# Save the best_model pipeline using joblib
joblib.dump(best_model, model_path)

print(f"Model pipeline saved successfully to: {model_path}")

# You can also load it back to verify
# loaded_model = joblib.load(model_path)
# print("\nModel loaded successfully.")
# print(loaded_model)

Model pipeline saved successfully to: models/fraud_detection_pipeline.pkl
