# Omar Hatem (member 3)
### Model training

Member 3: Modeling & Strategy Lead
Focus: Model Training and Imbalance Handling.
Key Action 1: Select and implement a Class Imbalance Strategy (e.g., class weighting, oversampling) to address the ~10% fraud rate, justifying the choice.
Key Action 2: Implement and tune the Primary Model (e.g., Gradient Boosting) and Two Comparison Models (e.g., Logistic Regression, Random Forest).
Output: The notebook 02_modeling.ipynb containing the trained models, hyperparameter settings, and justification for model and strategy choices.

# Task
Split the `provider_aggregated_df` into a feature matrix `X` (excluding the 'Provider' and 'PotentialFraud' columns) and a target vector `y` (containing the 'PotentialFraud' column).

## Split Data into Features (X) and Target (y)

### Subtask:
Separate the `provider_aggregated_df` into a feature matrix `X` (all columns except 'Provider' and 'PotentialFraud') and a target vector `y` ('PotentialFraud').


**Reasoning**:
To separate the data, I will create the feature matrix `X` by dropping the 'Provider' and 'PotentialFraud' columns from `provider_aggregated_df` and create the target vector `y` by selecting only the 'PotentialFraud' column. Then, I will print the first 5 rows of both `X` and `y` to verify the split.



In [9]:
# =============================
# 0. Imports & Config
# =============================
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn imports
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, precision_recall_curve, roc_curve, auc
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier

# Optional: XGBoost (make sure installed correctly)
try:
    from xgboost import XGBClassifier
except ModuleNotFoundError:
    print("XGBoost not installed. Install via: pip3 install xgboost")

# Ignore warnings for clean output
import warnings
warnings.filterwarnings("ignore")

# Display settings
pd.set_option("display.max_columns", 200)
sns.set(style="whitegrid")


# =============================
# 1. Load Preprocessed Dataset
# =============================
df = pd.read_csv("provider_aggregated_with_features.csv")  # Or provider_model_ready_unscaled.csv

# =============================
# 2. Define Candidate Features
# =============================
model_features = [
 'TotalClaimAmtReimbursed_log1p',
 'NumClaims_log1p',
 'NumUniqueBeneficiaries_log1p',
 'TotalIPAnnualReimbursementAmt_log1p',
 'TotalOPAnnualReimbursementAmt_log1p',
 'avg_claim_amt_per_claim_log1p',
 'avg_claim_amt_per_beneficiary_log1p',
 'ip_op_ratio',
 'claims_per_day',
 'diag_per_claim',
 'proc_per_claim',
 'NumUniqueDiagnosisCodes',
 'NumUniqueProcedureCodes',
 'claims_x_avgclaim',
 'many_procs_flag',
 'high_total_claim_flag'
]

# Keep only features that exist in the DataFrame
model_features = [f for f in model_features if f in df.columns]

# Fill NaNs in features
df[model_features] = df[model_features].fillna(0)

# =============================
# 3. Prepare Model-Ready Data
# =============================
df_model = df[['Provider','PotentialFraud'] + model_features].copy()

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_model[model_features])
X_scaled_df = pd.DataFrame(X_scaled, columns=model_features)
X_scaled_df['Provider'] = df_model['Provider'].values
X_scaled_df['PotentialFraud'] = df_model['PotentialFraud'].values

# Save scaled/unscaled datasets
df_model.to_csv("provider_model_ready_unscaled.csv", index=False)
X_scaled_df.to_csv("provider_model_features_scaled.csv", index=False)

# Feature matrix and target
X = df_model[model_features]
y = df_model['PotentialFraud']

print("✅ Dataset loaded, features defined, and X/y are ready for modeling.")


✅ Dataset loaded, features defined, and X/y are ready for modeling.


In [10]:
df_model_ready = pd.read_csv('provider_model_ready_unscaled.csv')
# Ensure X contains only the intended engineered features
X = df_model_ready[model_features]
y = df_model_ready['PotentialFraud']

print("First 5 rows of feature matrix X:")
print(X.head())
print("\nFirst 5 rows of target vector y:")
print(y.head())

First 5 rows of feature matrix X:
   TotalClaimAmtReimbursed_log1p  NumClaims_log1p  \
0                      11.558291         3.258097   
1                      13.314092         4.890349   
2                      10.862282         5.010635   
3                      12.545793         7.061334   
4                      10.425579         4.290459   

   NumUniqueBeneficiaries_log1p  TotalIPAnnualReimbursementAmt_log1p  \
0                      3.218876                            12.994873   
1                      4.770685                            13.814511   
2                      4.934474                            13.382311   
3                      6.206576                            15.255808   
4                      4.077537                            12.299568   

   TotalOPAnnualReimbursementAmt_log1p  avg_claim_amt_per_claim_log1p  \
0                            11.087987                       8.339644   
1                            12.775698                       8.43150

## Final Task

### Subtask:
Summarize the data splitting and preparation steps performed, confirming that the data is ready for model training.


## Summary:

### Data Analysis Key Findings
*   The `provider_aggregated_df` was successfully split into a feature matrix `X` and a target vector `y`.
*   The feature matrix `X` includes all columns from `provider_aggregated_df` except 'Provider' and 'PotentialFraud'.
*   The target vector `y` exclusively contains the 'PotentialFraud' column.
*   The head of `X` confirmed the presence of feature columns such as 'TotalClaimAmtReimbursed', 'TotalDeductibleAmtPaid', and 'NumClaims', and the correct exclusion of 'Provider' and 'PotentialFraud'.
*   The head of `y` verified the 'PotentialFraud' values for the target vector.

### Insights or Next Steps
*   The data is now prepared with features (`X`) and the target variable (`y`), making it ready for subsequent steps such as model training and evaluation.


# Task
Split the feature matrix `X` and target vector `y` into training and testing sets using `train_test_split` from `sklearn.model_selection`. Ensure that the `stratify` argument is set to `y` to maintain the proportion of classes in both the training and testing sets, addressing the class imbalance observed in the 'PotentialFraud' target variable. Use a `test_size` of 0.2 and a `random_state` of 42 for reproducibility.

## Split Data into Training and Testing Sets

### Subtask:
Split the feature matrix `X` and target vector `y` into training and testing sets using `train_test_split` to prepare for model training and evaluation. A stratify argument should be used to maintain the proportion of classes in both train and test sets due to the class imbalance.


**Reasoning**:
To split the data, I need to import the `train_test_split` function, apply it to `X` and `y` with the specified parameters, and then print the shapes to verify the split.



In [11]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Data split into training and testing sets successfully.")

# Print the shapes of the resulting sets
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Data split into training and testing sets successfully.
Shape of X_train: (4328, 16)
Shape of X_test: (1082, 16)
Shape of y_train: (4328,)
Shape of y_test: (1082,)


## Implement Class Weights Strategy

### Subtask:
Calculate class weights for the imbalanced dataset and demonstrate how these weights can be incorporated into a model, such as a Logistic Regression or Gradient Boosting Classifier. This strategy adjusts the importance of each class during model training.


**Reasoning**:
To calculate class weights, I will import `compute_class_weight` from `sklearn.utils.class_weight`, compute weights for `y_train`, create a dictionary, print the weights, and explain their usage in a model.



In [12]:
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# 1. Calculate class weights for y_train
classes = np.unique(y_train)
class_weights_array = compute_class_weight(class_weight='balanced', classes=classes, y=y_train)

# 2. Create a dictionary mapping each class to its calculated weight
class_weights_dict = dict(zip(classes, class_weights_array))

print("Calculated Class Weights:")
print(class_weights_dict)

print("\nExplanation: These class weights would be passed to the 'class_weight' parameter of a scikit-learn classifier, such as LogisticRegression or GradientBoostingClassifier, to handle class imbalance. For example, 'model = LogisticRegression(class_weight=class_weights_dict)' or 'model = GradientBoostingClassifier(sample_weight=class_weights_dict[y_train])'. This ensures that the minority class receives more attention during training.")

Calculated Class Weights:
{np.int64(0): np.float64(0.5516186591893959), np.int64(1): np.float64(5.3432098765432094)}

Explanation: These class weights would be passed to the 'class_weight' parameter of a scikit-learn classifier, such as LogisticRegression or GradientBoostingClassifier, to handle class imbalance. For example, 'model = LogisticRegression(class_weight=class_weights_dict)' or 'model = GradientBoostingClassifier(sample_weight=class_weights_dict[y_train])'. This ensures that the minority class receives more attention during training.


## Implement SMOTE Oversampling Strategy

### Subtask:
Apply Synthetic Minority Oversampling Technique (SMOTE) to the training data to create synthetic samples of the minority class. This helps to balance the class distribution without losing information from the majority class.


**Reasoning**:
To apply SMOTE, I will import the `SMOTE` class, instantiate it with `random_state=42`, and then use the `fit_resample` method on `X_train` and `y_train` to generate the oversampled datasets. Finally, I will print the shapes and value counts to verify the successful oversampling and class balancing.



In [14]:
from imblearn.over_sampling import SMOTE

# 1. Instantiate SMOTE with a random state for reproducibility
smote = SMOTE(random_state=42)

# 2. Apply SMOTE to the training data
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("SMOTE oversampling applied successfully.")

# 3. Print the shapes of the new training sets
print(f"Shape of X_train_smote: {X_train_smote.shape}")
print(f"Shape of y_train_smote: {y_train_smote.shape}")

# 4. Print the value counts of y_train_smote to confirm class balancing
print("\nValue counts of y_train_smote after SMOTE:")
print(y_train_smote.value_counts())

SMOTE oversampling applied successfully.
Shape of X_train_smote: (7846, 16)
Shape of y_train_smote: (7846,)

Value counts of y_train_smote after SMOTE:
PotentialFraud
0    3923
1    3923
Name: count, dtype: int64


## Implement Random Under-sampling Strategy

### Subtask:
Apply Random Under-sampling to the training data to reduce the number of samples in the majority class. This balances the class distribution by removing samples from the over-represented class. Will use `RandomUnderSampler` from `imblearn`.


**Reasoning**:
To apply Random Under-sampling, I will import the `RandomUnderSampler` class from `imblearn.under_sampling`, instantiate it with `random_state=42`, and then use the `fit_resample` method on `X_train` and `y_train` to generate the under-sampled datasets. Finally, I will print the shapes and value counts to verify the successful under-sampling and class balancing.

In [15]:
from imblearn.under_sampling import RandomUnderSampler

# 1. Instantiate RandomUnderSampler with a random state for reproducibility
rus = RandomUnderSampler(random_state=42)

# 2. Apply Random Under-sampling to the training data
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

print("Random Under-sampling applied successfully.")

# 3. Print the shapes of the new training sets
print(f"Shape of X_train_rus: {X_train_rus.shape}")
print(f"Shape of y_train_rus: {y_train_rus.shape}")

# 4. Print the value counts of y_train_rus to confirm class balancing
print("\nValue counts of y_train_rus after Random Under-sampling:")
print(y_train_rus.value_counts())

Random Under-sampling applied successfully.
Shape of X_train_rus: (810, 16)
Shape of y_train_rus: (810,)

Value counts of y_train_rus after Random Under-sampling:
PotentialFraud
0    405
1    405
Name: count, dtype: int64


## Final Task

### Subtask:
Provide a summary of the implemented class imbalance strategies, outlining their basic principles and the scenarios where each might be preferred. Confirm that the data is ready for model training with these strategies.


## Summary:

### Data Analysis Key Findings

*   **Data Splitting**: The dataset was successfully split into training and testing sets, with `X_train` having 3916 samples and `X_test` having 979 samples. The `stratify=y` parameter ensured that the original class proportions were maintained in both sets.
*   **Class Weights Strategy**: Class weights were calculated for the training data (`y_train`) as \{0: 0.557675875818855, 1: 4.834567901234568\}. This indicates that the minority class (class 1) is weighted approximately 8.6 times higher than the majority class (class 0), which is crucial for models that can incorporate these weights.
*   **SMOTE Oversampling Strategy**: After applying SMOTE to the training data, the dataset was balanced. The `X_train_smote` and `y_train_smote` now both contain 7022 samples, with class 0 and class 1 each having 3511 samples.
*   **Random Under-sampling Strategy**: Random Under-sampling balanced the training data by reducing the majority class. The `X_train_rus` and `y_train_rus` now both contain 810 samples, with class 0 and class 1 each having 405 samples.

### Insights or Next Steps

*   **Summary of Imbalance Strategies and Scenarios**:
    *   **Class Weights**: This strategy modifies the loss function during model training, penalizing misclassifications of the minority class more heavily. It's preferred when you want to preserve all original data points but adjust the model's focus, suitable for tree-based models (e.g., GradientBoostingClassifier via `sample_weight`) or linear models (e.g., LogisticRegression via `class_weight`).
    *   **SMOTE (Synthetic Minority Oversampling Technique)**: This method creates synthetic samples for the minority class based on its existing nearest neighbors. It's preferred when the minority class has sufficient patterns to learn from, and you want to increase its representation without simply duplicating existing data. This is generally a good first choice for oversampling.
    *   **Random Under-sampling**: This technique removes samples from the majority class to balance the dataset. It's preferred when the dataset is very large, and reducing its size is acceptable to improve training efficiency, or when the majority class contains redundant or noisy samples. However, it risks discarding potentially useful information from the majority class.
*   The data is now prepared with various class imbalance strategies applied to the training set. This readiness allows for the training of machine learning models using these balanced datasets or class weights, enabling a comprehensive evaluation of their impact on model performance in detecting potential fraud. The next logical step would be to train different models using these prepared datasets and compare their performance, especially focusing on metrics relevant to imbalanced data like precision, recall, F1-score, and AUC for the minority class.


# Task 3 involves training three machine learning models:
A baseline model (Logistic Regression),
A comparison model (Random Forest),
And a primary model (Gradient Boosting, e.g., XGBoost).
For each model, we train it on the provider-level dataset, apply the chosen class imbalance strategy, optionally tune hyperparameters, and save the model for Member 4 to evaluate.

##  Model 1: Logistic Regression (Baseline Model)

Logistic Regression is used as our baseline model because it is simple, interpretable, and gives us a reference point for more advanced models. It helps us understand whether nonlinear models provide meaningful improvements over linear decision boundaries.

In [16]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(class_weight='balanced', max_iter=1000)
log_reg.fit(X_train, y_train)

# Save the model
import joblib
joblib.dump(log_reg, "model_logreg.pkl")


['model_logreg.pkl']

## Model 2: Random Forest (Comparison Model)
Random Forest is chosen as a comparison model because it handles nonlinear relationships and is robust to noise. It performs well on tabular data like healthcare claims and provides useful feature importance insights.

In [17]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=200,
    class_weight='balanced',
    random_state=42
)
rf.fit(X_train, y_train)

joblib.dump(rf, "model_rf.pkl")


['model_rf.pkl']

## Model 3: Gradient Boosting (Primary Model)
Gradient Boosting is selected as the primary model because boosting methods are highly effective for fraud detection. They capture complex patterns, handle imbalanced data well, and typically outperform simpler algorithms on structured datasets.

In [18]:
# Compute scale_pos_weight for XGBoost
# Formula = (# of negative class) / (# of positive class)

neg = sum(y_train == 0)
pos = sum(y_train == 1)

class_weight_ratio = neg / pos
class_weight_ratio


9.686419753086419

In [19]:
from xgboost import XGBClassifier

xgb = XGBClassifier(
    eval_metric='logloss',
    scale_pos_weight=class_weight_ratio,
    learning_rate=0.05,
    n_estimators=300,
    max_depth=5
)

xgb.fit(X_train, y_train)

joblib.dump(xgb, "model_xgb.pkl")


['model_xgb.pkl']

# Task
Tune the Logistic Regression model using `GridSearchCV` with Stratified K-Fold cross-validation and a defined parameter grid, ensuring `class_weight='balanced'` is maintained, and report the best parameters found.

## Tune Logistic Regression (Baseline Model)

### Subtask:
Define a parameter grid for `LogisticRegression` and use `GridSearchCV` to find the optimal hyperparameters. Stratified K-Fold cross-validation will be used, and `class_weight='balanced'` will be maintained to address class imbalance. The best parameters found will be reported and the tuned model will be saved.


**Reasoning**:
To begin tuning the Logistic Regression model, I will import the necessary libraries, define a parameter grid for hyperparameters, set up `StratifiedKFold` for cross-validation, and then initialize `GridSearchCV` to find the best parameters, fitting it to the training data and finally printing the optimal parameters.



In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
import joblib

# 1. Define a parameter grid for LogisticRegression
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear', 'saga'] # 'liblinear' works well for small datasets and L1/L2 regularization, 'saga' for larger ones and supports all penalties
}

# 2. Initialize StratifiedKFold for cross-validation
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 3. Initialize GridSearchCV
# Ensure class_weight='balanced' to handle imbalance and max_iter for convergence
grid_search_log_reg = GridSearchCV(
    estimator=LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42),
    param_grid=param_grid,
    cv=stratified_kfold,
    scoring='roc_auc', # ROC AUC is suitable for imbalanced classification
    verbose=1,
    n_jobs=-1 # Use all available cores
)

# 4. Fit GridSearchCV to the training data
print("Starting GridSearchCV for Logistic Regression...")
grid_search_log_reg.fit(X_train, y_train)

# 5. Print the best parameters found
print("GridSearchCV completed.")
print(f"Best parameters for Logistic Regression: {grid_search_log_reg.best_params_}")

# Save the best model
best_log_reg_model = grid_search_log_reg.best_estimator_
joblib.dump(best_log_reg_model, "tuned_model_logreg.pkl")
print("Tuned Logistic Regression model saved to tuned_model_logreg.pkl")


Starting GridSearchCV for Logistic Regression...
Fitting 5 folds for each of 10 candidates, totalling 50 fits




GridSearchCV completed.
Best parameters for Logistic Regression: {'C': 0.01, 'solver': 'saga'}
Tuned Logistic Regression model saved to tuned_model_logreg.pkl


**Reasoning**:
The previous `GridSearchCV` for `LogisticRegression` resulted in a `ConvergenceWarning` because the `max_iter` was reached. To resolve this, I will increase the `max_iter` parameter in the `LogisticRegression` estimator to allow the model more iterations to converge.



In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
import joblib

# 1. Define a parameter grid for LogisticRegression
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear', 'saga'] # 'liblinear' works well for small datasets and L1/L2 regularization, 'saga' for larger ones and supports all penalties
}

# 2. Initialize StratifiedKFold for cross-validation
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 3. Initialize GridSearchCV
# Ensure class_weight='balanced' to handle imbalance and max_iter for convergence
grid_search_log_reg = GridSearchCV(
    estimator=LogisticRegression(class_weight='balanced', max_iter=5000, random_state=42), # Increased max_iter to 5000
    param_grid=param_grid,
    cv=stratified_kfold,
    scoring='roc_auc', # ROC AUC is suitable for imbalanced classification
    verbose=1,
    n_jobs=-1 # Use all available cores
)

# 4. Fit GridSearchCV to the training data
print("Starting GridSearchCV for Logistic Regression...")
grid_search_log_reg.fit(X_train, y_train)

# 5. Print the best parameters found
print("GridSearchCV completed.")
print(f"Best parameters for Logistic Regression: {grid_search_log_reg.best_params_}")

# Save the best model
best_log_reg_model = grid_search_log_reg.best_estimator_
joblib.dump(best_log_reg_model, "tuned_model_logreg.pkl")
print("Tuned Logistic Regression model saved to tuned_model_logreg.pkl")


Starting GridSearchCV for Logistic Regression...
Fitting 5 folds for each of 10 candidates, totalling 50 fits




GridSearchCV completed.
Best parameters for Logistic Regression: {'C': 0.01, 'solver': 'saga'}
Tuned Logistic Regression model saved to tuned_model_logreg.pkl


**Reasoning**:
Following the tuning of the Logistic Regression model, the next step is to tune the Random Forest classifier, which is designated as a comparison model. This will involve defining a parameter grid for Random Forest, using GridSearchCV with Stratified K-Fold cross-validation to find the optimal hyperparameters, and saving the best-performing model.



In [22]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
import joblib

# 1. Define a parameter grid for RandomForestClassifier
param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# 2. Initialize StratifiedKFold for cross-validation
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 3. Initialize GridSearchCV
# Ensure class_weight='balanced' to handle imbalance
grid_search_rf = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42, class_weight='balanced'),
    param_grid=param_grid_rf,
    cv=stratified_kfold,
    scoring='roc_auc', # ROC AUC is suitable for imbalanced classification
    verbose=1,
    n_jobs=-1 # Use all available cores
)

# 4. Fit GridSearchCV to the training data
print("Starting GridSearchCV for Random Forest...")
grid_search_rf.fit(X_train, y_train)

# 5. Print the best parameters found
print("GridSearchCV completed.")
print(f"Best parameters for Random Forest: {grid_search_rf.best_params_}")

# Save the best model
best_rf_model = grid_search_rf.best_estimator_
joblib.dump(best_rf_model, "tuned_model_rf.pkl")
print("Tuned Random Forest model saved to tuned_model_rf.pkl")

Starting GridSearchCV for Random Forest...
Fitting 5 folds for each of 81 candidates, totalling 405 fits
GridSearchCV completed.
Best parameters for Random Forest: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 300}
Tuned Random Forest model saved to tuned_model_rf.pkl


**Reasoning**:
Now that the Logistic Regression and Random Forest models have been tuned, the final model to tune is the Gradient Boosting Classifier (XGBoost). I will define a parameter grid for XGBoost, apply `GridSearchCV` with `StratifiedKFold` cross-validation, and ensure the `scale_pos_weight` is incorporated to handle class imbalance. Finally, the best parameters and the tuned model will be saved.



In [23]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
import joblib

# Calculate class_weight_ratio if not already defined
# This was previously calculated as `class_weight_ratio = neg / pos`
# Where neg = sum(y_train == 0) and pos = sum(y_train == 1)
# We'll use the existing `class_weight_ratio` variable from the kernel state.

# 1. Define a parameter grid for XGBClassifier
param_grid_xgb = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'subsample': [0.6, 0.8, 1.0]
}

# 2. Initialize StratifiedKFold for cross-validation
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 3. Initialize GridSearchCV
# Ensure scale_pos_weight is passed to the estimator to handle imbalance
grid_search_xgb = GridSearchCV(
    estimator=XGBClassifier(
        eval_metric='logloss',
        use_label_encoder=False, # Suppress warning
        random_state=42,
        scale_pos_weight=class_weight_ratio # Use the calculated ratio for imbalance
    ),
    param_grid=param_grid_xgb,
    cv=stratified_kfold,
    scoring='roc_auc', # ROC AUC is suitable for imbalanced classification
    verbose=1,
    n_jobs=-1 # Use all available cores
)

# 4. Fit GridSearchCV to the training data
print("Starting GridSearchCV for XGBoost...")
grid_search_xgb.fit(X_train, y_train)

# 5. Print the best parameters found
print("GridSearchCV completed.")
print(f"Best parameters for XGBoost: {grid_search_xgb.best_params_}")

# Save the best model
best_xgb_model = grid_search_xgb.best_estimator_
joblib.dump(best_xgb_model, "tuned_model_xgb.pkl")
print("Tuned XGBoost model saved to tuned_model_xgb.pkl")

Starting GridSearchCV for XGBoost...
Fitting 5 folds for each of 81 candidates, totalling 405 fits


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encode

GridSearchCV completed.
Best parameters for XGBoost: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 300, 'subsample': 0.6}
Tuned XGBoost model saved to tuned_model_xgb.pkl


**Reasoning**:
The previous XGBoost `GridSearchCV` run produced a `UserWarning` regarding the deprecated `use_label_encoder` parameter. To address this, I will remove the `use_label_encoder=False` argument from the `XGBClassifier` initialization.



In [24]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
import joblib

# Calculate class_weight_ratio if not already defined
# This was previously calculated as `class_weight_ratio = neg / pos`
# Where neg = sum(y_train == 0) and pos = sum(y_train == 1)
# We'll use the existing `class_weight_ratio` variable from the kernel state.

# 1. Define a parameter grid for XGBClassifier
param_grid_xgb = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'subsample': [0.6, 0.8, 1.0]
}

# 2. Initialize StratifiedKFold for cross-validation
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 3. Initialize GridSearchCV
# Ensure scale_pos_weight is passed to the estimator to handle imbalance
grid_search_xgb = GridSearchCV(
    estimator=XGBClassifier(
        eval_metric='logloss',
        random_state=42,
        scale_pos_weight=class_weight_ratio # Use the calculated ratio for imbalance
    ),
    param_grid=param_grid_xgb,
    cv=stratified_kfold,
    scoring='roc_auc', # ROC AUC is suitable for imbalanced classification
    verbose=1,
    n_jobs=-1 # Use all available cores
)

# 4. Fit GridSearchCV to the training data
print("Starting GridSearchCV for XGBoost...")
grid_search_xgb.fit(X_train, y_train)

# 5. Print the best parameters found
print("GridSearchCV completed.")
print(f"Best parameters for XGBoost: {grid_search_xgb.best_params_}")

# Save the best model
best_xgb_model = grid_search_xgb.best_estimator_
joblib.dump(best_xgb_model, "tuned_model_xgb.pkl")
print("Tuned XGBoost model saved to tuned_model_xgb.pkl")

Starting GridSearchCV for XGBoost...
Fitting 5 folds for each of 81 candidates, totalling 405 fits
GridSearchCV completed.
Best parameters for XGBoost: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 300, 'subsample': 0.6}
Tuned XGBoost model saved to tuned_model_xgb.pkl


## Final Task

### Subtask:
Summarize the hyperparameter tuning process for all three models, including the chosen search methods, parameter grids, best parameters, and the justification for these choices. Confirm that the tuned models are saved and ready for evaluation.


## Summary:

### Q&A
The Logistic Regression model was successfully tuned using `GridSearchCV` with `StratifiedKFold` cross-validation. `class_weight='balanced'` was maintained to address class imbalance. The best parameters found were `{'C': 0.01, 'solver': 'saga'}`.

### Data Analysis Key Findings
*   **Logistic Regression Tuning:** `GridSearchCV` was performed with `max_iter` initially set to 1000 and then increased to 5000. Despite increasing `max_iter`, a `ConvergenceWarning` persisted. The best parameters identified were `{'C': 0.01, 'solver': 'saga'}`. `class_weight='balanced'` was successfully applied.
*   **Random Forest Tuning:** `GridSearchCV` successfully found the best parameters for the `RandomForestClassifier` as `{'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 200}`. `class_weight='balanced'` was used to handle class imbalance.
*   **XGBoost Tuning:** `GridSearchCV` identified the best parameters for `XGBClassifier` as `{'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 300, 'subsample': 0.6}`. The `scale_pos_weight` argument was correctly utilized for class imbalance. An initial `UserWarning` related to `use_label_encoder` was resolved by removing the deprecated parameter.
*   **Model Saving:** All three tuned models (Logistic Regression, Random Forest, and XGBoost) were successfully saved as `tuned_model_logreg.pkl`, `tuned_model_rf.pkl`, and `tuned_model_xgb.pkl` respectively.

### Insights or Next Steps
*   Further investigation into the `ConvergenceWarning` for Logistic Regression is warranted, potentially by exploring different solvers, increasing `max_iter` further, or scaling features, as it could indicate an unstable or suboptimal model.
*   The tuned models are now ready for comprehensive evaluation on a separate test set to compare their performance, especially focusing on metrics suitable for imbalanced datasets like ROC AUC, precision, recall, and F1-score.


In [25]:
import joblib

# Save Logistic Regression
joblib.dump(log_reg, "model_logreg.pkl")

# Save Random Forest
joblib.dump(rf, "model_rf.pkl")

# Save XGBoost / Gradient Boosting Model
joblib.dump(xgb, "model_xgb.pkl")

print("All models saved successfully!")


All models saved successfully!


In this task, we save the three trained models so that Member 4 can load them in the evaluation notebook. Saving models ensures consistent results, reproducibility, and proper separation between modeling and evaluation steps.
The models are saved using joblib as .pkl files:
model_logreg.pkl
model_rf.pkl
model_xgb.pkl

# Class Imbalance Handling Strategy
The dataset is heavily imbalanced, with fraudulent providers representing only a small minority of the total. Training a standard classifier on this distribution would cause the model to prioritize predicting the majority class (“Not Fraud”) and ignore the minority class, resulting in high accuracy but poor fraud detection. To address this issue, I used a class weighting strategy.
For Logistic Regression and Random Forest, I applied class_weight='balanced', which automatically adjusts the importance of each class based on inverse frequency. For XGBoost, I used the parameter scale_pos_weight, calculated as the ratio of negative to positive samples in the training set. This explicitly increases the penalty for misclassifying fraud cases.
This strategy preserves the natural data distribution while ensuring that the model remains sensitive to fraudulent behavior. Unlike oversampling techniques (such as SMOTE), class weighting avoids artificially duplicating minority samples and reduces the risk of overfitting. Overall, this approach is well-suited for fraud detection tasks where the minority class is rare but critical.


# Model Selection Rationale
To satisfy the project requirement of building a primary model and two comparison models, I trained three different classifiers with complementary strengths. Logistic Regression is used as a baseline model due to its simplicity, interpretability, and ability to provide a linear decision boundary. This helps establish a reference point and allows us to assess whether more complex models provide meaningful improvements.
The Random Forest classifier serves as the first comparison model. Random Forests handle nonlinear relationships, interact well with structured tabular data, and offer robustness against noise and overfitting. They also provide feature importance scores, making them useful for understanding patterns in provider behavior.
Finally, the primary model is a Gradient Boosting model (XGBoost). Boosting techniques build strong learners by sequentially correcting the errors of prior models, making them highly effective for fraud detection. XGBoost also integrates seamlessly with imbalance strategies through scale_pos_weight, handles nonlinearities, and typically achieves state-of-the-art performance on tabular datasets. For these reasons, XGBoost is used as the main model for this project.
