# Fraud Detection Model Training with CatBoost

This notebook demonstrates the training and evaluation of a CatBoost model for detecting fraudulent ad clicks. The process includes loading data, splitting it into training and validation sets, training the model, and saving it for future use.


## 1. Importing Required Libraries
Libraries used in this notebook include:
- **pandas, numpy**: For data manipulation.
- **scikit-learn**: For dataset splitting and evaluation metrics.
- **CatBoostClassifier**: To train the CatBoost model.


In [1]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from catboost import CatBoostClassifier
# import training_preprocess



## 2. Loading Preprocessed Data
The preprocessed dataset is loaded using the `train_to_df` function. The first few rows are displayed to verify the data structure.


In [None]:
# using the preprocessed data
from data_reprocessing import train_to_df
train_file = 'new_train.csv'  
dataset = train_to_df(train_file)
# Display the first few rows of the dataset to verify the preprocessing
dataset.head()

## 3. Splitting the Data
- **Features (`X`)**: All columns except the target variable (`is_attributed`).
- **Target (`y`)**: Represents fraud detection labels (1: fraud, 0: non-fraud).
- The dataset is split into 80% training and 20% validation sets while maintaining class distribution using `stratify`.


In [None]:
# Split the data into features and target
X = dataset.drop('is_attributed', axis=1)
y = dataset['is_attributed']

# Split the dataset into 80% training and 20% validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Data split into training and validation sets.")
print(f"Training set shape: {X_train.shape}")
print(f"Validation set shape: {X_val.shape}")


## 4. Initializing CatBoost Classifier
The CatBoost model is configured with the following parameters:
- **iterations**: Number of boosting iterations.
- **learning_rate**: Step size for each iteration.
- **depth**: Tree depth for splitting.
- **eval_metric**: AUC (Area Under Curve) to optimize for imbalanced classification.
- **random_seed**: Ensures reproducibility.


## 5. Training the CatBoost Model
The model is trained using the training data (`X_train`, `y_train`) and validated on the validation set (`X_val`, `y_val`). Early stopping is used to prevent overfitting after 50 rounds of no improvement.


## 6. Saving the Trained Model
The trained CatBoost model is saved as a JSON file (`catboost_model.json`). This allows the model to be reused for future predictions or analysis.


In [None]:
# Initialize CatBoost Classifier
catboost_model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    eval_metric='AUC',
    random_seed=42,
    verbose=100
)

print("Starting model training...")

# Train the CatBoost model
catboost_model.fit(X_train, y_train, eval_set=(X_val, y_val), early_stopping_rounds=50)

print("Model training completed.")

# Save the trained CatBoost model to a JSON file
catboost_model.save_model('catboost_model.json', format='json')

print("Model saved as 'catboost_model.json'.")


## Conclusion
This notebook successfully demonstrates the training and saving of a CatBoost model for fraud detection. The saved model can be used for further evaluation or deployment.
