# Fraud Detection using XGBoost

This notebook implements an XGBoost model for detecting fraudulent ad clicks using a preprocessed dataset. It includes loading data, handling class imbalance, training the model, and saving it for future use.


## 1. Importing Libraries
This section imports required libraries:
- `numpy`, `pandas`: For data manipulation.
- `xgboost`: To train the classification model.
- `matplotlib`: For visualizing feature importance.
- `train_test_split`: For splitting the dataset into training and validation sets.


In [13]:
# Import required libraries
import numpy as np
import pandas as pd
import gc
import xgboost as xgb
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from xgboost import plot_importance
# Assuming the training_preprocess.py file is in the same directory
# from training_preprocess import dataset

## 2. Loading Preprocessed Data
The preprocessed dataset is loaded using a custom function (`train_to_df`), and the first few rows are displayed to confirm proper loading.


In [None]:
# using the preprocessed data
from data_reprocessing import train_to_df
train_file = 'new_train.csv'  
dataset = train_to_df(train_file)
# Display the first few rows of the dataset to verify the preprocessing
dataset.head()

## 3. Analyzing Data Distribution
Fraudulent and non-fraudulent clicks are counted, and their percentages are calculated to understand the class imbalance in the dataset.


In [None]:
# Count the number of fraudulent and non-fraudulent clicks
fraud_counts = dataset['is_attributed'].value_counts()
fraud_percentage = fraud_counts[0] / len(dataset) * 100
non_fraud_percentage = fraud_counts[1] / len(dataset) * 100

print(f"Fraudulent clicks: {fraud_counts[0]} ({fraud_percentage:.2f}%)")
print(f"Non-fraudulent clicks: {fraud_counts[1]} ({non_fraud_percentage:.2f}%)")

## 4. Splitting Data into Features and Target
- `X`: Contains all features except the target column (`is_attributed`).
- `y`: The target variable representing fraudulent (1) or non-fraudulent (0) clicks.
The data is split into training (80%) and validation (20%) sets.


In [None]:
# Split the dataset into features (X) and target (y)
X = dataset.drop(columns=['is_attributed'])  # Features
y = dataset['is_attributed']  # Target variable

# Split the data into training and validation sets (80% training, 20% validation)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shape of training and validation sets
print("Training data shape: ", X_train.shape)
print("Validation data shape: ", X_valid.shape)

## 5. Defining XGBoost Model Parameters
The XGBoost parameters are configured for efficient training, class imbalance handling, and generalization. Key parameters:
- `eta`: Controls learning rate.
- `scale_pos_weight`: Balances the weight for the minority class.
- `max_depth`: Limits the depth of trees to prevent overfitting.


In [5]:
# Define the parameters for the XGBoost model
params = {
    'eta': 0.3,  # Learning rate; faster training, potential overfitting risk
    'tree_method': 'hist',  # Faster tree-building for large datasets
    'max_depth': 6,  # Limits tree depth to prevent overfitting
    'subsample': 0.9,  # Uses 90% of data per tree to improve generalization
    'colsample_bytree': 0.7,  # Uses 70% of features to reduce overfitting
    'objective': 'binary:logistic',  # Binary classification (fraud detection)
    'eval_metric': 'auc',  # AUC metric for imbalanced data performance
    'scale_pos_weight': 9,  # Increases weight for minority class (real clicks)
    'nthread': 8,  # Parallel training for faster computation
    'random_state': 42  # Ensures reproducibility of results
}


## 6. Training the XGBoost Model
The model is trained with early stopping to prevent overfitting. Training and validation datasets are converted to DMatrix format for optimized computation. The best iteration is printed.

Visualizing Feature Importance
Feature importance is plotted to identify the most significant predictors contributing to the classification task.


In [None]:
# Convert the training and validation datasets to DMatrix format (optimized for XGBoost)
dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_valid, label=y_valid)

# Train the model with early stopping
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
model = xgb.train(params, dtrain, num_boost_round=100, evals=watchlist, early_stopping_rounds=10, verbose_eval=10)

# Display the best iteration
print(f"Best iteration: {model.best_iteration}")

# Plot feature importance to see the most significant features
plot_importance(model)
plt.show()

## 7. Making Predictions and Saving the Model
- Predictions are generated on the validation set.
- The trained model is saved as `xgboost_model.json`.
- The first few predictions are displayed for verification.


In [None]:
# Predict on the validation set
if hasattr(model, 'best_ntree_limit'):
    # Use best_ntree_limit if early stopping is applied
    y_pred = model.predict(dvalid, ntree_limit=model.best_ntree_limit)
else:
    # Use all trees if early stopping is not applied
    y_pred = model.predict(dvalid)

# Save the trained model to a file
model.save_model('xgboost_model.json')

# Display the first few predictions
print(y_pred[:10])


## Conclusion
This notebook demonstrates a streamlined approach to implementing XGBoost for fraud detection. The saved model can be used for deployment or further testing.
