# Fraud Detection with XGBoost

**What is XGBoost?**

* eXtreme Gradient Boosting (XGBoost) is a gradient-boosted decision tree (GBDT) machine learning library, and one of the most popular supervised machine learning algorithm. It is typically use to solve classification and regression problems.
* XGBoost is an _ensemble learning algorithm_. The "Gradient Boosting" stems from combining multiple weak models to collectively form a stronger one. The process involves training a first model decision tree based on training data, then iteratively training an ensemble of shallow decision trees, using the residuals of each iteration to fit the next model. The final prediction is a weighted sum of all the tree predictions.


## Dataset Overview

This project utilises the IEEE-CIS-FRAUD-DATASET downloaded from Kaggle. Below are some descriptions of the dataset, posted by the original competition host: https://www.kaggle.com/competitions/ieee-fraud-detection/discussion/101203

* TransactionDT: timedelta from a given refernce datetime
* TransactionAMT: transactoin amount in USD
* ProductCD: type of payment product used for transaction
* card1 - card6: payment card information, such as card issuer, type, etc.
* addr1 - addr2: address
* dist1 - dist2: distance (eg. between billing address to mailing address, etc)
* P_ and (R__) emaildomain: purchaser and recipient email domain
* C1-C14: count
* D1-D15: timedelta
* M1-M19: match, such as names on card and address, etc.
* Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.

### Step 0: Load the dataset

In [None]:
import pandas as pd


# Load the dataset
file_path = r''
df = pd.read_csv(file_path)


In [None]:
# Check the number of rows and columns
df.shape

### Dataset Description
Using df.head() we can observe the first few rows of the dataset. 

In [None]:
# Set maximum number of columns to display
pd.set_option("display.max_columns", 50)

# Explore the few rows
df.head()

In [None]:
# Set the TransactionID to index as the TransactionID signifies a unique value corresponding each row of data
df.set_index('TransactionID', inplace=True)

### Step 1: Understand the structure of the dataset

Before cleaning the data, it is important to first have an idea of what is included (or excluded, as missing values) in the dataset. As we have set TransactionID as the index, we should expect df.shape() to print 393 columns instead of 394 this time.

In [None]:
# Check the number of rows and columns
df.shape

Using df.dytpes.value_counts(), we assess the data type distribution.

In [None]:
df.dtypes.value_counts() # count the number of types in the dataset

To get a quick overview of the data, including non-null counts, types, and memory usage, we can use:

In [None]:
# Summary of the dataset
df.info()

### Step 2: Data Cleaning and Preprocessing

Before training any model, cleaning and preprocessing data ensures consistency and efficiency in downstream tasks. Here, we will handle missing values, reduce memory usage, and convert data types.

We will first identify columns with significant missingness using our custom check_missing function. 

In [None]:
# Total number of rows
total_rows = df.shape[0]
total_rows

In [None]:
def check_missing(df, dtype='object'): # This takes two parameters, df is the dataframe and the default value of 'object' 
    total_rows = df.shape[0]
    # Check missing values and their percentage for specified datatype
    missing_ = df.select_dtypes(include=dtype).isnull().sum()
    missing_percentage = (missing_ / total_rows) * 100

    # Combine into a df for better readability
    missing_summary = pd.DataFrame({
        'Missing Values' : missing_,
        'Missing Percentage (%)' : missing_percentage})

    # Filter out columns without missing values
    missing_summary = missing_summary[missing_summary['Missing Values'] > 0]

    # Sort by missing percentage
    return missing_summary.sort_values(by='Missing Percentage (%)', ascending=False)

In [None]:
# For numeric features
check_missing(df, dtype='number') # We pass 'number' in dtype value, since both flaot and int realte to numerical data

In [None]:
# For categorical features
check_missing(df, dtype='object')

### 2.1 Handle Missing Values

To address missing data robustly: for numerical features, we will use median imputation to fill $\text{NaN}$ values, as the median is less sensitive to outliers than the mean. For categorical features, we will apply mode imputation to replace $\text{NaN}$ values with the most frequently occurring category.


For categorical features:

In [None]:
# Fill categorical columns with most frequent value (mode)
cat_cols = df.select_dtypes(include=['object', 'category']).columns
for col in cat_cols:
  mode = df[col].mode(dropna=True)
  if not mode.empty:
    df.fillna({col:mode[0]}, inplace=True)

**Confirm the Result**

In [None]:
df.isnull().sum()

### 2.2 Handle Highly Correlated Features

In statistics, correlation is a term that indicates the degree to which two variables move in relation to each other. Highly correlated features are variables that have a strong linear relationship to each other. In other words, if two features are highly correlated, it is likely that they carry similar information. 

Important to:

    • Remove redundant features

    • Reduce multicollinearity
    
    • Improve model efficiency

**Compute correlation matrix**

To calculate a correlation matrix, we can use the corr() function from the Pandas library.

In [None]:
# Compute the correlation matrix (for numerical columns only)
corr_matrix = df.select_dtypes(include=['number']).corr()

In [None]:
import numpy as np

def get_high_correlations(corr_matrix, threshold=0.9): # Defining a function that takes two arguments, firstly, the corr_matrix that we calculated with the above code line, and a threshold of 0.9
	# Take the upper triangle of the correlation matrix without the diagonal
	upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)) # Filters

	# Find features with correlation greater than the threshold
	high_corr = [(col, row, upper.loc[row, col])
	for col in upper.columns
	for row in upper.index
	if abs(upper.loc[row, col]) > threshold]

	return sorted(high_corr, key=lambda x: -abs(x[2]))

In [None]:
# Then call it
high_corr_pairs = get_high_correlations(corr_matrix, threshold=0.9)
for feature1, feature2, corr_value in high_corr_pairs[:15]:
  print(f"{feature1} ↔ {feature2} = {corr_value:.2f}")

**Drop Redundant Features (Optional)**

If feature1 and feature2 are highly correlated, drop one of them:

In [None]:
# Uncomment, and run below code to drop one of the correlated features
"""
to_drop = set()
for feature1, feature2, _ in high_corr_pairs:
	if feature1 not in to_drop:
	  to_drop.add(feature2)

df.drop(columns=list(to_drop), inplace=True)
"""

### 2.3 Encoding Categorical Features

Encoding categorical features means converting a category to a numerical value, reason being most machine learning models are designed to only take numerica data as input. 

- Checking cardinality

In [None]:
# Select categorical columns
cat_cols = df.select_dtypes(include=['object']).columns

# Count unique values in each categorical column
cardinality = df[cat_cols].nunique().sort_values(ascending=False)

# Display the result
print(cardinality)

In [None]:
print(df['P_emaildomain'].dtype)

In [None]:
from sklearn.preprocessing import LabelEncoder

# Label Encoding for high-cardinality feature
le = LabelEncoder()
df['P_emaildomain'] = le.fit_transform(df['P_emaildomain'])
df['R_emaildomain'] = le.fit_transform(df['R_emaildomain'])

df['P_emaildomain'].values

In [None]:
print(df['P_emaildomain'].dtype)

In [None]:
# One-Hot Encoding for low-cardinality features
df = pd.get_dummies(df, columns=['ProductCD','card4', 'card6', 'M4', 'M1', 'M2', 'M3', 'M5', 'M6', 'M7', 'M8', 'M9'], drop_first=True)

In [None]:
print(df.columns)

In [None]:
# Explore few rows again
df.head(10)

In [None]:
df.shape

**Check imbalanced**

Check the distribution of the target variable, `isFraud`.

In [None]:
# Visualize class imbalance
import seaborn as sns
import matplotlib.pyplot as plt

# Plot the class distribution
sns.countplot(x='isFraud', data=df)
plt.title('Distribution of Fraudulent vs Non-Fraudulent Transactions')
plt.xlabel('Fraud or Not')
plt.ylabel('Count')
plt.xticks([0, 1], ['Not Fraud', 'Fraud'])
plt.show()

# Print percentage distribution
fraud_rate = df['isFraud'].value_counts(normalize=True) * 100
print(fraud_rate)

### Step 3: Model Training and Evaluation

**1: Prepare the Feature Data**

In [None]:
# Separate features and target
X = df.drop(columns=['isFraud'], axis=1)
y = df['isFraud']


**2: Train-Test Split**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

**3: Model Pipeline with XGBoost**

Train the XGBoost model using the training data

In [None]:
# Import Required Libraries
! pip install xgboost
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from xgboost import XGBClassifier


In [None]:
# Define the Classifier
classifier = XGBClassifier(eval_metric='logloss', random_state=42)


**Create the Training Pipeline**

Now, we construct our machine learning pipeline using `ImbPipeline` from the `imblearn` library.

In [None]:
# Create training pipeline
pipeline = ImbPipeline(steps=[
    ('scaler', StandardScaler()),
    ('smote', SMOTE(random_state=42)),
    ('classifier', classifier)
])

**Train the Model**

With our pipeline fully defined, we now train the model using the training data:

In [None]:
pipeline.fit(X_train, y_train) # This initiates a complete training workflow


## Step 4: Predict and Evaluate

Now that the model is trained, we predict on the test set.

In [None]:
y_pred = pipeline.predict(X_test)


**Evaluate the model performance**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Confusion matrix values from your result
cm = confusion_matrix(y_test, y_pred)

# Labels for display
labels = ["Non-Fraud", "Fraud"]

# Plot the confusion matrix
plt.figure(figsize=(8, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=labels, yticklabels=labels)
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.tight_layout()
plt.show()


In [None]:
print("\nClassification Report:\n", classification_report(y_test, y_pred))