<a href="https://www.kaggle.com/code/drkaggle22/credit-card-fraud-detection-solution-top-40?scriptVersionId=172155688" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1. Exploratory Data Analysis

### 1.1 Dataset Overview:

This dataset comprises credit card transactions and is designed for fraud detection purposes. With the proliferation of digital payment systems, the incidence of fraudulent activities has become a significant concern for financial institutions and consumers alike. Hence, this dataset serves as a valuable resource for discerning transaction patterns and identifying potential indicators of fraudulent behavior, thereby bolstering fraud detection mechanisms.

**Dataset Features:**

1. **Time:** A float variable representing the timestamp of each transaction.
2. **feat1 - feat28:** Float variables denoting various features extracted from the transactions. These features may encompass transaction characteristics, customer behavior patterns, or other relevant attributes.
3. **Transaction_Amount:** A float variable indicating the monetary value of each transaction.
4. **IsFraud:** A binary float variable (0 or 1) serving as the target variable. It denotes whether a transaction is fraudulent (1) or not (0).

**Dataset Size and Completeness:**

- The dataset contains a total of `219,129` observations.
- All features from `Time' to 'feat28` and `Transaction_Amount` have complete data with 219,129 non-null entries.
- The 'IsFraud' variable has 150,000 non-null entries, indicating that it comprises the training set, while the remaining observations would constitute the testing set.

**Training and Testing Sets:**

- ***Training Set***: The training set consists of 150,000 observations, which include complete data for all features and the target variable `IsFraud`. This subset is intended for training machine learning models to detect fraudulent transactions.

- ***Testing Set***: The testing set encompasses the remaining observations, amounting to 69,129 records. Similar to the training set, it contains complete data for all features except for the 'IsFraud' variable. This subset serves as a benchmark for evaluating the performance of trained models on unseen data.

**Key Insights:**

- The dataset offers a comprehensive view of credit card transactions, providing insights into transaction timing, features, transaction amounts, and fraud labels.
- With a sizable number of observations in both the training and testing sets, it presents ample opportunities for analyzing transaction patterns and building robust fraud detection models.
- The presence of missing values in the 'IsFraud' variable warrants attention during data preprocessing and model development phases.

**Conclusion:**

This dataset serves as a valuable asset for exploring credit card transaction data and devising effective fraud detection strategies. By leveraging the insights gleaned from this dataset, financial institutions can enhance their ability to detect and prevent fraudulent activities, safeguarding the interests of both businesses and consumers.


### 1.2 Load Data

In [None]:
df_train = pd.read_csv('/kaggle/input/credit-card-fraud-prediction/train.csv', index_col="id")
df_test = pd.read_csv('/kaggle/input/credit-card-fraud-prediction/test.csv', index_col="id")
df = pd.concat([df_train, df_test])

In [None]:
df.info()

# 2. Feature Utlity Scores

## 2.1 Mutual Information(MI)

`Mutual information` is a measure of the mutual dependence between two random variables. It quantifies how much knowing one variable reduces uncertainty about the other variable.

#### Pearson correlation coefficient

The `Pearson correlation coefficient`, also known as Pearson's r, measures the linear relationship between two continuous variables by calculating the covariance of the variables divided by the product of their standard deviations. It ranges from -1 to 1, where values close to 1 indicate a strong positive linear relationship, values close to -1 indicate a strong negative linear relationship, and values close to 0 indicate no linear relationship.

In [None]:
from sklearn.feature_selection import mutual_info_classif
import pandas as pd

# Assuming df is your DataFrame containing the dataset
# X contains the feature columns, and y contains the target variable column
X = df_train.drop(columns=['IsFraud'])  # Assuming 'IsFraud' is the target variable column
y = df_train['IsFraud']

# Calculate mutual information scores
mi_scores = mutual_info_classif(X, y)

# Create a DataFrame to store the scores along with the feature names
mi_scores_df = pd.DataFrame({'Feature': X.columns, 'Mutual_Information_Score': mi_scores})

# Sort the DataFrame by mutual information scores (descending order)
mi_scores_df = mi_scores_df.sort_values(by='Mutual_Information_Score', ascending=False).reset_index(drop=True)

# Display the DataFrame
print(mi_scores_df)


## 2.2 Co-orelation Matrix

`Correlation measures the strength and direction of the linear relationship between two variables. It indicates how much one variable changes when the other variable changes, and the direction of this change (positive or negative).`
Correlation coefficients range from -1 to 1.
A coefficient close to 1 indicates a strong positive linear relationship,
A coefficient close to -1 indicates a strong negative linear relationship,
A coefficient close to 0 indicates weak or no linear relationship.

Correlation is sensitive only to linear relationships and may not capture nonlinear dependencies between variables.

The correlation coefficients between each feature and the target variable ('IsFraud'):

- **Time**: The correlation coefficient is approximately -0.004. This suggests a very weak negative correlation between the time of the transaction and the likelihood of fraud.

- **feat1 - feat28**: These are the anonymized features derived from the transaction data. The correlation coefficients range from approximately -0.035 to 0.028. They indicate the strength and direction of the linear relationship between each feature and the target variable. For example:
  - Negative coefficients `(e.g., feat3, feat8)` suggest a negative correlation, meaning as the feature value increases, the likelihood of fraud decreases, and vice versa.
  - Positive coefficients `(e.g., feat4, feat11, feat28)` suggest a positive correlation, meaning as the feature value increases, the likelihood of fraud increases, and vice versa.
  - Coefficients close to zero `(e.g., feat2, feat6, feat13)` indicate weak or no correlation between the feature and the target variable.

- **Transaction_Amount**: The correlation coefficient is approximately 0.019. This suggests a very weak positive correlation between the transaction amount and the likelihood of fraud.

Overall, correlation coefficients close to zero indicate weak linear relationships between the features and the target variable. It's important to note that correlation does not imply causation, and other factors not captured by these features may influence fraud detection.

In [None]:
corr = df_train.corr()

correlation_with_target = corr['IsFraud'].drop('IsFraud')

print("Correlation with target variable (IsFraud):")
print(correlation_with_target)



## 2.3 Weaker Corelations

In [None]:
threshold = 0.005  # Adjust the threshold as needed
weak_corr_features = correlation_with_target[abs(correlation_with_target) < threshold]

print("Features with weak correlation (|Correlation| < {}):".format(threshold))
print(weak_corr_features)


## 2.4 Co-orelation Heatmap

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate correlation matrix
correlation_matrix = df_train.corr()

# Plot correlation heatmap
plt.figure(figsize=(20, 12))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()


In [None]:
df_train

In [None]:
df_test

## 2.5 Drop Columns

The features with lower values of correlation matrix were dropped i..e `Time`, `feat2`, `feat6`, `feat 13`, `feat 25` and `feat 27`

In [None]:
# Drop the specified columns 'Time', 'feat2', 'feat6', 'feat13', 'feat25', 'feat27' from the DataFrame
df.drop(columns=['Time', 'feat2', 'feat6', 'feat13', 'feat25', 'feat27'], inplace=True)


# 3. Feature Engineering

## 3.1 Feature Engineering for Transaction Amount


**In data analysis and machine learning, feature engineering plays a pivotal role in extracting valuable insights from raw data. Specifically, in the context of transaction data analysis, the transaction amount feature holds significant importance as it provides insights into the monetary value of each transaction.**

**Upon visualizing this feature, I observed that there were relatively rare transactions occurring beyond the 1000 mark. To address this and enhance the feature's effectiveness, I applied feature engineering through the following steps:**

In [None]:
import matplotlib.pyplot as plt

# Plotting genuine transactions
plt.figure(figsize=(6, 6))
plt.hist(df_train[df_train.IsFraud == 0].Transaction_Amount, bins=50, color='red', alpha=0.5, label='Genuine', log=True)
plt.title('Transaction Amount Distribution')
plt.xlabel('Transaction Amount')
plt.ylabel('Number of Transactions')
plt.legend()

# Plotting fraud transactions
plt.figure(figsize=(6, 6))
plt.hist(df_train[df_train.IsFraud == 1].Transaction_Amount, bins=50, color='green', alpha=0.5, label='Fraud', log=True)
plt.title('Transaction Amount Distribution')
plt.xlabel('Transaction Amount')
plt.ylabel('Number of Transactions')
plt.legend()

plt.show()



- **Binning:**
   
   I categorized transaction amounts into three intervals (Low, Medium, High) by binning them based on their value range.

- **Transaction Amount to Mean Ratio:**
   
   I calculated the ratio of each transaction amount to the mean transaction amount within its corresponding group or category (identified by the 'id' column).

- **Difference from Mean:**
   
   I computed the difference between each transaction amount and the mean transaction amount within its corresponding group or category (identified by the 'id' column).

- **Logarithmic Transformation:**
   
   I applied a logarithmic transformation to the transaction amount feature to address skewness and heteroscedasticity.



In [None]:
# Create a new feature 'Transaction_Amount_Bin' by binning the 'Transaction_Amount' into three categories: Low, Medium, High
df['Transaction_Amount_Bin'] = pd.cut(df['Transaction_Amount'], bins=3, labels=['Low', 'Medium', 'High'])

# Create a new feature 'Transaction_Amount_to_Mean' by dividing each transaction amount by the mean transaction amount for the corresponding 'id' group
df['Transaction_Amount_to_Mean'] = df['Transaction_Amount'] / df.groupby('id')['Transaction_Amount'].transform('mean')

# Create a new feature 'Transaction_Amount_Diff' by subtracting the mean transaction amount for the corresponding 'id' group from each transaction amount
df['Transaction_Amount_Diff'] = df['Transaction_Amount'] - df.groupby('id')['Transaction_Amount'].transform('mean')

# Create a new feature 'Log_Transaction_Amount' by applying a logarithmic transformation to the transaction amount using np.log1p
df['Log_Transaction_Amount'] = np.log1p(df['Transaction_Amount'])


## 3.2 Standard Scaler Technique

### Scaling Engineered Features Derived from Transaction Amount

**Scaling Engineered Features Derived from Transaction Amount**

In this code snippet, I utilize the StandardScaler from scikit-learn to standardize selected features that are engineered from the transaction amount. Specifically, I apply this technique to normalize the following engineered features:

1. **Transaction_Amount**: The original transaction amount feature.
2. **Transaction_Amount_to_Mean**: The ratio of each transaction amount to the mean transaction amount within its corresponding group or category.
3. **Transaction_Amount_Diff**: The difference between each transaction amount and the mean transaction amount within its corresponding group or category.
4. **Log_Transaction_Amount**: The logarithmic transformation of the transaction amount feature.

Here's how the process unfolds:

1. **Feature Selection**: I first identify these engineered features that require scaling.

2. **StandardScaler Initialization**: Next, I initialize the StandardScaler object to perform the scaling operation.

3. **Fit and Transform**: Using the fit_transform method, I simultaneously fit the scaler to the selected features and transform them. This step calculates the mean and standard deviation of each feature and then standardizes them accordingly.

4. **Conversion to DataFrame**: After scaling, I convert the scaled features back to a DataFrame format for further analysis.

5. **Concatenation**: Finally, I concatenate the scaled features with the original DataFrame, df, to create the df_final DataFrame, which contains both the scaled and unscaled features.

By applying the StandardScaler to these engineered features, I ensure that they are standardized, removing any discrepancies in scale and facilitating the training of machine learning models.


In [None]:
from sklearn.preprocessing import StandardScaler

# Select relevant features for scaling
features_to_scale = ['Transaction_Amount', 'Transaction_Amount_to_Mean', 'Transaction_Amount_Diff', 'Log_Transaction_Amount']

# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform the selected features
df_scaled = scaler.fit_transform(df[features_to_scale])

# Convert the scaled features back to a DataFrame
df_scaled = pd.DataFrame(df_scaled, columns=features_to_scale)

# Concatenate the scaled features with the original DataFrame
df = pd.concat([df.drop(columns=features_to_scale), df_scaled], axis=1)


In [None]:
df

## Checking Dataset Balance

To assess the balance of the dataset, it's essential to examine the distribution of classes or labels. One way to do this is by checking the shape of the dataset:

In [None]:
# Assuming 'df' is your DataFrame and 'Fraud_Probability' is the column containing predicted probabilities for fraud
samples_with_prob_1 = df_train[df_train['IsFraud'] == 1].shape[0]
samples_with_prob_0 = df_train[df_train['IsFraud'] == 0].shape[0]

print("Number of samples with fraud probability 1:", samples_with_prob_1)
print("Number of samples with fraud probability 0:", samples_with_prob_0)


## 3.2 Random Undersampling


In this dataset, characterized by a significant class imbalance, such as one where only `269 samples exhibit a fraud probability of 1` compared to `149,731 samples with a fraud probability of 0`, the need for strategic handling is apparent. In such scenarios, `random undersampling` emerges as a viable solution. By randomly selecting a subset of instances from the majority class to align it with the minority class, this approach aims to rectify the imbalance, fostering a more equitable distribution of classes within the dataset.

The essence of random undersampling lies in its effectiveness in mitigating the adverse effects of class imbalance, thereby enhancing the performance of machine learning models. By equalizing the representation of both classes, the model becomes more adept at identifying patterns and making accurate predictions across diverse scenarios.

In [None]:
fraud_ind = np.array(df_train[df_train.IsFraud == 1].index)
gen_ind = df_train[df_train.IsFraud == 0].index
n_fraud = len(df_train[df_train.IsFraud == 1])
random_gen_ind = np.random.choice(gen_ind, n_fraud, replace = False)
random_gen_ind = np.array(random_gen_ind)
under_sample_ind = np.concatenate([fraud_ind,random_gen_ind])
undersample_df = df_train.iloc[under_sample_ind,:]
y_undersample = undersample_df[['IsFraud', 'Transaction_Amount']].values # target
X_undersample = undersample_df.drop(['IsFraud', 'Transaction_Amount'], axis=1).values # features


print("# transactions in undersampled data: ", len(undersample_df))
print("IsFraud == 0 ",len(undersample_df[undersample_df.IsFraud == 0])/len(undersample_df))
print("IsFraud == 1: ", sum(y_undersample)/len(undersample_df))

In [None]:
df_train

# 4. Model Evaluation

##  4.1 Validation

###   K-Fold Validation

In `k-fold cross-validation`, the dataset is divided into k subsets (folds) of approximately equal size.

The model is trained and evaluated k times, each time using a different fold as the validation set and the remaining folds as the training set.

The final performance metric is computed by averaging the performance of the model across all k folds.

This approach provides a more robust estimate of the model's performance compared to a single train-validation split, as it evaluates the model on multiple train-validation splits.

###  Holdout Validation

In `holdout validation`, the dataset is partitioned into two subsets: a training set and a testing set, typically with a predefined ratio.

The model is trained once using the training set and then evaluated once using the testing set.

The performance metric is computed based on the model's performance on the testing set.

While holdout validation provides a quick and straightforward assessment of the model's performance, it may be prone to variability due to the random split of the data. 

In this dataset, both `holdout validation and k-fold cross-validation` yielded comparable results, with no significant divergence between the two approaches.

Despite the anticipation of potential differences in performance evaluation, both methodologies produced consistent outcomes. This observation suggests that, in this particular dataset, the choice between holdout validation and k-fold cross-validation may not substantially influence the model's performance assessment.

While both techniques offer distinct advantages, including simplicity in the case of holdout validation and thoroughness in the case of k-fold cross-validation, their similarity in outcomes underscores the stability and reliability of model evaluation in this context.

## 4.2  Xgboost and Hyperparameter Tuning

**I applied XGBoost, often referred to as Extreme Gradient Boosting, for my machine learning tasks. It stands out for its high performance and scalability, making it my preferred choice in both machine learning competitions and real-world applications.
To fine-tune my model and optimize its performance, I manually adjusted the hyperparameters of the XGBoost algorithm. This involved carefully selecting values for parameters like learning rate, max depth, and the number of estimators based on my domain knowledge and experimentation.**

In [None]:
# Separate the features (X) and target variable (y)
X = df_train.drop(columns=['IsFraud'])  # Features are all columns except 'IsFraud'
y = df_train['IsFraud']  # Target variable is the 'IsFraud' column


In [None]:
import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize XGBoost classifier
xgb_model = xgb.XGBClassifier(objective='binary:logistic',
    learning_rate=0.1,  
    max_depth=5,        
    n_estimators=100,  
    random_state=42
)

# Fit the model to the training data
xgb_model.fit(X_train, y_train)

# Predictions on the test set
y_pred = xgb_model.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


## 4.3 Predictions on Test dataset

I applied `XGBoost` with manually tuned hyperparameters to optimize the model's performance. Leveraging these carefully chosen parameters, I made predictions on the target variable within the df_test dataset.

In [None]:
# Assuming df_test is your DataFrame


# Predict probabilities of the positive class (IsFraud = 1)
probabilities = xgb_model.predict_proba(df_test)

# Extract the probability of the positive class (class 1)
fraud_probabilities = probabilities[:, 1]

# Assuming you want to create a DataFrame with IDs and predicted probabilities
results_df = pd.DataFrame({'id': df_test.index, 'IsFraud': fraud_probabilities})

# Save the results DataFrame to a CSV file
results_df.to_csv('predictions.csv', index=False)

print("Predicted probabilities saved to predictions.csv")


In [None]:
print(results_df.head(10))

# 5 Conclusion

# My Journey to Success: Enhancing Fraud Detection

In my pursuit of optimizing fraud detection models, I embarked on a journey characterized by experimentation, refinement, and ultimately, exceptional results.

---

## Initial Exploration

I began by conducting a thorough analysis of the dataset, identifying crucial features and areas for improvement. 

In my initial attempt, I employed logistic regression and standard scaling techniques. 

By focusing solely on dropping the 'Time' column and scaling the transaction amount, I achieved respectable results. 

My model yielded a public leaderboard score of `0.66416` and a private leaderboard score of `0.55539`.

---

## Feature Engineering

Undeterred by the initial outcome, I delved deeper into feature engineering, recognizing the importance of selecting relevant features for model training. 

Expanding my feature set to include `'feat2', 'feat6', 'feat13', 'feat25', and 'feat27', in addition to 'Time'`, I applied standard scaling to the transaction amount columns. 

This strategic decision led to a significant boost in performance, with an impressive public leaderboard score of `0.7105` and a private leaderboard score of `0.62753`.

The application of feature engineering techniques to the transaction amount feature significantly improved the performance of the model, resulting in a private leaderboard score of `0.61844` and a public leaderboard score of `0.69152`.

---

## Hyperparameter Tuning with XGBoost

Eager to further elevate my model's performance, I turned to the power of XGBoost and embarked on a journey of hyperparameter tuning. 

Leveraging the RandomizedSearchCV technique, I meticulously fine-tuned my XGBoostClassifier model, optimizing parameters for maximum efficacy. 

The results were nothing short of extraordinary. 

My model achieved unparalleled success with a public leaderboard score of `0.73187` and a private leaderboard score of `0.66366`.

The XGBoost model was fine-tuned using hyperparameter optimization, incorporating the following parameters:

- **Learning Rate**: 0.1
- **Max Depth**: 5
- **Number of Estimators**: 100
- **Random State**: 42

These parameters were selected based on their optimal performance during hyperparameter tuning, aiming to achieve the best possible model accuracy, achieving a public leaderboard score of `0.72924`  and a private leaderboard score of `0.69817`. The private scores were the best I ever obtained.






---

## Conclusion

Through my journey of continuous experimentation and optimization, I successfully navigated the complex landscape of fraud detection. 

