# UPI Number fraud detection

## Brief Introduction

# Contents <a id='10'></a>

1. [Importing dependecies](#1)
2. [Data preprocessing](#2)
3. [Statistical Data Analysis](#3)
4. [Machine Learning Model building](#4)
5. [Training and Evaluation](#5)
6. [Hyper-parameter Tuning](#6)
7. [Key Observations of Shap and Lime images](#7)
8. [Summary](#8)
9. [Conclusion](#9)

## Importing depencies<a id='1'></a> 
[back to content](#10)

In [9]:
!pip install pystan

Collecting pystan
  Using cached pystan-3.10.0-py3-none-any.whl.metadata (3.7 kB)
Collecting clikit<0.7,>=0.6 (from pystan)
  Using cached clikit-0.6.2-py2.py3-none-any.whl.metadata (1.6 kB)
INFO: pip is looking at multiple versions of pystan to determine which version is compatible with other requirements. This could take a while.
Collecting pystan
  Using cached pystan-3.9.1-py3-none-any.whl.metadata (3.7 kB)
  Using cached pystan-3.9.0-py3-none-any.whl.metadata (3.7 kB)
  Using cached pystan-3.8.0-py3-none-any.whl.metadata (3.8 kB)
  Using cached pystan-3.7.0-py3-none-any.whl.metadata (3.7 kB)
  Using cached pystan-3.6.0-py3-none-any.whl.metadata (3.7 kB)
  Using cached pystan-3.5.0-py3-none-any.whl.metadata (3.7 kB)
  Using cached pystan-3.4.0-py3-none-any.whl.metadata (3.7 kB)
INFO: pip is still looking at multiple versions of pystan to determine which version is compatible with other requirements. This could take a while.
  Using cached pystan-3.3.0-py3-none-any.whl.metadata (3.6

  error: subprocess-exited-with-error
  
  python setup.py bdist_wheel did not run successfully.
  exit code: 1
  
  [4 lines of output]
    self.version = node.value.s
  Cython>=0.22 and NumPy are required.
  [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pystan
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pystan)


In [None]:
!pip install pystan

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot
import seaborn as sns
import pystan
from bayesian_testing.experiments import BinaryDataTest
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.linear_model import LogisticRegression, RidgeCV
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.ensemble import HistGradientBoostingClassifier, RandomForestClassifier, GradientBoostingClassifier,AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCVa
import plotly.colors as colors
import plotly.express as px
from plotly.subplots import make_subplots
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
import pickle
import shap
shap.initjs()
import lime
from lime import lime_tabular

ModuleNotFoundError: No module named 'pystan'

## Data Preprocessing<a id='2'></a> 
[back to content](#10)

In [None]:
df = pd.read_csv('upi_fraud_dataset.csv')
df

In [None]:
df.isnull().sum()

In [None]:
df.info()

## Statistical Data Analysis<a id='3'></a> 
<a href='' style="float:right">[back to contents](#10)</a> 

In [None]:
df.describe()

### correlation analysis

In [None]:
correlation_matrix = df.corr()
plt.figure(figsize=(10,8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

### Finding Outliers

In [None]:
# Function to detect outliers using Z-score
def detect_outliers_zscore(data, threshold=3):
    z_scores = (data - data.mean()) / data.std()
    outliers = data[np.abs(z_scores) > threshold]
    return outliers

# Function to detect outliers using IQR
def detect_outliers_iqr(data):
    q1, q3 = data.quantile([0.25, 0.75])
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    outliers = data[(data < lower_bound) | (data > upper_bound)]
    return outliers

outliers_zscore = detect_outliers_zscore(df['trans_amount'])
outliers_iqr = detect_outliers_iqr(df['trans_amount'])

print("Outliers using Z-score:", outliers_zscore)
print("Outliers using IQR:", outliers_iqr)
sns.violinplot(y=df['trans_amount'])
sns.boxplot(y=df['trans_amount'], whis=np.inf, boxprops={'facecolor': 'none'}, ax=ax)
plt.title('Combined Box and Violin Plot')
plt.show()

### Bayesian Hypothesis testing

I am assumming to test if the average transaction amount for fraudulent transactions is significantly different from the average transaction amount for non-fraudulent transactions.

**Hypotheses:**

**Null Hypothesis (H₀):** The average transaction amount for fraudulent and non-fraudulent transactions is the same.

**Alternative Hypothesis (H₁):** The average transaction amount for fraudulent and non-fraudulent transactions is different.

In [None]:
fraudulent_transactions = df[df['fraud_risk'] == 1]['trans_amount']
non_fraudulent_transactions = df[df['fraud_risk'] == 0]['trans_amount']

test = BinaryDataTest()
test.add_variant_data("A", fraudulent_transactions)
test.add_variant_data("B", non_fraudulent_transactions)
test.add_variant_data_agg("C", totals=1000, positives=50)

# evaluate test:
results = test.evaluate()
results
print(pd.DataFrame(results).set_index('variant').T.to_markdown(tablefmt="grid"))

In [None]:
fraudulent_transactions = df[df['fraud_risk'] == 1]['trans_amount']
non_fraudulent_transactions = df[df['fraud_risk'] == 0]['trans_amount']

with pm.Model() as model:
    # Priors for the means of the two groups
    mu_fraud = pm.Normal('mu_fraud', mu=0, sd=10)
    mu_non_fraud = pm.Normal('mu_non_fraud', mu=0, sd=10)

    # Shared standard deviation for both groups
    sigma = pm.HalfCauchy('sigma', beta=10)

    # Likelihood for fraudulent transactions
    fraud_likelihood = pm.Normal('fraud_likelihood', mu=mu_fraud, sd=sigma, observed=fraudulent_transactions)

    # Likelihood for non-fraudulent transactions
    non_fraud_likelihood = pm.Normal('non_fraud_likelihood', mu=mu_non_fraud, sd=sigma, observed=non_fraudulent_transactions)

    # Sample from the posterior distribution
    trace = pm.sample(1000, tune=1000, cores=1)

# Analyze the posterior distribution
pm.traceplot(trace)

# Calculate the probability that the difference in means is greater than a certain threshold
diff_means = trace['mu_fraud'] - trace['mu_non_fraud']
prob_diff_greater_than_5 = np.mean(diff_means > 5)  # Adjust the threshold as needed

print(f"Probability that the difference in means is greater than 5: {prob_diff_greater_than_5:.2f}")

In [None]:
fraudulent_transactions = df[df['fraud_risk'] == 1]['trans_amount'].values
non_fraudulent_transactions = df[df['fraud_risk'] == 0]['trans_amount'].values

# Define the Stan model
stan_model = """
data {
  int<lower=0> N1;
  int<lower=0> N2;
  vector[N1] y1;
  vector[N2] y2;
}
parameters {
  real mu_diff;
  real<lower=0> sigma;
  real<lower=0> nu;
}
model {
  // Priors
  mu_diff ~ normal(0, 10);
  sigma ~ cauchy(0, 10);
  nu ~ exponential(1/29);

  // Likelihood
  y1 ~ student_t(nu, mu_diff, sigma);
  y2 ~ student_t(nu, 0, sigma);
}
"""

# Compile the Stan model
model = pystan.StanModel(model_code=stan_model)

# Prepare data for Stan
data = {'N1': len(fraudulent_transactions),
        'N2': len(non_fraudulent_transactions),
        'y1': fraudulent_transactions,
        'y2': non_fraudulent_transactions}

# Sample from the posterior distribution
fit = model.sampling(data=data, iter=1000, chains=4)

# Extract posterior samples
diff_means = fit['mu_diff']

# Calculate the probability of the alternative hypothesis (H1)
prob_h1 = np.mean(diff_means > 0)
print(f"Probability of H1: {prob_h1:.2f}")

# Machine Learning Model<a id='4'></a>
[back to contents](#10)

# ML model Training and Evaluation<a id='5'></a>
[back to contents](#10)

In [None]:
features =  df[['trans_hour','trans_day', 'trans_month', 'trans_year', 'category', 'upi_number', 'age', 'trans_amount', 'state', 'zip']]
target = df['fraud_risk']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(features, target, random_state = 200, test_size = 0.15)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [None]:
def train_and_evaluate_models(x_train_scaled, y_train, x_test_scaled, y_test):
    models_list = [
        LogisticRegression(),
        DecisionTreeClassifier(),
        HistGradientBoostingClassifier(),
        RandomForestClassifier(),
        GradientBoostingClassifier(),
        AdaBoostClassifier(),
        GaussianNB()
    ]

    for model in models_list:
        model.fit(x_train_scaled, y_train)
        y_pred = model.predict(x_test_scaled)

        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)

        print(f"Accuracy of the {model} model: {accuracy:.2f}")
        print(f"Precision of the {model} model: {precision:.2f}")
        print(f"Recall of the {model} model: {recall:.2f}")
        print(f"F1-score of the {model} model: {f1:.2f}")
        print('\n ________________________________________________ \n')

In [None]:
train_and_evaluate_models(x_train_scaled, y_train, x_test_scaled, y_test)

### **From the above results its very clear that**
- **DecisionTreeClassifier**, 
- **HistGradientBoostingClassifier**, 
- **RandomForestClassifier**, 
- **GradientBoostingClassifier**
- **AdaBoostClassifier** 

**are more accurate than the other models with their accuracy greater than 90%. So I will be increasing their performance with the help of Hyperparameter tuning** 

## Hyper parameter tuning <a id='6'></a>
[back to contents](#10)

In [None]:
# Define the parameter grids for each model
param_grid_decision_tree = {
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

param_grid_hist_gradient_boosting = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_leaf_nodes':[10, 20, 15],
}

param_grid_random_forest = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}

param_grid_gradient_boosting = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.1, 0.05, 0.01],
    'max_depth': [3, 5, 7]
}

param_grid_ada_boost = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.1, 0.05, 0.01]
}

# Create a dictionary to store the models and their parameter grids
model_param_grid = {
    DecisionTreeClassifier(): param_grid_decision_tree,
    HistGradientBoostingClassifier(): param_grid_hist_gradient_boosting,
    RandomForestClassifier(): param_grid_random_forest,
    GradientBoostingClassifier(): param_grid_gradient_boosting,
    AdaBoostClassifier(): param_grid_ada_boost,
}
results = []
# Perform hyperparameter tuning and evaluate models
for model, param_grid in model_param_grid.items():
    grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
    grid_search.fit(x_train_scaled, y_train)
    with open(f'best_{model}.pkl', 'wb') as f:  # Open in binary write mode
        pickle.dump(grid_search.best_estimator_, f)
        print(f"Best model for {model} saved successfully!")
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(x_test_scaled)

    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    results.append({'Model': model, 'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1-Score': f1})
    print(f"Best parameters for {model}: {grid_search.best_params_}")
    print(f"Accuracy of the {model} model: {accuracy:.2f}")
    print(f"Precision of the {model} model: {precision:.2f}")
    print(f"Recall of the {model} model: {recall:.2f}")
    print(f"F1-score of the {model} model: {f1:.2f}")
    print('\n _______________________________________________________________________________________________________________ \n')
    try:
        explainer = shap.TreeExplainer(best_model)
        shap_values = explainer.shap_values(x_test)
        print("Variable Importance Plot - UPI fraudDetection")
        figure = plt.figure()
        shap.summary_plot(shap_values, x_test)
    except:
        best_model==AdaBoostClassifier()
    print('\n _______________________________________________________________________________________________________________ \n')
    
    # SHAP Explanation (for all models)
    try:
        explainer = shap.TreeExplainer(best_model) if isinstance(best_model, (DecisionTreeClassifier, RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier)) else shap.KernelExplainer(best_model.predict, x_train)
        shap_values = explainer.shap_values(x_test)
        print("Variable Importance Plot - UPI fraudDetection")
        shap.summary_plot(shap_values, x_test)
    except:
        pass  # Handle cases where SHAP explainer might not be applicable

    print('\n _______________________________________________________________________________________________________________ \n')
    # LIME Explanation (for tree-based models)
    if isinstance(best_model, (DecisionTreeClassifier, RandomForestClassifier)):
        # Get class and feature names (adjust if necessary)
        class_names = ['Fraudulent', 'Honest']
        feature_names = list(x_train.columns)

        # Create LIME explainer
        explainer = lime_tabular.LimeTabularExplainer(x_train.values, feature_names=feature_names, class_names=class_names, mode='classification')

        # Explain a prediction (modify for specific instance)
        instance = x_test.iloc[0]  # Assuming you want to explain the first test instance
        explanation = explainer.explain_instance(instance.values, best_model.predict_proba)

        # Visualize explanation (adjust for desired visualization)
        explanation.show_in_notebook(show_table=True, show_all=False)
        plt.show()
    # LIME Explanation (for tree-based models)
    if isinstance(best_model, (HistGradientBoostingClassifier,GradientBoostingClassifier,AdaBoostClassifier)):
        # Get class and feature names (adjust if necessary)
        class_names = ['Fraudulent', 'Honest']
        feature_names = list(x_train.columns)

        # Create LIME explainer
        explainer = lime_tabular.LimeTabularExplainer(x_train.values, feature_names=feature_names, class_names=class_names, mode='classification')

        # Explain a prediction (modify for specific instance)
        instance = x_test.iloc[0]  # Assuming you want to explain the first test instance
        explanation = explainer.explain_instance(instance.values, best_model.predict_proba)

        # Visualize explanation (adjust for desired visualization)
        explanation.show_in_notebook(show_table=True, show_all=False)
        plt.show()


## Summary of Hyperparameter Tuning Results
[back to contents](#10)
Hyperparameter Tuning is a technique used to optimize the performance of machine learning models by systematically exploring different combinations of hyperparameters. In this case, we tuned several popular classification models: Decision Tree, HistGradientBoosting, Random Forest, Gradient Boosting, and AdaBoost.

Model Performance:

1. Decision Tree:
    - Best Parameters: max_depth=10, min_samples_leaf=1, min_samples_split=5
    - Performance: Achieved an accuracy of 0.95, indicating strong performance.
    
2. HistGradientBoosting:
    - Best Parameters: learning_rate=0.2, max_depth=7, max_leaf_nodes=10
    - Performance: Outperformed other models with an accuracy of 0.97 and excellent recall of 0.99.
    
3. Random Forest:
    - Best Parameters: max_depth=None, min_samples_split=2, n_estimators=100
    - Performance: Achieved an accuracy of 0.95 and a strong recall of 0.98.
4. Gradient Boosting:
    - Best Parameters: learning_rate=0.05, max_depth=5, n_estimators=100
    - Performance: Showed solid performance with an accuracy of 0.96 and a good recall of 0.98.
    
5. AdaBoost:
    - Best Parameters: learning_rate=0.1, n_estimators=200
    - Performance: Achieved an accuracy of 0.94, slightly lower than other models.

### **Overall, the HistGradientBoosting model demonstrated the best performance in terms of accuracy and recall. This suggests that it is the most suitable model for this specific classification task. **

## Key Observations <a id="7"></a>
[back to contents](#10)

### Important features
 The top features which affect the Risk of Fraud activity are
 - Transaction Hour
 - Transaction Day
 - 

In [None]:
df_results = pd.DataFrame(results)
#df_results['Model'] = df_results['Model'].str.replace(r'\([^()]*\)', '', regex=True).str.strip()
df_results

In [None]:
data= {'Model': ['DecisionTreeClassifier', 'HistGradientBoostingClassifier', 'RandomForestClassifier', 'GradientBoostingClassifier', 'AdaBoostClassifier'],
        'Accuracy': [0.955, 0.97, 0.95, 0.9575, 0.935]}
df = pd.DataFrame(data)

# Create the bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='Accuracy',data=df)
plt.title('Model Accuracy Comparison')
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='Accuracy', data=df)
plt.title('Model Accuracy')
plt.xticks(rotation=45)
plt.show()

# Summary <a id='8'></a>
[back to contents](#10)

## Conclusion <a id ="9"></a>
[back to contents](#10)