## **This notebook includes preprocessing, model selection and evaluation**

In [1]:
import numpy as np 
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,accuracy_score,recall_score,precision_score,f1_score

In [2]:
df=pd.read_csv("cleaned.csv")

### **Data Preprocessing**

In [3]:
X=df.drop('isFraud',axis=1)
y=df['isFraud']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,stratify=y,random_state=42)

### **Split Features into Numeric and Categorical Columns**

We separate the numeric features from the categorical feature ('type') for both the training and test sets. This allows us to apply the appropriate preprocessing steps to each type of data.

In [4]:
X_num = X_train.drop('type', axis=1)
X_cat = X_train['type']
X_test_num = X_test.drop('type', axis=1)
X_test_cat = X_test['type']

### **Scale Numeric Features Using StandardScaler**

We fit the StandardScaler on the training numeric features and use it to transform both the training and test numeric features. This ensures that the scaling parameters are learned only from the training data, preventing data leakage.

In [5]:
scalar=StandardScaler()
X_train_scaled=scalar.fit_transform(X_num)
X_test_scaled=scalar.transform(X_test_num)

###  **One-Hot Encode the Categorical Feature**

We fit the OneHotEncoder on the training categorical feature and use it to transform both the training and test categorical features. The encoder is set to ignore unknown categories in the test set.

In [6]:
encode=OneHotEncoder(handle_unknown="ignore")
X_train_encoded=encode.fit_transform(X_cat.to_frame())
X_test_encoded=encode.transform(X_test_cat.to_frame())

###  **Concatenate Scaled Numeric and One-Hot Encoded Categorical Features**

We combine the processed numeric and categorical features for both the training and test sets. The resulting arrays are ready to be used for model training and evaluation.

In [7]:
X_train_encoded = X_train_encoded.toarray()
X_train_combined=np.concatenate([X_train_scaled,X_train_encoded],axis=1)

In [8]:
# Ensure X_test_cat index matches X_test
X_test_cat = X_test_cat.reset_index(drop=True)

# Re-encode after resetting index
X_test_encoded = encode.transform(X_test_cat.to_frame())
X_test_encoded = X_test_encoded.toarray()

# Now concatenate
X_test_combined = np.concatenate([X_test_scaled, X_test_encoded], axis=1)

### **Handle Class Imbalance with SMOTE**

We use SMOTE (Synthetic Minority Over-sampling Technique) to balance the classes in the training data. This helps the model learn to detect fraud cases more effectively by generating synthetic samples for the minority class (fraud).

In [9]:
smote=SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train_combined,y_train)

### **Train the Logistic Regression Model**

We initialize the Logistic Regression model and fit it to our preprocessed training data.

In [10]:
logreg=LogisticRegression(class_weight='balanced',max_iter=1000,random_state=42)
logreg.fit(X_resampled,y_resampled)

### **Train the Random Forest Classifier**

We initialize the Random Forest Classifier and fit it to our preprocessed training data.

In [11]:
rfc=RandomForestClassifier(class_weight={0:1,1:50},n_estimators=20,warm_start=True,max_depth=10,n_jobs=-1, random_state=7,criterion='entropy')
rfc.fit(X_resampled,y_resampled)

### **Evaluate Model Performance on Test Set**

We use the trained models to predict on the test set and evaluate their performance using metrics such as accuracy, recall, precision, and F1 score. This helps us understand how well the models generalize to unseen data.

In [12]:
y_pred_logreg = logreg.predict(X_test_combined)
y_pred_rf = rfc.predict(X_test_combined)

In [13]:
models = {
    "Logistic Regression": y_pred_logreg,
    "Random Forest": y_pred_rf
}

for name, y_pred in models.items():
    print(f"\n=== {name} ===")
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Recall:", recall_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred))
    print("F1 Score:", f1_score(y_test, y_pred))


=== Logistic Regression ===
Confusion Matrix:
 [[1204175   66706]
 [     78    1565]]
Accuracy: 0.9475184750935935
Recall: 0.9525258673158856
Precision: 0.02292334959206691
F1 Score: 0.04476928798237835

=== Random Forest ===
Confusion Matrix:
 [[1262037    8844]
 [      2    1641]]
Accuracy: 0.9930484611685123
Recall: 0.9987827145465612
Precision: 0.15650929899856938
F1 Score: 0.27061345646437995


### **Adjust Classification Threshold and Re-evaluate**

We adjust the classification threshold for the Random Forest model to 0.7 to improve precision and reduce false positives. We then evaluate the model's performance at this new threshold.

In [14]:
y_score_logreg = logreg.predict_proba(X_test_combined)[:, 1]
y_score_rfc = rfc.predict_proba(X_test_combined)[:, 1]
threshold = 0.7
y_pred_custom = (y_score_rfc >= threshold).astype(int)

   
print(f"\n🔍 Evaluation at threshold = {threshold}")
print(confusion_matrix(y_test, y_pred_custom))
print("Accuracy:", accuracy_score(y_test, y_pred_custom))
print("Recall:", recall_score(y_test, y_pred_custom))
print("Precision:", precision_score(y_test, y_pred_custom))
print("F1 Score:", f1_score(y_test, y_pred_custom))


🔍 Evaluation at threshold = 0.7
[[1270585     296]
 [      3    1640]]
Accuracy: 0.9997650339011288
Recall: 0.9981740718198417
Precision: 0.8471074380165289
F1 Score: 0.9164571109248394


## **Conclusion and Next Steps**

In this notebook, we built and evaluated multiple models for online payment fraud detection. After preprocessing, handling class imbalance with **SMOTE**, and tuning the classification threshold, the **Random Forest** model with a threshold of **0.7** achieved the best balance between recall and precision.

**Best Model:** Random Forest Classifier  
**Optimal Threshold:** 0.7  
**Performance:** High recall and significantly improved precision, minimizing false positives while catching nearly all frauds.

### Next Steps:
- Deploy the model for real-time fraud detection.
- Monitor model performance on new data and retrain as needed.
- Explore additional features or advanced algorithms for further improvement.

This workflow provides a robust foundation for practical fraud detection in financial transactions.
