In [7]:
import joblib
dataset= joblib.load(r'C:\Users\MAJID KHAN\Data Science\Projects\CreditCard Fraud\Data\processed_data.pkl')


In [9]:
dataset.head(5)

Unnamed: 0,V17,V14,V12,V10,V3,V16,V7,V11,V4,V18,Scaled_Amount,Class
0,0.207971,-0.311169,-0.617801,0.090794,2.536347,-0.470401,0.239599,-0.5516,1.378155,0.025791,1.731174,0.0
1,-0.114805,-0.143772,1.065235,-0.166974,0.16648,0.463917,-0.078803,1.612727,0.448154,-0.183361,-0.276889,0.0
2,1.109969,-0.165946,0.066084,0.207643,1.773209,-2.890083,0.791461,0.624501,0.37978,-0.121359,4.861419,0.0
3,-0.684093,-0.287924,0.178228,-0.054952,1.792993,-1.059647,0.237609,-0.226487,-0.863291,1.965775,1.374197,0.0
4,-0.237033,-1.11967,0.538196,0.753074,1.548718,-0.451449,0.592941,-0.822843,0.403034,-0.038195,0.642886,0.0


In [11]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, recall_score, f1_score

In [13]:
X= dataset.iloc[:,:-1]
y=dataset['Class']

In [15]:
x_train, x_test, y_train, y_test= train_test_split(X,y, test_size=0.2, random_state=42, stratify=y) 

In [19]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((154791, 11), (154791,), (38698, 11), (38698,))

In [None]:
# Oversampling

**❓Why Not Oversample Only the Class Feature?**
- Because the goal of oversampling is not to just increase the number of “fraud” labels (i.e., the Class column)…
- 🔁 It's to create new full data points (rows) where features and labels are both synthetic and realistic.

**✅ Why Oversample Only Training Data?**
- If you apply SMOTE (or any resampling) before splitting:
- The synthetic fraud samples will leak into your test set.
- This will cause data leakage, giving you an unrealistically high score.
- Your model won’t generalize well to real, unseen fraud.

In [45]:
from imblearn.over_sampling import RandomOverSampler

In [47]:
sampling= RandomOverSampler(random_state=42)

In [49]:
x_train, y_train= sampling.fit_resample(x_train,y_train)

In [51]:
# modeling

In [53]:
model = LogisticRegression()

In [55]:
model.fit(x_train,y_train)

In [57]:
y_predict= model.predict(x_test)

In [67]:
print("🔍 Logistic Regression Report: AFTER OverSampling")
print(classification_report(y_test, y_predict, digits=4))

🔍 Logistic Regression Report: AFTER OverSampling
              precision    recall  f1-score   support

         0.0     0.9999    0.9744    0.9870     38625
         1.0     0.0645    0.9315    0.1206        73

    accuracy                         0.9744     38698
   macro avg     0.5322    0.9530    0.5538     38698
weighted avg     0.9981    0.9744    0.9854     38698



### Before OverSampling: 

                  **precision    recall  f1-score   support**

         **0.0**     0.9992    0.9998    0.9995     38625  ✅ Legitimate
         **1.0**     0.8571    0.5753    0.6885        73  ⚠️ Fraud

    accuracy                         0.9990     38698
   macro avg     0.9282    0.7876    0.8440     
weighted avg     0.9989    0.9990    0.9989     


In [70]:
model.score(x_test,y_test)*100, model.score(x_train,y_train)*100

(97.43656002894207, 93.84987799431718)

## 📊 Comparison Table
|Metric	|Before Oversampling	|After Oversampling|
|-------|-----------------------|------------------|
|Class 0 (Legit) Precision	|0.9992	|0.9999 ✅
|Class 0 (Legit) Recall	|0.9998 ✅	|0.9744 ⬇️
|Class 1 (Fraud) Precision	|0.8571 ✅	|0.0645 ⬇️
|Class 1 (Fraud) Recall	|0.5753 ⬇️	|0.9315 ✅⬆️
|Accuracy	|0.9990 ✅	|0.9744 ⬇️
|Macro Avg F1	|0.8440 ✅	|0.5538 ⬇️

### 🧠 What’s Happening?
**🧪 BEFORE Oversampling:**

Your model is trained on the original, imbalanced dataset:

- Legit (Class 0): 99.8% of data
- Fraud (Class 1): ~0.2%

So:

- Model is heavily biased toward Class 0
- High accuracy (99.9%) is misleading
- Only catches ~57% of frauds (recall = 0.5753)
- Precision for frauds is good (0.8571), but misses many actual frauds

**🧪 AFTER Random Oversampling:**

You've duplicated the minority (fraud) class in the training set to make it balanced.

Now:

- Model is forced to learn what frauds look like
- Recall improves dramatically (now catches 93% of frauds ✅)
- But precision drops sharply to 0.0645 — means many false positives
- Model is overpredicting frauds, labeling too many legit transactions as fraud

**🚦 So… Which One Is Better?**
It depends on business goals:

|Scenario	|Focus	|Recommended|
|-----------|-------|-----------|
|💳 Credit Card Company	 |Catch as many frauds as possible	|✅ High recall is crucial → Accept oversampling and false positives
|⚖️ Legal Risk or Costly Investigations	 |Avoid false accusations	|✅ Higher precision is needed → May prefer original model or tune threshold

