In [1]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('creditcard.csv')

In [3]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
df.shape

(284807, 31)

In [5]:
df['Class'].value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

In the context of the credit card dataset, where the Class column represents whether a transaction is fraudulent (1) or not fraudulent (0), the numbers  confirm that the dataset is highly imbalanced:
<li>0 (Non-fraudulent transactions): 284,315 instances
<li>1 (Fraudulent transactions): 492 instances</li>
This kind of imbalance is typical in fraud detection datasets, where fraudulent transactions are much rarer than non-fraudulent ones. The key challenge here is to build a model that can accurately detect the rare fraudulent cases without being biased toward the majority class (non-fraudulent transactions).

In [6]:
#indeependent and dependent features
X = df.drop('Class',axis = 1)
y = df['Class']

## Before Applying any method

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

In [8]:
log_class = LogisticRegression(solver='liblinear', max_iter=1000)
# Define the parameter grid
grid = {'C': 10.0 ** np.arange(-2, 3), 'penalty': ['l1', 'l2']}
# Set up KFold cross-validation
cv = KFold(n_splits=5, random_state=42, shuffle=True)

<br>Purpose:</br> This line defines a grid of hyperparameters that you want to search over to find the best combination for your model.
##### C:
<li>The C parameter in Logistic Regression is a regularization parameter that controls the strength of regularization. Regularization is a technique used to prevent overfitting by penalizing large coefficients in the model.
<li>A lower value of C implies stronger regularization, and a higher value implies weaker regularization.
<li>10.0 ** np.arange(-2, 3) creates a range of values for C by exponentiating 10 to the power of -2, -1, 0, 1, and 2, resulting in values [0.01, 0.1, 1, 10, 100].

##### penalty:
<li>The penalty parameter specifies the type of regularization to use:
<li>'l1': Lasso (L1) regularization, which can lead to sparse models where some coefficients are exactly zero.
<li>'l2': Ridge (L2) regularization, which tends to shrink coefficients but usually does not make them zero.
<li>The grid defines two options: ['l1', 'l2'], so the model will try both regularization types.

##### KFold:
<li>KFold is a cross-validation technique that splits the data into k subsets (folds). The model is trained k times, each time using a different fold as the validation set and the remaining folds as the training set.
<li>n_splits=5 means the data will be split into 5 folds, so the model will be trained and validated 5 times, each time using a different fold as the validation set.
<li>random_state=None means that the data will not be shuffled before splitting, and the splits will be deterministic.
<li>shuffle=False means the data will be split as it is, without randomizing the order.

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size = 0.7)

In [10]:
# Set up GridSearchCV with f1_macro scoring
clf = GridSearchCV(log_class, grid, cv=cv, n_jobs=-1, scoring='f1_macro')
# Fit the model
clf.fit(X_train, y_train)

In [11]:
y_pred = clf.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85275    22]
 [   53    93]]
0.9991222218320986
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85297
           1       0.81      0.64      0.71       146

    accuracy                           1.00     85443
   macro avg       0.90      0.82      0.86     85443
weighted avg       1.00      1.00      1.00     85443



<li>85,275 (True Negatives - TN): The model correctly predicted 85,275 instances as class 0 (the majority class). These are cases where the actual value was 0, and the model also predicted 0.
<li>22 (False Positives - FP): The model incorrectly predicted 22 instances as class 1 when they were actually class 0. These are errors where the model thought something belonged to class 1, but it actually belonged to class 0.
<li>53 (False Negatives - FN): The model incorrectly predicted 53 instances as class 0 when they were actually class 1. These are errors where the model missed something that belonged to class 1.
<li>93 (True Positives - TP): The model correctly predicted 93 instances as class 1. These are cases where the actual value was 1, and the model also predicted 1.
    
### Accuracy
<li>Accuracy is the ratio of correctly predicted instances (both true positives and true negatives) to the total instances.
<li>Formula:</li>

##### Accuracy=𝑇𝑃+𝑇𝑁/𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁=93+85275/93+85275+22+53=0.9991 or 99.91%
<li>The model has a very high accuracy of 99.88%, indicating that it predicts correctly almost all the time. However, as mentioned before, accuracy might not fully reflect the model's performance on imbalanced datasets.

### Classification Report
##### For Class 0 (The majority class):
<li>Precision: 1.00 means that when the model predicted 0, it was correct 100% of the time.
<li>Recall: 1.00 means that the model correctly identified 100% of the actual 0 cases.
<li>F1-Score: 1.00 is a combined measure of precision and recall, indicating perfect performance for class 0.
    
##### For Class 1 (The minority class):
<li>Precision: 0.81 means that when the model predicted 1, it was correct 81% of the time.
<li>Recall: 0.64 means that the model correctly identified 64% of the actual 1 cases. In other words, it missed 36% of the true 1 cases.
<li>F1-Score: 0.71 is a combined measure of precision and recall. It shows that while the model is pretty good at predicting 1, it still misses some cases.
    
##### Macro Avg
Precision: 0.90, Recall: 0.82, F1-Score: 0.86
##### Weighted Avg
Precision: 1.00, Recall: 1.00, F1-Score: 1.00

### Conclusion
<li>High Accuracy: Your model is really good at predicting the majority class (0), which is why the accuracy is so high (99.91%).
<li>Challenge with Class 1: The model is less accurate when it comes to the minority class (1). It catches 64% of the cases where it should predict 1, but misses 36% of them. This is a common issue when you have a lot more examples of one class than the other.
<li>What to Watch For: Even though the accuracy is very high, it’s important to look at how well the model is doing with the minority class. Sometimes, accuracy alone can be misleading, especially in cases where one class is much more common than the other.

In [17]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [23]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train,y_train)

In [15]:
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85286    11]
 [   34   112]]
0.9994733330992591
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85297
           1       0.91      0.77      0.83       146

    accuracy                           1.00     85443
   macro avg       0.96      0.88      0.92     85443
weighted avg       1.00      1.00      1.00     85443



## Under Sampling
Undersampling is a technique used to handle imbalanced datasets, where one class significantly outnumbers another. The goal is to balance the dataset by reducing the number of instances in the majority class so that it matches the size of the minority class. This can help prevent the model from being biased towards the majority class.

In [31]:
from collections import Counter
Counter(y_train)

Counter({0: 199018, 1: 346})

In [38]:
from imblearn.under_sampling import NearMiss
from collections import Counter
# Initialize NearMiss with the desired sampling strategy
ns = NearMiss(sampling_strategy=0.8)
X_train_ns, y_train_ns = ns.fit_resample(X_train, y_train)
# Print the class distribution before and after applying NearMiss
print("The number of classes before fit():", Counter(y_train))
print("The number of classes after fit():", Counter(y_train_ns))

The number of classes before fit(): Counter({0: 199018, 1: 346})
The number of classes after fit(): Counter({0: 432, 1: 346})


<li>The sampling_strategy=0.8 parameter means that after undersampling, the majority class will have 80% of the instances of the minority class. For example, if the minority class has 500 instances, the majority class will be reduced to 400 instances.
<li>This output shows that the majority class (0) was reduced to 80 instances to match 80% of the 100 instances of the minority class (1).

In [39]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train_ns,y_train_ns)

In [40]:
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[10866 74431]
 [    3   143]]
0.12884613133902134
              precision    recall  f1-score   support

           0       1.00      0.13      0.23     85297
           1       0.00      0.98      0.00       146

    accuracy                           0.13     85443
   macro avg       0.50      0.55      0.11     85443
weighted avg       1.00      0.13      0.23     85443



#### Advantages :
<li>By making the classes equal in size, the model is less likely to be biased towards the majority class, improving its ability to detect the minority class.
<li>Since the dataset size is reduced, training the model can be faster and require less computational resources.
    
#### Disadvantages :
<li>By discarding a significant portion of the majority class data, you might lose important information that could help the model make better predictions. This can lead to a less accurate model overall.
<li>With fewer instances of the majority class, the model might overfit to the minority class, leading to poor generalization to new, unseen data.
<li>If the dataset is already small, further reducing the number of instances might result in too little data to effectively train the model.

### Summary :
<li>Severe Impact on Accuracy: After undersampling, the accuracy dropped significantly because the model is now predicting the minority class (1) correctly most of the time, but it's missing a large number of majority class (0) instances.
<li>Overfitting to the Minority Class: The model became too focused on the minority class due to the undersampling, leading to nearly all predictions being class 1. This is why you see a high recall for class 1 but almost zero precision.
<li>Imbalanced Performance: While the recall for class 1 improved dramatically, the overall model performance is poor because it misclassifies the majority of instances as class 1.
    
#### Why This Happened:
Undersampling Overdone: The undersampling reduced the majority class too much, making the model struggle with distinguishing between the classes. This often leads to a model that overfits the minority class at the expense of the majority class.

## Over Sampling
Oversampling involves increasing the number of instances of the minority class by creating duplicates or synthetic data points until the class distribution is more balanced.

In [41]:
from imblearn.over_sampling import RandomOverSampler

In [42]:
os = RandomOverSampler(sampling_strategy=0.5)
X_train_os, y_train_os = os.fit_resample(X_train, y_train)
print("The number of classes before fit():", Counter(y_train))
print("The number of classes after fit():", Counter(y_train_os))

The number of classes before fit(): Counter({0: 199018, 1: 346})
The number of classes after fit(): Counter({0: 199018, 1: 99509})


In your case, the minority class (1) was increased from 346 instances to 99,509 instances. This is approximately 50% of the majority class size (199,018 instances).

In [43]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train_os,y_train_os)

In [44]:
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85286    11]
 [   34   112]]
0.9994733330992591
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85297
           1       0.91      0.77      0.83       146

    accuracy                           1.00     85443
   macro avg       0.96      0.88      0.92     85443
weighted avg       1.00      1.00      1.00     85443



## Summary 
<li>Improved Minority Class Detection: After oversampling, the model's performance in detecting the minority class (1) improved significantly. With a recall of 0.77 and an F1-score of 0.83, the model is now much better at identifying instances of the minority class compared to when it was trained on an imbalanced dataset.
<li>High Overall Accuracy: The overall accuracy is very high, which is expected given the large number of majority class instances that the model can easily classify correctly.

#### Advantages:
<li>Increased Representation: By adding more instances of the minority class, the model gets more examples to learn from, which can improve its ability to detect and classify the minority class.
<li> Oversampling can create a more balanced dataset, which helps the model learn to make predictions for both classes more effectively.

#### Disadvantages:
<li>Duplicated Data: Simply duplicating existing minority class instances can lead to overfitting, where the model learns to memorize these instances rather than generalizing well.
<li>Larger Dataset: Oversampling increases the size of the training dataset, which can lead to longer training times and higher computational costs.|

## SMOTETomek

In [51]:
from imblearn.combine import SMOTETomek

In [53]:
os = SMOTETomek(sampling_strategy=0.5)
X_train_os, y_train_os = os.fit_resample(X_train, y_train)
print("The number of classes before fit():", Counter(y_train))
print("The number of classes after fit():", Counter(y_train_os))

The number of classes before fit(): Counter({0: 199018, 1: 346})
The number of classes after fit(): Counter({0: 199018, 1: 99509})


In [54]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train_os,y_train_os)

In [55]:
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85273    24]
 [   29   117]]
0.9993797034280163
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85297
           1       0.83      0.80      0.82       146

    accuracy                           1.00     85443
   macro avg       0.91      0.90      0.91     85443
weighted avg       1.00      1.00      1.00     85443



#### Summary
<li>The model demonstrates excellent performance with very high accuracy (99.94%) and perfect precision, recall, and F1-scores for the majority class (0).
For the minority class (1), the precision (83%), recall (80%), and F1-score (82%) are strong, indicating effective identification of the minority class after applying SMOTETomek.
    
#### Impact of SMOTETomek:
<li>Balanced Classes: SMOTETomek has improved the model's ability to identify the minority class while maintaining high performance for the majority class.
<li>Reduced Noise: The Tomek Links component has helped in cleaning up the data, likely contributing to the improved performance.

#### Advantages 
<li>By combining SMOTE and Tomek Links, SMOTETomek addresses both class imbalance and noise, creating a more balanced dataset.
<li>It can improve model performance by reducing class imbalance and removing ambiguous or noisy data points.
<li>Data Cleaning: By removing Tomek Links instances, it helps to reduce noise in the dataset, which can improve the overall quality and reliability of the model.
    
#### Disadvantages
<li>Increased Processing Time: Combining SMOTE and Tomek Links can be more computationally intensive compared to using each technique separately. This may lead to increased processing time and resource usage.
<li>Synthetic Samples: SMOTE introduces synthetic samples, which can sometimes lead to overfitting, especially if the synthetic samples do not represent realistic variations of the minority class.
<li>Larger Dataset: SMOTE increases the size of the dataset by creating synthetic samples, which can lead to longer training times and potentially require more memory.