In [1]:
from random import randrange, uniform
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, recall_score, accuracy_score

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
df = pd.read_csv("datasets/creditcard.csv")

In [3]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


As we can see, there are significantly more negative samples than positive samples.

In [4]:
df['Class'].value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

**Class Imbalance** is clearly visible here.

For simplicity, we remove the time dimension.

In [5]:
df = df.drop(['Time'],axis=1)

split the dataset into features and target

In [6]:
X = df.drop(['Class'], axis=1)
y = df['Class']

train-test split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Random Forest Classifier 

In [8]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train,y_train)

y_pred= rf.predict(X_test)

Suppose our dataset contained 100 examples of fraudulent and 9900 examples of regular transactions. If we used accuracy to measure the model’s performance, it could obtain an accuracy of 99% by simply predicting false every time. It’s for this reason that we use a confusion matrix to evaluate the model’s performance. 

In [9]:
confusion_matrix(y_test, y_pred)

array([[56862,     2],
       [   23,    75]], dtype=int64)

As we can see, our model classified 23 samples as non-fraudulent when, in fact, they were.

For ease of comparison, if we wanted a single number to gauge the model’s performance, we could use recall.

**recall** is equal to the number of true positives divided by the sum of true positives and false negatives.

In [10]:
recall_score(y_test, y_pred)

0.7653061224489796

In [11]:
accuracy_score(y_test, y_pred)

0.9995611109160493

**accuracy** is very high but **not reliable**

because of **class imbalance**

# SMOTE from scratch

The algorithm assumes that if N > 100 than it’s a multiple of 100. In the event N is less than 100 then we select a subset of the samples. For example, if N were 50 (e.g. 50%), we’d want the length of our synthetic array to be 50 / 100 * T = 0.5T where T is the length of the original array of rows belonging to the minority class. We also set N to 100, here, so that it becomes 1 in the subsequent line since we only want to generate a synthetic sample using 1 of the nearest neighbors of each point in the subset.

We use k+1, here, because the implementation of NearestNeighbors considers the point itself as one of the neighbors. In other words, we want the k nearest neighbors excluding the point itself.

In Python, `nonlocal` ensures the variable references the “closest” (in this case, the scope outside the function) variable with the same name in the source code.

We select a nearest neighbor at random, by selecting a number between 1 and k+1 because like we mentioned before, the implementation of NearestNeighbors considers the point itself as one of the nearest neighbors.

We create a new example by taking the difference between the point being considered and a random neighbor then multiplying it by a number between 0 and 1.

We move onto the next available index in our array and decrement N to indicate that we already considered one of the N/100 nearest neighbors.

We obtain the nearest neighbors for a point in the original array and call the populate function.

In [12]:

def SMOTE(sample: np.array, N: int, k: int) -> np.array:
    
    T, num_attrs = sample.shape
    
    # If N is less than 100%, randomize the minority class samples as only a random percent of them will be SMOTEd
    if N < 100:
        T = round(N / 100 * T)
        N = 100
    # The amount of SMOTE is assumed to be in integral multiples of 100
    N = int(N / 100)
    synthetic = np.zeros([T * N, num_attrs])
    new_index = 0
    nbrs = NearestNeighbors(n_neighbors=k+1).fit(sample.values)
    
    def populate(N, i, nnarray):
        nonlocal new_index
        nonlocal synthetic
        nonlocal sample
        while N != 0:
            nn = randrange(1, k+1)
            for attr in range(num_attrs):
                dif = sample.iloc[nnarray[nn]][attr] - sample.iloc[i][attr]
                gap = uniform(0, 1)
                synthetic[new_index][attr] = sample.iloc[i][attr] + gap * dif
            new_index += 1
            N = N - 1
        
    for i in range(T):
        nnarray = nbrs.kneighbors(sample.iloc[i].values.reshape(1, -1), return_distance=False)[0]
        populate(N, i, nnarray)
    
    return synthetic

Prior to running the algorithm, we select all the fraudulent rows within our dataset.

In [13]:
minority = df[df['Class'] == 1].drop(['Class'], axis=1)

We set k to 5 meaning that for each row we will randomly select N/100 nearest neighbors from the available k = 5 to use in our calculations (assuming N ≥ 100). We set N to 200 meaning that we want to generate 200% more fraudulent examples.

In [14]:
synthetic = SMOTE(minority, N=200, k=5)

  dif = sample.iloc[nnarray[nn]][attr] - sample.iloc[i][attr]
  synthetic[new_index][attr] = sample.iloc[i][attr] + gap * dif
  dif = sample.iloc[nnarray[nn]][attr] - sample.iloc[i][attr]
  synthetic[new_index][attr] = sample.iloc[i][attr] + gap * dif
  dif = sample.iloc[nnarray[nn]][attr] - sample.iloc[i][attr]
  synthetic[new_index][attr] = sample.iloc[i][attr] + gap * dif
  dif = sample.iloc[nnarray[nn]][attr] - sample.iloc[i][attr]
  synthetic[new_index][attr] = sample.iloc[i][attr] + gap * dif
  dif = sample.iloc[nnarray[nn]][attr] - sample.iloc[i][attr]
  synthetic[new_index][attr] = sample.iloc[i][attr] + gap * dif
  dif = sample.iloc[nnarray[nn]][attr] - sample.iloc[i][attr]
  synthetic[new_index][attr] = sample.iloc[i][attr] + gap * dif
  dif = sample.iloc[nnarray[nn]][attr] - sample.iloc[i][attr]
  synthetic[new_index][attr] = sample.iloc[i][attr] + gap * dif
  dif = sample.iloc[nnarray[nn]][attr] - sample.iloc[i][attr]
  synthetic[new_index][attr] = sample.iloc[i][attr] + ga

In [15]:
synthetic.shape

(984, 29)

As we can see, the array of synthetic examples has twice the number of rows as the original dataset.

Next, we concatenate the original samples with the samples we just generated and set the label to 1 for a total of 984 + 492 = 1476 samples.

In [16]:
synthetic_df = pd.DataFrame(synthetic, columns=minority.columns)
combined_minority_df = pd.concat([minority, synthetic_df])
combined_minority_df["Class"] = 1

Finally, we combine the fraudulent and non-fraudulent samples into a single DataFrame.

In [17]:
new_df = pd.concat([combined_minority_df, df[df['Class'] == 0]])

Like we did before, we split the data into training and testing datasets, train the model and classify the rows in the testing dataset.

In [18]:
X = new_df.drop(['Class'], axis=1)
y = new_df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

In [19]:
confusion_matrix(y_test, y_pred)

array([[56834,    10],
       [   59,   256]], dtype=int64)

 we can clearly see that the model has a similar number of false negatives despite the fact that there are 3 times more positive samples.

In [20]:
recall_score(y_test, y_pred)

0.8126984126984127

recall score is much higher than the model that was trained on the dataset without the use of SMOTE.

In [21]:
accuracy_score(y_test, y_pred)

0.9987928410224112

we can see **accuracy** has **decreased** a little after dealing with class imbalance with **manually coded SMOTE**

# SMOTE using library

In [22]:
from imblearn.over_sampling import SMOTE

In [23]:
df = pd.read_csv("datasets/creditcard.csv")
df = df.drop(['Time'], axis=1)
X = df.drop(['Class'], axis=1)
y = df['Class']

In [24]:
sm = SMOTE(random_state=42, k_neighbors=5)

In [25]:
X_res, y_res = sm.fit_resample(X, y)

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=42)


In [27]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

In [28]:
confusion_matrix(y_test, y_pred)

array([[56737,    13],
       [    0, 56976]], dtype=int64)

If we look at the confusion matrix, we can see that there are an equal number of positive samples as negative samples and the model didn’t have any false negatives. So, the recall is 1.

In [29]:
recall_score(y_test, y_pred)

1.0

In [30]:
accuracy_score(y_test, y_pred)

0.9998856901675958

we can see **accuracy** has **decreased** a little after dealing with class imbalance with **SMOTE library**