## **OVERVIEW**

### **Knowing The Dataset**

* contains transactions made by credit cards in September 2013 by European cardholders
* presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions
* characterised by extreme class imbalance; the positive class (frauds) account for only 0.172% of all transactions
* number of rows    : 2,84,807
  number of columns : 31
* contains only numerical input variables which are the result of a PCA transformation
* due to confidentiality issues, original features and more background information about the data is hidden 
* features V1, V2, … V28 are the principal components obtained with PCA
* features which have not been transformed with PCA are 'Time' and 'Amount'. 
* feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset
* feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning
* feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

### **Problem Statement**

Develop a classification algorithm—like logistic regression or random forests on the creditcard dataset to
differentiate between fraudulent and legitimate transactions.

### **Approach**

1. Create a Random Forest classifier from scratch by making use of the decision tree class created for previous project
2. Load the creditcard dataset, analyse it, preprocess it 
3. Train the model with the data
4. evaluate the model using relavent metrics

### **Learning Outcomes**

1. concepts in random forest classifier
2. evaluating binary classification model
3. dealing with class imbalance data
4. data preprocessing
5. feature importance in classification problem

## **SOLUTION**

### **STEP 0** : Importing Required Dependencies

In [24]:
import pandas as pd
import numpy as np
from random import choice
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score

'''IMPORTING THE CLASS I CREATED'''

from DecisionTree import DecisionTree_Catogerizer
from Forest import RandomForest_Catogerizer

### **STEP 1 :** Loading the Data

In [2]:
data = pd.read_csv("creditcard.csv")

In [3]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
data.shape

(284807, 31)

In [5]:
data.dtypes

Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object

### **STEP 2** : Data Preprocessing

#### 1. **Missing Values**

In [6]:
print(data.isnull().sum())

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64


No missing values in the dataset

#### 2. **Duplicate Records**

In [7]:
data.duplicated(keep='first').sum()

np.int64(1081)

In [8]:
# to retain duplicate records regarding fraudulent transactions, store it in a seperate data frame
essential = data[data.duplicated()].loc[(data['Class'] == 1)]

19 rows out of 1081 duplicated rows, contain information regarding fraudulent transactions. These 19 records should be retained in the dataset, because, there is veryless data regarding fraudulent transactions. 

In [9]:
# drop all duplicated records from the dataframe and store it 
data_1 = data.drop_duplicates()

In [10]:
# concatenate the data without duplicates and data of duplicated fraudulent transactions
data = pd.concat([data_1, essential], axis = 0)

In [11]:
# whether the dataset has duplicate records of legitimate transactions
data[data.duplicated()].loc[(data['Class'] == 0)]

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class


Successfully removed all duplicated records regarding legitimate transactions

#### 3. **Resampling (Dealing with Class Imbalance)**

**Effects of Class Imbalance on Classification**

* **Increased false positives** : Since fraudulent transactions are very rare the dataset, there are chances that our classification model will tag instances from majority class as fraudulent. This will result in customer dissatisfaction.
* **Detection Difficulty** : Due to very scarce instances of fraudulent transactions in the dataset, the model won't be able to learn patterns related to fraud transactions. The model may underestimate fraud transactions and classify it as legitimate. This is very dangerous as it will lead to loss of money of customers
* **Misleading Metrics** : Traditional evaluation metrics like accuracy can be misleading in the case of class imbalanced data because model can achieve high accuracy by simply classifying all instances as the majority class (normal). But it would completely fail at detecting fraud transactions.
* **Overfitting to majority class**


**Solution**

**Resampling Techniques** : Oversampling the minority class or undersampling the majority class to balance the dataset. Before building the model, we will have to perform any of these techniques so that we do not compromise our model's performance.



In [12]:
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values.reshape(-1,1)

To fix the class imbalance before building our model, we will perform undersampling of majority class so that number of genuine transactions become equal to the number of fraudulent transactions in our training set.

Oversampling of minority can also be done. This will help the model to learn effectively giving more number of examples. But this will increase the size of training set to around 4,00,000 records. Such a dataset will be computationally expensive. For the purpose of this lerning assignment, I lean towards undersampling.

In [13]:
# Create a RandomUnderSampler object
rus = RandomUnderSampler(random_state=42)

# Fit and transform the data
X, y = rus.fit_resample(X, y)

# Print class distribution
print(pd.Series(y).value_counts())

0    492
1    492
Name: count, dtype: int64


#### **Standardization**

In [30]:
data.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,283745.0,283745.0,283745.0,283745.0,283745.0,283745.0,283745.0,283745.0,283745.0,283745.0,...,283745.0,283745.0,283745.0,283745.0,283745.0,283745.0,283745.0,283745.0,283745.0,283745.0
mean,0.548698,0.005141,-0.003529,0.000636,-0.002546,0.001293,-0.001175,0.000773,-0.001454,-0.001867,...,8.9e-05,-0.000135,0.000289,0.00021,-0.000228,0.000155,0.001702,0.000548,-1.642729e-17,0.001734
std,0.27478,1.951316,1.648781,1.514705,1.415196,1.379224,1.332279,1.236046,1.190455,1.096214,...,0.733504,0.725583,0.624098,0.605618,0.521219,0.482067,0.39658,0.328085,1.000002,0.041605
min,0.0,-56.40751,-72.715728,-48.325589,-5.683171,-113.743307,-26.160506,-43.557242,-73.216718,-13.434066,...,-34.830382,-10.933144,-44.807735,-2.836627,-10.295397,-2.604551,-22.565679,-15.430084,-0.3533331,0.0
25%,0.313718,-0.916191,-0.600272,-0.889801,-0.850082,-0.690003,-0.769102,-0.552676,-0.208838,-0.64431,...,-0.228302,-0.542715,-0.161705,-0.354468,-0.317476,-0.326759,-0.070642,-0.052821,-0.3309683,0.0
50%,0.490156,0.020241,0.063994,0.179928,-0.022112,-0.053571,-0.275211,0.040783,0.021903,-0.052596,...,-0.029436,0.006675,-0.011154,0.041015,0.016303,-0.052173,0.00148,0.011288,-0.2654712,0.0
75%,0.806154,1.316034,0.800414,1.026882,0.739927,0.612182,0.396794,0.570453,0.325737,0.595876,...,0.18622,0.52826,0.147765,0.43971,0.350669,0.24027,0.091217,0.078281,-0.04373993,0.0
max,1.0,2.45493,22.057729,9.382558,16.875344,34.801666,73.301626,120.589494,20.007208,15.594995,...,27.202839,10.50309,22.528412,4.584549,7.519589,3.517346,31.612198,33.847808,102.25,1.0


It can be seen that, all columns except 'Time' and 'Amount' are standardized as they have close to 0 mean value and close to 1 standard deviation value

##### **How to Choose between Scaling Methods?**

|  | Staandard Scaler | Min-max Scaler |
|----------|----------|----------|
| **Scaling Method**    | Scales data so that its mean is 0 and standard deviation is 1  | Scales data so that it lies in the specific range 0-1   |
| **Change in Data Distribution**    | Preserves the data distribution  | Distorts the data distribution if data is skewed  |
| **Outliers**    | Robust to outliers as it is based on mean and standard deviation   | More sensitive to outliers  |
| **Interpretability**   | Original scale is not preserved, data may appear meaningless and less interpretable | Original scale is preserved, hence easily interpretable   |
| **Usage**  | Suitable for data with outliers or skewed distributions, and for algorithms that rely on distance calculation   | Suitable for bounded data, algorithms that are sensitive to the scale of features, and when you want to interpret the scaled values as proportions or percentages   |


**Time** is a bounded feature and has contextual importance when it comes to interpretability. Therefore, Time should be scaled using Min-max scalar. **Amount** feature has exceptionally low as well as high values ie., outliers. Hence, amount shall be scaled using standard scaler

In [14]:
m_scale = MinMaxScaler()
data['Time'] = m_scale.fit_transform(pd.DataFrame(data['Time']))

In [15]:
s_scale = StandardScaler()
data['Amount'] = s_scale.fit_transform(pd.DataFrame(data['Amount']))

#### **Train - Test Split**

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=692024)

### **STEP 3 :** Implementing the Random Forest Classifier Model

#### 1. **Create the model instance**

In [17]:
model = RandomForest_Catogerizer(6)

#### 2. **Train the Model using data**

In [18]:
y_train = y_train.reshape(-1,1)
y_train.shape

(787, 1)

In [19]:
model.fit_to_forest(X_train, y_train)

**Important features for classification** 

In [39]:
most_imp_features = model.find_xfactors().most_common(15)


The feature importance is calculated by counting how many times each feature is used for splitting in the decision trees.

In [47]:
most_imp_features
columns = data.columns.to_list()
print("The top 15 most important features for this classification problems are :\n")
for i in range(len(most_imp_features)):
    idx = most_imp_features[i][0]
    print(columns[idx], end = " ")


The top 15 most important features for this classification problems are :

V14 V11 V17 V3 V4 V12 V19 Amount V16 V10 V20 V1 V21 V23 V18 

#### 3. **Test the Model on unseen data**

In [21]:
y_pred = model.Predict(X_test)

#### 4. **Evaluate model performance**

**Accuracy Score**

In [28]:
round(accuracy_score(y_test, y_pred),4)

0.9239

**Precision Score**

In [25]:
precision_score(y_test, y_pred)

np.float64(0.9375)

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

**Recall Score**

In [27]:
round(recall_score(y_test, y_pred),4)

np.float64(0.9091)

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The best value is 1 and the worst value is 0.

**F1-Score**

In [None]:
f1_score(y_test, y_pred)

The F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal.
