# Overview of the project

I started the project with data visualization and **univariate and bivariate analysis** and drew some important insights from it. Since the data was highly skewed, I performed **undersampling** on the data before creating and training the model. I then went on to train the model on multiple machine learning algorithms like **Logistic Regression, k-Nearest Neighbours, Decision Tree, and Random Forest** achieving a classification recall of 93%. I benchmarked the performance of the models on the basis of the F1 scores and **Confusion Matrix** and also plotted the **ROC curves** and reached the conclusion that Logistic Regression gives out the best results.

[Dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud?select=creditcard.csv)


# Exploratory Data Analysis

***UNIVARIETE ANALYSIS***

Univariate analysis generally refers to the data analysis where there is only one dependent variable. The main goal of the univariate analysis is to summarize the data. We can easily identify measures of central tendency like mean, median, mode, the quartiles, and the standard deviation.


***BIVARIETE ANALYSIS***

Bivariate analysis happens between 2 variables to identify the relationship between them. There are three types of bivariate analysis:
1. Numerical-Numerical (scatter plot, linear correlation)
2. Categorical-Categorical (Chi-Square)
3. Numerical-Categorical (Z-test, T-test)


[Exploratory Data Analysis Reference Article](https://www.analyticsvidhya.com/blog/2021/04/exploratory-analysis-using-univariate-bivariate-and-multivariate-analysis-techniques/)

[Exploratory Data Analysis Reference Video](https://www.youtube.com/watch?v=-o3AxdVcUtQ)


***CONCLUSIONS FROM EDA***
1. The data consisted of around 2,85,000 data points, 30 features including time and amount, and the labeled class of whether a transaction is actually fraud or not. 
2. There were no null values present in the original dataset but the data was highly skewed with 99.83% of the data points being non-fraudulent transactions.
3. The time feature had a bimodal distribution i.e. peaks falling and rising. I have concluded that the peaks might fall due to lesser transactions during nighttime.
4. Very small proportion of transactions had amounts > 10,000 hence they were eliminated from the dataset.
5. Most of the fraudulent transactions were of small amounts (<1000 units - since we don’t know about the units about the currency).
6. The occurrence of fraudulent transactions was independent of the time of the day.





# Data Preprocessing

SCALING

[Reference Video 1](https://www.youtube.com/watch?v=goMoUHl8q6c)

[Reference Video 2](https://www.youtube.com/watch?v=mnKm3YP56PY)

[Reference Article](https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/)

SPLITTING

[Reference Article 1](https://towardsdatascience.com/data-splitting-technique-to-fit-any-machine-learning-model-c0d7f3f1c790)

[Reference Article 2](https://towardsdatascience.com/splitting-a-dataset-e328dab2760a)

UNDERSAMPLING

Undersampling refers to a technique designed to balance the class distribution of highly skewed classification data. An imbalance dataset is referred to as the case where we have one or more classes with few examples (the minority class) and one or more classes with many examples(the majority class). Undersampling involves removing examples from the training dataset of the majority class in order to balance the class distribution. In my case, I reduced it to a 1:1 class distribution of fraudulent to non-fraudulent transactions. I randomly selected examples from the majority class and deleted them from the training dataset (Random Undersampling). 

[Handling Imbalanced Datasets](https://www.youtube.com/watch?v=JnlM4yLFNuo)

[Oversampling Undersampling Code Understanding](https://www.youtube.com/watch?v=HtBDg619ozg)

[SMOTE](https://www.youtube.com/watch?v=U3X98xZ4_no)

[Understanding Under and Oversampling Article](https://www.mastersindatascience.org/learning/statistics-data-science/undersampling/#:~:text=Undersampling%20is%20a%20technique%20to,information%20from%20originally%20imbalanced%20datasets.)

[Undersamping Algorithm for imbalanced classification](https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/)

DETECTING AND REMOVING OUTLIERS

[Outlier detection and removal using IQR](https://www.youtube.com/watch?v=A3gClkblXK8)

[Outlier detection and removal using Percentile](https://www.youtube.com/watch?v=7sJaRHF03K8)

[Outlier Detection and removal: z score and standard deviation](https://www.youtube.com/watch?v=KFuEAGR3HS4)

DIMENSIONALITY REDUCTION

[PCA Main Ideas](https://www.youtube.com/watch?v=HMOI_lkzW08)

[PCA: step by step](https://www.youtube.com/watch?v=FgakZw6K1QQ)

[Introduction: Dimensionality Reduction](https://www.geeksforgeeks.org/dimensionality-reduction/)

[More on Dimensionality Reduction](https://machinelearningmastery.com/dimensionality-reduction-for-machine-learning/)

[11 techniques you should know (Optional)](https://towardsdatascience.com/11-dimensionality-reduction-techniques-you-should-know-in-2021-dcb9500d388b)

#Modelling



1.   **Logistic Regression**: It is one of the most common methods employed for binary classification. It makes use of the sigmoid or logistic function. It is an S-shaped graph that can take any real values between (0,1) but never touch the limits. The sigmoid function is: **f(x) = 1/(1+exp(-x))**.
But in most cases out input values are combined linearly using weights or coefficients to predict values i.e. we will have **exp(-(b0+b1x))** in the denominator. Here b0 is the bias or intercept term and b1 is the coefficient for the input variable x. b0 and b1 are learned from your training data. In memory, the coefficients will be stored (b0, b1, and so on). 
Note: The best coefficients would result in a model that would predict a value very close to 1 for the default class and a value very close to 0  for other values.

  [Exhaustive Playlist (first 4 videos are to be watched)](https://www.youtube.com/playlist?list=PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe)

2.   **k-Nearest Neighbours**: KNN is a supervised machine learning algorithm that predicts the label of the unknown data point on the basis of the votes of its k-nearest neighbors. We draw a circular boundary such that it consists of k-nearest data points to the unknown data points and then we predict on the basis of the majority of the class predict in that area. Usually, k varies from 1 to infinity but in practical cases, k is always taken to be less than 30. Choosing a higher k will require either a high processing time or a costlier processor. Lower k will fail to generalize leading to overfitting and higher k will make the entire process costlier. On higher k’s, the boundaries become smooth. But this causes some of the data points in the opposite labeled regions which is collateral damage. This accounts for the loss in training accuracy but better generalization and high test accuracy.
  
  [Reference Video](https://www.youtube.com/watch?v=HVXime0nQeI)

  [Reference Article 1](https://www.analyticsvidhya.com/blog/2021/04/simple-understanding-and-implementation-of-knn-algorithm/)

  [Reference Article 2](https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761)

  [Reference Article 3](https://machinelearningmastery.com/k-nearest-neighbors-for-machine-learning/)


3. **Decision Trees**: They are used for predictive modeling machine learning. Decision trees use some cost functions in order to choose the best split. It will try to find the best split/attribute that performs the best at classifying the training data and this is repeated until a leaf node is reached. Since the algorithm repeatedly partitions the data into smaller subsets, the final subsets (leaf nodes) consist of few or only one data point. This causes the algorithm to have low bias and high variance. It works in such as way as to maximize the purity of the classes when making splits. Entropy is calculated before and after the splits. If the entropy increases, another split is tried or the branch of the tree is stopped (when the current tree has the lowest entropy). Entropy denotes the purity of the class. The higher the entropy, the lower will be the predictive power of the class.

4. **Random Forest**: Random forest is a popular supervised machine learning algorithm that follows the principle of ensemble learning where it combines multiple classifiers, in our case, decision trees to solve complex problems and to improve the performance of the model. 
Working:
* Take k points from the training dataset
* Build the decision tree on the basis of the selected points
* Choose the number of decision trees you want to build
* Repeat till N trees are created

  The unknown data points are then predicted using all the decision trees and the class to the data point is assigned based on the majority voting from all the decision trees.






# Testing and Benchmarking the Models

CLASSIFICATION REPORT

ROC-AUC CURVE