# Setup and Downloading the Dataset

### Setup Kaggle Environment

1. Create a directory named `.kaggle` in your home directory:
   ```bash
   !mkdir ~/.kaggle
  ```
2. Copy the Kaggle API key file, kaggle.json, to the newly created directory:
```
!cp kaggle.json ~/.kaggle/
```
3. Set the appropriate permissions for the API key file:
```
!chmod 600 ~/.kaggle/kaggle.json
```
### Download and Extract Dataset

1. Use the Kaggle CLI to download the dataset titled "Credit Card Fraud Detection" by mlg-ulb:
```
!kaggle datasets download -d mlg-ulb/creditcardfraud
```
2. Unzip the downloaded file to extract its contents:
```
!unzip creditcardfraud.zip
```


In [None]:
!mkdir ~/.kaggle
!cp /content/drive/MyDrive/Colab\ Notebooks/kaggle.json /root/.kaggle
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download -d mlg-ulb/creditcardfraud -p /content/drive/MyDrive/Colab\ Notebooks

import zipfile
zipfile_obj = zipfile.ZipFile("/content/drive/MyDrive/Colab Notebooks/creditcardfraud.zip")
zipfile_obj.extractall('/content/drive/MyDrive/Colab Notebooks')
zipfile_obj.close()
# !unzip creditcardfraud.zip

### Read and Display Dataset
The code uses the pandas library to read a CSV file named 'creditcard.csv' and displays the resulting dataset.

In [None]:
import pandas as pd
# Read the CSV file into a DataFrame
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/creditcard.csv')
# Display the DataFrame
df

### Class Distribution

The output shows the count of each unique value in the 'Class' column.

- Class 0: 284,315 occurrences
- Class 1: 492 occurrences

This information is valuable, especially in tasks like credit card fraud detection, where Class 1 often represents fraudulent transactions, and Class 0 represents non-fraudulent transactions. The imbalanced nature of the classes (with significantly more non-fraudulent transactions) is common in fraud detection datasets.



In [None]:
# Count occurrences of each unique value in the 'Class' column
df['Class'].value_counts()

### Histograms for Numerical Columns

The code generates histograms for each numerical column in the dataset.

In [None]:
# Create histograms for each numerical column
df.hist(bins=30, figsize=(30, 30))

### Descriptive Statistics

The code generates descriptive statistics for the numerical columns in the dataset.

- The Amount min - 0 and max - 25691.16 and avg - 88.349619 which means their are few high values only
- Class lies is either 0 or 1

In [None]:
df.describe()

### Feature Scaling with RobustScaler and Min-Max Scaling

The code uses the RobustScaler from scikit-learn to scale the 'Amount' column, addressing outliers. Additionally, it applies min-max scaling to the 'Time' column to normalize its values.

In [None]:
from sklearn.preprocessing import RobustScaler
# Create a copy of the original DataFrame
new_df = df.copy()
# Scale the 'Amount' column using RobustScaler
new_df['Amount'] = RobustScaler().fit_transform(new_df['Amount'].to_numpy().reshape(-1, 1))
# Apply min-max scaling to the 'Time' column
time = new_df['Time']
new_df['Time'] = (time - time.min()) / (time.max() - time.min())
# Display the resulting DataFrame
new_df

### Shuffling Rows in the DataFrame

The code shuffles the rows of the DataFrame 'new_df' to introduce randomness.

In [None]:
new_df = new_df.sample(frac=1, random_state=1)
new_df

### Train-Test-Validation Split and Class Distribution

The code splits the shuffled DataFrame 'new_df' into three sets: training, testing, and validation.


The class distribution is as follows:

Training Set:

- Class 0: 239,589 occurrences
- Class 1: 411 occurrences

Testing Set:

- Class 0: 21,955 occurrences
- Class 1: 45 occurrences

Validation Set:

- Class 0: 22,771 occurrences
- Class 1: 36 occurrences

These counts provide insights into the distribution of the target variable ('Class') in each set, which is crucial for understanding the balance between the two classes.

In [None]:
# Split the DataFrame into training, testing, and validation sets
train, test, val = new_df[:240000], new_df[240000:262000], new_df[262000:]
# Display the class distribution for each set
train["Class"].value_counts(), test["Class"].value_counts(), val["Class"].value_counts()

### Converting DataFrames to NumPy Arrays

The code converts the DataFrames 'train', 'test', and 'val' into NumPy arrays and prints their shapes.


In [None]:
# Convert DataFrames to NumPy arrays
train_np, test_np, val_np = train.to_numpy(), test.to_numpy(), val.to_numpy()
# Display the shapes of the resulting arrays
train_np.shape, test_np.shape, val_np.shape

### Separating Features and Target Variable

The code separates the input features and target variable from the NumPy arrays representing the training, testing, and validation sets.


In [None]:
# Separate features and target variable for the training set
x_train, y_train = train_np[:, :-1], train_np[:, -1]
# Separate features and target variable for the testing set
x_test, y_test = test_np[:, :-1], test_np[:, -1]
# Separate features and target variable for the validation set
x_val, y_val = val_np[:, :-1], val_np[:, -1]
# Display the shapes of the resulting arrays
x_train.shape, y_train.shape, x_test.shape, y_test.shape, x_val.shape, y_val.shape

### Logistic Regression Model Training and Evaluation

The code trains a logistic regression model on the training data and evaluates its accuracy on the same training set.


In [None]:
# Import the Logistic Regression model from scikit-learn
from sklearn.linear_model import LogisticRegression
# Create a logistic regression model
logistic_model = LogisticRegression()
# Fit the model to the training data
logistic_model.fit(x_train, y_train)
# Evaluate the accuracy of the model on the training set
logistic_model.score(x_train, y_train)

#### Result of the Logistic regression on validation

- Precision (Fraud): Precision is 0.73 for 'Fraud,' meaning that out of all instances predicted as 'Fraud,' only 73% are true 'Fraud.' The impact of class imbalance is evident here; the model tends to be more conservative in predicting 'Fraud' to avoid false positives.

- Recall (Fraud): Recall is 0.53 for 'Fraud,' indicating that only 53% of actual 'Fraud' instances were correctly predicted. The class imbalance affects the model's ability to capture all positive instances.

- F1-Score (Fraud): The F1-Score for 'Fraud' is 0.61, providing a balance between precision and recall. However, it reflects the challenge posed by the class imbalance.

- Support (Fraud): The number of instances for 'Fraud' is small (36), highlighting the class imbalance issue.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_val, logistic_model.predict(x_val), target_names=['Not Fraud', 'Fraud']))

### Shallow Neural Net
This code defines a neural network with one hidden layer (2 units) and uses the ReLU activation function. Batch normalization is applied to normalize the inputs. The output layer has 1 unit with a sigmoid activation function for binary classification. The model is compiled using the Adam optimizer and binary crossentropy loss.

During training, a model checkpoint is employed to save the weights of the model with the best performance on the validation set. The training is done for 5 epochs.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Dense, BatchNormalization
from tensorflow.keras.callbacks import ModelCheckpoint

# Create a shallow neural network
shallow_nn = Sequential()
# Add an input layer with the number of features in the input data
shallow_nn.add(InputLayer((x_train.shape[1],)))
# Add a dense layer with 2 units and ReLU activation function
shallow_nn.add(Dense(2, 'relu'))
# Add Batch Normalization layer
shallow_nn.add(BatchNormalization())
# Add the output layer with 1 unit and sigmoid activation function for binary classification
shallow_nn.add(Dense(1, 'sigmoid'))

# Define a model checkpoint to save the best weights during training
checkpoint = ModelCheckpoint('shallow_nn', save_best_only=True)
# Compile the model using the Adam optimizer and binary crossentropy loss and use accuracy as a metric
shallow_nn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# Display a summary of the model architecture
shallow_nn.summary()

In [None]:
# Train the model on the training set, validate on the validation set, and save the best weights
shallow_nn.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=5, callbacks=checkpoint)

In [None]:
# Function to make predictions using the trained model
def neural_net_predictions(model, x):
  return (model.predict(x).flatten() > 0.5).astype(int)

# neural_net_predictions(shallow_nn, x_val)

#### Result Metric

- Precision (Fraud): Precision is 0.61 for 'Fraud,' meaning that out of all instances predicted as 'Fraud,' 61% are true 'Fraud.' This is an improvement compared to the logistic regression model.

- Recall (Fraud): Recall is 0.78 for 'Fraud,' indicating that 78% of actual 'Fraud' instances were correctly predicted. This is also an improvement compared to the logistic regression model.

- F1-Score (Fraud): The F1-Score for 'Fraud' is 0.68, providing a balance between precision and recall.

- Support (Fraud): The number of instances for 'Fraud' is small (36), and the model correctly predicts a substantial portion of them.


The shallow neural network demonstrates improved performance, especially in terms of precision, recall, and F1-Score for the 'Fraud' class, indicating better handling of the imbalanced dataset.

In [None]:
print(classification_report(y_val, neural_net_predictions(shallow_nn, x_val), target_names=['Not Fraud', 'Fraud']))

### Random Forest Classifier

- Precision (Fraud): 0.81 - Out of all instances predicted as 'Fraud,' 81% are true 'Fraud.' This indicates the accuracy of positive predictions.

- Recall (Fraud): 0.47 - Only 47% of actual 'Fraud' instances were correctly predicted. This indicates the ability of the model to capture positive instances.

- F1-Score (Fraud): 0.60 - The harmonic mean of precision and recall. It provides a balance between precision and recall.

The Random Forest classifier shows good overall performance but has a trade-off between precision and recall for the 'Fraud' class. The model tends to be more conservative in predicting 'Fraud,' resulting in a lower recall.

In [None]:
from sklearn.ensemble import RandomForestClassifier
# Create a Random Forest classifier with a maximum depth of 2 and parallel processing
rf = RandomForestClassifier(max_depth=2, n_jobs=-1)
# Fit the model to the training data
rf.fit(x_train, y_train)

print(classification_report(y_val, rf.predict(x_val), target_names=['Not Fraud', 'Fraud']))

### Gradient Boosting Classifier

- Precision (Fraud): 0.67 - Precision (Fraud): 0.67 - Out of all instances predicted as 'Fraud,' 67% are true 'Fraud.

- Recall (Fraud): 0.67 - 67% of actual 'Fraud' instances were correctly predicted. This indicates the ability of the model to capture positive instances.

- F1-Score (Fraud): 0.67 - The harmonic mean of precision and recall. It provides a balance between precision and recall.

The model performs exceptionally well on the majority class ('Not Fraud'), achieving perfect precision and recall.

For the minority class ('Fraud'), precision and recall are balanced but lower than the majority class.

The overall high accuracy might be influenced by the class imbalance, as predicting the majority class accurately contributes significantly.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
# Create a Gradient Boosting Classifier with specific hyperparameters
gbc = GradientBoostingClassifier(n_estimators=50, learning_rate=1.0, max_depth=1, random_state=0)
# Fit the model to the training data
gbc.fit(x_train, y_train)

print(classification_report(y_val, gbc.predict(x_val), target_names=['Not Fraud', 'Fraud']))

### Linear Support Vector Machine classifier

- Precision (Fraud): 0.68 - Out of all instances predicted as 'Fraud,' 68% are true 'Fraud.

- Recall (Fraud): 0.78 - 78% of actual 'Fraud' instances were correctly predicted.

- F1-Score (Fraud): 0.73 - The harmonic mean of precision and recall. It provides a balance between precision and recall.

The SVM with a linear kernel and balanced class weights achieved high precision for 'Not Fraud' but relatively lower precision for 'Fraud.' The recall for 'Fraud' is decent, contributing to a balanced F1-Score.

In [None]:
from sklearn.svm import LinearSVC
# Create a Linear Support Vector Machine classifier with balanced class weights
svc = LinearSVC(class_weight='balanced')
# Fit the model to the training data
svc.fit(x_train, y_train)

print(classification_report(y_val, svc.predict(x_val), target_names=['Not Fraud', 'Fraud']))

## Dealing with Class Imbalance in Data

### Overview
In classification problems, class imbalance can pose challenges as the model may be biased towards the majority class. This imbalance can affect the model's ability to accurately predict the minority class.

### Handling Class Imbalance in the Dataset

In order to address the issue of class imbalance in the dataset, a strategic approach is taken to create a balanced dataset with equal representation of both 'Fraud' and 'Not Fraud' cases.

#### Dataset Division

The initial step involves dividing the dataset into two subsets based on the class labels:


In [None]:
# Code to divide the dataset
not_frauds = new_df.query("Class == 0")
frauds = new_df.query("Class == 1")
# Checking it worked or not
not_frauds['Class'].value_counts(), frauds['Class'].value_counts()

### Creating a Balanced Dataset

To address the class imbalance in the original dataset, a balanced dataset is constructed by combining equal instances of 'Fraud' and 'Not Fraud' cases.

In [None]:
# Code to create a balanced dataset
balanced_df = pd.concat([frauds, not_frauds.sample(len(frauds), random_state=1)])
# Check class distribution in the balanced dataset
balanced_df['Class'].value_counts()

### Shuffling the Balanced Dataset

To ensure randomness in the balanced dataset, shuffling is performed. This step is crucial to eliminate any potential ordering bias in the data.

In [None]:
# To shuffle the balanced dataset
balanced_df = balanced_df.sample(frac=1, random_state=1)
# Display the shuffled balanced dataset
balanced_df

### Creating Train, Test, and Validation Sets from the Shuffled Balanced Dataset

To further prepare the data, subsets for training, testing, and validation are created from the shuffled balanced dataset.


In [None]:
# Coverting them to numpy array
balanced_df_np = balanced_df.to_numpy()
# To split the shuffled balanced dataset into train, test, and validation sets
x_train_b, y_train_b = balanced_df_np[:700, :-1], balanced_df_np[:700, -1].astype(int)
x_test_b, y_test_b = balanced_df_np[700:842, :-1], balanced_df_np[700:842, -1].astype(int)
x_val_b, y_val_b = balanced_df_np[842:, :-1], balanced_df_np[842:, -1].astype(int)
# Display the shapes of the subsets
x_train_b.shape, y_train_b.shape, x_test_b.shape, y_test_b.shape, x_val_b.shape, y_val_b.shape

### Class Distribution in Balanced Train, Test, and Validation Sets

Let's examine the class distribution in the balanced training, testing, and validation sets.

The output shows the count of instances for each class in the respective sets:

Training Set:

- Class 0: 347 instances
- Class 1: 353 instances

Testing Set:

- Class 0: 73 instances
- Class 1: 69 instances

Validation Set:

- Class 0: 72 instances
- Class 1: 70 instances

These counts reflect the balanced nature of the subsets, with an equal representation of both classes.

In [None]:
pd.Series(y_train_b).value_counts(), pd.Series(y_test_b).value_counts(), pd.Series(y_val_b).value_counts()

### Logistic Regression on Balanced Data

Let's train a Logistic Regression model on the balanced training set (`x_train_b`, `y_train_b`) and evaluate its performance on the balanced validation set (`x_val_b`, `y_val_b`).

#### Classification Report for Logistic Regression on Balanced Validation Set

The classification report for the Logistic Regression model on the balanced validation set shows promising results:

Precision:

- For 'Not Fraud': 96%
- For 'Fraud': 93%

Precision measures the accuracy of the positive predictions made by the model.

Recall:

- For 'Not Fraud': 93%
- For 'Fraud': 96%

Recall (Sensitivity) measures the ability of the model to capture all positive instances.

F1-Score:

- For 'Not Fraud': 94%
- For 'Fraud': 94%

The F1-score is the harmonic mean of precision and recall, providing a balanced measure.

Accuracy:

- Overall accuracy of 94%

#### Comparison with the Imbalanced Dataset:

When comparing these results with those obtained from the imbalanced dataset, the key improvement is seen in the model's ability to correctly predict instances of the minority class ('Fraud'). In imbalanced datasets, models tend to be biased towards the majority class, leading to lower performance on the minority class. In the balanced dataset, precision, recall, and F1-score for both classes are more balanced.

It's important to note that the accuracy metric alone might not be a reliable indicator of model performance, especially in imbalanced datasets. The classification report provides a more comprehensive view by considering precision and recall for each class.

In [None]:
# @title
 # To train Logistic Regression on balanced data
logistic_model_b = LogisticRegression()
logistic_model_b.fit(x_train_b, y_train_b)
# Evaluate the model on the balanced validation set
logistic_model_b.score(x_train_b, y_train_b)

# Display the classification report for the balanced validation set
print(classification_report(y_val_b, logistic_model_b.predict(x_val_b), target_names=['Not Fraud', 'Fraud']))

### Shallow Neural Network on Balanced Data

A shallow neural network is trained on the balanced training set (`x_train_b`, `y_train_b`) and evaluated on the balanced validation set (`x_val_b`, `y_val_b`).

#### Classification Report for Shallow Neural Network on Balanced Validation Set

The classification report for the shallow neural network on the balanced validation set is as follows:

#### Interpretation:

- **Precision:**
  - For 'Not Fraud': 95%
  - For 'Fraud': 96%

- **Recall:**
  - For 'Not Fraud': 96%
  - For 'Fraud': 94%

- **F1-Score:**
  - For 'Not Fraud': 95%
  - For 'Fraud': 95%

- **Accuracy:**
  - Overall accuracy of 95%

These metrics indicate a balanced performance, with a notable ability to correctly identify instances of both classes. Compared to the logistic regression model, the shallow neural network shows competitive results and may offer advantages in capturing complex patterns in the data.

It's important to consider the trade-offs between precision and recall based on the specific goals of the classification task. Further fine-tuning or exploration of more complex models may lead to even better performance.

In [None]:
# To create and train a shallow neural network on balanced data
shallow_nn_b = Sequential()
shallow_nn_b.add(InputLayer((x_train_b.shape[1],)))
shallow_nn_b.add(Dense(2, 'relu'))
shallow_nn_b.add(BatchNormalization())
shallow_nn_b.add(Dense(1, 'sigmoid'))

checkpoint_b = ModelCheckpoint('shallow_nn_b', save_best_only=True)
shallow_nn_b.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Display the summary of the model architecture
shallow_nn_b.summary()

# Train the model on balanced data and save the best weights
shallow_nn_b.fit(x_train_b, y_train_b, validation_data=(x_val_b, y_val_b), epochs=40, callbacks=checkpoint)


print(classification_report(y_val_b, neural_net_predictions(shallow_nn_b, x_val_b), target_names=['Not Fraud', 'Fraud']))

### Random Forest Classifier on Balanced Data

A Random Forest classifier is trained on the balanced training set (`x_train_b`, `y_train_b`).

#### Classification Report for Random Forest Classifier on Balanced Validation Set

The classification report for the Random Forest classifier on the balanced validation set is as follows:


### Interpretation:

- **Precision:**
  - For 'Not Fraud': 93%
  - For 'Fraud': 97%

- **Recall:**
  - For 'Not Fraud': 97%
  - For 'Fraud': 93%

- **F1-Score:**
  - For 'Not Fraud': 95%
  - For 'Fraud': 95%

- **Accuracy:**
  - Overall accuracy of 95%

The Random Forest classifier demonstrates a strong ability to correctly classify both classes. The precision, recall, and F1-score for both 'Not Fraud' and 'Fraud' are balanced, contributing to an impressive overall accuracy of 94%.

Comparing these results with the Shallow Neural Network and Logistic Regression, the Random Forest model showcases competitive performance and is well-suited for the balanced dataset.



In [None]:
# To create and train a Random Forest Classifier on balanced data
rf_b = RandomForestClassifier(max_depth=2, n_jobs=-1)
rf_b.fit(x_train_b, y_train_b)

print(classification_report(y_val_b, rf_b.predict(x_val_b), target_names=['Not Fraud', 'Fraud']))

### Gradient Boosting Classifier on Balanced Data

A Gradient Boosting Classifier is trained on the balanced training set (`x_train_b`, `y_train_b`).

#### Classification Report for Gradient Boosting Classifier on Balanced Validation Set

The classification report for the Gradient Boosting Classifier on the balanced validation set is as follows:


#### Interpretation:

- **Precision:**
  - For 'Not Fraud': 98%
  - For 'Fraud': 87%

- **Recall:**
  - For 'Not Fraud': 86%
  - For 'Fraud': 99%

- **F1-Score:**
  - For 'Not Fraud': 92%
  - For 'Fraud': 93%

- **Accuracy:**
  - Overall accuracy of 92%

These metrics indicate a well-performing Gradient Boosting Classifier on the balanced dataset. The model demonstrates a high ability to correctly classify instances of both 'Not Fraud' and 'Fraud', contributing to a balanced F1-score and accuracy.



In [None]:
# To create and train a Gradient Boosting Classifier on balanced data
gbc_b = GradientBoostingClassifier(n_estimators=50, learning_rate=1.0, max_depth=1, random_state=0)
gbc_b.fit(x_train_b, y_train_b)

print(classification_report(y_val_b, gbc_b.predict(x_val_b), target_names=['Not Fraud', 'Fraud']))

### Linear Support Vector Machine classifier on Balanced Data

A Linear Support Vector Machine classifier is trained on the balanced training set (`x_train_b`, `y_train_b`).

#### Classification Report for Linear Support Vector Machine classifier on Balanced Validation Set

The classification report for the Linear Support Vector Machine classifier on the balanced validation set is as follows:


#### Interpretation:

- **Precision:**
  - For 'Not Fraud': 96%
  - For 'Fraud': 93%

- **Recall:**
  - For 'Not Fraud': 93%
  - For 'Fraud': 96%

- **F1-Score:**
  - For 'Not Fraud': 94%
  - For 'Fraud': 94%

- **Accuracy:**
  - Overall accuracy of 94%

These metrics indicate a strong performance by the Linear SVM classifier on the balanced dataset. The model demonstrates a balanced ability to correctly classify instances of both 'Not Fraud' and 'Fraud', contributing to a high F1-score and accuracy.


In [None]:
# To create and train a Linear Support Vector Machine classifier on balanced data
svc_b = LinearSVC()
svc_b.fit(x_train_b, y_train_b)

print(classification_report(y_val_b, svc_b.predict(x_val_b), target_names=['Not Fraud', 'Fraud']))

# Test Metrics for Logistic Regression on Balanced Test Set

The test metrics for the Logistic Regression model on the balanced test set are as follows:


#### Interpretation:

- **Precision:**
  - For 'Not Fraud': 92%
  - For 'Fraud': 95%

- **Recall:**
  - For 'Not Fraud': 96%
  - For 'Fraud': 91%

- **F1-Score:**
  - For 'Not Fraud': 94%
  - For 'Fraud': 93%

- **Accuracy:**
  - Overall accuracy of 94%

These metrics indicate that the Logistic Regression model generalizes well to the balanced test set, maintaining a balanced performance in correctly classifying instances of both 'Not Fraud' and 'Fraud'.

In [None]:
#Test
print("Logistic Metric on Test")
print(classification_report(y_test_b, logistic_model_b.predict(x_test_b), target_names=['Not Fraud', 'Fraud']))

# Test Metrics for Shallow Neural Network on Balanced Test Set

The test metrics for the Shallow Neural Network on the balanced test set are as follows:


#### Interpretation:

- **Precision:**
  - For 'Not Fraud': 92%
  - For 'Fraud': 97%

- **Recall:**
  - For 'Not Fraud': 97%
  - For 'Fraud': 91%

- **F1-Score:**
  - For 'Not Fraud': 95%
  - For 'Fraud': 94%

- **Accuracy:**
  - Overall accuracy of 94%

These metrics indicate the performance of the Shallow Neural Network on the balanced test set. The model demonstrates a reasonable ability to correctly classify instances of both 'Not Fraud' and 'Fraud', contributing to a balanced F1-score and accuracy.

In [None]:
print("Shallow Neural Net Metric on Test")
print(classification_report(y_test_b, neural_net_predictions(shallow_nn_b ,x_test_b), target_names=['Not Fraud', 'Fraud']))

# Test Metrics for Random Forest Classifier on Balanced Test Set

The test metrics for the Random Forest Classifier on the balanced test set are as follows:


#### Interpretation:

- **Precision:**
  - For 'Not Fraud': 90%
  - For 'Fraud': 100%

- **Recall:**
  - For 'Not Fraud': 100%
  - For 'Fraud': 88%

- **F1-Score:**
  - For 'Not Fraud': 95%
  - For 'Fraud': 94%

- **Accuracy:**
  - Overall accuracy of 94%

These metrics demonstrate the performance of the Random Forest Classifier on the balanced test set. The model exhibits a balanced ability to correctly classify instances of both 'Not Fraud' and 'Fraud', contributing to a high F1-score and accuracy.


In [None]:
print("RandomForest Classifier Metric on Test")
print(classification_report(y_test_b, rf_b.predict(x_test_b), target_names=['Not Fraud', 'Fraud']))

# Test Metrics for Gradient Boosting Classifier on Balanced Test Set

The test metrics for the Gradient Boosting Classifier on the balanced test set are as follows:

#### Interpretation:

- **Precision:**
  - For 'Not Fraud': 92%
  - For 'Fraud': 90%

- **Recall:**
  - For 'Not Fraud': 90%
  - For 'Fraud': 91%

- **F1-Score:**
  - For 'Not Fraud': 91%
  - For 'Fraud': 91%

- **Accuracy:**
  - Overall accuracy of 91%

These metrics illustrate the performance of the Gradient Boosting Classifier on the balanced test set. The model displays a balanced ability to correctly classify instances of both 'Not Fraud' and 'Fraud', resulting in a high F1-score and accuracy.

In [None]:
print("Gradient Boosting Classifier Metric on Test")
print(classification_report(y_test_b, gbc_b.predict(x_test_b), target_names=['Not Fraud', 'Fraud']))

# Test Metrics for Linear Support Vector Machine (LinearSVC) on Balanced Test Set

The test metrics for the Linear Support Vector Machine (LinearSVC) on the balanced test set are as follows:

#### Interpretation:

- **Precision:**
  - For 'Not Fraud': 92%
  - For 'Fraud': 94%

- **Recall:**
  - For 'Not Fraud': 95%
  - For 'Fraud': 91%

- **F1-Score:**
  - For 'Not Fraud': 93%
  - For 'Fraud': 93%

- **Accuracy:**
  - Overall accuracy of 93%

These metrics showcase the performance of the Linear Support Vector Machine (LinearSVC) on the balanced test set. The model demonstrates a balanced ability to correctly classify instances of both 'Not Fraud' and 'Fraud', resulting in a high F1-score and accuracy.


In [None]:
print("LinearSVC Metric on Test")
print(classification_report(y_test_b, svc_b.predict(x_test_b), target_names=['Not Fraud', 'Fraud']))

# Overall Observations

- The models demonstrate competitive performance on the balanced dataset, with balanced precision, recall, and F1-scores for both classes.
- Random Forest and Shallow Neural Network exhibit excellent precision for the 'Fraud' class.
- Logistic Regression and LinearSVC show strong overall accuracy and balanced performance.
- The choice of the best model depends on the specific goals and trade-offs between precision and recall.

# Factor Influencing Model Selection

The considerations and factors involved in choosing the best model based on the specific goals and trade-offs between precision and recall:

## Precision and Recall:
- **Precision:** Precision is the ratio of true positive predictions to the total predicted positives. It focuses on minimizing false positives. In the context of fraud detection, a high precision means that when the model predicts a transaction as fraudulent, it is highly likely to be accurate.

- **Recall:** Recall is the ratio of true positive predictions to the total actual positives. It focuses on minimizing false negatives. In fraud detection, high recall implies that the model is effective in identifying most of the actual fraudulent transactions.

## Trade-offs:
### 1. Precision-Focused Approach:
   - **Scenario:** If the business priority is to minimize the number of false positives (genuine transactions misclassified as fraud), a precision-focused approach is appropriate.
   - **Implication:** This ensures that when the model flags a transaction as fraudulent, it is very likely to be an actual case of fraud. However, it may result in missing some actual fraud cases (lower recall).

### 2. Recall-Focused Approach:
   - **Scenario:** If the business is more concerned about capturing as many fraud cases as possible, even at the cost of a higher false positive rate, a recall-focused approach is suitable.
   - **Implication:** This maximizes the identification of actual fraud cases but may lead to more false positives, as the model might be less stringent in its criteria for flagging transactions.

## Model Considerations:
### 1. Precision-Focused Models:
   - **Logistic Regression:** Known for its simplicity and interpretability, logistic regression can be tuned to prioritize precision.
   - **Shallow Neural Network:** With careful parameter tuning, a neural network can be designed to emphasize precision, leveraging its ability to capture complex patterns.

### 2. Recall-Focused Models:
   - **Linear Support Vector Machine (LinearSVC):** Linear models like LinearSVC can be effective in recall-focused scenarios while maintaining interpretability.
   - **Ensemble Models (Random Forest, Gradient Boosting):** These models, especially Gradient Boosting, are capable of capturing complex relationships, leading to higher recall.

## Business Decision:
- **Precision-Recall Trade-off:** The business decision hinges on the acceptable trade-off between precision and recall. It's about finding the right balance based on the business's risk tolerance and the cost associated with false positives and false negatives.

- **Consideration of Business Goals:** The choice of the best model should align with the overarching business goals. For instance, a financial institution might prioritize minimizing false positives to avoid inconveniencing legitimate customers, while an e-commerce platform might prioritize recall to prevent fraudulent transactions.

In essence, the decision-making process involves a thoughtful evaluation of the specific business context, risks, and priorities. It's not a one-size-fits-all scenario, and the chosen model should align with the unique requirements and constraints of the business at hand.