# Sub-task 4


To create a prototype system that can identify payments using machine learning we should adhere to the following procedure;

## Step 1: Data Preprocessing

Cleaning the Data; Address any information.

Creating Informative Features; Extract characteristics from the available data.

Transforming the Data;. Scale it as needed.

Encoding Categorical Variables; Convert numerical columns (such as customer ID, gender and category) into a format suitable, for machine learning models.


In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

# Load the dataset
data = pd.read_csv('bs140513_032310.csv')

# Drop non-varying string columns
data = data.drop(['zipcodeOri', 'zipMerchant'], axis=1)

# Identify categorical columns for encoding
categorical_cols = ['customer', 'age', 'gender', 'merchant', 'category']

# Apply OneHotEncoder for categorical columns
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_cols)
    ],
    remainder='passthrough'
)

# Prepare features and target
X = data.drop('fraud', axis=1)
y = data['fraud']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply transformations
X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.transform(X_test)


To make categorical variables usable for machine learning algorithms we convert them into values using a technique called Label Encoding. This conversion is crucial for processing non numeric data.

In order to determine whether a transaction is fraudulent the dataset is divided into two parts; features and the target variable.

To evaluate the models performance on data it is customary, in machine learning to split the data into training and testing sets.

By applying feature scaling we normalize the values of the features, which helps improve the performance of machine learning algorithms (Chawla et al. 2002).


## Step 2: Feature Scaling (Adjusted for Sparse Data)

In [15]:
# Initialize the StandardScaler with with_mean set to False
scaler = StandardScaler(with_mean=False)

# Apply scaling to the transformed data
X_train_scaled = scaler.fit_transform(X_train_transformed)
X_test_scaled = scaler.transform(X_test_transformed)



## Step 3: Model Selection

In [22]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

# Initialize models
models = {
    'RandomForest': RandomForestClassifier(random_state=42),
    'GradientBoosting': GradientBoostingClassifier(random_state=42),
    'LogisticRegression': LogisticRegression(random_state=42, max_iter=5000, solver='saga')
}


**Random Forest:** Chosen for its robustness in handling high-dimensional data and its ability to model complex, non-linear relationships. It's particularly effective in imbalanced datasets like those found in fraud detection scenarios (Bhattacharyya et al., 2011).

**Gradient Boosting:** An ensemble learning method known for its high accuracy. Gradient Boosting constructs models in a stage-wise fashion and is well-suited for datasets where the relationships between features are complex (Dong et al., 2020).

**Logistic Regression:** A simpler, linear model used as a baseline. Despite its simplicity, Logistic Regression can be very effective, especially when the underlying relationships in the data are not highly complex (Dal Pozzolo et al., 2015).

## Step 4:Model Training and Evaluation

Each model is trained using the scaled training data. The training process involves adjusting the model parameters to best fit the data.

Post-training, each model is evaluated on the test data. This step is crucial to assess the model's performance on unseen data, providing insights into its generalizability.

The classification_report and confusion_matrix from scikit-learn are used for evaluation. These metrics are particularly important in the context of imbalanced datasets, such as those common in fraud detection. The focus is on metrics like precision, recall, and F1-score, which provide a more nuanced view of the model's performance, especially for the minority class (fraudulent transactions) (He & Garcia, 2009).

High recall for the fraudulent class is critical in fraud detection to minimize false negatives. However, maintaining a balance with precision is necessary to avoid excessive false positives (Ngai et al., 2011).

In [23]:
from sklearn.metrics import classification_report, confusion_matrix

# Train and evaluate models
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    print(f'{name} Classification Report:')
    print(classification_report(y_test, y_pred))
    print(f'{name} Confusion Matrix:')
    print(confusion_matrix(y_test, y_pred))
    print('------------------------------------------------------')



RandomForest Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    117512
           1       0.90      0.81      0.86      1417

    accuracy                           1.00    118929
   macro avg       0.95      0.90      0.93    118929
weighted avg       1.00      1.00      1.00    118929

RandomForest Confusion Matrix:
[[117391    121]
 [   268   1149]]
------------------------------------------------------
GradientBoosting Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    117512
           1       0.89      0.76      0.82      1417

    accuracy                           1.00    118929
   macro avg       0.95      0.88      0.91    118929
weighted avg       1.00      1.00      1.00    118929

GradientBoosting Confusion Matrix:
[[117383    129]
 [   341   1076]]
------------------------------------------------------
LogisticRegression Class

Based on the classification reports and confusion matrices for the RandomForest, GradientBoosting, and LogisticRegression models, let's analyze their performance in the context of fraud detection:

**RandomForest Model**

Precision: High precision for both classes, particularly for the fraudulent transactions (1) at 0.90.
Recall: High recall for non-fraudulent transactions (0) and good recall for fraudulent transactions (1) at 0.81.
F1-Score: Balanced F1-score for fraudulent transactions, reflecting a good balance between precision and recall.
Confusion Matrix: 1149 out of 1417 fraudulent transactions correctly identified, with 268 false negatives.

**GradientBoosting Model**

Precision: Similar to RandomForest, with slightly lower precision for fraudulent transactions at 0.89.
Recall: A bit lower recall for fraudulent transactions at 0.76, indicating more false negatives compared to RandomForest.
F1-Score: Slightly lower than RandomForest, indicating a slight drop in the balance between precision and recall for fraudulent transactions.
Confusion Matrix: 1076 out of 1417 fraudulent transactions correctly identified, with 341 false negatives.

**LogisticRegression Model**

Precision: Comparable to GradientBoosting for fraudulent transactions at 0.88.
Recall: Similar to GradientBoosting for fraudulent transactions at 0.80.
F1-Score: Slightly lower than RandomForest, indicating a balance between precision and recall.
Confusion Matrix: 1134 out of 1417 fraudulent transactions correctly identified, with 283 false negatives.

**Summary and Best Model Selection**

RandomForest appears to be the best model among the three for this specific task. It has the highest recall for fraudulent transactions, which is crucial in fraud detection to minimize false negatives. Additionally, its precision is also high, indicating fewer false positives.
GradientBoosting and LogisticRegression show similar performance, with GradientBoosting having a slightly lower recall for fraudulent transactions compared to RandomForest.

**Overall Performance:** Considering the importance of correctly identifying fraudulent transactions (high recall) while maintaining a reasonable level of precision, RandomForest stands out as the most suitable model for this dataset and task.

## Key Findings and Conclusion

Based on the evaluation metrics obtained from analyzing the classification reports and confusion matrices of the RandomForest, GradientBoosting and LogisticRegression models we can draw the following findings and conclusions:

**RandomForest Model**

Precision for identifying transactions; 90%

Recall, for identifying fraudulent transactions; 81%

Overall Accuracy; 99.67% (117391 true non fraudulent + 1149 true fraudulent out of a total of 118929 transactions)

**Conclusion:** The RandomForest model showcased the precision and recall in detecting transactions. With an 81% recall rate it successfully identified 81% of fraud cases minimizing instances where fraud goes undetected (Bhattacharyya et al., 2011). It also boasted a precision rate of 90% meaning that when it predicted a transaction as fraud it was highly likely to be accurate reducing alarms.

**GradientBoosting Model**

Precision for identifying transactions; 89%

Recall for identifying transactions; 76%

Overall Accuracy; 99.61% (117383 true non fraudulent +1076 true fraudulent out of a total of118929 transactions)

**Conclusion:** The GradientBoosting model exhibited performance compared to RandomForest, particularly in terms of recall. Recall plays a role, in scenarios where failing to identify fraud can have consequences (Dong et al.,2020).

**LogisticRegression Model**

Precision for Fraudulent Transactions: 88%

Recall for Fraudulent Transactions: 80%

Overall Accuracy: 99.64% (117364 true non-fraudulent + 1134 true fraudulent out of 118929 total transactions)

**Conclusion:** LogisticRegression performed comparably to GradientBoosting, indicating its effectiveness as a more straightforward model. However, it falls slightly behind RandomForest in terms of recall, which is a key metric in fraud detection (Dal Pozzolo et al., 2015).
Based on the results the LogisticRegression model showed effectiveness to GradientBoosting. Slightly lagged behind RandomForest in terms of recall, which is an important metric in fraud detection (Dal Pozzolo et al., 2015).

## Overall Conclusion 
In conclusion after analyzing models it was found that the RandomForest model performed best for this dataset. It demonstrated a balance between precision and recall when identifying transactions. This model effectively minimized negatives without increasing false positives making it highly suitable for fraud detection applications.

However it is important to note that while RandomForest emerged as the performer in this analysis the selection of a model should always consider the specific requirements and characteristics of the dataset, at hand. In fraud detection scenarios maximizing recall (correctly identifying fraudulent transactions as possible) while maintaining a reasonable level of precision to avoid false positives is often prioritized (He & Garcia 2009; Ngai et al., 2011).

## References
1.  Chawla, N. V., et al. (2002). SMOTE: Synthetic Minority Over-sampling Technique.
2.  Bhattacharyya, S., et al. (2011). Data mining for credit card fraud: A comparative study.
3.  Dong, X., et al. (2020). A survey on ensemble learning.
4.  Dal Pozzolo, A., et al. (2015). Calibrating Probability with Undersampling for Unbalanced Classification.
5.  He, H., & Garcia, E.A. (2009). Learning from imbalanced data.
6.  Ngai, E. W. T., et al. (2011). The application of data mining techniques in financial fraud detection:    A classification framework and an academic review of literature.