# Task - 2

### 1. Load the Processed Data for Model Retraining
We will start by loading the processed training and testing datasets from the `Data/Processed` folder. These datasets were created during the data preparation phase and are ready for model training.

In [2]:
# Import necessary libraries
import pandas as pd

# Load the processed datasets
X_train_balanced = pd.read_csv('../Data/Processed/X_train_balanced.csv')
y_train_balanced = pd.read_csv('../Data/Processed/y_train_balanced.csv')
X_test = pd.read_csv('../Data/Processed/X_test.csv')
y_test = pd.read_csv('../Data/Processed/y_test.csv')

# Check the shapes of the loaded datasets
print("Shape of X_train_balanced:", X_train_balanced.shape)
print("Shape of y_train_balanced:", y_train_balanced.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_test:", y_test.shape)


Shape of X_train_balanced: (454902, 30)
Shape of y_train_balanced: (454902, 1)
Shape of X_test: (56962, 30)
Shape of y_test: (56962, 1)


### 2. Retrain the Random Forest Model
We will retrain the **Random Forest Classifier** on the balanced training data (`X_train_balanced`, `y_train_balanced`) and evaluate it on the test set (`X_test`, `y_test`).

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Initialize the Random Forest model with balanced class weights
random_forest = RandomForestClassifier(class_weight='balanced', random_state=42, n_estimators=100)

# Train the model on the balanced training set
random_forest.fit(X_train_balanced, y_train_balanced)

# Make predictions on the test set
y_pred_forest = random_forest.predict(X_test)

# Evaluate the model performance
print("Random Forest Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_forest))
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_forest))
print(f"Random Forest ROC-AUC Score: {roc_auc_score(y_test, y_pred_forest)}")

  return fit_method(estimator, *args, **kwargs)


Random Forest Confusion Matrix:
[[56848    16]
 [   17    81]]
Random Forest Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.84      0.83      0.83        98

    accuracy                           1.00     56962
   macro avg       0.92      0.91      0.92     56962
weighted avg       1.00      1.00      1.00     56962

Random Forest ROC-AUC Score: 0.913124619572083


### 3. Save the Trained Random Forest Model
Once the model has been retrained, we will save it using the `joblib` library to the `models/` folder. This saved model can then be used later for predictions.

In [4]:
import joblib

# Save the trained Random Forest model
joblib.dump(random_forest, '../models/random_forest_model.pkl')

print("Random Forest model saved successfully!")

Random Forest model saved successfully!


### 4. Load and Test the Saved Model
To ensure the saved model works correctly, we will load it back and test it by making a prediction on a sample input. This step will verify that the model is correctly saved and loaded.

In [9]:
# Load the saved Random Forest model
loaded_model = joblib.load('../models/random_forest_model.pkl')

#### 4.1. Inspect the Training Data Columns
Let's first check the columns used in the training data (`X_train_balanced`). This will help ensure that the input data for the prediction matches the expected feature structure.

In [6]:
# Display the columns of X_train_balanced to understand the structure
print("Training Data Columns:", X_train_balanced.columns)

Training Data Columns: Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount'],
      dtype='object')


#### 4.2. Use Real Data from X_test for Prediction
We will take a sample from the test set (`X_test`) to ensure the input data has the correct structure and number of features for making a prediction with the saved model.


In [7]:
# Take a sample input from the test data
sample_input = X_test.iloc[0].values.reshape(1, -1)  # Take the first row from X_test

# Make a prediction using the loaded Random Forest model
prediction = loaded_model.predict(sample_input)
print(f"Prediction (0=non-fraud, 1=fraud): {prediction[0]}")

Prediction (0=non-fraud, 1=fraud): 0




#### 4.3. Provide Manually Constructed Input Data for Prediction
We will manually construct an input data array that matches the structure and number of features in the training data.


In [8]:
# Construct a sample input with the correct number of features (30 in this case)
# Replace these values with actual or realistic values for all features
sample_input = [0.5, -0.3, 0.8, 1.2, -0.1, 0.6, -0.7, 1.1, 0.9, 0.4, -0.2, 1.3, 0.7, -0.6, 0.3, 0.5, -1.2, 0.8, 0.2, -0.9, 1.0, 0.4, 0.9, 0.1, -0.3, 1.1, 0.2, 0.7, -0.5, 0.6]
sample_input = pd.DataFrame([sample_input], columns=X_train_balanced.columns)

# Make a prediction using the loaded Random Forest model
prediction = loaded_model.predict(sample_input)
print(f"Prediction (0=non-fraud, 1=fraud): {prediction[0]}")


Prediction (0=non-fraud, 1=fraud): 0


## Conclusion


The output Prediction (0=non-fraud, 1=fraud): 0 means that the Random Forest model has predicted the transaction represented by the manually constructed input as non-fraudulent.

Here’s a breakdown of what it signifies:

Prediction = 0: The model predicts that the transaction is not fraudulent. In other words, based on the 30 feature values you provided, the model has determined that the transaction is likely to be a legitimate one.
Had the prediction been 1, it would have meant the model identified the transaction as fraudulent.

##### Why Does This Matter?

This prediction is based on the model's understanding of patterns in the data. It analyzed the input data (values for 30 features like purchase amount, time, customer profile, etc.) and determined that it doesn’t match the characteristics of a fraudulent transaction.
If you were using this model in a real-world setting, a prediction of 0 would imply that no further investigation or action is needed for this particular transaction.