## Task 4
- Model Selection and Training
- Model Evaluation

In [9]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import joblib

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from pyngrok import ngrok

In [7]:
# pip install joblib
# !pip install fastapi uvicorn pyngrok




In [6]:
model_data = pd.read_csv('model_dataset.csv')
model_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95662 entries, 0 to 95661
Data columns (total 50 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   TransactionId                95662 non-null  int64  
 1   BatchId                      95662 non-null  int64  
 2   AccountId                    95662 non-null  int64  
 3   SubscriptionId               95662 non-null  int64  
 4   CustomerId                   95662 non-null  int64  
 5   ProviderId                   95662 non-null  int64  
 6   ProductId                    95662 non-null  int64  
 7   ProductCategory              95662 non-null  int64  
 8   ChannelId                    95662 non-null  int64  
 9   Amount                       95662 non-null  float64
 10  Value                        95662 non-null  int64  
 11  TransactionStartTime         95662 non-null  object 
 12  PricingStrategy              95662 non-null  int64  
 13  FraudResult     

From the previous task the important features are:
- Amount_woe
- Value_woe
- Amount_Total_woe
- TransactionId_frequency_woe
- ProviderId_woe
- ProductId_woe
- ProductCategory_woe
- ChannelId_woe
- PricingStrategy_woe
- Month_woe
- Day_woe
- Hour_woe

In [11]:
# only picking important features
df = model_data[['Amount_woe','Value_woe','Amount_Total_woe','TransactionId_frequency_woe','ProviderId_woe','ProductId_woe','ProductCategory_woe','ChannelId_woe','PricingStrategy_woe','Month_woe','Day_woe','Hour_woe','Cluster']]

# training and test sets
X = df.drop('Cluster', axis = 1)
y = df['Cluster']

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)


In [19]:
# Random forest
param_grid_rf = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Define the Random Forest classifier
rf = RandomForestClassifier(random_state=42)

# Setup GridSearchCV with StratifiedKFold
grid_search_rf = GridSearchCV(estimator=rf, param_grid=param_grid_rf, cv=StratifiedKFold(n_splits=3), scoring='accuracy', n_jobs=-1)

# Train the model
grid_search_rf.fit(x_train, y_train)

# Best params and results
print("Best Random Forest Params:", grid_search_rf.best_params_)
print("Random Forest Best Accuracy:", grid_search_rf.best_score_)

y_pred = grid_search_rf.predict(x_test)
print(classification_report(y_test, y_pred))

# Get predicted probabilities
y_pred_proba = grid_search_rf.predict_proba(x_test)[:, 1]  # Take probabilities for the positive class (class '1')

# ROC-AUC Score
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")

Best Random Forest Params: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 50}
Random Forest Best Accuracy: 0.9984450301517448
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     19069
           1       0.92      0.75      0.83        64

    accuracy                           1.00     19133
   macro avg       0.96      0.87      0.91     19133
weighted avg       1.00      1.00      1.00     19133

ROC-AUC Score: 0.9993


In [20]:
# Create parameter grid for Gradient Boosting Machine
param_grid_gbm = {
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5]
}

# Define the Gradient Boosting classifier
gbm = GradientBoostingClassifier(random_state=42)

# Setup GridSearchCV
grid_search_gbm = GridSearchCV(estimator=gbm, param_grid=param_grid_gbm, cv=StratifiedKFold(n_splits=3), scoring='accuracy', n_jobs=-1)

# Train the model
grid_search_gbm.fit(x_train, y_train)

# Best params and results
print("Best GBM Params:", grid_search_gbm.best_params_)
print("GBM Best Accuracy:", grid_search_gbm.best_score_)

y_pred = grid_search_gbm.predict(x_test)
print(classification_report(y_test, y_pred))


# Get predicted probabilities
y_pred_proba = grid_search_gbm.predict_proba(x_test)[:, 1]  # Take probabilities for the positive class (class '1')

# ROC-AUC Score
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")

Best GBM Params: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
GBM Best Accuracy: 0.9978831420861357
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     19069
           1       0.56      0.78      0.65        64

    accuracy                           1.00     19133
   macro avg       0.78      0.89      0.82     19133
weighted avg       1.00      1.00      1.00     19133

ROC-AUC Score: 0.9207


Comparison and Insights

**Overall Accuracy:**

Both models show very high overall accuracy (close to 1.00), but the Random Forest slightly outperforms the GBM.
Class Imbalance:

The support for class 1 is much lower (64 instances) compared to class 0 (19069 instances), indicating a potential class imbalance issue.
Random Forest performs better on class 1 with higher precision and recall, while GBM struggles with lower precision.

**Precision and Recall:**

The Random Forest has a much higher precision for class 1 (0.92) compared to GBM (0.56). This means that when Random Forest predicts a positive case (class 1), it is correct more often than the GBM.
The recall for class 1 in both models indicates that both models capture a good proportion of actual positive cases, but the Random Forest does so more reliably.

**F1-Score:**

The F1-score for class 1 is significantly higher in the Random Forest model (0.83) compared to GBM (0.65), further emphasizing the Random Forest's better balance between precision and recall for this class.
ROC-AUC Score:

The Random Forest model has an excellent ROC-AUC score (0.9993), indicating it can distinguish well between the two classes. In contrast, the GBM's score (0.9207) suggests it has a harder time differentiating between the classes, likely due to the lower precision for class 1.

In [21]:
# Saving the best performing model
joblib.dump(grid_search_rf.best_estimator_, 'best_random_forest_model.joblib')

['best_random_forest_model.joblib']

## Task 5
- Choose a framework
- Load the model
- Define API endpoints
- Handle requests
- Return predictions
- Deployment

In [10]:
# Load the model
model = joblib.load('best_random_forest_model.joblib')  # Ensure this matches the uploaded filename

# Create the FastAPI app
app = FastAPI()

class InputData(BaseModel):
    feature1: float
    feature2: float
    feature3: float
    feature4: float
    feature5: float
    feature6: float
    feature7: float
    feature8: float
    feature9: float
    feature10: float
    feature11: float
    feature12: float

@app.post('/predict')
def predict(data: InputData):
    input_data = [[data.feature1, data.feature2, data.feature3, data.feature4,
                   data.feature5, data.feature6, data.feature7, data.feature8,
                   data.feature9, data.feature10, data.feature11, data.feature12]]
    try:
        prediction = model.predict(input_data)
        return {"prediction": prediction[0]}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


In [11]:
!uvicorn app:app --host 0.0.0.0 --port 8000


[31mERROR[0m:    Error loading ASGI app. Could not import module "app".


In [12]:
# Set up ngro
public_url = ngrok.connect(8000)
print("Public URL:", public_url)  # public URL to access FastAPI app