# Model Evaluation

In [2]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from helper_functions.utils import evaluate_fairness_by_group

In [3]:
results_xgb = pd.read_csv('../data/3_evaluation/XGBClassifier_results.csv')
results_xgb_engineered = pd.read_csv('../data/3_evaluation/XGBClassifier_engineered_results.csv')
results_xgb_tunned = pd.read_csv('../data/3_evaluation/XGBClassifier_tunned_results.csv')

In [4]:
# Calculate  accuracy, global precision, recall, and F1-score
results = {
    'Model': ["XGB", "XGB_engineered", "XGB_tunned"],
    'Accuracy':  [],
    'Precision': [],
    'Recall':    [],
    'F1-Score':  [],
    'ROC-AUC':   [],
}

for df in [results_xgb, results_xgb_engineered, results_xgb_tunned]:
    results['Accuracy'].append(accuracy_score(df['CLASS'], df['CLASS_pred']))
    results['Precision'].append(precision_score(df['CLASS'], df['CLASS_pred']))
    results['Recall'].append(recall_score(df['CLASS'], df['CLASS_pred']))
    results['F1-Score'].append(f1_score(df['CLASS'], df['CLASS_pred']))
    results['ROC-AUC'].append(roc_auc_score(df['CLASS'], df['CLASS_pred']))

# Create a DataFrame
results_df = pd.DataFrame(results)

In [5]:
results_df

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,ROC-AUC
0,XGB,0.820717,0.606006,0.808536,0.692772,0.816656
1,XGB_engineered,0.760977,0.514855,0.760887,0.614147,0.760947
2,XGB_tunned,0.834499,0.631023,0.813915,0.710894,0.827638


## Resulsts Discussion

The tuned XGBoost model with all initial features is the best-performing model, improving across all metrics. An attempt to modify a feature set resulted in a much lower performance across all metrics, even compared to the baseline, untuned XGBoost.

The XGBoost tuned correctly classified 83% of the instances. It is better at avoiding false positives, with 63% precision. Avoiding false positives is critical if we want to optimize the ad spend. The model best identifies actual conversions, with 81% recall, which is essential if missing a conversion is costly. 

The 71% F1 score confirms that the tuned model is better at balancing precision and recall. The 83% ROC-AUC shows the model's ability to distinguish between positive and negative classes across all thresholds, and it is also the highest for the tuned XGBoost model.


## Bias Detection

Let's check for biases in the model. We'll group the data by a potentially problematic feature, such as OS_FAMILY_NAME, and measure each group's **accuracy** and **selection rate**. 

The accuracy will give us a percentage of correct predictions within each group and the selection rate - a proportion of points in each group we predicted as conversions.

In [5]:
# Load the data before encoding
test_data = pd.read_csv('../data/1_processed/test.csv')

In [6]:
test_data[:3]

Unnamed: 0,SITE,AD_FORMAT,BROWSER_NAME,SUPPLY_VENDOR,METRO,OS_FAMILY_NAME,USER_HOUR_OF_WEEK,CLASS
0,mail.yahoo.com,300x50,Chrome,google,528.0,Windows,131.0,1
1,ebay.com,160x600,Chrome,Xandr – Monetize SSP (AppNexus),501.0,Windows,59.0,1
2,logicaldollar.com,640x360,Chrome,yieldmo,505.0,OS X,40.0,1


In [7]:
# Check that CLASS labels match
(results_xgb_tunned['CLASS'] == test_data['CLASS']).all()

np.True_

In [8]:
X_test = test_data.drop(columns=['CLASS'])
y_test = test_data['CLASS']
y_pred = results_xgb_tunned['CLASS_pred'] 

In [9]:
sorted_metrics, overall_metrics = evaluate_fairness_by_group(
    y_true=y_test,
    y_pred=y_pred,
    sensitive_feature_column=X_test['OS_FAMILY_NAME'],
    metric="Accuracy"
)

print("Group-wise Metrics (Sorted by Accuracy):")
print(sorted_metrics)

print("\nOverall Metrics:")
print(overall_metrics)


Group-wise Metrics (Sorted by Accuracy):
                Accuracy  Selection Rate
OS_FAMILY_NAME                          
Other           0.894737        0.230263
iOS             0.883759        0.199480
Android         0.845995        0.280338
Linux           0.827757        0.293061
Windows         0.818847        0.378442
OS X            0.803459        0.373690

Overall Metrics:
Accuracy          0.834499
Selection Rate    0.322459
dtype: float64


In [10]:
sorted_metrics, overall_metrics = evaluate_fairness_by_group(
    y_true=y_test,
    y_pred=y_pred,
    sensitive_feature_column=X_test['BROWSER_NAME'],
    metric="Accuracy"
)

print("Group-wise Metrics (Sorted by Accuracy):")
print(sorted_metrics)

print("\nOverall Metrics:")
print(overall_metrics)


Group-wise Metrics (Sorted by Accuracy):
                      Accuracy  Selection Rate
BROWSER_NAME                                  
Internet Explorer 7   1.000000        0.000000
Internet Explorer 11  0.991150        0.000000
WebView               0.972979        0.036134
Other                 0.922414        0.172414
Edge                  0.880898        0.211200
Firefox               0.865714        0.200000
Chrome                0.802948        0.391944
Safari                0.798089        0.351274
Opera                 0.783019        0.339623

Overall Metrics:
Accuracy          0.834499
Selection Rate    0.322459
dtype: float64


## Bias Analysis

The Windows users group has the highest selection rate and second to lowest accuracy. That indicates that our model is biased towards Windows users, predicting conversion for them more often than it should.

The Chrome users group has the highest selection rate with 3rd to lowest accuracy. However, we know that Chrome users dominate our dataset along with Windows users (obviously, the two largely intersecting users' sets). More data points mean more room for errors. (See output_images/OS_FAMILY_NAME_categorical.png and output_images/BROWSER_NAME_categorical.png)

The accuracy doesn't change dramatically across the OS_FAMILY_NAME or BROWSER_NAME groups, meaning the model is relatively fair.
