# Lab 3: Contextual Bandit-Based News Article Recommendation

**`Course`:** Reinforcement Learning Fundamentals  
**`Student Name`:**  Tanu Adhikari
**`Roll Number`:** U20230115 
**`GitHub Branch`:** tanu_U20230115 

# Imports and Setup

In [47]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

from rlcmab_sampler import sampler


# Load Datasets

In [48]:
# Load training and test datasets
train_users = pd.read_csv("data/train_users.csv")
test_users = pd.read_csv("data/test_users.csv")

# Separate input features and target variable
X_train_full = train_users.drop(columns=['label', 'user_id'])
y_train_full = train_users['label']

# Test set has no labels
X_test = test_users.drop(columns=['user_id'])

## Data Preprocessing

In this section:
- Handle missing values
- Encode categorical features
- Prepare data for user classification

In [49]:
# Fill missing values in 'age' using median
X_train_full['age'].fillna(X_train_full['age'].median(), inplace=True)
X_test['age'].fillna(X_test['age'].median(), inplace=True)

# Store label encoders for categorical columns
label_encoders = {}
categorical_cols = ['browser_version', 'region_code']

# Encode categorical features
for col in categorical_cols:
    le = LabelEncoder()
    
    # Fit encoder on training data
    X_train_full[col] = le.fit_transform(X_train_full[col].astype(str))
    
    # Handle unseen categories in test data
    encoded_test = []
    for val in X_test[col].astype(str):
        if val in le.classes_:
            encoded_test.append(le.transform([val])[0])
        else:
            # Assign unseen category a default value
            encoded_test.append(-1)
    
    X_test[col] = encoded_test
    label_encoders[col] = le

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_train_full['age'].fillna(X_train_full['age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_test['age'].fillna(X_test['age'].median(), inplace=True)


In [50]:
# Convert boolean 'subscriber' column to integer
X_train_full['subscriber'] = X_train_full['subscriber'].astype(int)
X_test['subscriber'] = X_test['subscriber'].astype(int)

# Encode target labels (user categories)
le_target = LabelEncoder()
y_train_encoded = le_target.fit_transform(y_train_full)

## User Classification

Train a classifier to predict the user category (`User1`, `User2`, `User3`),
which serves as the **context** for the contextual bandit.


In [51]:
# Split training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full,
    y_train_encoded,
    test_size=0.2,
    random_state=42,
    stratify=y_train_encoded
)

# Print dataset shapes
print(f"Training set shape: {X_train.shape}")
print(f"Validation set shape: {X_val.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"\nTarget classes: {le_target.classes_}")

Training set shape: (1600, 31)
Validation set shape: (400, 31)
Test set shape: (2000, 31)

Target classes: ['user_1' 'user_2' 'user_3']


In [52]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Initialize Gradient Boosting Classifier with tuned hyperparameters
gb_clf = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    min_samples_split=20,
    min_samples_leaf=10,
    subsample=0.8,
    random_state=42,
    verbose=1
)

# Train the model
print("Training Gradient Boosting Classifier...")
gb_clf.fit(X_train, y_train)

Training Gradient Boosting Classifier...
      Iter       Train Loss      OOB Improve   Remaining Time 
         1           0.9499           0.1341            5.05s
         2           0.8395           0.1091            5.72s
         3           0.7528           0.1026            5.42s
         4           0.6788           0.0741            5.75s
         5           0.6157           0.0561            5.93s
         6           0.5615           0.0420            5.74s
         7           0.5225           0.0689            5.92s
         8           0.4761           0.0203            6.34s
         9           0.4451           0.0359            6.28s
        10           0.4160           0.0268            6.31s
        20           0.2493          -0.0370            5.70s
        30           0.1712          -0.0295            5.09s
        40           0.1292          -0.0134            4.38s
        50           0.0981          -0.0283            3.57s
        60           0.0773 

In [53]:
# Predict on validation set
y_val_pred = gb_clf.predict(X_val)

# Compute validation accuracy
val_accuracy = accuracy_score(y_val, y_val_pred)
print("=" * 60)
print(f"Validation Accuracy: {val_accuracy:.4f}")
print("=" * 60)

# Detailed performance metrics
print("\nClassification Report (Validation Set):")
print(classification_report(y_val, y_val_pred, target_names=le_target.classes_))

print("\nConfusion Matrix (Validation Set):")
print(confusion_matrix(y_val, y_val_pred))

# Feature importance analysis
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': gb_clf.feature_importances_
}).sort_values(by='importance', ascending=False)

print("\nTop 10 Most Important Features:")
print(feature_importance.head(10))

Validation Accuracy: 0.9075

Classification Report (Validation Set):
              precision    recall  f1-score   support

      user_1       0.90      0.86      0.88       142
      user_2       0.99      0.88      0.93       142
      user_3       0.84      1.00      0.91       116

    accuracy                           0.91       400
   macro avg       0.91      0.91      0.91       400
weighted avg       0.91      0.91      0.91       400


Confusion Matrix (Validation Set):
[[122   1  19]
 [ 14 125   3]
 [  0   0 116]]

Top 10 Most Important Features:
                  feature  importance
4        session_duration    0.393034
29            region_code    0.333421
0                     age    0.055386
15  preferred_price_range    0.020425
5         content_variety    0.017644
13           time_on_site    0.014131
12        scroll_activity    0.011889
25        browser_version    0.010574
14      interaction_count    0.009706
20       churn_risk_score    0.009429


In [54]:
# Predict categories for test users
y_test_pred = gb_clf.predict(X_test)
test_users['predicted_user_category'] = le_target.inverse_transform(y_test_pred)

print("Test predictions completed!")
print("\nPredicted distribution:")
print(test_users['predicted_user_category'].value_counts())

print("\nFirst 5 test predictions:")
print(test_users[['user_id', 'predicted_user_category']].head())

Test predictions completed!

Predicted distribution:
predicted_user_category
user_2    712
user_1    674
user_3    614
Name: count, dtype: int64

First 5 test predictions:
  user_id predicted_user_category
0   U4058                  user_2
1   U1118                  user_1
2   U6555                  user_1
3   U9170                  user_1
4   U3348                  user_1


In [55]:
# Reload original test data to ensure correct user IDs
test_users_original = pd.read_csv("data/test_users.csv")

# Convert encoded predictions back to labels
predicted_labels = le_target.inverse_transform(y_test_pred)

# Save final predictions
output_df = pd.DataFrame({
    'user_id': test_users_original['user_id'],
    'predicted_user_category': predicted_labels
})

output_df.to_csv('test_predictions.csv', index=False)

print("✓ Predictions saved to 'test_predictions.csv'")
print("=" * 60)
print("Prediction Complete!")
print("=" * 60)

✓ Predictions saved to 'test_predictions.csv'
Prediction Complete!


# `Contextual Bandit`

## Reward Sampler Initialization

The sampler is initialized using the student's roll number `i`.
Rewards are obtained using `sampler.sample(j)`.


## Arm Mapping

| Arm Index (j) | News Category | User Context |
|--------------|---------------|--------------|
| 0–3          | Entertainment, Education, Tech, Crime | User1 |
| 4–7          | Entertainment, Education, Tech, Crime | User2 |
| 8–11         | Entertainment, Education, Tech, Crime | User3 |

## Epsilon-Greedy Strategy

This section implements the epsilon-greedy contextual bandit algorithm.


## Upper Confidence Bound (UCB)

This section implements the UCB strategy for contextual bandits.

## SoftMax Strategy

This section implements the SoftMax strategy with temperature $ \tau = 1$.


## Reinforcement Learning Simulation

We simulate the bandit algorithms for $T = 10,000$ steps and record rewards.

P.S.: Change $T$ value as and if required.


## Results and Analysis

This section presents:
- Average Reward vs Time
- Hyperparameter comparisons
- Observations and discussion


## Final Observations

- Comparison of Epsilon-Greedy, UCB, and SoftMax
- Effect of hyperparameters
- Strengths and limitations of each approach
