## RL based Recommendation System

### This is part 2: Environment Setup & Training
In this part we will:

1. Create a training environment
2. Create a training agent
3. Train the agent

In [60]:
#importing libraries
import pandas as pd
import numpy as np
import gym
from gym import spaces
import torch
import torch.nn as nn
import torch.nn.functional as F
import random
from collections import deque
import torch.optim as optim
import math

### Step 1: Load the preprocessed data
<p>We will use the preprocessed data from the previous notebook, which we had saved in a pickle file.<br> We will also extract the key metrics, such as the number of unique users and items, from the preprocessed data.</p>

In [2]:
# Load preprocessed data
df_full = pd.read_pickle('df_full.pkl')
df_train = df_full[df_full['set'] == 'train']

# Calculate number of unique users and items
num_users = df_train['user_idx'].nunique()
num_items = df_train['item_idx'].nunique()
print(f"Number of users: {num_users}")
print(f"Number of items: {num_items}")

Number of users: 192403
Number of items: 62993


### Step 2: Prepare User Interaction Data
<p>In this step, we will prepare the user interaction data for training the model.<br> We shall compute the following</p>

1. `user_interactions`: Dict mapping user_idx to set of item_idx that user has interacted with
2. `user_ratings`: Dict mapping (user_idx, item_idx) touples to corresponding rating  

In [3]:
# Create user_interactions: {user_idx: set of item_idx}
user_interactions = df_train.groupby('user_idx')['item_idx'].apply(set).to_dict()

# Create user_ratings: {(user_idx, item_idx): rating}
user_ratings = df_train.set_index(['user_idx', 'item_idx'])['overall'].to_dict()

### Step 3: Set up the training environment class
<p>We will set up a class <code>AmazonEnv</code> that inherits from <code>gym.Env</code> and implements the <code>step</code> and <code>reset</code> methods. <br>The <code>step</code> method will take the user's action and return the reward, the next state, and whether the episode is done.</p>
<p>We will initialize the class with training data, history length <code>N</code> and episode length <code>M</code>. We will also set up the <code>action_space</code> and <code>observation_space</code> attributes.</p>

In [81]:
class AmazonEnv(gym.Env):
    def __init__(self, df_train, N=5, M=10):
        super(AmazonEnv, self).__init__()
        self.df_train = df_train
        self.user_interactions = user_interactions
        self.user_ratings = user_ratings
        self.num_users = num_users
        self.num_items = num_items
        self.N = N
        self.M = M
        self.current_user = None
        self.history = []

        high = np.array([self.num_users - 1] + [self.num_items - 1] * N + [5] * N, dtype=np.float32)
        self.observation_space = spaces.Box(low=0, high=high, shape=(1 + 2 * N,), dtype=np.float32)
        self.action_space = spaces.Discrete(self.num_items)

    def reset(self):
        """This method initializes the environment for a new episode
        1: Randomly selects a user
        2: Clear their reccomendation history
        3: Return an initial state vector with the user index and zeros for the history.
        
        An empty history simulated the start of a reccomendation sequence
        """
        self.current_user = np.random.choice(self.df_train['user_idx'].unique())
        self.history = []
        state = np.array([self.current_user] + [0] * self.N + [0] * self.N, dtype=np.float32)
        return state

    def step(self, action):
        """Define the environment's response to an agent's action (item recommendation).
        1. Check if the recommended item (action) is in the user's interaction set
        2. If yes, set reward to the rating from user_ratings (The reward reflects the quality of the recommendation based on historical data.)
        3. If no, reward is 0
        4. Append the item and reward to the history
        5. Update the state with the last N items and rewards, padding with zeros if the history is shorter than N.
        6. Set done to True if the episode reaches M steps
        """
        key = (self.current_user, action)
        if key in self.user_ratings:
            reward = self.user_ratings[key]
            # print(f"Step - User: {self.current_user}, Action: {action}, Reward: {reward} (in history)")
        else:
            reward = -0.2  # Small penalty for unseen items
            # print(f"Step - User: {self.current_user}, Action: {action}, Reward: {reward} (not in history)")
        self.history.append((action, reward))
        if len(self.history) < self.N:
            state_items = [0] * (self.N - len(self.history)) + [item for item, _ in self.history]
            state_ratings = [0.0] * (self.N - len(self.history)) + [rating for _, rating in self.history]
        else:
            state_items = [item for item, _ in self.history[-self.N:]]
            state_ratings = [rating for _, rating in self.history[-self.N:]]
        state = np.array([self.current_user] + state_items + state_ratings, dtype=np.float32)
        done = len(self.history) >= self.M
        return state, reward, done, {}

### Step 4 (Optional): Test the environment
<p>Ensure the environment functions correctly before training the agent.</p>

In [82]:
# Instantiate and test the environment
env = AmazonEnv(df_train, N=5, M=10)
state = env.reset()
print("Initial state:", state)
# Test the environment
for _ in range(10):
    action = env.action_space.sample()
    state, reward, done, _ = env.step(action)
    print(f"State: {state}, Reward: {reward}, Done: {done}")
    if done:
        break

Initial state: [12935.     0.     0.     0.     0.     0.     0.     0.     0.     0.
     0.]
State: [ 1.2935e+04  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  5.9267e+04
  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00 -2.0000e-01], Reward: -0.2, Done: False
State: [ 1.2935e+04  0.0000e+00  0.0000e+00  0.0000e+00  5.9267e+04  2.4414e+04
  0.0000e+00  0.0000e+00  0.0000e+00 -2.0000e-01 -2.0000e-01], Reward: -0.2, Done: False
State: [ 1.2935e+04  0.0000e+00  0.0000e+00  5.9267e+04  2.4414e+04  4.1395e+04
  0.0000e+00  0.0000e+00 -2.0000e-01 -2.0000e-01 -2.0000e-01], Reward: -0.2, Done: False
State: [ 1.2935e+04  0.0000e+00  5.9267e+04  2.4414e+04  4.1395e+04  2.0875e+04
  0.0000e+00 -2.0000e-01 -2.0000e-01 -2.0000e-01 -2.0000e-01], Reward: -0.2, Done: False
State: [ 1.2935e+04  5.9267e+04  2.4414e+04  4.1395e+04  2.0875e+04  1.2319e+04
 -2.0000e-01 -2.0000e-01 -2.0000e-01 -2.0000e-01 -2.0000e-01], Reward: -0.2, Done: False
State: [ 1.2935e+04  2.4414e+04  4.1395e+04  2.0875e+04  1.2

<p>The environment seems to be working correctly, and is now ready. Let's move on to the training process.</p>

### Step 5: Design and implement DQN Network Architecture
<p>We will Build a neural network that takes the environment’s state (1 + 2 * N dimensions) as input and outputs Q-values for all num_items actions.</p>

- Input Layer: Process the state vector ([user_idx, item1, ..., itemN, rating1, ..., ratingN]).
    - Embedding layer for user_idx: Map user_idx to a vector of size embedding_dim.
    - Embedding layer for item_idx: Map item_idx to a vector of size embedding_dim.
    - Pass rewards directly as floats.
- Concatenate embeddings and rewards, then process through two hidden layers (e.g., 128 units each) with ReLU activations to produce a state embedding
- Use a separate item embedding matrix (e.g., num_items × 128) and compute Q-values via a dot product between the state embedding and all item embeddings.

In [83]:
class DQN(nn.Module):
    def __init__(self, num_users, num_items, history_length, user_emb_dim=50, item_emb_dim=50, state_emb_dim=128):
        super(DQN, self).__init__()
        self.history_length = history_length
        self.user_embedding = nn.Embedding(num_users, user_emb_dim)
        self.item_embedding = nn.Embedding(num_items, item_emb_dim)
        input_dim = user_emb_dim + history_length * (item_emb_dim + 1)
        self.fc1 = nn.Linear(input_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.state_embedding = nn.Linear(128, state_emb_dim)
        self.action_embedding = nn.Embedding(num_items, state_emb_dim)

    def forward(self, state):
        user_idx = state[:, 0].long()
        items = state[:, 1:1+self.history_length].long()
        rewards = state[:, 1+self.history_length:].float()
        user_emb = self.user_embedding(user_idx)
        item_embs = self.item_embedding(items).view(items.size(0), -1)
        state_input = torch.cat([user_emb, item_embs, rewards], dim=1)
        x = F.relu(self.fc1(state_input))
        x = F.relu(self.fc2(x))
        state_emb = self.state_embedding(x)
        action_emb = self.action_embedding.weight
        q_values = torch.matmul(state_emb, action_emb.T)
        return q_values

### Step 6: Implement the Replay Buffer
<p>We will create a buffer to store transitions (state, action, reward, next_state, done) from AmazonEnv</p>

In [79]:
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def add(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)
    
    def __len__(self):
        return len(self.buffer)

### Step 7: Implementing the DQN Agent
Define a DQN agent with
1. `policy_net` and `target_net` as DQN instances, with `target_net` initialized as a copy of `policy_net`
2. `optimizer` as Adam with learning rate of 0.001
3. A `ReplayBuffer` instance to store the experiences
4. Hyperparameters: `gamma` (discount factor), `epsilon` (exploration rate), `batch_size`, etc

5. `select_action`: Use epsilon-greedy policy—random action with probability `epsilon`, otherwise the item with the highest Q-value.
6. update: Sample a batch, compute the TD loss (MSE between predicted and target Q-values), and update policy_net.
7. `decay_epsilon`: Reduce `epsilon` exponentially after each episode.

In [84]:
# DQN Agent
class DQNAgent:
    def __init__(self, env, num_users, num_items, history_length, buffer_capacity=100000, batch_size=64, 
                 gamma=0.99, epsilon_start=1.0, epsilon_end=0.01, epsilon_decay=0.999, learning_rate=0.001):
        self.env = env
        self.num_items = num_items
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.batch_size = batch_size
        self.policy_net = DQN(num_users, num_items, history_length)
        self.target_net = DQN(num_users, num_items, history_length)
        self.target_net.load_state_dict(self.policy_net.state_dict())
        self.target_net.eval()
        self.optimizer = torch.optim.Adam(self.policy_net.parameters(), lr=learning_rate)
        self.replay_buffer = ReplayBuffer(buffer_capacity)
    
    def select_action(self, state):
        user_idx = int(state[0])
        user_items = self.env.user_interactions.get(user_idx, set())
        if random.random() < self.epsilon:
            action = random.randint(0, self.num_items - 1)
            # print(f"Select - Explored Action: {action}, Epsilon: {self.epsilon:.4f}")
            return action
        else:
            with torch.no_grad():
                state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
                q_values = self.policy_net(state_tensor)
                max_q = q_values.max().item()
                if user_items and random.random() < 0.75:  # 75% chance to pick known item
                    action = random.choice(list(user_items))
                    # print(f"Select - Exploited Known Action: {action}, Max Q-Value: {max_q:.4f}")
                else:
                    action = q_values.argmax().item()
                    # print(f"Select - Exploited Action: {action}, Max Q-Value: {max_q:.4f}")
                return action
    
    def update(self):
        if len(self.replay_buffer) < self.batch_size:
            return
        batch = self.replay_buffer.sample(self.batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        states = torch.tensor(states, dtype=torch.float32)
        actions = torch.tensor(actions, dtype=torch.long)
        rewards = torch.tensor(rewards, dtype=torch.float32)
        next_states = torch.tensor(next_states, dtype=torch.float32)
        dones = torch.tensor(dones, dtype=torch.float32)
        q_values = self.policy_net(states).gather(1, actions.unsqueeze(1)).squeeze(1)
        with torch.no_grad():
            next_q_values = self.target_net(next_states).max(1)[0]
            targets = rewards + (1 - dones) * self.gamma * next_q_values
        loss = F.mse_loss(q_values, targets)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
    
    def decay_epsilon(self):
        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)

### Step 8: Set up and execute the training loop
<p>Now that we have everything (finally), we can set up and execute the training loop.</p>

1. Hyperparameters
    - num_episodes=1000
    - target_update_frequency=10
    - buffer_capacity=100000
    - batch_size=64
    - gamma=0.99
    - epsilon_start=1.0
    - epsilon_end=0.01
    - epsilon_decay=0.995
    - learning_rate=0.001.
2. For each episode, we will:
    - Reset AmazonEnv to get an initial state
    - While not `done` (up to M=10 steps):
        - Select an action using `select_action`.
        - Step through the environment to get `next_state`, `reward`, `done`.
        - Store the transition in the replay buffer.
        - Update the policy network.
        - Set `state = next_state`.
    - Decay `epsilon`
    - Update `target_net` every `target_update_frequency` episodes

In [85]:
# Load data and initialize environment
df_full = pd.read_pickle('/Ankit\Reposetories\RL_Based_Reccomendation_System\df_full.pkl')
df_train = df_full[df_full['set'] == 'train']
env = AmazonEnv(df_train, N=5, M=10)

# Hyperparameters
num_episodes = 2000
target_update_frequency = 10
buffer_capacity = 100000
batch_size = 64
gamma = 0.99
epsilon_start = 1.0
epsilon_end = 0.01
epsilon_decay = 0.9995 
learning_rate = 0.001

# Initialize agent
num_users = env.num_users
num_items = env.num_items
history_length = env.N
agent = DQNAgent(env, num_users, num_items, history_length, buffer_capacity, batch_size, 
                    gamma, epsilon_start, epsilon_end, epsilon_decay, learning_rate)

# Training loop
for episode in range(num_episodes):
    state = env.reset()
    done = False
    total_reward = 0
    while not done:
        action = agent.select_action(state)
        next_state, reward, done, _ = env.step(action)
        agent.replay_buffer.add(state, action, reward, next_state, done)
        state = next_state
        total_reward += reward
        agent.update()
    agent.decay_epsilon()
    if episode % target_update_frequency == 0:
        agent.target_net.load_state_dict(agent.policy_net.state_dict())  # Fixed: Use agent.policy_net
    print(f"Episode {episode}, Total Reward: {total_reward:.1f}, Epsilon: {agent.epsilon:.4f}")

# Save model
torch.save(agent.policy_net.state_dict(), 'dqn_model.pth')
print("Model saved to 'dqn_model.pth'")

Episode 0, Total Reward: -2.0, Epsilon: 0.9995
Episode 1, Total Reward: -2.0, Epsilon: 0.9990
Episode 2, Total Reward: -2.0, Epsilon: 0.9985
Episode 3, Total Reward: -2.0, Epsilon: 0.9980
Episode 4, Total Reward: -2.0, Epsilon: 0.9975
Episode 5, Total Reward: -2.0, Epsilon: 0.9970
Episode 6, Total Reward: -2.0, Epsilon: 0.9965
Episode 7, Total Reward: 2.2, Epsilon: 0.9960
Episode 8, Total Reward: -2.0, Epsilon: 0.9955
Episode 9, Total Reward: -2.0, Epsilon: 0.9950
Episode 10, Total Reward: -2.0, Epsilon: 0.9945
Episode 11, Total Reward: -2.0, Epsilon: 0.9940
Episode 12, Total Reward: -2.0, Epsilon: 0.9935
Episode 13, Total Reward: -2.0, Epsilon: 0.9930
Episode 14, Total Reward: -2.0, Epsilon: 0.9925
Episode 15, Total Reward: 3.2, Epsilon: 0.9920
Episode 16, Total Reward: -2.0, Epsilon: 0.9915
Episode 17, Total Reward: -2.0, Epsilon: 0.9910
Episode 18, Total Reward: -2.0, Epsilon: 0.9905
Episode 19, Total Reward: -2.0, Epsilon: 0.9900
Episode 20, Total Reward: -2.0, Epsilon: 0.9896
Epis

### Hurray!
<p>We have successfully trained our model. Let's see how it performs on the test set.</p>