## RL based Recommendation System

### This is part 2: Environment Setup & Training
In this part we will:

1. Create a training environment
2. Create a training agent
3. Train the agent

In [1]:
#importing libraries
import pandas as pd
import numpy as np
import gym
from gym import spaces

### Step 1: Load the preprocessed data
<p>We will use the preprocessed data from the previous notebook, which we had saved in a pickle file.<br> We will also extract the key metrics, such as the number of unique users and items, from the preprocessed data.</p>

In [2]:
# Load preprocessed data
df_full = pd.read_pickle('df_full.pkl')
df_train = df_full[df_full['set'] == 'train']

# Calculate number of unique users and items
num_users = df_train['user_idx'].nunique()
num_items = df_train['item_idx'].nunique()
print(f"Number of users: {num_users}")
print(f"Number of items: {num_items}")

Number of users: 192403
Number of items: 62993


### Step 2: Prepare User Interaction Data
<p>In this step, we will prepare the user interaction data for training the model.<br> We shall compute the following</p>

1. `user_interactions`: Dict mapping user_idx to set of item_idx that user has interacted with
2. `user_ratings`: Dict mapping (user_idx, item_idx) touples to corresponding rating  

In [3]:
# Create user_interactions: {user_idx: set of item_idx}
user_interactions = df_train.groupby('user_idx')['item_idx'].apply(set).to_dict()

# Create user_ratings: {(user_idx, item_idx): rating}
user_ratings = df_train.set_index(['user_idx', 'item_idx'])['overall'].to_dict()

### Step 3: Set up the training environment class
<p>We will set up a class <code>AmazonEnv</code> that inherits from <code>gym.Env</code> and implements the <code>step</code> and <code>reset</code> methods. <br>The <code>step</code> method will take the user's action and return the reward, the next state, and whether the episode is done.</p>
<p>We will initialize the class with training data, history length <code>N</code> and episode length <code>M</code>. We will also set up the <code>action_space</code> and <code>observation_space</code> attributes.</p>

In [4]:
class AmazonEnv(gym.Env):
    def __init__(self, df_train, N=5, M=10):
        super(AmazonEnv, self).__init__()
        self.df_train = df_train
        self.user_interactions = user_interactions
        self.user_ratings = user_ratings
        self.num_users = num_users
        self.num_items = num_items
        self.N = N      #Length of history in state
        self.M = M      #Maximum steps per episode
        self.current_user = None
        self.history = []   #List of (item_idx, reward) tuples
        
        #Observation space: [user_idx, item1, ..., itemN, rating1, ..., ratingN]
        high = np.array([num_users - 1] + [num_items - 1] * N + [5] * N, dtype=np.float32) #Because we have num_users total users, num_items total items and 5 possible ratings
        self.observation_space = spaces.Box(low=0, high=high, shape=(1 + 2 * N,), dtype=np.float32)
        
        # Action space: Recommend any item
        self.action_space = spaces.Discrete(self.num_items) #Because we have the number of possible actions as the number of items
        
    #Implement reset method
    def reset(self):
        """This method initializes the environment for a new episode
        1: Randomly selects a user
        2: Clear their reccomendation history
        3: Return an initial state vector with the user index and zeros for the history.
        
        An empty history simulated the start of a reccomendation sequence
        """
        #Choose a random user
        self.current_user = np.random.choice(self.df_train['user_idx'].unique())
        self.history = []
        #Initial state: [user_idx, 0, ..., 0]
        state = np.array([self.current_user] + [0] * self.N + [0] * self.N, dtype=np.float32)
        return state
    
    #Implement step method
    def step(self, action):
        """Define the environment's response to an agent's action (item recommendation).
        1. Check if the recommended item (action) is in the user's interaction set
        2. If yes, set reward to the rating from user_ratings (The reward reflects the quality of the recommendation based on historical data.)
        3. If no, reward is 0
        4. Append the item and reward to the history
        5. Update the state with the last N items and rewards, padding with zeros if the history is shorter than N.
        6. Set done to True if the episode reaches M steps
        """
        #Compute the reward based on user's interaction history
        if (self.current_user, action) in self.user_ratings:
            reward = self.user_ratings[(self.current_user, action)]
        else:
            reward = 0.0
        
        #Update the history with the reccomendation
        self.history.append((action, reward))
        
        #Extract the last N items and ratings from the history
        if len(self.history) < self.N: #We apply padding to the history if it is not long enough
            state_items = [0] * (self.N - len(self.history)) + [item for item, _ in self.history]
            state_ratings = [0] * (self.N - len(self.history)) + [rating for _, rating in self.history]
        else:
            state_items = [item for item, _ in self.history[-self.N:]]
            state_ratings = [rating for _, rating in self.history[-self.N:]]
            
        #Construct the state vector
        state = np.array([self.current_user] + state_items + state_ratings, dtype=np.float32)
        
        #Check if the episode is done
        done = len(self.history) >= self.M
        
        return state, reward, done, {}
    
    

### Step 4 (Optional): Test the environment
<p>Ensure the environment functions correctly before training the agent.</p>

In [5]:
# Instantiate and test the environment
env = AmazonEnv(df_train, N=5, M=10)
state = env.reset()
print("Initial state:", state)

# Run a few steps with random actions
for _ in range(10):
    action = env.action_space.sample()  # Random item recommendation
    state, reward, done, _ = env.step(action)
    print(f"State: {state}, Reward: {reward}, Done: {done}")
    if done:
        break

Initial state: [184761.      0.      0.      0.      0.      0.      0.      0.      0.
      0.      0.]
State: [184761.      0.      0.      0.      0.  43354.      0.      0.      0.
      0.      0.], Reward: 0.0, Done: False
State: [184761.      0.      0.      0.  43354.  58598.      0.      0.      0.
      0.      0.], Reward: 0.0, Done: False
State: [184761.      0.      0.  43354.  58598.  24137.      0.      0.      0.
      0.      0.], Reward: 0.0, Done: False
State: [184761.      0.  43354.  58598.  24137.   4957.      0.      0.      0.
      0.      0.], Reward: 0.0, Done: False
State: [184761.  43354.  58598.  24137.   4957.  24293.      0.      0.      0.
      0.      0.], Reward: 0.0, Done: False
State: [184761.  58598.  24137.   4957.  24293.  31586.      0.      0.      0.
      0.      0.], Reward: 0.0, Done: False
State: [184761.  24137.   4957.  24293.  31586.  47172.      0.      0.      0.
      0.      0.], Reward: 0.0, Done: False
State: [184761.   4957.  2

<p>The environment seems to be working correctly, and is now ready. Let's move on to the training process.</p>