### `---------------Mandatory Information to fill------------`

### Group ID:
### Group Members Name with Student ID:
1. Student 1
2. Student 2
3. Student 3
4. Student 4


`-------------------Write your remarks (if any) that you want should get consider at the time of evaluation---------------`

Remarks: ##Add here

# Background

In the world of online streaming, user satisfaction and engagement are critical metrics for the success of a movie recommendation system. A well-designed recommendation algorithm can significantly enhance user experience by suggesting movies that align with their preferences, leading to higher platform retention and usage. Recommendation systems face the challenge of balancing exploration (discovering new movies) with exploitation (recommending known favourites) to maximize user satisfaction over time.


# Scenario

Imagine a leading online movie streaming platform, TrendMovie Inc., that aims to become the go-to destination for personalized movie recommendations. The platform features a vast collection of movies catering to diverse audiences. TrendMovie Inc. wants to optimize its recommendation strategy to deliver maximum user satisfaction while maintaining a high level of engagement. Each movie recommendation is treated as an interaction with the user, and their feedback is used to refine the recommendation strategy dynamically.


# Objective

Your objective is to design and implement a recommendation system using Multi-Armed Bandit (MAB) algorithms to maximize cumulative user satisfaction. The system should dynamically allocate recommendations by learning user preferences in real-time, striking the right balance between exploration and exploitation.


# Dataset

The dataset contains user ratings for a variety of movies. Key columns in the dataset include:
*   **User ID:** A unique identifier for each user.
*   **Movie ID:** A unique identifier for each.
*   **Rating:** A score provided by the user for a movie (on a scale of 1 to 5).
*   **Timestamp:** The time when the rating was given (optional for this assignment).

***Link for accessing dataset:***
https://drive.google.com/file/d/1gfobhqlVCw8Oo52JCiYpEBGhG5k7cWBr/view?usp=drive_link


# Environment Details

**Arms:** Each movie represents an "arm" in the MAB framework. The probability of a movie being liked by a user is initially unknown and will be estimated based on user feedback during the interactions.
For example:

Arm 1: Movie A

Arm 2: Movie B

Arm 3: Movie C

... and so on, for all movies in the dataset.

**Reward Function:**
The reward function is defined based on user ratings:

***Reward = 1:*** The user rates the movie high star (e.g., 4 or 5 stars).

***Reward = 0:*** The user rates the movie low star (e.g., 1, 2, or 3 stars).


**Assumptions:**

Run simulations for 1000 iterations for each policy


# Requirements and Deliverables:
Implement the Multi-Arm Bandit Problem for the given above scenario for all the below mentioned policy methods.

### Initialize constants

In [79]:
!pip install gym pandas

175337.56s - pydevd: Sending message related to process being replaced timed-out after 5 seconds




In [80]:
import random
import numpy as np
import pandas as pd
import gym

# Constants
no_of_iterations = 1000

# Initialize value function and policy

# Load Dataset (0.5M)

In [81]:
# Code for Dataset loading and print dataset statistics
#-----write your code below this line---------


# Data Exploration

import pandas as pd

# Load the dataset
dataset = pd.read_csv("TrendMovie.csv")

# Display basic information about the dataset
print("Dataset Info:")
dataset.info()

print("Total Movie count:")
print(len(dataset['movieId'].unique()))

# Display the first few rows of the dataset
print("\nDataset Preview:")
print(dataset.head())

# Check for missing values
print("\nMissing Values:")
print(dataset.isnull().sum())

# Summary statistics of the ratings
print("\nSummary Statistics:")
print(dataset["rating"].describe())

# Preprocessing

# Convert timestamp to a readable format
dataset["timestamp"] = pd.to_datetime(dataset["timestamp"], unit="s")

# Sort by user and timestamp for consistent processing
dataset = dataset.sort_values(by=["userId", "timestamp"]).reset_index(drop=True)

print("\nProcessed Dataset Preview:")
print(dataset.head())



Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB
Total Movie count:
9724

Dataset Preview:
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931

Missing Values:
userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

Summary Statistics:
count    100836.000000
mean          3.501557
std           1.042529
min           0.500000
25%           3.000000
50%           3.500000
75%           4.000000
max           5.000000
Name: rating, d

# Design a Movie Environment (0.5M)

In [82]:
import random
import pandas as pd
import gym
import math

# Define the Movie Recommendation Environment
class MovieRecommendationEnv(gym.Env):
    def __init__(self, data):
        """
        Initialize the environment.

        Parameters:
        - data: DataFrame containing user ratings with columns: 'userId', 'movieId', 'rating'.
        """
        self.data = data
        self.movies = data['movieId'].unique()  # Unique movie IDs
        self.state = {movie_id: [] for movie_id in self.movies}  # Tracks user feedback for each movie
        self.reset()  # Initialize the state

    def step(self, action):
        """
        Simulate a user's interaction with a selected movie (arm).

        Parameters:
        - action: The selected movie ID (arm).

        Returns:
        - state: Updated state with rewards for each movie.
        - reward: 1 if liked, 0 otherwise.
        - done: Always False, as this is a continuous environment.
        - debug: None for now (can be extended for additional info).
        """
        # Filter ratings for the selected movie
        movie_ratings = self.data[self.data['movieId'] == action]['rating']
        
        # Use the average rating for the movie
        if not movie_ratings.empty:
            avg_rating = movie_ratings.mean()
            reward = 1 if avg_rating >= 4.0 else 0  # High rating gives a reward of 1
        else:
            reward = 0  # Default to 0 if no ratings are available

        self.state[action].append(reward)  # Update state with reward
        done = False
        debug = None
        return self.state, reward, done, debug

    def reset(self):
        """
        Reset the environment state. Each movie starts with an empty feedback list.

        Returns:
        - state: The initial state of the environment.
        """
        self.state = {movie_id: [] for movie_id in self.movies}
        return self.state

    def render(self, mode="ascii"):
        """
        Display statistics of user feedback for all movies.

        Parameters:
        - mode: The rendering mode (default is "ascii").
        """
        returns = {movie_id: sum(rewards) for movie_id, rewards in self.state.items()}
        trials = {movie_id: len(rewards) for movie_id, rewards in self.state.items()}
        print(f'===== Total Trials: {sum(trials.values())} =====')
        for movie_id, total_return in returns.items():
            total_trials = trials[movie_id]
            print(f'Movie {movie_id} | Rewards: {total_return}, Trials: {total_trials}')
        print(f'===== Total Rewards: {sum(returns.values())} =====')



In [83]:

# Main Execution
# Load the dataset

# Initialize the environment with the dataset
env = MovieRecommendationEnv(dataset)

# Using Random Policy (0.5M)
Implement a random policy for movie recommendations and print each iteration. (Mandatory)

In [84]:
#  run the environment with an agent that is guided by a random policy
#-----write your code below this line---------

# Random Policy Implementation
def random_policy(env, iterations=1000):
    """
    Implement the random policy to select movies randomly.

    Parameters:
    - env: The movie recommendation environment.
    - iterations: Number of iterations for the simulation.

    Returns:
    - None
    """
    for _ in range(iterations):
        # Select a random movie
        action = random.choice(env.movies)

        # Interact with the environment using the selected action
        _, reward, _, _ = env.step(action)

    # Render final statistics
    env.render()
    env.close()

# Run the policies
print("Running Random Policy:")
random_policy(env)

Running Random Policy:
===== Total Trials: 1000 =====
Movie 804 | Rewards: 0, Trials: 0
Movie 1210 | Rewards: 0, Trials: 0
Movie 2018 | Rewards: 0, Trials: 0
Movie 2628 | Rewards: 0, Trials: 0
Movie 2826 | Rewards: 0, Trials: 0
Movie 3578 | Rewards: 0, Trials: 0
Movie 3617 | Rewards: 0, Trials: 0
Movie 3744 | Rewards: 0, Trials: 0
Movie 101 | Rewards: 0, Trials: 0
Movie 441 | Rewards: 0, Trials: 0
Movie 2858 | Rewards: 0, Trials: 0
Movie 1473 | Rewards: 0, Trials: 0
Movie 2997 | Rewards: 0, Trials: 0
Movie 235 | Rewards: 0, Trials: 0
Movie 1060 | Rewards: 0, Trials: 0
Movie 356 | Rewards: 0, Trials: 0
Movie 223 | Rewards: 0, Trials: 0
Movie 1500 | Rewards: 0, Trials: 0
Movie 2700 | Rewards: 0, Trials: 0
Movie 2395 | Rewards: 0, Trials: 0
Movie 3243 | Rewards: 0, Trials: 1
Movie 1517 | Rewards: 0, Trials: 0
Movie 3253 | Rewards: 0, Trials: 0
Movie 1580 | Rewards: 0, Trials: 0
Movie 1732 | Rewards: 0, Trials: 0
Movie 2596 | Rewards: 0, Trials: 0
Movie 3450 | Rewards: 0, Trials: 0
Movie 2

# Using Greedy Policy (1M)
Implement a greedy policy that always recommends the movie with the highest estimated reward and print each iteration. (Mandatory)

In [85]:
#  run the environment with an agent that is guided by a greedy policy
#-----write your code below this line---------

# Greedy Policy Implementation
def greedy_policy(env, iterations=1000):
    """
    Implement the greedy policy to select movies based on estimated probabilities.

    Parameters:
    - env: The movie recommendation environment.
    - iterations: Number of iterations for the simulation.

    Returns:
    - None
    """
    estimated_values = {movie_id: 0 for movie_id in env.movies}  # Initialize estimated values
    counts = {movie_id: 0 for movie_id in env.movies}  # Track the number of times each movie is selected

    for _ in range(iterations):
        # Select the movie with the highest estimated probability (greedy choice)
        action = max(estimated_values, key=estimated_values.get)

        # Interact with the environment using the selected action
        _, reward, _, _ = env.step(action)

        # Update counts and estimated values
        counts[action] += 1
        estimated_values[action] += (reward - estimated_values[action]) / counts[action]

    # Render final statistics
    env.render()
    env.close()
    
print("Running Greedy Policy:")
greedy_policy(env)


Running Greedy Policy:
===== Total Trials: 2000 =====
Movie 804 | Rewards: 0, Trials: 1000
Movie 1210 | Rewards: 0, Trials: 0
Movie 2018 | Rewards: 0, Trials: 0
Movie 2628 | Rewards: 0, Trials: 0
Movie 2826 | Rewards: 0, Trials: 0
Movie 3578 | Rewards: 0, Trials: 0
Movie 3617 | Rewards: 0, Trials: 0
Movie 3744 | Rewards: 0, Trials: 0
Movie 101 | Rewards: 0, Trials: 0
Movie 441 | Rewards: 0, Trials: 0
Movie 2858 | Rewards: 0, Trials: 0
Movie 1473 | Rewards: 0, Trials: 0
Movie 2997 | Rewards: 0, Trials: 0
Movie 235 | Rewards: 0, Trials: 0
Movie 1060 | Rewards: 0, Trials: 0
Movie 356 | Rewards: 0, Trials: 0
Movie 223 | Rewards: 0, Trials: 0
Movie 1500 | Rewards: 0, Trials: 0
Movie 2700 | Rewards: 0, Trials: 0
Movie 2395 | Rewards: 0, Trials: 0
Movie 3243 | Rewards: 0, Trials: 1
Movie 1517 | Rewards: 0, Trials: 0
Movie 3253 | Rewards: 0, Trials: 0
Movie 1580 | Rewards: 0, Trials: 0
Movie 1732 | Rewards: 0, Trials: 0
Movie 2596 | Rewards: 0, Trials: 0
Movie 3450 | Rewards: 0, Trials: 0
Movi

# Using Epsilon-Greedy Policy (1.5M)
Implement the epsilon-greedy policy, where with probability ε you explore (recommend a random movie) and with probability (1-ε) you exploit (recommend the best-known movie). Try with ε =0.1, 0.2, 0.5 and print each iteration. What value of ε yields the best performance? (Mandatory)

In [86]:
#  run the environment with an agent that is guided by a epsilon-greedy policy
#-----write your code below this line---------

# Using UCB (1M)
Implement the UCB algorithm for movie recommendations and print each iteration. (Mandatory)

In [87]:
#  run the environment with an agent that is guided by a UCB
#-----write your code below this line---------

# Upper Confidence Bound (UCB) Policy Implementation
def ucb_policy(env, iterations=1000, c=2):
    """
    Implement the UCB policy to select movies based on estimated values and confidence bounds.

    Parameters:
    - env: The movie recommendation environment.
    - iterations: Number of iterations for the simulation.
    - c: Exploration parameter for UCB.

    Returns:
    - None
    """
    estimated_values = {movie_id: 0 for movie_id in env.movies}  # Initialize estimated values
    counts = {movie_id: 0 for movie_id in env.movies}  # Track the number of times each movie is selected

    for t in range(1, iterations + 1):
        # Select the movie based on UCB formula
        ucb_values = {}
        for movie_id in env.movies:
            if counts[movie_id] == 0:
                ucb_values[movie_id] = float('inf')  # Select untried movies first
            else:
                ucb_values[movie_id] = estimated_values[movie_id] + c * math.sqrt(math.log(t) / counts[movie_id])

        action = max(ucb_values, key=ucb_values.get)

        # Interact with the environment using the selected action
        _, reward, _, _ = env.step(action)

        # Update counts and estimated values
        counts[action] += 1
        estimated_values[action] += (reward - estimated_values[action]) / counts[action]

    # Render final statistics
    env.render()
    env.close()

print("Running UCB Policy:")
ucb_policy(env)

Running UCB Policy:
===== Total Trials: 3000 =====
Movie 804 | Rewards: 0, Trials: 1001
Movie 1210 | Rewards: 1, Trials: 1
Movie 2018 | Rewards: 0, Trials: 1
Movie 2628 | Rewards: 0, Trials: 1
Movie 2826 | Rewards: 0, Trials: 1
Movie 3578 | Rewards: 0, Trials: 1
Movie 3617 | Rewards: 0, Trials: 1
Movie 3744 | Rewards: 0, Trials: 1
Movie 101 | Rewards: 0, Trials: 1
Movie 441 | Rewards: 0, Trials: 1
Movie 2858 | Rewards: 1, Trials: 1
Movie 1473 | Rewards: 1, Trials: 1
Movie 2997 | Rewards: 0, Trials: 1
Movie 235 | Rewards: 0, Trials: 1
Movie 1060 | Rewards: 0, Trials: 1
Movie 356 | Rewards: 1, Trials: 1
Movie 223 | Rewards: 0, Trials: 1
Movie 1500 | Rewards: 0, Trials: 1
Movie 2700 | Rewards: 0, Trials: 1
Movie 2395 | Rewards: 0, Trials: 1
Movie 3243 | Rewards: 0, Trials: 2
Movie 1517 | Rewards: 0, Trials: 1
Movie 3253 | Rewards: 0, Trials: 1
Movie 1580 | Rewards: 0, Trials: 1
Movie 1732 | Rewards: 0, Trials: 1
Movie 2596 | Rewards: 0, Trials: 1
Movie 3450 | Rewards: 0, Trials: 1
Movie 2

# Plot the cumulative rewards for all policies on a single graph to compare their performance. (0.5M)

In [88]:
#-----write your code below this line---------

# Conclusion (0.5M)

Determine which policy performs the best based on cumulative reward. Provide a concise conclusion (250 words) summarizing the decision-making process and the trade-offs between exploration and exploitation.

`----write below this line------`