# 1.0 Use Case
In this noteboook, I will implement a contextual bandit algorithm for online shopping platform. The *contexts* are derived using a B2C business model but can easily be leveraged for other types of business 
model. In a bandit model for an online ecommerce platform, the contexts could be various characteristics of the user or the item. Here are a few examples:

1. User demographics: Age, gender, location, etc.
2. User behavior: Past purchases, browsing history, click patterns, etc.
3. Time: Time of day, day of the week, season, etc.
4. Item characteristics: Category, price, brand, ratings, etc.
5. Current context: What page the user is on, what they searched for, etc.

These contexts can be used to personalize the recommendations made by the bandit algorithm. For example, the algorithm might recommend different products to a user who is browsing in the morning compared to the evening, or to a user who has a history of purchasing electronics compared to a user who typically buys books.

There are several algorithms used in bandit models for online ecommerce platforms. Here are a few examples:

1. **Epsilon-Greedy Algorithm**: This is a simple method where the algorithm explores with a probability of epsilon and exploits the best option otherwise. 

2. **Upper Confidence Bound (UCB) Algorithm**: This algorithm balances exploration and exploitation by choosing the option with the highest upper confidence bound.

3. **Thompson Sampling**: This is a probabilistic algorithm that chooses an option based on the probability that it is the best option.

4. **Contextual Bandit Algorithms**: These algorithms take into account the context (user demographics, time of day, etc.) when choosing an option.

5. **Gradient Bandit Algorithms**: These algorithms use a gradient ascent method to update the preference for each action based on the received reward.

Each of these algorithms has its own strengths and weaknesses, and the choice of algorithm depends on the specific requirements of the ecommerce platform.

## 1.1 Algorithm Selection: 
I will be implementing a contextual bandit model using *Grandient Bandit Algorithms* to power the model selection given a context. 

In [None]:
import math
import numpy as np
import tensorflow as tf

class ContextualBandit:
    """A class that defines the contextual bandit environment.

    Attributes:
        state (int): The current state of the environment.
        channels (list): A list of the channels in the environment.
        models (list): A list of the models in the environment.
        rewards (numpy.ndarray): A 2D array of the rewards for each model and channel.
        num_channels (int): The number of channels (contexts) in the environment.
        num_models (int): The number of models in the environment.
        total_reward (float): The total reward obtained by the agent.
        action_count (numpy.ndarray): A 2D array of the number of times each action has been taken for each state.
        total_reward_per_model_per_channel (numpy.ndarray): A 2D array of the total reward obtained by each model for each channel.
    """
    def __init__(self):
        """Initializes the ContextualBandit class."""
        self.state = 0
        # Define the four channels
        self.channels = ['Channel 1', 'Channel 2', 'Channel 3', 'Channel 4']
        # Define the nine models
        self.models = ['Model 1', 'Model 2', 'Model 3', 'Model 4', 'Model 5', 'Model 6', 'Model 7', 'Model 8', 'Model 9']
        # Define the rewards for each model
        self.rewards = np.array([[0.1, 0.2, 0.3, 0.4],
                                 [0.2, 0.3, 0.4, 0.1],
                                 [0.3, 0.4, 0.1, 0.2],
                                 [0.4, 0.1, 0.2, 0.3],
                                 [0.5, 0.6, 0.7, 0.8],
                                 [0.6, 0.7, 0.8, 0.5],
                                 [0.7, 0.8, 0.5, 0.6],
                                 [0.8, 0.5, 0.6, 0.7],
                                 [0.9, 0.9, 0.9, 0.9]])
        self.num_channels = len(self.channels)
        self.num_models = len(self.models)
        self.total_reward = 0
        self.action_count = np.zeros((self.num_channels, self.num_models))
        self.total_reward_per_model_per_channel = np.zeros((self.num_channels, self.num_models))

    def get_state(self):
        """Gets a random state from the environment.

        Returns:
            int: A random state from the environment.
        """
        # Randomly select a channel. For deployment, this would be the channel that the user is currently on.
        self.state = np.random.randint(0, self.num_channels)
        return self.state
    
    def get_reward(self, action):
        """Gets the reward for a given action in the current state.

        Args:
            action (int): The action to take in the current state.

        Returns:
            float: The reward for the given action in the current state.
        """
        # Get the reward for the selected model and channel
        reward = self.rewards[action, self.state]
        # Generate a random noise to add to the reward
        noise = np.random.randn(1)/10.0
        # Add the noise to the reward
        reward += noise
        # Update the total reward
        self.total_reward += reward
        # Update the reward per model per channel
        self.total_reward_per_model_per_channel[self.state, action] += reward
        # Increment the action count for the current state and action
        self.action_count[self.state, action] += 1
        return reward

    def get_action(self, model, step):
        """Gets an action for a given model using the Upper Confidence Bound algorithm.

        Args:
            model (int): The model to select an action for.
            step (int): The current step of the algorithm.

        Returns:
            int: The action to take for the given model.
        """
        # Calculate the Upper Confidence Bound for each action
        ucb = np.zeros(self.num_models)
        for i in range(self.num_models):
            if np.sum(self.action_count[self.state, :]) == 0:
                ucb[i] = np.inf
            else:
                ucb[i] = self.total_reward_per_model_per_channel[self.state, i] / self.action_count[self.state, i] + math.sqrt(2 * math.log(step) / self.action_count[self.state, i])
        # Select the action with the highest UCB
        action = np.argmax(ucb)
        return action


class Agent:
    """A class that defines the agent that interacts with the contextual bandit environment.

    Attributes:
        model (tensorflow.python.keras.engine.training.Model): The neural network model used by the agent.
    """
            """
    def __init__(self, lr, state_size, action_size):
        """Initializes the Agent class.

        Args:
            lr (float): The learning rate for the neural network model.
            state_size (int): The size of the state space.
            action_size (int): The size of the action space.
            """
        self.state_in = tf.keras.layers.Input(shape=(1,), dtype=tf.int32)
        state_in_OH = tf.one_hot(indices=self.state_in, depth=state_size)
        output = Dense(units=action_size, activation=tf.nn.sigmoid, kernel_initializer=tf.ones_initializer())(state_in_OH)
        self.model = tf.keras.models.Model(inputs=self.state_in, outputs=output)
        self.model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=lr))
        
    def get_action(self, state, bandit, step):
        """Gets an action for a given state using the Upper Confidence Bound algorithm.

        Args:
            state (numpy.ndarray): The state to select an action for.
            bandit (ContextualBandit): The contextual bandit environment.
            step (int): The current step of the algorithm.

        Returns:
            int: The action to take for the given state.
        """
        # Get the action for the current state using the Upper Confidence Bound algorithm
        action = bandit.get_action(self.model.predict(state)[0], step)
        return action

        Returns:
            int: An action for the given state.
        """
        probs = self.model.predict(state)[0]
        return np.random.choice(len(probs), p=probs)

    def train(self, state, action, reward):
        """Trains the neural network model.

        Args:
            state (numpy.ndarray): The state to train the model on.
            action (int): The action taken in the given state.
            reward (float): The reward received for taking the given action in the given state.
        """
        action_one_hot = tf.one_hot(indices=action, depth=action_size)
        self.model.train_on_batch(state, action_one_hot, sample_weight=reward)


def train_bandit(agent, bandit, num_episodes):
    """Trains the agent on the contextual bandit environment.

    Args:
        agent (Agent): The agent to train.
        bandit (ContextualBandit): The contextual bandit environment to train the agent on.
        num_episodes (int): The number of episodes to train the agent on.

    Returns:
        numpy.ndarray: A 2D array of the total rewards for each channel and model.
    """
    total_episodes = num_episodes  # Set total number of episodes to train agent on.
    total_reward = np.zeros([bandit.num_channels, bandit.num_models])  # Set scoreboard for bandit (4x9).
    e = 0.1  # Set the chance of taking a random action.

    # Launch the tensorflow graph
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        i = 0
        while i < total_episodes:
            s = bandit.get_state()  # Get a state from the environment.
            # Choose either a random action or one from our network.
            if np.random.rand(1) < e:
                a = bandit.get_action()
            else:
                a = agent.get_action(np.array([s]))
            r = bandit.get_reward(a)  # Get our reward for taking an action given a bandit.
            # Update the network.
            agent.train(np.array([s]), a, r)
            # Update our running tally of scores.
            total_reward[s, a] += r
            if i % 100 == 0:
                print("Mean reward for each of the " + str(bandit.num_channels) + " channels: " + str(np.mean(total_reward, axis=1)))
                            i += 1
    return total_reward

