# Model 1_1
## Dataset: Gym

## Programmer: Giovanni Vecchione
## Date: 4/17/24
## Subject: Machine Learning 2 - Project 6
Use Reinforced Learning (RL) to build the project. Submit your project as Jupyter notebook.

In [189]:
import matplotlib as mtp
import torch
import numpy as np

In [190]:
#Checks if GPU is being used
if torch.cuda.is_available():
    device = torch.device("cuda")  # Use the GPU
    print("Using GPU:", torch.cuda.get_device_name(0)) 
else:
    device = torch.device("cpu")  # Fallback to CPU
    print("GPU not available, using CPU.")

#Using GPU: NVIDIA GeForce GTX 1660 SUPER - Successful
#NOTE: This took some time to set up by installing and pathing the cuda toolkit v.12.4 and the right supplemental packages. This drastically improved
#training time

Using GPU: NVIDIA GeForce GTX 1660 SUPER


In [191]:
import gym

env = gym.make("CartPole-v1", render_mode="rgb_array")
#render_mode="rgb_array" dosn't work for some reason when using the render call

CartPole-v1 is a classic control problem where the goal is to balance a pole on a cart by applying forces left or right.

In [192]:
print(env.observation_space)    # See what kind of data the environment provides
print(env.action_space)         # See the agent's possible actions 

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
Discrete(2)


In [193]:
obs, info = env.reset(seed=42)
obs

array([ 0.0273956 , -0.00611216,  0.03585979,  0.0197368 ], dtype=float32)

In [194]:
info

{}

In [195]:
img = env.render()
img.shape #only works when using hte rgb call from earlier, however it does not seem to function

(400, 600, 3)

In [196]:
action = 1  # accelerate right
obs, reward, done, truncated, info = env.step(action)

  if not isinstance(terminated, (bool, np.bool8)):


## The step() method executes the desired action and returns five values:

### *1. obs*
    This is the new observation. The cart is now moving toward the right (obs[1] > 0). The pole is still tilted toward the right (obs[2] > 0), but its angular velocity is now negative (obs[3] < 0), so it will likely be tilted toward the left after the next step.

### *2. reward*
    In this environment, you get a reward of 1.0 at every step, no matter what you do, so the goal is to keep the episode running for as long as possible.

### *3. done*
    This value will be True when the episode is over. This will happen when the pole tilts too much, or goes off the screen, or after 200 steps (in this last case, you have won). After that, the environment must be reset before it can be used again.

### *4. truncated*
    This value will be True when an episode is interrupted early, for example by an environment wrapper that imposes a maximum number of steps per episode (see Gym's documentation for more details on environment wrappers). Some RL algorithms treat truncated episodes differently from episodes finished normally (i.e., when done is True), but in this chapter we will treat them identically.

### *5. info*
    This environment-specific dictionary may provide extra information, just like the one returned by the reset() method.

In [197]:
n_iterations = 150
n_episodes_per_update = 10
n_max_steps = 50
discount_factor = 0.95

## Neural Network Policies

Basically using a neural net instead of a basic policy function

*Typical Work Flow:* 

Define Neural Network (model1_1) and loss function (loss_fn)

For each iteration:

1. Run multiple episodes using play_multiple_episodes
2. Discount and Normalize the rewards using the helper functions
3. Update the model1_1 parameters based on the collected gradients (this would need an optimizer and a gradient application step)

In [198]:
#Here is the code to build a basic neural network policy using Keras:
import tensorflow as tf
from tensorflow import keras

model1_1 = keras.Sequential([
    keras.layers.Dense(32, activation='relu'), 
    keras.layers.Dense(32, activation='relu'),  # Another hidden layer
    keras.layers.Dense(1, activation='sigmoid')   # Output layer: probabilities for 2 actions
])



## Policy Gradients:
Neural Nets cannot train on their own and must have a policy to follow. In this case we're using a Policy Gradient.

In [199]:
"""
Core Action: This function takes a single step within the environment.

Neural Network Interaction: It receives the current observation (obs), 
passes it through the model (model1_1) to get the probability of moving left (left_proba).

Action Selection: An action is sampled based on that probability.

Environment Update: The action is executed, and the function gets the next observation, reward, and 'done' flag from the environment.

Loss Calculation: It prepares data for calculating the policy gradient loss, using a loss_fn (due to cross entropy stated earlier).

Gradient Calculation: Uses a tf.GradientTape to record the operations, enabling the calculation of the policy gradient.

NOTE: Returning actions to track
"""

def play_one_step(env, obs, model, loss_fn):
    with tf.GradientTape() as tape:
        left_proba = model(obs[np.newaxis])
        action = (tf.random.uniform([1, 1]) > left_proba)
        y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32)
        loss = tf.reduce_mean(loss_fn(y_target, left_proba))
    
    grads = tape.gradient(loss, model.trainable_variables)
    obs, reward, done, truncated, info = env.step(int(action))
    return obs, reward, done, truncated, grads

In [200]:
""" 
Episode Loop: This function is responsible for running n_episodes.

Data Collection: It collects the rewards (all_rewards) and gradients (all_grads) produced by play_one_step during each episode.
*New* imporvements: added an observation collection also.
    Added an action collection also.
"""

def play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn):
    all_rewards = []
    all_grads = []
    for episode in range(n_episodes):
        current_rewards = []
        current_grads = []
        obs, info = env.reset()
        for step in range(n_max_steps):
            obs, reward, done, truncated, grads = play_one_step(
                env, obs, model, loss_fn)
            current_rewards.append(reward)
            current_grads.append(grads)
            if done or truncated:
                break

        all_rewards.append(current_rewards)
        all_grads.append(current_grads)

    return all_rewards, all_grads

In [201]:
"""
advantage-based update*

Advantage function attempts to answer the question: "Was this action better or worse than expected in the current situation?" 
It provides a more refined learning signal compared to just using the raw rewards.

"""
"""
def calculate_advantage(all_final_rewards, baseline=None):
    if baseline is None:
        baseline = np.mean(all_final_rewards)  # Simple average baseline

    advantages = np.array([reward - baseline for reward in all_final_rewards])
    return advantages
"""

'\ndef calculate_advantage(all_final_rewards, baseline=None):\n    if baseline is None:\n        baseline = np.mean(all_final_rewards)  # Simple average baseline\n\n    advantages = np.array([reward - baseline for reward in all_final_rewards])\n    return advantages\n'

In [202]:
""" 
Discounted Returns: This straightforward function takes a list of rewards from a single episode and calculates the 
discounted cumulative rewards, with future rewards being weighted less by the discount_factor.
"""
def discount_rewards(rewards, discount_factor):
    discounted = np.array(rewards)
    for step in range(len(rewards) - 2, -1, -1):
        discounted[step] += discounted[step + 1] * discount_factor
    return discounted

""" 
Normalization: This function applies the discount_rewards to each episode's rewards and then normalizes them 
(subtracting the mean and dividing by the standard deviation). Normalization can often improve stability during learning.

"""

def discount_and_normalize_rewards(all_rewards, discount_factor):
    all_discounted_rewards = [discount_rewards(rewards, discount_factor)
                              for rewards in all_rewards]
    flat_rewards = np.concatenate(all_discounted_rewards)
    reward_mean = flat_rewards.mean()
    reward_std = flat_rewards.std()
    return [(discounted_rewards - reward_mean) / reward_std
            for discounted_rewards in all_discounted_rewards]

*NOTE:* Since we are sampling a single action based on the probability, the y_target in the play_one_step function represents which action was actually taken.

In [203]:
optimizer = tf.keras.optimizers.Nadam(learning_rate=0.01)  #Nadam is an extension of adam and is common when using RL
loss_fn = tf.keras.losses.binary_crossentropy #You can replace this with a custom loss function

In [204]:
for iteration in range(n_iterations):
    all_rewards, all_grads = play_multiple_episodes(
        env, n_episodes_per_update, n_max_steps, model1_1, loss_fn)
    all_final_rewards = discount_and_normalize_rewards(all_rewards,
                                                       discount_factor)
    # Advantage Calculation 
    #all_advantages = calculate_advantage(all_final_rewards)  

    all_mean_grads = []
    for var_index in range(len(model1_1.trainable_variables)):

        # Get gradients from 'play_one_step' for the variable 
        #grads_from_play_one_step = [grads[var_index] for episode_rewards, grads in zip(all_rewards, all_grads)]

    
        # Calculate the mean
        #mean_grad = tf.reduce_mean(grads_from_play_one_step, axis=0)
                                   
        mean_grads = tf.reduce_mean(
            [final_reward * all_grads[episode_index][step][var_index]
             for episode_index, final_rewards in enumerate(all_final_rewards)
                 for step, final_reward in enumerate(final_rewards)], axis=0)
        
        #Implement the advanatage-based factor to the mean_grads
        #modified_grad = mean_grad * all_advantages 

        all_mean_grads.append(mean_grads)

    optimizer.apply_gradients(zip(all_mean_grads, model1_1.trainable_variables))

    """ 
    NOTE: The commented out code lines above are meant to implement a custom advantage-based function.
    However sequential models are very limited it seems in being able to customize with variable length of data.
    A solution to this could be to implement an RNN however this would take some more time.
    """

KeyboardInterrupt: 

In [205]:
def evaluate_model(model, n_test_episodes=100):
    total_rewards = []
    for _ in range(n_test_episodes):
        obs = env.reset()
        done = False
        total_reward = 0
        while not done:
            action = np.argmax(model.predict(obs[np.newaxis])[0])  # Get action from the model
            obs, reward, done, truncated, info = env.step(action)
            total_reward += reward
        total_rewards.append(total_reward)

    avg_reward = np.mean(total_rewards)
    return avg_reward

In [None]:
# Evaluate the trained model
avg_reward = evaluate_model(model1_1)
print("Average reward after training:", avg_reward)

""" 
This in no means is the standard in evaluating how a model performs in game.
Couldn't find a specific way to do this so went with a general avg_reward evaluation.
We could add this to the training loop to evaluate the average overtime.
"""

# Degugging

### ISSUE 1: Custom Loss Function not working
### 1: SOLVED
model1_1 = keras.Sequential([
    keras.layers.Dense(32, activation='relu'), 
    keras.layers.Dense(32, activation='relu'),  # Another hidden layer
    keras.layers.Dense(1, activation='sigmoid')   # Output layer: probabilities for 2 actions
])

    So the issue with this isn't the code or the model just need to make a choice here. When using sequential models you need to ensure that the data is consistent in shape to feed into training. However in my case I wanted to add a Policy Gradient Loss Calculation as well. This means the shapes of the "all_final_rewards" need to be uniformed either by enforcing; a max size through early termination, padding shorter sequences to a certain length, OR just switching to an RNN since it can handle inconsistent lengths of data (does add more complexity tho).

Tried to use early termination but that didn't seem to work.

Okay so the issue is that we are trying to convert inconsistent data lengths (which I assume is because it takes different steps to succeed or fail the goal), into a tensorFlow format. This is the issue when implementing a calc_gradient_loss function. Although this is not explicitly in this section, it appears to be good practice to integrate it.
1. Early Termination
2. Padding
3. RNN

Even when adding a padding function to help with shape it still does not accept the shape. Probably due to the 0s being interpretted as T/F or something.


*FOUND IT*, I tried to implement muliple loss functions at once, I removed my custom one and instead am using an advantage-based function to act as a helper.

### ISSUE 2: Advantage-Based Function not working
### 2: SOLVED

Sequential models are very limited it seems in being able to customize with variable length of data.
A solution to this could be to implement an RNN however this would take some more time.

### MODEL DECISION FOR TESTING : Focus on the parameters of the actual model. We'll implement custom functions another time.

### Future Directions

* RNN Exploration: While you're focusing on sequential model optimization now, keep the RNN option in mind.  If you hit limitations with the sequential approach, the time investment to migrate to an RNN model could pay off in terms of performance.

* Advanced Advantage:  If your current advantage calculation is working well, consider investigating Generalized Advantage Estimation (GAE)  later on for potentially even better learning stability.


# TESTING:

### TEST #1: 
