# Model 1_1
## Dataset: Gym

## Programmer: Giovanni Vecchione
## Date: 4/17/24
## Subject: Machine Learning 2 - Project 6
Use Reinforced Learning (RL) to build the project. Submit your project as Jupyter notebook.

In [424]:
import matplotlib as mtp
import torch
import numpy as np

In [425]:
#Checks if GPU is being used
if torch.cuda.is_available():
    device = torch.device("cuda")  # Use the GPU
    print("Using GPU:", torch.cuda.get_device_name(0)) 
else:
    device = torch.device("cpu")  # Fallback to CPU
    print("GPU not available, using CPU.")

#Using GPU: NVIDIA GeForce GTX 1660 SUPER - Successful
#NOTE: This took some time to set up by installing and pathing the cuda toolkit v.12.4 and the right supplemental packages. This drastically improved
#training time

Using GPU: NVIDIA GeForce GTX 1660 SUPER


In [426]:
import gym

env = gym.make("CartPole-v1", render_mode="rgb_array")
#render_mode="rgb_array" dosn't work for some reason when using the render call

In [427]:
print(env.observation_space)    # See what kind of data the environment provides
print(env.action_space)         # See the agent's possible actions 

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
Discrete(2)


In [428]:
obs, info = env.reset(seed=42)
obs

array([ 0.0273956 , -0.00611216,  0.03585979,  0.0197368 ], dtype=float32)

In [429]:
info

{}

In [430]:
img = env.render()
img.shape #only works when using hte rgb call from earlier, however it does not seem to function

(400, 600, 3)

In [431]:
action = 1  # accelerate right
obs, reward, done, truncated, info = env.step(action)

## The step() method executes the desired action and returns five values:

### *1. obs*
    This is the new observation. The cart is now moving toward the right (obs[1] > 0). The pole is still tilted toward the right (obs[2] > 0), but its angular velocity is now negative (obs[3] < 0), so it will likely be tilted toward the left after the next step.

### *2. reward*
    In this environment, you get a reward of 1.0 at every step, no matter what you do, so the goal is to keep the episode running for as long as possible.

### *3. done*
    This value will be True when the episode is over. This will happen when the pole tilts too much, or goes off the screen, or after 200 steps (in this last case, you have won). After that, the environment must be reset before it can be used again.

### *4. truncated*
    This value will be True when an episode is interrupted early, for example by an environment wrapper that imposes a maximum number of steps per episode (see Gym's documentation for more details on environment wrappers). Some RL algorithms treat truncated episodes differently from episodes finished normally (i.e., when done is True), but in this chapter we will treat them identically.

### *5. info*
    This environment-specific dictionary may provide extra information, just like the one returned by the reset() method.

In [432]:
n_iterations = 150
n_episodes_per_update = 10
n_max_steps = 50
discount_factor = 0.95

## Neural Network Policies

Basically using a neural net instead of a basic policy function

*Typical Work Flow:* 

Define Neural Network (model1_1) and loss function (loss_fn)

For each iteration:

1. Run multiple episodes using play_multiple_episodes
2. Discount and Normalize the rewards using the helper functions
3. Update the model1_1 parameters based on the collected gradients (this would need an optimizer and a gradient application step)

In [433]:
#Here is the code to build a basic neural network policy using Keras:
import tensorflow as tf
from tensorflow import keras

model1_1 = keras.Sequential([
    keras.layers.Dense(32, activation='relu'), 
    keras.layers.Dense(32, activation='relu'),  # Another hidden layer
    keras.layers.Dense(1, activation='sigmoid')   # Output layer: probabilities for 2 actions
])



## Policy Gradients:
Neural Nets cannot train on their own and must have a policy to follow. In this case we're using a Policy Gradient.

In [434]:
"""
Core Action: This function takes a single step within the environment.

Neural Network Interaction: It receives the current observation (obs), 
passes it through the model (model1_1) to get the probability of moving left (left_proba).

Action Selection: An action is sampled based on that probability.

Environment Update: The action is executed, and the function gets the next observation, reward, and 'done' flag from the environment.

Loss Calculation: It prepares data for calculating the policy gradient loss, using a loss_fn (due to cross entropy stated earlier).

Gradient Calculation: Uses a tf.GradientTape to record the operations, enabling the calculation of the policy gradient.

NOTE: Returning actions to track
"""

def play_one_step(env, obs, model1_1, loss_fn):
    with tf.GradientTape() as tape:
        left_proba = model1_1(obs[np.newaxis])
        # Calculate action probabilities from your neural network
        action = (tf.random.uniform([1, 1]) > left_proba)
        y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32)
        # Calculate a loss based on the returns and action probabilities
        loss = tf.reduce_mean(loss_fn(y_target, left_proba))

# Calculate gradients and update the model
    grads = tape.gradient(loss, model1_1.trainable_variables)
    obs, reward, done, truncated, info = env.step(int(action))
    return obs, reward, done, truncated, grads, action

In [435]:
""" 
Episode Loop: This function is responsible for running n_episodes.

Data Collection: It collects the rewards (all_rewards) and gradients (all_grads) produced by play_one_step during each episode.
*New* imporvements: added an observation collection also.
    Added an action collection also.
"""

def play_multiple_episodes(env, n_episodes, n_max_steps, model1_1, loss_fn):
    all_rewards = []
    all_grads = []
    all_observations = []  # Added for storing observations
    all_actions = [] #Added for storing actions

    for episode in range(n_episodes):
        current_rewards = []
        current_grads = []
        current_observations = []  # Store observations for each episode
        current_actions = []  # Actions for each episode
        obs, info = env.reset()
        current_observations.append(obs)  # Store the initial observation

        for step in range(n_max_steps):
            obs, reward, done, truncated, grads, action = play_one_step(
                env, obs, model1_1, loss_fn)
            current_rewards.append(reward)
            current_grads.append(grads)
            current_actions.append(action)  # Store the action 
            current_observations.append(obs) # Store each subsequent observation 

            if done or truncated:
                break

        all_rewards.append(current_rewards)
        all_grads.append(current_grads)
        all_observations.append(current_observations) 
        all_actions.append(current_actions)

    return all_rewards, all_grads, all_observations, all_actions

In [436]:
""" 
Discounted Returns: This straightforward function takes a list of rewards from a single episode and calculates the 
discounted cumulative rewards, with future rewards being weighted less by the discount_factor.
"""
def discount_rewards(rewards, discount_factor):
    discounted = np.array(rewards)
    for step in range(len(rewards) - 2, -1, -1):
        discounted[step] += discounted[step + 1] * discount_factor
    return discounted

""" 
Normalization: This function applies the discount_rewards to each episode's rewards and then normalizes them 
(subtracting the mean and dividing by the standard deviation). Normalization can often improve stability during learning.

Updated this--------------------------- 

Normalization Scope: Now, we calculate the mean and standard deviation for each individual episode's discounted rewards 
and normalize accordingly.

Preserving Episode Structure: The output will maintain the structure of nested lists, where each inner list represents a 
single episode's normalized rewards.
"""
def discount_and_normalize_rewards(all_rewards, discount_factor):
    all_discounted_rewards = []

    #Normalize
    for rewards in all_rewards:
        discounted_rewards = discount_rewards(rewards, discount_factor)
        reward_mean = discounted_rewards.mean()
        reward_std = discounted_rewards.std()
        normalized_rewards = (discounted_rewards - reward_mean) / reward_std  # Normalize here
        all_discounted_rewards.append(normalized_rewards) 
    return all_discounted_rewards

In [437]:
"""
Tensor Conversion: Ensure all_final_rewards is a TensorFlow tensor for efficient calculation.

GradientTape:  Essential for tracking the operations needed to calculate gradients later.

Resimulating Episodes:  Since the action probabilities from the model are needed, we re-run the episodes, 
this time storing the log probabilities of the actions that were taken .
(NOTE: need to replace "initial_obs" and "actions" with ways to access these from the code above).

Log Probabilities:  We calculate the log probabilities of the chosen actions.
"""

""" 

"""
def calculate_policy_gradient_loss(model, all_final_rewards, all_grads, all_observations, all_actions):
    print(all_final_rewards)  # Inspect the structure thoroughly
    print([len(episode_rewards) for episode_rewards in all_final_rewards])  # Check lengths
    
    all_final_rewards = tf.convert_to_tensor(all_final_rewards, dtype=tf.float32)

    with tf.GradientTape() as tape:  
        # Simulate the episodes again to get action probabilities 
        all_log_probs = [] 
        for episode_index, final_rewards in enumerate(all_final_rewards):
            obs = all_observations[episode_index][0]  # Access initial observation
            episode_log_probs = []
            
            for step, reward in enumerate(final_rewards):
                action_probs = model(obs[tf.newaxis])  
                selected_action = all_actions[episode_index][step]  # Access stored action
                log_prob = tf.math.log(tf.gather_nd(action_probs, tf.stack([tf.range(1), selected_action], axis=1)))
                episode_log_probs.append(log_prob)
                obs = all_observations[episode_index][step + 1]  # Update obs for the next step
                
                if done:
                    break 

            all_log_probs.append(episode_log_probs) 
            
        all_log_probs = tf.stack(all_log_probs, axis=0) 
        
        # Calculate the losses, weigh them, and take the mean across episodes  
        losses = -all_log_probs * all_final_rewards 
        loss = tf.reduce_mean(losses)

    return loss

*NOTE:* Since we are sampling a single action based on the probability, the y_target in the play_one_step function represents which action was actually taken.

In [438]:
optimizer = tf.keras.optimizers.Nadam(learning_rate=0.01)  #Nadam is an extension of adam and is common when using RL
loss_fn = tf.keras.losses.binary_crossentropy

In [439]:
for iteration in range(n_iterations):
    all_rewards, all_grads, all_observations, all_actions = play_multiple_episodes(
        env, n_episodes_per_update, n_max_steps, model1_1, loss_fn)
    all_final_rewards = discount_and_normalize_rewards(all_rewards,
                                                       discount_factor)
    all_mean_grads = []

    loss = calculate_policy_gradient_loss(model1_1, all_final_rewards, all_grads, all_observations, all_actions) 

    for var_index in range(len(model1_1.trainable_variables)):
        mean_grads = tf.reduce_mean(
            [final_reward * all_grads[episode_index][step][var_index]
             for episode_index, final_rewards in enumerate(all_final_rewards)
                 for step, final_reward in enumerate(final_rewards)], axis=0)
        all_mean_grads.append(mean_grads)

    optimizer.apply_gradients(zip(all_mean_grads, model1_1.trainable_variables))

[array([ 1.28468712,  1.23873344,  1.19036115,  1.13944295,  1.08584484,
        1.02942578,  0.9700373 ,  0.90752311,  0.84171869,  0.77245089,
        0.69953741,  0.62278638,  0.54199582,  0.45695313,  0.36743451,
        0.27320438,  0.17401476,  0.06960465, -0.04030074, -0.15599062,
       -0.27776944, -0.40595767, -0.54089265, -0.68292947, -0.83244192,
       -0.98982344, -1.1554882 , -1.32987216, -1.51343422, -1.70665744,
       -1.9100503 , -2.12414805]), array([ 1.43765107,  1.28945721,  1.13346366,  0.96925994,  0.79641391,
        0.61447072,  0.42295158,  0.22135248,  0.0091429 , -0.21423561,
       -0.44937087, -0.69688168, -0.95741937, -1.23166958, -1.520354  ,
       -1.82423234]), array([ 1.45052841,  1.27120908,  1.0824519 ,  0.88376012,  0.67461089,
        0.45445379,  0.22270949, -0.02123189, -0.27801228, -0.54830743,
       -0.83282864, -1.13232466, -1.44758362, -1.77943516]), array([ 1.33595584,  1.27245261,  1.20560709,  1.1352434 ,  1.06117635,
        0.9832110

ValueError: Can't convert non-rectangular Python sequence to Tensor.

# Degugging

### Version 1: 
model1_1 = keras.Sequential([
    keras.layers.Dense(32, activation='relu'), 
    keras.layers.Dense(32, activation='relu'),  # Another hidden layer
    keras.layers.Dense(1, activation='sigmoid')   # Output layer: probabilities for 2 actions
])

    So the issue with this isn't the code or the model just need to make a choice here. When using sequential models you need to ensure that the data is consistent in shape to feed into training. However in my case I wanted to add a Policy Gradient Loss Calculation as well. This means the shapes of the "all_final_rewards" need to be uniformed either by enforcing; a max size through early termination, padding shorter sequences to a certain length, OR just switching to an RNN since it can handle inconsistent lengths of data (does add more complexity tho).

Tried to use early termination but that didn't seem to work.

Okay so the issue is that we are trying to convert inconsistent data lengths (which I assume is because it takes different steps to succeed or fail the goal), into a tensorFlow format. This is the issue when implementing a calc_gradient_loss function. Although this is not explicitly in this section, it appears to be good practice to integrate it.
1. Early Termination
2. Padding
3. RNN

Even when adding a padding function to help with shape it still does not accept the shape. Probably due to the 0s being interpretted as T/F or something.
