# Introduction
In this notebook, we will learn how to apply Reinforcement Learning to train an agent to play Super Mario Bros. We will build up the entire project step by step and also give yourself a chance to explore the packages and functions behind it. 

## Imports
First, let's import our packages that we need. We will be using the `gym_super_mario_bros` package which is an OpenAI Gym enviroment for Super Mario Bros.

In [None]:
import gym_super_mario_bros
from gym_super_mario_bros.actions import RIGHT_ONLY,SIMPLE_MOVEMENT,COMPLEX_MOVEMENT
from nes_py.wrappers import JoypadSpace
import numpy as np
from gym import Wrapper
from gym.wrappers import GrayScaleObservation, ResizeObservation, FrameStack
import os
from PIL import Image
from wrappers import *
from agent import *


## Basic Functions
The Super Mario Gym API gives us a lot of functions that we use right of the bat. These are the exact same functions that OpenAI Gym uses. 

Let's check what actions we can take with Mario. According to the documenation, "gym_super_mario_bros.actions provides three actions lists (RIGHT_ONLY, SIMPLE_MOVEMENT, and COMPLEX_MOVEMENT) "

In [None]:
print("The actions in RIGHT_ONLY are: ", RIGHT_ONLY)
print("The actions in SIMPLE_MOVEMENT are: ", SIMPLE_MOVEMENT)
print("The actions in COMPLEX_MOVEMENT are: ", COMPLEX_MOVEMENT)

If you're familiar with the NES controller, you might recognize some of these actions. For example, the `right` and `A` key performs a jump while moving right. I'm sure you get the idea. Another note: The `"NOOP"` means no operation, or in this case, not do anything. 

![alt text](NES.jpg "NES Controller")

Now, let's define our enviroment. We will use the variable `ENV_NAME` to store the string corresponding to the level of the game. According to the documentation, we specify the enviroment in the form 
```python
SuperMarioBros-<world>-<stage>-v<version>
```
where:

- `<world>` is a number in {1, 2, 3, 4, 5, 6, 7, 8} indicating the world
- `<stage>` is a number in {1, 2, 3, 4} indicating the stage within a world
- `<version>` is a number in {0, 1, 2, 3} specifying the ROM mode to use
    - 0: standard ROM
    - 1: downsampled ROM
    - 2: pixel ROM
    - 3: rectangle ROM

Thus, if we want to play 3-1 on the standard ROM, you would use the environment id `SuperMarioBros-3-1-v0`. We will be basic and play 1-1 on the standard ROM.

In [None]:
ENV_NAME = "SuperMarioBros-1-1-v0"
env = gym_super_mario_bros.make(ENV_NAME, render_mode = 'human', apply_api_compatibility=True)
env = JoypadSpace(env, RIGHT_ONLY)

As with any gym enviroment, we call the `.make()` function and pass in our enviroment name, along with specifiying some parameters.
- `render_mode`: tells the enviroment how it should be rendered. Check documentation.
- `apply_api_compatibility`: allows use to use recent versions of OpenAI Gym

Then, we will wrap it with the `JoypadSpace` to allow the code to control and play as Mario, passing in our action space (`RIGHT_ONLY`).

In [None]:
done = False
env.reset()

while not done:
    action = RIGHT_ONLY.index(['right'])
    _,_,done,_,_ = env.step(action)
    env.render()
    
env.close()

We will then make a function which handles applying the wrappers to enviroment. We will apply 
- SkipFrame: perform the same action for 4 frames
- ResizeObservation: resize frame from 240x256 to 84x84
- GrayScaleObservation: make the frame greyscale 
- FrameStack: compresses the frames together

TODO: In `wrappers.py`, complete the step function in the SkipFrame class that inherits from the wrapper class. 

## Solution

```python
def step(self, action):
    total_reward = 0.0
    done = False
    for _ in range(self.skip):
        next_state, reward, done, trunc, info = self.env.step(action)
        total_reward += reward
        if done:
            break
    return next_state, total_reward, done, trunc, info
'''

Then run this cell below to check if your implementation is correct. 

In [None]:
import numpy as np

class MockEnv:
    def __init__(self):
        self.current_state = 0
        self.step_count = 0

    def reset(self):
        self.current_state = 0
        self.step_count = 0
        return self.current_state

    def step(self, action):
        # Simulate the environment's response to an action
        self.current_state += 1
        reward = 1  # Fixed reward for simplicity
        self.step_count += 1
        done = self.step_count >= 10  # Terminal state after 10 steps
        trunc = False  # For simplicity, we won't use truncation in this mock
        info = {}
        return self.current_state, reward, done, trunc, info


def test_skip_frame():
    env = MockEnv()
    skip_env = SkipFrame(env, skip=4)
    total_rewards = []
    done = False

    while not done:
        next_state, total_reward, done, trunc, info = skip_env.step(action=0)
        total_rewards.append(total_reward)

    # Check if the total rewards accumulated are as expected
    expected_rewards = [4] * (10 // 4)  # Adjust based on your mock environment and skip value
    assert all(tr == er for tr, er in zip(total_rewards, expected_rewards)), "Test failed :("

    print("Test passed! :D.")

test_skip_frame()


Now complete the `apply_wrappers()` function in `wrapper.py` which will execute all of the wrappers above. You just need to add some parameters in to make sure it fits the description!

## Agent NN Class
Now, we will implement the agent nn class which will outline the architecture of the CNN. We will implement this with PyTorch. Check this out for a great resource: [Link](https://pytorch.org/tutorials/intermediate/mario_rl_tutorial.html)

TODO: In `agent_nn.py`, within the AgentNN class, implement the CNN architecture. Be sure to include 
- Convolution Layers
- Linear Layers

**Make sure to check what the input and outputs for each layer!**

## Agent Class

Now that we've implemented our network architecture, let's implement the agent class which will do the learning (calculate rewards). Try looking through the file and make sure you understand how the class is structured. You will see 2 networks. Online vs target network. Let's see what the difference is.

### Online v Target Network

The target and online networks serve different roles in the learning process, helping to stabilize the training of the agent. 
### Online Network

- **Role**: The online network, also known as the policy network, is directly involved in the decision-making process. It evaluates the current state and predicts the Q-values for each action. The action with the highest Q-value (or a random action, depending on the epsilon-greedy strategy) is selected by the agent to perform in the environment.
- **Training**: The weights of the online network are updated continuously at every learning step based on the loss calculated between its predicted Q-values and the target Q-values derived from the target network.
- **Purpose**: The main purpose of the online network is to learn and improve the policy the agent follows by minimizing the difference between its predicted Q-values and the target Q-values.

### Target Network

- **Role**: The target network has a similar architecture to the online network but serves a different purpose. It is used to generate the target Q-values for the next state when calculating the loss during training. These target Q-values are used to provide a stable goal for the online network to achieve.
- **Training**: The weights of the target network are updated less frequently, often by copying the weights from the online network at regular intervals. This infrequent update schedule helps to stabilize the training process.
- **Purpose**: Its primary purpose is to stabilize learning by providing consistent targets for the online network's updates. Without it, the training process can become unstable due to the constantly shifting targets, as the same network would be generating the predictions and also providing the targets for those predictions.

### Why the Distinction Matters

The use of separate target and online networks addresses the problem of moving targets in Q-learning. In a constantly changing environment, where both the policy and the value estimates are being updated simultaneously, having a stable target for value estimation is crucial. The target network provides this stability, as its parameters are updated less frequently, thereby making the training process more stable and reliable.

This separation essentially helps mitigate the risk of positive feedback loops where the network's predictions can become overly optimistic, leading to poor policy decisions. By decoupling the generation of target Q-values from the online network's predictions, it ensures that the agent learns from a slightly out-of-date, but more stable, version of its own value estimates, leading to more robust learning outcomes.

Now that we've seen what both networks are used for, implement the `choose_action()` and `learn()` functions in `agent.py`, ensuring that you pass in the correct inputs! If you need a peak at the solutions, feel free to!

## Solution

```python
predicted_q_values = self.online_network(states)
predicted_q_values = predicted_q_values[np.arange(self.batch_size), actions.squeeze()]
target_q_values = self.target_network(next_states).max(dim=1)[0]
target_q_values = rewards + self.gamma * target_q_values * (1 - dones.float())
```

# Putting it all together

At this point, we should have gone through all of the parts that we need to put this all togther. Run `main.py` to train the model. Make sure you look over the code to make sure you understand what the code is doing. Since we are running for 50,000 episodes, this will take a long while, depending on your machine. Try running it for 5000 episodes and the model will save itself in the same folder. Don't worry if you're not able to train it for the whole 50000 episodes. It is a lot. Play around with the code. Some things you can play around with are:

### 1. **Learning Rate (`lr`)**:
- The rate at which the agent learns from each batch of experiences. Adjusting the learning rate can have a significant impact on the convergence and stability of training.

### 2. **Discount Factor (`gamma`)**:
- Determines the importance of future rewards. A higher value places more emphasis on future rewards, which can affect the agent's strategy significantly.

### 3. **Batch Size**:
- The number of experiences sampled from the replay buffer for each learning step. Changing the batch size can affect learning dynamics and computational efficiency.

### 4. **Action Space**:
- Choosing different sets of actions (e.g., `RIGHT_ONLY` vs. `SIMPLE_MOVEMENT` vs. `COMPLEX_MOVEMENT`) changes the complexity of the decision-making problem and can lead to different behaviors.

## Further exploration
Once you are able to train the model to some extent, try exploring with different paramters. Below, I change the learning rate and discount rates and plot it versus rewards.

In [None]:
# Example parameters to explore
learning_rates = [0.001, 0.0005, 0.0001]
discount_factors = [0.99, 0.95, 0.90]
NUM_OF_EPISODES = 500
ENV_NAME = 'SuperMarioBros-1-1-v0
# Store results
experiment_results = {}

for lr in learning_rates:
    for gamma in discount_factors:
        # Declare your enviroment 
        env = gym_super_mario_bros.make(ENV_NAME, render_mode='human', apply_api_compatibility=True)
        env = JoypadSpace(env, RIGHT_ONLY)

        #Apply the wrappers
        env = apply_wrappers(env)

        # instantiate an agent of agent class
        agent = Agent(input_dims=env.observation_space.shape, num_actions=env.action_space.n, lr=lr, gamma=gamma)
        
        total_rewards = []
        
        for episode in range(NUM_OF_EPISODES):
            done = False
            total_reward = 0
            state = env.reset()
            
            while not done:
                action = agent.choose_action(state)
                new_state, reward, done, _ = env.step(action)
                agent.store_in_memory(state, action, reward, new_state, done)
                agent.learn()
                state = new_state
                total_reward += reward
            
            total_rewards.append(total_reward)
        
        # Store the results
        experiment_results[(lr, gamma)] = total_rewards



# Conclusion
Great job! I hope you learned something through these 10 weeks, albeit it was somewhat not well structured. I hope this showed up what the tip of reinforcement learning is and how it can be applied. Training takes a while but the finished product is pretty cool! 