# Week 8, Day 7: Review and Feedback Session

## Session Overview
This session will review the key concepts covered in Week 8 and provide practice exercises to reinforce learning:

1. Basic RL Concepts
2. Value-Based Methods
3. Policy-Based Methods
4. Advanced RL Topics

## Learning Objectives
- Reinforce RL concepts
- Practice technique selection
- Master implementation skills
- Prepare for advanced topics

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gym
import tensorflow as tf

## 1. Basic RL Review

In [None]:
def basic_rl_review():
    # Create simple environment
    env = gym.make('CartPole-v1')
    
    # Show environment info
    print("Action Space:", env.action_space)
    print("State Space:", env.observation_space)
    
    # Run one episode with random actions
    state = env.reset()
    done = False
    total_reward = 0
    
    while not done:
        action = env.action_space.sample()
        next_state, reward, done, _ = env.step(action)
        total_reward += reward
        state = next_state
    
    print("\nEpisode finished with reward:", total_reward)

basic_rl_review()

## 2. Value-Based Methods Review

In [None]:
def value_based_review():
    # Q-Learning example
    env = gym.make('FrozenLake-v1')
    Q = np.zeros([env.observation_space.n, env.action_space.n])
    
    # Training parameters
    alpha = 0.1
    gamma = 0.95
    epsilon = 0.1
    episodes = 100
    
    # Training loop
    rewards = []
    for episode in range(episodes):
        state = env.reset()
        total_reward = 0
        done = False
        
        while not done:
            if np.random.random() < epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(Q[state])
            
            next_state, reward, done, _ = env.step(action)
            
            # Q-Learning update
            Q[state, action] = Q[state, action] + alpha * (
                reward + gamma * np.max(Q[next_state]) - Q[state, action]
            )
            
            state = next_state
            total_reward += reward
        
        rewards.append(total_reward)
    
    # Plot results
    plt.figure(figsize=(10, 5))
    plt.plot(pd.Series(rewards).rolling(10).mean())
    plt.title('Q-Learning: Average Reward over Episodes')
    plt.xlabel('Episode')
    plt.ylabel('Average Reward')
    plt.show()

value_based_review()

## 3. Policy-Based Methods Review

In [None]:
def policy_based_review():
    # Simple policy gradient example
    env = gym.make('CartPole-v1')
    
    # Build policy network
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(24, input_dim=env.observation_space.shape[0], activation='relu'),
        tf.keras.layers.Dense(24, activation='relu'),
        tf.keras.layers.Dense(env.action_space.n, activation='softmax')
    ])
    
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
    
    # Training loop
    episodes = 100
    rewards_history = []
    
    for episode in range(episodes):
        state = env.reset()
        episode_states = []
        episode_actions = []
        episode_rewards = []
        done = False
        
        while not done:
            # Get action probabilities
            state_input = tf.convert_to_tensor(state[None, :], dtype=tf.float32)
            action_probs = model(state_input)
            action = np.random.choice(env.action_space.n, p=action_probs[0].numpy())
            
            # Take action
            next_state, reward, done, _ = env.step(action)
            
            # Store experience
            episode_states.append(state)
            episode_actions.append(action)
            episode_rewards.append(reward)
            
            state = next_state
        
        # Calculate returns
        returns = []
        G = 0
        for r in reversed(episode_rewards):
            G = r + 0.99 * G
            returns.insert(0, G)
        returns = np.array(returns)
        
        # Normalize returns
        returns = (returns - np.mean(returns)) / (np.std(returns) + 1e-8)
        
        # Update policy
        with tf.GradientTape() as tape:
            states = tf.convert_to_tensor(episode_states, dtype=tf.float32)
            actions = tf.convert_to_tensor(episode_actions, dtype=tf.int32)
            action_probs = model(states)
            
            # Calculate loss
            action_masks = tf.one_hot(actions, env.action_space.n)
            selected_action_probs = tf.reduce_sum(action_probs * action_masks, axis=1)
            loss = -tf.reduce_mean(tf.math.log(selected_action_probs) * returns)
        
        # Apply gradients
        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))
        
        rewards_history.append(sum(episode_rewards))
    
    # Plot results
    plt.figure(figsize=(10, 5))
    plt.plot(pd.Series(rewards_history).rolling(10).mean())
    plt.title('Policy Gradient: Average Reward over Episodes')
    plt.xlabel('Episode')
    plt.ylabel('Average Reward')
    plt.show()

policy_based_review()

## Week 8 Review Quiz

### Multiple Choice Questions

1. What is reinforcement learning?
   - a) Supervised learning
   - b) Learning from interaction
   - c) Unsupervised learning
   - d) Transfer learning

2. What is Q-learning?
   - a) Policy gradient method
   - b) Value-based method
   - c) Model-based method
   - d) Supervised learning

3. What is policy gradient?
   - a) Value method
   - b) Direct policy optimization
   - c) Model-based method
   - d) Supervised learning

4. What is DQN?
   - a) Policy method
   - b) Deep Q-Network
   - c) Model-based method
   - d) Supervised learning

5. What is SARSA?
   - a) Off-policy method
   - b) On-policy method
   - c) Model-based method
   - d) Supervised learning

6. What is actor-critic?
   - a) Value method
   - b) Hybrid approach
   - c) Model-based method
   - d) Supervised learning

7. What is exploration vs exploitation?
   - a) Learning rate
   - b) Action selection tradeoff
   - c) Model architecture
   - d) Loss function

8. What is experience replay?
   - a) Policy method
   - b) Memory mechanism
   - c) Model architecture
   - d) Loss function

9. What is multi-agent RL?
   - a) Single agent
   - b) Multiple agents
   - c) Model-based
   - d) Supervised

10. What is hierarchical RL?
    - a) Flat policy
    - b) Nested policies
    - c) Single policy
    - d) Random policy

Answers: 1-b, 2-b, 3-b, 4-b, 5-b, 6-b, 7-b, 8-b, 9-b, 10-b

## Week 8 Summary

### Key Concepts Covered:
1. Basic RL concepts and algorithms
2. Value-based methods (Q-Learning, SARSA, DQN)
3. Policy-based methods (Policy Gradients, Actor-Critic)
4. Advanced topics (Multi-Agent, Hierarchical RL)

### Preparation for Advanced Topics:
- Review challenging concepts
- Practice implementation
- Study real-world applications
- Explore latest research

### Additional Resources:
- OpenAI Spinning Up: https://spinningup.openai.com/
- RL Course by David Silver: https://www.youtube.com/watch?v=2pWv7GOvuf0
- Stable Baselines3: https://stable-baselines3.readthedocs.io/