# Week 8, Day 1: Introduction to Reinforcement Learning

## Learning Objectives
- Understand RL fundamentals
- Learn key RL concepts
- Master basic RL algorithms
- Practice implementing RL solutions

## Topics Covered
1. RL Basics
2. Markov Decision Processes
3. Value Functions
4. Basic Algorithms

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gym
import random
from collections import defaultdict

## 1. Basic Concepts

In [None]:
def rl_concepts_example():
    # Create simple environment
    env = gym.make('FrozenLake-v1')
    
    # Show environment info
    print("Action Space:", env.action_space)
    print("State Space:", env.observation_space)
    
    # Run one episode
    state = env.reset()
    done = False
    total_reward = 0
    
    while not done:
        # Random action
        action = env.action_space.sample()
        
        # Take action
        next_state, reward, done, info = env.step(action)
        total_reward += reward
        
        # Update state
        state = next_state
    
    print("\nEpisode finished with reward:", total_reward)

rl_concepts_example()

## 2. Q-Learning Implementation

In [None]:
def q_learning_example():
    # Initialize environment
    env = gym.make('FrozenLake-v1')
    
    # Q-learning parameters
    learning_rate = 0.1
    discount_factor = 0.99
    epsilon = 0.1
    episodes = 1000
    
    # Initialize Q-table
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    
    # Training loop
    rewards = []
    for episode in range(episodes):
        state = env.reset()
        total_reward = 0
        done = False
        
        while not done:
            # Epsilon-greedy action selection
            if random.random() < epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(Q[state])
            
            # Take action
            next_state, reward, done, info = env.step(action)
            
            # Update Q-value
            best_next_action = np.argmax(Q[next_state])
            td_target = reward + discount_factor * Q[next_state][best_next_action]
            td_error = td_target - Q[state][action]
            Q[state][action] += learning_rate * td_error
            
            # Update state and reward
            state = next_state
            total_reward += reward
        
        rewards.append(total_reward)
    
    # Plot results
    plt.figure(figsize=(10, 5))
    plt.plot(pd.Series(rewards).rolling(100).mean())
    plt.title('Average Reward over Episodes')
    plt.xlabel('Episode')
    plt.ylabel('Average Reward')
    plt.show()
    
    return Q

Q = q_learning_example()

## 3. Policy Evaluation

In [None]:
def policy_evaluation_example():
    # Create simple grid world
    grid_size = 4
    states = [(i, j) for i in range(grid_size) for j in range(grid_size)]
    actions = ['up', 'right', 'down', 'left']
    
    # Random policy
    policy = {state: {action: 0.25 for action in actions} for state in states}
    
    # Value function
    V = {state: 0 for state in states}
    
    # Reward function
    R = {state: -1 for state in states}
    R[(grid_size-1, grid_size-1)] = 0  # Goal state
    
    # Policy evaluation
    theta = 0.0001
    gamma = 0.9
    
    while True:
        delta = 0
        for state in states:
            if state == (grid_size-1, grid_size-1):
                continue
                
            v = V[state]
            new_v = 0
            
            for action in actions:
                # Get next state
                i, j = state
                if action == 'up':
                    next_state = (max(0, i-1), j)
                elif action == 'right':
                    next_state = (i, min(grid_size-1, j+1))
                elif action == 'down':
                    next_state = (min(grid_size-1, i+1), j)
                else:  # left
                    next_state = (i, max(0, j-1))
                
                new_v += policy[state][action] * (R[state] + gamma * V[next_state])
            
            V[state] = new_v
            delta = max(delta, abs(v - V[state]))
        
        if delta < theta:
            break
    
    # Visualize value function
    value_grid = np.zeros((grid_size, grid_size))
    for i in range(grid_size):
        for j in range(grid_size):
            value_grid[i, j] = V[(i, j)]
    
    plt.figure(figsize=(8, 8))
    plt.imshow(value_grid)
    plt.colorbar()
    plt.title('State Value Function')
    for i in range(grid_size):
        for j in range(grid_size):
            plt.text(j, i, f'{value_grid[i,j]:.2f}',
                     ha='center', va='center')
    plt.show()

policy_evaluation_example()

## Practical Exercises

In [None]:
# Exercise 1: Simple RL Agent

def rl_agent_exercise():
    print("Task: Implement a basic RL agent")
    print("1. Create environment")
    print("2. Define agent behavior")
    print("3. Train agent")
    print("4. Evaluate performance")
    
    # Your code here

rl_agent_exercise()

In [None]:
# Exercise 2: Value Function Implementation

def value_function_exercise():
    print("Task: Implement value function estimation")
    print("1. Define state space")
    print("2. Implement value updates")
    print("3. Run iterations")
    print("4. Visualize results")
    
    # Your code here

value_function_exercise()

## MCQ Quiz

1. What is reinforcement learning?
   - a) Supervised learning
   - b) Learning from interaction
   - c) Unsupervised learning
   - d) Transfer learning

2. What is a state in RL?
   - a) Action
   - b) Environment description
   - c) Reward
   - d) Policy

3. What is a policy?
   - a) Reward function
   - b) Action selection strategy
   - c) State space
   - d) Value function

4. What is Q-learning?
   - a) Policy evaluation
   - b) Value function learning
   - c) State estimation
   - d) Reward calculation

5. What is exploration vs exploitation?
   - a) Learning rate
   - b) Action selection tradeoff
   - c) State transition
   - d) Reward function

6. What is a value function?
   - a) Action selection
   - b) State/action evaluation
   - c) Policy definition
   - d) Reward calculation

7. What is temporal difference learning?
   - a) Policy evaluation
   - b) Incremental learning
   - c) State estimation
   - d) Action selection

8. What is the discount factor?
   - a) Learning rate
   - b) Future reward weight
   - c) State value
   - d) Action probability

9. What is an episode?
   - a) State transition
   - b) Complete interaction sequence
   - c) Reward calculation
   - d) Policy update

10. What is the Bellman equation?
    - a) Policy definition
    - b) Value recursion
    - c) Action selection
    - d) State transition

Answers: 1-b, 2-b, 3-b, 4-b, 5-b, 6-b, 7-b, 8-b, 9-b, 10-b