## **Value-Based Methods (Q-Learning)**

Value-based methods aim to learn a value function that estimates the long-term return of each state-action pair. Q-Learning is one of the most popular algorithms, where an agent updates a Q-table based on the actions it takes and the rewards it receives.



**Imports**

In [3]:
import numpy as np
import gym
from sklearn.metrics import accuracy_score


**Data Loading**

In [None]:
# Create an environment
env = gym.make('CartPole-v1')

# Initialize Q-table
n_actions = env.action_space.n
n_states = env.observation_space.shape[0]  # Number of state variables
Q = np.zeros((n_states, n_actions))


**Model Building**

In [None]:
# Hyperparameters
alpha = 0.1  # Learning rate
gamma = 0.99  # Discount factor
epsilon = 0.1  # Exploration rate

# Q-Learning Algorithm
def q_learning(env, n_episodes=1000):
    for episode in range(n_episodes):
        state = env.reset()
        done = False
        while not done:
            if np.random.rand() < epsilon:
                action = env.action_space.sample()  # Explore
            else:
                action = np.argmax(Q[state])  # Exploit
            
            next_state, reward, done, _ = env.step(action)
            Q[state, action] += alpha * (reward + gamma * np.max(Q[next_state]) - Q[state, action])
            state = next_state

q_learning(env)


**Predictions**

In [None]:
# After training, use the learned Q-table for predictions
state = env.reset()
action = np.argmax(Q[state])  # Select the best action based on the learned Q-values


**Performance Metrics**

In [None]:
# Evaluate the agent's performance after training
total_rewards = 0
for _ in range(10):
    state = env.reset()
    done = False
    while not done:
        action = np.argmax(Q[state])  # Use the policy derived from the Q-table
        state, reward, done, _ = env.step(action)
        total_rewards += reward
print(f"Average Reward: {total_rewards / 10}")
