Reinforcement learning is a type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize cumulative reward.

Key concepts:
- Agent: The learner or decision maker.
- Environment: The world in which the agent operates.
- Action: A decision made by the agent.
- State: A situation in which the agent finds itself.
- Reward: Feedback from the environment based on the action taken by the agent.
- Policy: A strategy used by the agent to determine actions based on states.
- Value Function: A function that estimates the expected cumulative reward from a state or state-action pair.
- Exploration vs Exploitation: The dilemma of choosing between exploring new actions or exploiting known rewarding actions.


Key characteristics of reinforcement learning
1. No labels
2. Involves exploration and exploitation
3. Involves a reward signal

Reinforcement learning algotithms
1. Q-learning : A model-free reinforcement learning algorithm that learns the value of an action in a particular state.
2. SARSA : An on-policy reinforcement learning algorithm that updates the action-value function based on the action taken.
3. Deep Q-Networks (DQN) : Combines Q-learning with deep neural networks to handle high-dimensional state spaces.

Q-learning algorithm
Environment(position, goal, reward)
actions = [left, right, up, down]

In [3]:
#import necessary libraries
import numpy as np
import pandas as pd


In [4]:
#Define the environment
position = 5
actions = 2

In [5]:
#Insitailize the Q-table
Q = np.zeros((position, actions))

In [6]:
#Define parameters
episodes = 1000
alpha = 0.8  # learning rate 
gamma = 0.9  # discount factor for future rewards
epsilon = 0.3  # exploration rate 

In [7]:
#Training loop
for episode in range(episodes):
    state = np.random.randint(0, position)  # Random initial state
    done = False
    
    while not done:
        # Choose action using epsilon-greedy policy
        if np.random.rand() < epsilon:
            action = np.random.randint(0, actions)  # Explore
        else:
            action = np.argmax(Q[state])  # Exploit
        
        # Simulate environment response
        next_state = (state + 1) % position if action == 0 else (state - 1) % position
        reward = 1 if next_state == 0 else -1
        
        # Update Q-value
        Q[state, action] += alpha * (reward + gamma * np.max(Q[next_state]) - Q[state, action])
        
        # Move to the next state
        state = next_state
        
        # Check if the episode is done
        if state == 0:
            done = True

In [8]:
# Convert Q-table to DataFrame for better visualization
Q_df = pd.DataFrame(Q, columns=['Action 0', 'Action 1'], index=[f'State {i}' for i in range(position)])
# Print the Q-table
print("Q-table after training:")
print(Q_df)

Q-table after training:
         Action 0  Action 1
State 0 -0.526315 -0.526315
State 1 -1.473684  0.526316
State 2 -1.473684 -0.526315
State 3 -0.526315 -1.473684
State 4  0.526316 -1.473684


In [9]:
#Testing the learned policy
test_episodes = 10
for episode in range(test_episodes):
    state = np.random.randint(0, position)  # Random initial state
    done = False
    print(f"Episode {episode + 1}:")
    
    while not done:
        action = np.argmax(Q[state])  # Choose the best action
        print(f"State: {state}, Action: {action}")
        
        # Simulate environment response
        next_state = (state + 1) % position if action == 0 else (state - 1) % position
        
        # Move to the next state
        state = next_state
        
        # Check if the episode is done
        if state == 0:
            done = True
            print("Reached goal state!")
        else:
            print(f"Moved to State: {state}")
# End of the script
# The script implements a simple reinforcement learning algorithm using Q-learning.
# It initializes a Q-table, trains it over multiple episodes, and tests the learned policy.
# The Q-table is printed at the end to show the learned values for each state-action pair.
# The testing phase simulates episodes where the agent follows the learned policy to reach the goal state.
# The script is a basic example of reinforcement learning and can be extended or modified for more complex environments.
# The script implements a simple reinforcement learning algorithm using Q-learning.

Episode 1:
State: 1, Action: 1
Reached goal state!
Episode 2:
State: 1, Action: 1
Reached goal state!
Episode 3:
State: 4, Action: 0
Reached goal state!
Episode 4:
State: 2, Action: 1
Moved to State: 1
State: 1, Action: 1
Reached goal state!
Episode 5:
State: 4, Action: 0
Reached goal state!
Episode 6:
State: 2, Action: 1
Moved to State: 1
State: 1, Action: 1
Reached goal state!
Episode 7:
State: 1, Action: 1
Reached goal state!
Episode 8:
State: 0, Action: 1
Moved to State: 4
State: 4, Action: 0
Reached goal state!
Episode 9:
State: 1, Action: 1
Reached goal state!
Episode 10:
State: 0, Action: 1
Moved to State: 4
State: 4, Action: 0
Reached goal state!


In [10]:
# Convert Q-table to DataFrame for better visualization
Q_df = pd.DataFrame(Q, columns=['Action 0', 'Action 1'], index=[f'State {i}' for i in range(position)])
# Print the Q-table
print("Q-table after training:")
print(Q_df)
# Save the Q-table to a CSV file
Q_df.to_csv('q_table.csv', index=True)
# The script implements a simple reinforcement learning algorithm using Q-learning.         

Q-table after training:
         Action 0  Action 1
State 0 -0.526315 -0.526315
State 1 -1.473684  0.526316
State 2 -1.473684 -0.526315
State 3 -0.526315 -1.473684
State 4  0.526316 -1.473684
