<a href="https://colab.research.google.com/github/EngrIBGIT/Reinforcment_Learning/blob/main/Introduction_to_Reinforcement_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Reinforcement Learning

Reinforcement Learning (RL) is a type of machine learning where an agent learns how to interact with an environment in order to achieve a goal. The agent receives feedback in the form of rewards or penalties for its actions. Through trial and error, the agent learns the best sequence of actions to maximize its total rewards. This technique is widely used in applications such as game playing (e.g., AlphaGo), robotic control, and autonomous systems.

**Key Concepts**

1. `Agent:` The entity that interacts with the environment (e.g., a robot, game player, etc.).

2. `Environment:` The system with which the agent interacts (e.g., the game board, the physical world).

3. `Action (A):` The choices the agent can make (e.g., moving a piece in a game, turning left or right).

4. `State (S):` The current situation of the agent in the environment (e.g., the current configuration of the game board).

5. `Reward (R):` The feedback the agent receives for its action. Positive rewards encourage the agent to take similar actions in the future, while penalties discourage them.

## Basic Example: Reinforcement Learning in Python

Creating a simple RL environment using Python.

By simulating a basic environment where an agent tries to reach a goal by making decisions.

This is a basic RL model that a beginner can understand.

### Step 1: Install Required Libraries

`pip install gym`

`gym` is a toolkit for building RL environments, and numpy will help with numerical computations.

In [1]:
pip install gym numpy



### Step 2: Create the Environment

Define a simple grid world environment where the agent tries to reach a goal.

In [2]:
import numpy as np
import random

class SimpleGridWorld:
    def __init__(self):
        self.grid_size = 5
        self.state = (0, 0)  # Start position at top-left corner
        self.goal_state = (4, 4)  # Goal at bottom-right corner

    def reset(self):
        self.state = (0, 0)  # Reset the agent to the start position
        return self.state

    def step(self, action):
        # Action can be up, down, left, right
        next_state = list(self.state)

        if action == 'up' and self.state[0] > 0:
            next_state[0] -= 1
        elif action == 'down' and self.state[0] < self.grid_size - 1:
            next_state[0] += 1
        elif action == 'left' and self.state[1] > 0:
            next_state[1] -= 1
        elif action == 'right' and self.state[1] < self.grid_size - 1:
            next_state[1] += 1

        self.state = tuple(next_state)

        # Check if the agent reached the goal
        if self.state == self.goal_state:
            reward = 1  # Reward for reaching the goal
            done = True
        else:
            reward = -0.1  # Penalty for every step taken
            done = False

        return self.state, reward, done

    def get_possible_actions(self):
        return ['up', 'down', 'left', 'right']

# Create an instance of the environment
env = SimpleGridWorld()

# Test the environment
state = env.reset()
print("Initial State:", state)

actions = env.get_possible_actions()
print("Possible Actions:", actions)

# Take a step in the environment
new_state, reward, done = env.step('down')
print("New State:", new_state, "Reward:", reward, "Done:", done)


Initial State: (0, 0)
Possible Actions: ['up', 'down', 'left', 'right']
New State: (1, 0) Reward: -0.1 Done: False


### Step 3: Implement Q-Learning Algorithm

Now that we have a simple environment, let’s train the agent using a basic RL technique called **Q-Learning**. The idea of Q-Learning is to use a table (Q-table) to store the agent's expected rewards for each action in each state.

#### Q-Learning Steps:

1. **Initialize** a Q-table with zeros. The rows correspond to states and the columns correspond to actions.
2. **Choose an action** based on an exploration strategy (e.g., epsilon-greedy).
3. **Take the action** and observe the reward and new state.
4. **Update the Q-value** for the action using the Bellman equation:

   \[
   Q(s, a) \leftarrow Q(s, a) + \alpha \times \left[ r + \gamma \times \max_a Q(s', a) - Q(s, a) \right]
   \]

   Where:

   - \( s \) = current state
   - \( a \) = action taken
   - \( r \) = reward
   - \( s' \) = new state
   - \( \alpha \) = learning rate
   - \( \gamma \) = discount factor (importance of future rewards)

 #### Implementing Q-Learning:

In [3]:
# Initialize the Q-table with zeros
q_table = np.zeros((env.grid_size, env.grid_size, len(actions)))

# Hyperparameters
alpha = 0.1  # Learning rate
gamma = 0.9  # Discount factor
epsilon = 0.1  # Exploration factor
episodes = 1000

def choose_action(state):
    if random.uniform(0, 1) < epsilon:
        # Explore: Choose a random action
        return random.choice(actions)
    else:
        # Exploit: Choose the action with the highest Q-value
        state_index = (state[0], state[1])
        return actions[np.argmax(q_table[state_index])]

# Train the agent
for episode in range(episodes):
    state = env.reset()
    done = False

    while not done:
        action = choose_action(state)
        next_state, reward, done = env.step(action)

        state_index = (state[0], state[1])
        next_state_index = (next_state[0], next_state[1])

        # Update the Q-value using the Q-learning formula
        q_table[state_index][actions.index(action)] = q_table[state_index][actions.index(action)] + alpha * (reward + gamma * np.max(q_table[next_state_index]) - q_table[state_index][actions.index(action)])

        state = next_state

# Display the trained Q-table
print("Trained Q-table:")
print(q_table)

Trained Q-table:
[[[-0.14809391 -0.12695022 -0.15755067 -0.0434062 ]
  [-0.064028    0.02561415 -0.15082672  0.062882  ]
  [ 0.05092854  0.18098    -0.06990217  0.10336519]
  [-0.11205739  0.29095655 -0.09818734 -0.10482227]
  [-0.06707876 -0.05198497 -0.07259832 -0.06793465]]

 [[-0.22431599 -0.21689332 -0.2280373   0.03388275]
  [-0.15483082 -0.10978417 -0.17145275  0.18067415]
  [ 0.0482231   0.3122      0.0325186   0.27452431]
  [-0.04974876  0.45785633 -0.0187411  -0.02156011]
  [-0.04944662  0.60208779 -0.04096347  0.00250655]]

 [[-0.16050408 -0.15717974 -0.15684744 -0.09140406]
  [-0.10354154 -0.12266686 -0.12508875  0.28297861]
  [ 0.13855179  0.20689614  0.07804196  0.458     ]
  [ 0.26118165  0.47677064  0.27757373  0.62      ]
  [ 0.38036298  0.8         0.40947797  0.56320893]]

 [[-0.12125297 -0.11546065 -0.11361513 -0.11749482]
  [-0.08704582 -0.07356103 -0.0954861  -0.08217558]
  [-0.04039525  0.47980313 -0.05667937 -0.03668682]
  [ 0.01834739 -0.019      -0.0109      0

## Step 4: Test the Trained Agent

After training the agent, we can test how well it learned by letting it navigate the environment.

In [4]:
state = env.reset()
done = False
total_reward = 0

while not done:
    action = actions[np.argmax(q_table[state[0], state[1]])]
    state, reward, done = env.step(action)
    total_reward += reward
    print(f"State: {state}, Action: {action}, Reward: {reward}")

print("Total Reward:", total_reward)

State: (0, 1), Action: right, Reward: -0.1
State: (0, 2), Action: right, Reward: -0.1
State: (1, 2), Action: down, Reward: -0.1
State: (2, 2), Action: down, Reward: -0.1
State: (2, 3), Action: right, Reward: -0.1
State: (2, 4), Action: right, Reward: -0.1
State: (3, 4), Action: down, Reward: -0.1
State: (4, 4), Action: down, Reward: 1
Total Reward: 0.30000000000000004


## Summary of Steps:

`Environment Setup:` We created a simple grid world.

`Q-Learning Algorithm:` Implemented Q-Learning to teach the agent to navigate.

`Training:` The agent learned by exploring and updating its Q-values based on the rewards.

`Testing:` We evaluated the agent’s performance in the environment.

This example shows the fundamental principles of reinforcement learning.

It can be extended to more complex environments, and other techniques like deep Q-networks (DQN) can be introduced for environments with larger state spaces.