# Environment, Agent, Reward, Action

In reinforcement learning (RL), these components interact to enable learning through trial and error in dynamic environments.

**Environment in RL**

It represents the external system or context where the agent operates. It could be a physical space, a simulated world, or even an abstract problem. The environment responds to the actions of the agent and provides feedback or observations in response to those actions.

**Agent in RL**

The agent is the decision-making entity that interacts with the environment. It perceives the state of the environment, selects actions, and receives feedback in the form of rewards or penalties from the environment based on those actions.

**Action in RL**

Actions are decisions or moves made by the agent within the environment. These actions impact the state of the environment and may lead to new observations or states.

**Reward in RL**

Rewards are numerical values provided by the environment to the agent after it takes an action. These rewards serve as feedback to the agent, indicating how good or bad the action was with respect to the agent's goal. The agent's objective is typically to maximize the cumulative reward it receives over time.

The goal of the agent in reinforcement learning is to learn a policy, a set of rules or strategies, that maximizes the total expected cumulative reward over time. The agent learns by exploring different actions and observing the consequences (rewards) in the environment, then updating its decision-making strategy based on these experiences.

In essence, reinforcement learning involves the agent learning to make better decisions by interacting with an environment, receiving feedback (rewards), and adjusting its behavior to maximize long-term reward.

In [1]:
!pip install typing

Collecting typing
  Downloading typing-3.7.4.3.tar.gz (78 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/78.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/78.6 kB[0m [31m891.1 kB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m71.7/78.6 kB[0m [31m1.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: typing
  Building wheel for typing (setup.py) ... [?25l[?25hdone
  Created wheel for typing: filename=typing-3.7.4.3-py3-none-any.whl size=26304 sha256=85397e3a905080481983c33aa08e6f2eca898bb72e1635eb33fb35588d27ad39
  Stored in directory: /root/.cache/pip/wheels/7c/d0/9e/1f26ebb66d9e1732e4098bc5a6c2d91f6c9a529838f0284890
S

In [8]:
import random
from typing import List

class Environment:
    # Initializing the environment
    def __init__(self):
        # Total steps to complete the game
        self.steps_left = 20

    # Observation of the environment
    def get_observation(self) -> List[float]:
        # For this example, a list of three float values for observation
        return [0.0, 0.0, 0.0]

    # Possible actions the agent can take
    def get_actions(self) -> List[int]:
        # The agent can take two actions: 0 and 1
        return [0, 1]

    # Check if the game is over
    def is_done(self) -> bool:
        # The game ends when the steps are completed
        return self.steps_left == 0

    # Implementing an action and providing a reward
    def action(self, action: int) -> float:
        # If the game is already over, raise an exception
        if self.is_done():
            raise Exception("Game is over")

        # Decrement the number of steps left
        self.steps_left -= 1

        # Return a random reward for the action taken
        return random.random()


The above code :

Observes the environment

Takes decision based in the environment

Sends the action to the environment

Rewards the upcoming steps

In [10]:
class Agent:
    def __init__(self):
        # Starting score point for the agent is 0.0
        self.total_reward = 0.0

    def step(self, env: Environment):
        # Getting the current observation from the environment
        current_obs = env.get_observation()
        print("Observation {}".format(current_obs))

        # Getting available actions from the environment
        actions = env.get_actions()
        print(actions)

        # Taking a random action and receiving a reward from the environment
        reward = env.action(random.choice(actions))

        # Accumulating the reward obtained from the action
        self.total_reward += reward
        print("Total Reward {}".format(self.total_reward))



In [11]:
if __name__ == "__main__":
    env = Environment()
    agent = Agent()
    i=0

    while not env.is_done():
        i=i+1
        print("Steps {}".format(i))
        agent.step(env)

    print("Total reward got: %.4f" % agent.total_reward)


print("Max value element : ", (agent.total_reward))


Steps 1
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 0.38418183995796384
Steps 2
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 1.2233808814117078
Steps 3
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 1.9314806451884
Steps 4
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 2.5265748958486545
Steps 5
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 3.4167193819439925
Steps 6
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 3.866090981480984
Steps 7
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 4.862868630313203
Steps 8
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 4.8685887542693225
Steps 9
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 5.857826599098511
Steps 10
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 6.721536992530635
Steps 11
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 6.9009392356067805
Steps 12
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 7.323914682859593
Steps 13
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 7.583330917203812
Steps 14
Observation [0.0, 0.

Tightening the code involves making it more precise and optimized. Here are a couple of ways to achieve that:

**Strict Reward Points**

Define a specific reward system where each correct step yields a defined number of points. For instance, you might assign a reward of +1 for every correct step taken by the agent towards the goal. This makes the reward system more defined and consistent.


**Define Steps and Goals**

Establish clear steps or a sequence of actions that the agent needs to take to reach a particular goal or complete the environment. This involves setting specific objectives or milestones that the agent should achieve within the environment, guiding it towards a defined target or goal state.
By implementing these strategies, the code becomes more structured, defining the reward system and the steps required for the agent to accomplish its objectives within the environment. This enhances clarity, precision, and control over the agent's behavior and interactions within the environment.