<a href="https://colab.research.google.com/github/TKtheFirstone/My-Reinforcement-Learning/blob/main/My_Reinforcement_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Aim -
```
There are two doors for our agent to choose from.
In first one, there is +10 reward for opening it always.
In second door, there is 20% probability of +200 reward for opening it, and 80% probability of getting -10 penalty.
```


In [2]:
# installing dependencies
!pip install gym
# !pip install gymnasium - latest version




**Key Terms & Concepts:**

* State: In RL, a state represents the current situation of the agent. In this problem, we have a single state since the agent always has two doors to choose from.  
* Action: Choices the agent can make. Here, the actions are choosing door 1 or door 2.  
* Reward: The feedback received after taking an action in a state. Here, rewards are given based on opening the doors.  
* Q-values: Q-values denote the expected future reward of an action taken in a state. Will use a Q-table to store these values.  

* **Exploration vs Exploitation**: Initially, the agent will explore actions randomly (exploration). As it learns, it will start choosing the best actions it knows (exploitation).


**Steps:**

Initialization: Initialize Q-values to zeros.
Policy: Define a policy to choose actions. We'll use an epsilon-greedy policy.  
Learning: Update Q-values using the Q-learning update rule.  
Iteration: Repeat the policy and learning steps.  

In [8]:
import gym
import numpy as np

# Environment
num_doors = 2
reward_door1 = 10
reward_door2 = [200, -10]
prob_door2 = [0.2, 0.8]

# Q-learning parameters
alpha = 0.1
gamma = 0.9
epsilon = 0.1
episodes = 1000

# Q-table initialization
q_table = np.zeros(num_doors)

In [13]:
class Choice:
  def run(self):
    for _ in range(episodes):
      # Epsilon-greedy policy
      if np.random.uniform(0, 1) < epsilon:
          action = np.random.choice(num_doors)
      else:
          action = np.argmax(q_table)

      # Get the reward
      if action == 0:
          reward = reward_door1
      else:
          reward = np.random.choice(reward_door2, p=prob_door2)

      # Q-learning update rule
      q_table[action] = q_table[action] + alpha * (reward + gamma * np.max(q_table) - q_table[action])

    print("Q-values:", q_table)
    best_door = np.argmax(q_table) + 1
    print(f"Best door to choose is Door {best_door}")

In [14]:
RL = Choice()
RL.run()

Q-values: [291.80244844 291.83601929]
Best door to choose is Door 2


##### Another sample program to train a mountain car how to climb a hill

In [3]:
from gym import wrappers

n_states = 40
iter_max = 10000

initial_lr = 1.0 # Learning rate
min_lr = 0.003
gamma = 1.0
t_max = 10000
eps = 0.02

def run_episode(env, policy=None, render=False):
    obs = env.reset()
    total_reward = 0
    step_idx = 0
    for _ in range(t_max):
        if render:
            env.render()
        if policy is None:
            action = env.action_space.sample()
        else:
            a,b = obs_to_state(env, obs)
            action = policy[a][b]
        obs, reward, done, _ = env.step(action)
        total_reward += gamma ** step_idx * reward
        step_idx += 1
        if done:
            break
    return total_reward

def obs_to_state(env, obs):
    """ Maps an observation to state """
    env_low = env.observation_space.low
    env_high = env.observation_space.high
    env_dx = (env_high - env_low) / n_states
    a = int((obs[0] - env_low[0])/env_dx[0])
    b = int((obs[1] - env_low[1])/env_dx[1])
    return a, b

if __name__ == '__main__':
    env_name = 'MountainCar-v0'
    env = gym.make(env_name)
    env.seed(0)
    np.random.seed(0)
    print ('----- using Q Learning -----')
    q_table = np.zeros((n_states, n_states, 3))
    for i in range(iter_max):
        obs = env.reset()
        total_reward = 0
        ## eta: learning rate is decreased at each step
        eta = max(min_lr, initial_lr * (0.85 ** (i//100)))
        for j in range(t_max):
            a, b = obs_to_state(env, obs)
            if np.random.uniform(0, 1) < eps:
                action = np.random.choice(env.action_space.n)
            else:
                logits = q_table[a][b]
                logits_exp = np.exp(logits)
                probs = logits_exp / np.sum(logits_exp)
                action = np.random.choice(env.action_space.n, p=probs)
            obs, reward, done, _ = env.step(action)
            total_reward += reward
            # update q table
            a_, b_ = obs_to_state(env, obs)
            q_table[a][b][action] = q_table[a][b][action] + eta * (reward + gamma *  np.max(q_table[a_][b_]) - q_table[a][b][action])
            if done:
                break
        if i % 100 == 0:
            print('Iteration #%d -- Total reward = %d.' %(i+1, total_reward))
    solution_policy = np.argmax(q_table, axis=2)
    solution_policy_scores = [run_episode(env, solution_policy, False) for _ in range(100)]
    print("Average score of solution = ", np.mean(solution_policy_scores))
    # Animate it
    run_episode(env, solution_policy, True)

  deprecation(
  deprecation(
  deprecation(


----- using Q Learning -----
Iteration #1 -- Total reward = -200.
Iteration #101 -- Total reward = -200.
Iteration #201 -- Total reward = -200.
Iteration #301 -- Total reward = -200.
Iteration #401 -- Total reward = -200.
Iteration #501 -- Total reward = -200.
Iteration #601 -- Total reward = -200.
Iteration #701 -- Total reward = -200.
Iteration #801 -- Total reward = -200.
Iteration #901 -- Total reward = -200.
Iteration #1001 -- Total reward = -200.
Iteration #1101 -- Total reward = -200.
Iteration #1201 -- Total reward = -200.
Iteration #1301 -- Total reward = -200.
Iteration #1401 -- Total reward = -200.
Iteration #1501 -- Total reward = -200.
Iteration #1601 -- Total reward = -200.
Iteration #1701 -- Total reward = -200.
Iteration #1801 -- Total reward = -200.
Iteration #1901 -- Total reward = -200.
Iteration #2001 -- Total reward = -200.
Iteration #2101 -- Total reward = -200.
Iteration #2201 -- Total reward = -200.
Iteration #2301 -- Total reward = -200.
Iteration #2401 -- Tota

If you want to render in human mode, initialize the environment in this way: gym.make('EnvName', render_mode='human') and don't call the render method.
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(
