Q1. Implement Q-learning algorithm using OpenAI gym environment. The Smart cab's job is to pick up the passenger at one location and drop them off in another.
The agent should receive a high positive reward for a successful drop off because this
behavior is highly desired
The agent should be penalized if it tries to drop off a passenger in wrong locations
The agent should get a slight negative reward for not making it to the destination after
every time-step.
The passenger can be in one of the four possible locations: R, G, Y, B, which are
represented in row, column coordinates as (0,0), (0,4), (4,0), (4,3) respectively.
Additionally, we need to consider a fifth state where the passenger is already inside the
taxi. Therefore, the number of possible states for the passenger's location is 5.
The destination can be one of the four possible locations: R, G, Y, B, which are also
represented in row, column coordinates. Therefore, the number of possible states for the
destination is 4.
We have six possible actions:
1. south
2. north
3. east
4. west
5. pickup
6. drop off
Implement the above problem using Gym environment called Taxi-V2.


In [26]:
!pip install numpy
!pip install gym



In [27]:
import numpy as np
import gym

env = gym.make('Taxi-v3')

alpha = 0.1
gamma = 0.6
num_episodes = 10000
epsilon = 0.1

num_states = env.observation_space.n
num_actions = env.action_space.n
Q = np.zeros((num_states, num_actions))

for episode in range(num_episodes):
    state = env.reset()
    done = False
    total_reward = 0

    while not done:
        if np.random.uniform(0, 1) < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state, :])

        next_state, reward, done, info = env.step(action)
        Q[state, action] = Q[state, action] + alpha * (reward + gamma * np.max(Q[next_state, :]) - Q[state, action])
        total_reward += reward
        state = next_state
    print("Episode:", episode + 1, "Total Reward:", total_reward)

num_test_episodes = 100
total_rewards = []

for _ in range(num_test_episodes):
    state = env.reset()
    done = False
    total_reward = 0

    while not done:
        action = np.argmax(Q[state, :])
        next_state, reward, done, info = env.step(action)
        total_reward += reward
        state = next_state
    total_rewards.append(total_reward)
average_reward = sum(total_rewards) / num_test_episodes
print("Average Reward:", average_reward)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Episode: 5002 Total Reward: -16
Episode: 5003 Total Reward: 10
Episode: 5004 Total Reward: -8
Episode: 5005 Total Reward: 6
Episode: 5006 Total Reward: -3
Episode: 5007 Total Reward: -27
Episode: 5008 Total Reward: -18
Episode: 5009 Total Reward: 3
Episode: 5010 Total Reward: 10
Episode: 5011 Total Reward: 11
Episode: 5012 Total Reward: -5
Episode: 5013 Total Reward: 0
Episode: 5014 Total Reward: 9
Episode: 5015 Total Reward: 6
Episode: 5016 Total Reward: 9
Episode: 5017 Total Reward: 8
Episode: 5018 Total Reward: -4
Episode: 5019 Total Reward: 10
Episode: 5020 Total Reward: -4
Episode: 5021 Total Reward: 10
Episode: 5022 Total Reward: -20
Episode: 5023 Total Reward: -1
Episode: 5024 Total Reward: 10
Episode: 5025 Total Reward: 11
Episode: 5026 Total Reward: -6
Episode: 5027 Total Reward: 10
Episode: 5028 Total Reward: -7
Episode: 5029 Total Reward: -9
Episode: 5030 Total Reward: 3
Episode: 5031 Total Reward: 5
Episode: 5

Q2. A well equipped state of the art hospital monitors data of the patients using AI devices
remotely. Each patient has different sensors deployed on his/her body to acquire the values
of sugar, Blood Pressure etc. Day and night expelling of the data requires infinite lifetime
which is the ideal situation and nearly impossible due to DC devices deployment but to
increase its lifetime we can go for certain AI techniques e.g reinforcement learning by
measuring its nearest path to the sink from the cluster head. Develop such algorithm to
increase the network lifetime upto maximum extent using Reinforcement learning. Your
code must contain the cost factor while calculating the reward function for the next state.

In [28]:
import numpy as np

class HospitalNetwork:
    def __init__(self, num_nodes, num_actions):
        self.num_nodes = num_nodes
        self.num_actions = num_actions
        self.Q_table = np.zeros((num_nodes, num_nodes, num_actions))
        self.alpha = 0.1
        self.gamma = 0.9
        self.epsilon = 0.1

    def select_action(self, state):
        if np.random.uniform(0, 1) < self.epsilon:
            return np.random.randint(self.num_actions)
        else:
            return np.argmax(self.Q_table[state[0], state[1]])

    def update_Q_table(self, state, action, reward, next_state):
        self.Q_table[state[0], state[1], action] += self.alpha * (reward + self.gamma * np.max(self.Q_table[next_state[0], next_state[1]]) - self.Q_table[state[0], state[1], action])

    def train(self, num_episodes, max_steps):
        for episode in range(num_episodes):
            state = (np.random.randint(self.num_nodes), np.random.randint(self.num_nodes))
            total_reward = 0

            for step in range(max_steps):
                action = self.select_action(state)
                next_state = (state[0], np.random.randint(self.num_nodes))
                reward = -abs(next_state[1] - 0)
                total_reward += reward

                self.update_Q_table(state, action, reward, next_state)
                state = next_state
            print("Episode:", episode, "Total Reward:", total_reward)

num_nodes = 10
num_actions = 4
num_episodes = 100
max_steps = 10

hospital_network = HospitalNetwork(num_nodes, num_actions)
hospital_network.train(num_episodes, max_steps)

Episode: 0 Total Reward: -38
Episode: 1 Total Reward: -46
Episode: 2 Total Reward: -51
Episode: 3 Total Reward: -39
Episode: 4 Total Reward: -56
Episode: 5 Total Reward: -49
Episode: 6 Total Reward: -42
Episode: 7 Total Reward: -42
Episode: 8 Total Reward: -38
Episode: 9 Total Reward: -63
Episode: 10 Total Reward: -52
Episode: 11 Total Reward: -36
Episode: 12 Total Reward: -45
Episode: 13 Total Reward: -55
Episode: 14 Total Reward: -46
Episode: 15 Total Reward: -34
Episode: 16 Total Reward: -65
Episode: 17 Total Reward: -30
Episode: 18 Total Reward: -55
Episode: 19 Total Reward: -41
Episode: 20 Total Reward: -55
Episode: 21 Total Reward: -33
Episode: 22 Total Reward: -50
Episode: 23 Total Reward: -34
Episode: 24 Total Reward: -35
Episode: 25 Total Reward: -54
Episode: 26 Total Reward: -30
Episode: 27 Total Reward: -54
Episode: 28 Total Reward: -49
Episode: 29 Total Reward: -37
Episode: 30 Total Reward: -40
Episode: 31 Total Reward: -50
Episode: 32 Total Reward: -37
Episode: 33 Total Re