# Zadanie 5

Celem ćwiczenia jest implementacja algorytmu Q-learning.

Następnie należy stworzyć agenta rozwiązującego problem [Taxi](https://gymnasium.farama.org/environments/toy_text/taxi/). Problem dostępny jest w pakiecie **gym**.

Punktacja (max 7 pkt):
- Implementacja algorytmu Q-learning. [3 pkt]
- Eksperymenty dla różnych wartości hiperparametrów [2 pkt]
- Jakość kodu [1 pkt]
- Wnioski [1 pkt]


In [8]:
import numpy as np
import gymnasium as gym

In [9]:
class QLearningSolver:
    """Class containing the Q-learning algorithm that might be used for different discrete environments."""

    def __init__(
        self,
        observation_space: int,
        action_space: int,
        learning_rate: float = 0.9,
        gamma: float = 0.9,
        epsilon: float = 0.1,
        q_table: np.ndarray = None,
    ):
        self.observation_space = observation_space
        self.action_space = action_space
        self.learning_rate = learning_rate
        self.gamma = gamma
        self.epsilon = epsilon
        if q_table is None:
            self.q_table = np.zeros(shape=(observation_space, action_space))
        else:
            self.q_table = q_table

    def __call__(self, state: np.ndarray, action: np.ndarray) -> float:
        """Return Q-value of given state and action."""
        return self.q_table[state][action]

    def update(self, state: np.ndarray, action: np.ndarray, reward: float) -> None:
        """Update Q-value of given state and action."""
        self.q_table[state][action] += reward

    def get_best_action(self, state: np.ndarray) -> int:
        """Return action that maximizes Q-value for a given state."""
        return np.argmax(self.q_table[state])

    def get_best_move_evaluation(self, state: np.array) -> float:
        return np.max(self.q_table[state])

    def __repr__(self):
        """Elegant representation of Q-learning solver."""
        pass

    def __str__(self):
        return self.__repr__()

In [10]:
def run_episode(solver: QLearningSolver, environment):
    """
    Episode starts in random state.
    Then decide whether to use q_table or make perform random action.
    Get the reward of chosen action and update the q_table.
    """
    state = environment.reset()[0]
    terminated, truncated = False, False
    while not terminated and not truncated:
        if np.random.random() < solver.epsilon:
            action = environment.action_space.sample()
        else:
            action = solver.get_best_action(state)

        next_state, reward, terminated, truncated, _ = environment.step(action)
        delta = (
            reward
            + solver.gamma * solver.get_best_move_evaluation(next_state)
            - solver(state, action)
        )
        solver.update(state, action, solver.learning_rate * delta)
        state = next_state


def q_learning(
    environment, learning_rate=0.9, gamma=0.9, epsilon=0.1, number_of_episodes=1000
):
    """
    Run n episodes, which will allow QLearningSolver with specified parameters to learn
    """
    solver = QLearningSolver(
        environment.observation_space.n,
        environment.action_space.n,
        learning_rate,
        gamma,
        epsilon,
    )
    for _ in range(number_of_episodes):
        run_episode(solver, environment)
    return solver

In [11]:
def test_solver(solver: QLearningSolver, environment, number_of_tests: int = 1000):
    """
    Method for testing QLearningSolvers;
    Checks whether a solver can reach the reward from a random state n times.
    Returns the success rate and average number of steps.
    """
    successes = 0
    total_steps = 0
    for _ in range(number_of_tests):
        state = environment.reset()[0]
        terminated, truncated = False, False
        steps = 0
        while not terminated and not truncated:
            action = solver.get_best_action(state)
            next_state, reward, terminated, truncated = environment.step(action)[:4]
            state = next_state
            steps += 1
        if terminated and reward > 0:
            successes += 1
        total_steps += steps
    return successes / number_of_tests, total_steps / number_of_tests

In [12]:
def run_experiment(
    env, learning_rate=0.9, gamma=0.9, epsilon=0.1, number_of_episodes=1000
):
    solver = q_learning(env, learning_rate, gamma, epsilon, number_of_episodes)
    success_rate, average_number_of_steps = test_solver(solver, env)
    print("LEARNING:")
    print(
        f"Learning rate: {learning_rate}, gamma: {gamma}, epsilon: {epsilon}, number of episodes: {number_of_episodes}"
    )
    print("TESTING:")
    print(
        f"Success Rate: {success_rate}\nAverage number of steps: {average_number_of_steps}"
    )

# Testy

In [13]:
env = gym.make("Taxi-v3")

In [14]:
run_experiment(env)

LEARNING:
Learning rate: 0.9, gamma: 0.9, epsilon: 0.1, number of episodes: 1000
TESTING:
Success Rate: 0.977
Average number of steps: 17.47


In [15]:
run_experiment(env, number_of_episodes=800)

LEARNING:
Learning rate: 0.9, gamma: 0.9, epsilon: 0.1, number of episodes: 800
TESTING:
Success Rate: 0.933
Average number of steps: 25.681


In [16]:
run_experiment(env, number_of_episodes=600)

LEARNING:
Learning rate: 0.9, gamma: 0.9, epsilon: 0.1, number of episodes: 600
TESTING:
Success Rate: 0.899
Average number of steps: 31.923


In [17]:
run_experiment(env, number_of_episodes=1500)

LEARNING:
Learning rate: 0.9, gamma: 0.9, epsilon: 0.1, number of episodes: 1500
TESTING:
Success Rate: 0.998
Average number of steps: 13.55


In [18]:
run_experiment(env, gamma = 0.7)

LEARNING:
Learning rate: 0.9, gamma: 0.7, epsilon: 0.1, number of episodes: 1000
TESTING:
Success Rate: 0.958
Average number of steps: 20.9


In [19]:
run_experiment(env, gamma = 0.7, number_of_episodes=1500)

LEARNING:
Learning rate: 0.9, gamma: 0.7, epsilon: 0.1, number of episodes: 1500
TESTING:
Success Rate: 0.991
Average number of steps: 14.84


In [20]:
run_experiment(env, learning_rate=0.6)

LEARNING:
Learning rate: 0.6, gamma: 0.9, epsilon: 0.1, number of episodes: 1000
TESTING:
Success Rate: 0.973
Average number of steps: 18.019


In [21]:
run_experiment(env, learning_rate=0.8, epsilon=0.5)

LEARNING:
Learning rate: 0.8, gamma: 0.9, epsilon: 0.5, number of episodes: 1000
TESTING:
Success Rate: 0.991
Average number of steps: 14.724


In [22]:
run_experiment(env, learning_rate=0.5, epsilon=0.5)

LEARNING:
Learning rate: 0.5, gamma: 0.9, epsilon: 0.5, number of episodes: 1000
TESTING:
Success Rate: 0.97
Average number of steps: 18.644


In [23]:
run_experiment(env, learning_rate=0.1, epsilon=0.5)

LEARNING:
Learning rate: 0.1, gamma: 0.9, epsilon: 0.5, number of episodes: 1000
TESTING:
Success Rate: 0.422
Average number of steps: 120.183


In [41]:
run_experiment(env, learning_rate=0.1, epsilon=0.5, number_of_episodes=2500)

LEARNING:
Learning rate: 0.1, gamma: 0.9, epsilon: 0.5, number of episodes: 2500
TESTING:
Success Rate: 0.955
Average number of steps: 21.239


In [24]:
run_experiment(env, learning_rate=0.9, epsilon=0.5)

LEARNING:
Learning rate: 0.9, gamma: 0.9, epsilon: 0.5, number of episodes: 1000
TESTING:
Success Rate: 0.992
Average number of steps: 14.555


In [25]:
run_experiment(env, learning_rate=0.9, gamma=0.9, epsilon=0.25)

LEARNING:
Learning rate: 0.9, gamma: 0.9, epsilon: 0.25, number of episodes: 1000
TESTING:
Success Rate: 0.989
Average number of steps: 15.319


In [26]:
run_experiment(env, learning_rate=0.2, gamma=0.9, epsilon=0.4, number_of_episodes=2500)

LEARNING:
Learning rate: 0.2, gamma: 0.9, epsilon: 0.4, number of episodes: 2500
TESTING:
Success Rate: 0.999
Average number of steps: 13.187


In [27]:
run_experiment(env, learning_rate=0.2, gamma=0.9, epsilon=0.4)

LEARNING:
Learning rate: 0.2, gamma: 0.9, epsilon: 0.4, number of episodes: 1000
TESTING:
Success Rate: 0.813
Average number of steps: 47.504


In [28]:
run_experiment(env, learning_rate=0.8, gamma=0.4, epsilon=0.4)

LEARNING:
Learning rate: 0.8, gamma: 0.4, epsilon: 0.4, number of episodes: 1000
TESTING:
Success Rate: 0.871
Average number of steps: 36.931


In [29]:
run_experiment(env, learning_rate=0.8, gamma=0.4, epsilon=0.7)

LEARNING:
Learning rate: 0.8, gamma: 0.4, epsilon: 0.7, number of episodes: 1000
TESTING:
Success Rate: 0.969
Average number of steps: 18.758


In [30]:
run_experiment(env, learning_rate=0.9, gamma=0.4, epsilon=0.4)

LEARNING:
Learning rate: 0.9, gamma: 0.4, epsilon: 0.4, number of episodes: 1000
TESTING:
Success Rate: 0.925
Average number of steps: 26.861


In [31]:
run_experiment(env, learning_rate=0.99, gamma=0.8, epsilon=0.2, number_of_episodes=1000)

LEARNING:
Learning rate: 0.99, gamma: 0.8, epsilon: 0.2, number of episodes: 1000
TESTING:
Success Rate: 0.982
Average number of steps: 16.543


In [32]:
run_experiment(env, learning_rate=0.7, gamma=0.8, epsilon=0.2, number_of_episodes=1000)

LEARNING:
Learning rate: 0.7, gamma: 0.8, epsilon: 0.2, number of episodes: 1000
TESTING:
Success Rate: 0.97
Average number of steps: 18.674


In [34]:
run_experiment(env, learning_rate=0.9, gamma=0.9, epsilon=0.3, number_of_episodes=500)

LEARNING:
Learning rate: 0.9, gamma: 0.9, epsilon: 0.3, number of episodes: 500
TESTING:
Success Rate: 0.858
Average number of steps: 39.273


In [35]:
run_experiment(env, learning_rate=0.9, gamma=0.9, epsilon=0.3, number_of_episodes=300)

LEARNING:
Learning rate: 0.9, gamma: 0.9, epsilon: 0.3, number of episodes: 300
TESTING:
Success Rate: 0.699
Average number of steps: 68.855


In [39]:
run_experiment(env, learning_rate=0.9, gamma=0.9, epsilon=0.3, number_of_episodes=200)

LEARNING:
Learning rate: 0.9, gamma: 0.9, epsilon: 0.3, number of episodes: 200
TESTING:
Success Rate: 0.355
Average number of steps: 133.094


In [33]:

solver = q_learning(
    env, learning_rate=0.9, gamma=0.9, epsilon=0.4, number_of_episodes=1000
)
np.save("solver", solver.q_table)

# Wnioski

Skuteczność algorytmu QLearning w dużym stopniu zależy od poziomu skomplikowania środowisk i dopowiedniego doboru hiperparametrów. 

Jeśli epsilon jest zbyt mały, przestrzeń nie będzie eksplorowana w odpowiednim stopniu, jednak jeśli epsilon będzie zbyt duży, zdobyta wiedza będzie wykorzystywana w małym stopniu, a co za tym idzie algorytm może nie dojść do stanu akceptującego w rozsądnym czasie.

Zbyt duży learning rate powoduje, że to czego algorytm nauczy się w początkowych epizodach może zostać zapomniane w trakcie późniejszych epizodów, co poskutkuje obniżeniem skuteczności. Jednak mniejszy learning rate, sprawia że potrzeba większej liczby epizodów trenujących do uzyskania satysfakcjonujących wyników. 

Gamma jest odpowiedzialna za szybkość dążenia do potencjalnych nagród. Duża wartość parametru, wskazuje że preferowane są większe nagrody, nawet jeśli do ich uzysania należy poświęcić więcej wysiłku (preferencja nagród długoterminowych). Zbyt duża wartość gammy prowadzi do małej eksploatacji, natomiast zbyt mała gamma prowadzi do za małej eksploracji.

W skomplikowanych środowiskach pomocne okazać się może zmniejszanie hiperparametrów epsilon i learning rate wraz ze wzrostem wiedzy na temat środowiska, a także zmniejszanie parametru gamma w późniejszych fazach każdego epizodu. Zmniejszanie epsilonu pozwala na wykorzystanie zdobytej wiedzy, dzięki czemu algorytm nie musi się uczyć kilka razy tych samych ścieżek. Dzięki zmniejszeniu learning rate algorytm nie zapomina informacji, które zdobył w trakcie wcześniejszych epizodów. Stopniowe zmniejszanie parametru gamma w każdym epizodzie może pozwolić na preferowanie szybkich nagród, w późniejszych fragmentach epizodu, czyli kiedy algorytm nie ma dużo czasu na próby zdobycia nagród długoterminowych.