#Pavlov’s Conditioning

When you hear "AI," your mind might zoom straight to super advanced robots and futuristic tech, right? But hang tight, because the story actually begins way back in the **1800s** with a super curious guy named **Ivan Pavlov**.

 <img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-2-Introduction-to-machine-learning/imgs/bell_dog.png' width=700px>


Now, Pavlov was this Russian physiologist, and he had a big question on his mind: "**Why do dogs drool when they see food?**" To find out, he spent time with his furry friends, giving them food and observing their reactions. But he noticed something totally unexpected — the dogs started drooling even before the food was brought out!

**This sparked a new idea.** Pavlov began ringing a bell before feeding the dogs, and pretty soon, just the sound of the bell made them salivate, even if there was no food in sight. They'd learned to associate the bell with mealtime.

**It's a bit like how AI works, learning and adapting based on experiences.** And it all started with Pavlov, a psychologist intrigued by drooling dogs, laying the groundwork for today's smart tech. It's kind of cool to think about how this simple experiment with dogs has connections to the futuristic world of AI, don't you think?

#What is Reinforcement Learning?

Well, it's this neat branch of Machine Learning that essentially lets an AI entity learn on the go by interacting with its environment. It's kind of like learning through a series of actions and experiences, always adapting and evolving.

**Imagine this scenario:** You find yourself stranded on a deserted island, all alone. Naturally, panic hits first. But as reality sets in, your survival instincts kick in. **You start to figure out the basics:** where to find food, the safest spot to sleep, and identifying which plants are safe to eat and which ones to steer clear of.

Let's say you find a cozy shelter — that's a win, **a positive action reinforced.** But oh no, you ate something that didn't agree with your stomach; you learn quickly to avoid that food in the future. **Just like that, through trial and error**, you're learning to navigate your surroundings more efficiently with each passing day.

Well, reinforcement learning operates on this very principle. It's all about an **agent** — think of it as our AI explorer — who learns from interacting with its environment, constantly tweaking its approach based on the outcomes of its actions, whether good or bad, aiming to bag the maximum rewards and learn optimally from each situation. It's a continuous cycle of action, consequence, and adaptation to get better and smarter with each experience.





#Types of ML

 <img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-2-Introduction-to-machine-learning/imgs/Types%20of%20ML.png' width=700px>

[Image source](https://medium.com/geekculture/three-main-categories-of-machine-learning-with-examples-of-usage-41e2d136c66f)

**Supervised Learning:** This is a type of machine learning where the model is trained on a labeled dataset, which is a dataset where the "right answers" are provided during the training. The model makes predictions or decisions based on input data and is corrected when its predictions are incorrect. Basically, it's "supervised" in the sense that the learning algorithm is guided towards finding the correct answer through the training data.
**Task-Driven:** This term refers to a system or approach being designed to accomplish a specific task or a set of tasks. In the context of machine learning, "task-driven" would imply that the learning is directed towards accomplishing specific predefined tasks.

So, when you combine the two terms, "**supervised learning task-driven**" refers to a scenario where supervised learning is applied in a manner that is directed towards achieving specific tasks. The learning is both supervised (guided through labeled data) and is focused on optimizing performance for those specific tasks. It is generally a method to ensure that a system performs well on the specific tasks it is designed to accomplish, using supervised learning to guide its training in a way that is focused on those tasks.S

**Unsupervised Learning:** This is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses. The system aims to learn the underlying structure of the data without any supervision (i.e., without being provided with the correct answers). The goal is often to discover hidden patterns in the data. Common unsupervised learning approaches include clustering (grouping similar data points together) and association (finding rules that highlight relationships between seemingly independent data in a database).


**Data-Driven:** Being "data-driven" means that decisions, processes, or algorithms are developed and optimized based on available data. It signifies a reliance on data to make decisions and build strategies, rather than relying on intuition or observations alone.

When combining these terms, "**unsupervised learning and data-driven**" could refer to a scenario where an unsupervised learning approach is applied in a data-driven manner. This would mean utilizing unsupervised learning techniques to analyze and leverage large amounts of data to uncover hidden patterns or insights without relying on pre-defined labels or categories. The ultimate goal is to allow the data itself to guide the learning process and the insights derived from it, fostering a more organic exploration of underlying structures or relationships present in the data. It's about letting the data 'speak for itself' to discover new information or patterns.


**Reinforcement Learning:** This is a type of machine learning paradigm where an agent learns how to behave in an environment by taking actions and receiving rewards or penalties in return. It's somewhat similar to how humans learn from their experiences. The agent learns to achieve a goal in an uncertain, potentially complex environment.

L**earning from Mistakes**: In the context of reinforcement learning, "learning from mistakes" refers to the agent improving its strategy over time through trial and error. Initially, the agent might make incorrect or suboptimal choices, but it uses the feedback (rewards or penalties) from these experiences to make better decisions in the future. This way, the agent iteratively refines its policy, aiming to maximize the cumulative reward.

When we combine these concepts, we're describing a learning process where an agent uses reinforcement learning to improve its policy over time, learning both from its successful actions (which earn it rewards) and its mistakes (which incur penalties). The agent aims to find the strategy that will yield the highest cumulative reward, effectively learning "what works" through a continuous process of trial and error, learning progressively from the mistakes it makes along the way. It's a dynamic learning process that encourages the agent to explore various strategies and learn the most efficient pathway to achieving its goal.




#But, how does it compare against supervised learning?

We could totally choose supervised learning instead of exploring reinforcement learning techniques. But here's the thing – we'd need an incredibly large dataset that records every imaginable action and its outcome. And, well, there's a cap to how much we can learn this way.

**Picture this:** if we base all the learning on the actions of the best player out there, the machine might become as great as them, but going beyond them? That won't be on the cards, since it won't know any strategies beyond what the player has shown. It's kind of a tricky situation.

 <img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-2-Introduction-to-machine-learning/imgs/super_rein.webp' width=700px>

[Table Source](https://medium.com/p/d2643ca39b51)

#So, how does it compare against unsupervised learning?

Well, in unsupervised learning, there isn't exactly a direct tie between input and output; it's more about picking up on patterns. But reinforcement learning? That's a different ball game; it learns straight from the results created by previous inputs.

 <img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-2-Introduction-to-machine-learning/imgs/unsupervised%2Breinforcement.webp' width=700px>

 [Table Source](https://medium.com/p/d2643ca39b51)




#"So, would this be considered Deep Learning?"

Deep learning definitely falls under the big roof of Machine Learning. Deep learning is particularly well-suited or has a special ability to address and solve complex problems.

 <img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-2-Introduction-to-machine-learning/imgs/deep_reinforc.webp' width=700px>

  [Image Source](https://medium.com/p/d2643ca39b51)


Taking a look at the Venn diagram, it lays out how different Machine Learning techniques relate to each other. And yep, based on the **Universal Approximation Theorem (UAT)**, it's possible to crack any problem using **Neural Nets**. But here's the thing - they aren't the go-to solution for every issue since **they munch on heaps of data and can be pretty tough to figure out**.

**The diagram makes it clear that not all reinforcement learning problems are a job for deep learning, breaking free from the notion that it's all about deep learning.**

**NOTE:**

The Universal Approximation Theorem (UAT) is kind of a big deal in the world of neural networks and artificial intelligence. Basically, it says that a neural network can approximate, or come really close to, any continuous function.

In simpler terms, give a neural network **the right settings and enough data**, and it can tackle any problem you throw at it by finding the function that links the inputs to the outputs. It's like saying with the right recipe, righ and enough ingredients, you can bake any kind of cake!

**It's a pretty cool concept that shows the amazing potential of neural networks.**

# How does reinforcement learning work?

To get a handle on this, let's kick things off with a little scenario.
Picture this: you're about to play the Super Mario video game for the very first time, and you've just been handed the controller. Initially, you're not really sure what's what, so you begin by simply experimenting with different buttons to see how it all works.

 <img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-2-Introduction-to-machine-learning/imgs/super-mario.gif' width=400px>

 "Imagine pressing the right arrow and—ding!—you nab a coin, earning a +1 reward. Feeling lucky, you press right again but—oops!—you tumble into a pit, snagging a -1 penalty.

As you keep playing, a pattern emerges through good old trial and error. You begin to figure out that the name of the game is to grab as many coins as you can, avoid those pesky pits, and make it to the end of the level. And the best part? You're doing this all on your own, learning the ropes without any guidance. Every new game makes you a little bit better, a little bit smarter.

Reinforcement learning is just the computational approach of solving this problem. In this world, we create a **virtual agent** whose mission is to **explore its environment** through a **series of trial and error adventures**, just like you did with Super Mario. The ultimate goal? **To rack up the highest score by learning to choose the most rewarding actions over tim**e."

 <img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-2-Introduction-to-machine-learning/imgs/mario2.png' width=700px>




# How Does Reinforcement Learning Function?

Let's focus on the dynamic world of Reinforcement Learning where the central players are the **'Agent'** and the **'Environment'**.

*   **Agent**

Think of the **Agent as our problem-solve**r, a **computer program **designed to navigate a **series of decisions to reach a goal**. On the other hand, the Environment is essentially the stage or setting where all these decisions unfold, dictating the challenges the Agent needs to overcome.

To illustrate, let's talk chess. For example, in the case of the **chess game**, we can consider that the **Agent is one of the player**s and the **Environment constitute the board and competitor**.
'

*   **Interaction of Agent and Environment**

Both components are inter-dependent in a way that the Agent tries to adjust its actions based on the influence by the Environment, and Environment reacts to Agent’s action.


*   **State space**

The Environment is bound by a set of variables that are usually associated with **decision-making problems**. **A set of all possible values can be regarded as state space.** A **state** is a part of state space i.e. a value the variable takes.


**At each state**, the Environment is entitled to **provide a set of actions to the Agent**, amongst whom it should choose one. The agent tries to influence the Environment using these actions and **Environment may change states as a response to the Agent’s actions**.


*   **Transition function**

Transition function is something that tracks these associations.


*   **Reward and Penalty**

As our Agent makes its moves, the Environment gives feedback in the form of rewards or penalties, **encouraging strategies that steer towards the goal and discouraging those that don't.** It's a game of trial and error, with the Agent continuously tweaking its approach to bag the **maximum rewards** and inch closer to victory.


*   **Training time**

Another thing that Reinforcement learning requires is a lot of training time, as the rewards aren’t disclosed to the Agent until the end of an episode(game). e.g. if our computer is playing chess against us and it wins, then it will be rewarded (as our desired outcome was to win) but still, it needs to figure out for which actions it was rewarded and that can only be achieved when it is given a tonne of training time and data.



The Reinforcement Learning problem involves an **agent** exploring **an unknown environment** to achieve a goal. RL is based on the **hypothesis **that all goals can be described by the maximization of expected cumulative reward. The agent **must learn to sense and perturb the state of the environment **using its actions to derive **maximal reward**. The formal framework for RL borrows from the problem of optimal **control of Markov Decision Processes (MDP)**.

 <img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-2-Introduction-to-machine-learning/imgs/reinforcement-learningV1-02.webp' width=600px>

[Source](https://www.synopsys.com/ai/what-is-reinforcement-learning.html#:~:text=Definition,environment%20to%20obtain%20maximum%20reward.)

The main elements of an RL system are:

1. The agent or the learner
2. The environment the agent interacts with
3. The policy that the agent follows to take actions
4. The reward signal that the agent observes upon taking actions

A useful abstraction of the reward signal is the **value function**, which faithfully captures the **‘goodness’ of a state**.

While the **reward signal** represents the **immediate benefit of being in a certain state**, the value function captures the cumulative reward that is expected to be collected from that state on, going into the future.

The **objective of an RL algorithm** is to discover the **action policy** that maximizes the average value that it can extract from every state of the system.

# Two main ways algorithms of reinforcement learning

## Model-free algorithms

In reinforcement learning (RL), there are two main ways algorithms figure out the best actions to take: using models or not using them. Let's break down each category:

 <img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-2-Introduction-to-machine-learning/imgs/reinforcement-learningV1-01.webp' width=600px>


1. **Model-free algorithms:** "Model-free algorithms do not build an **explicit model of the environment**, or more rigorously, the MDP. They are closer to **trial-and-error algorithms** that run experiments with the environment using actions and **derive the optimal policy** from it directly."

In other words, these don't use a detailed blueprint of the environment they're in. Instead, they learn by doing, trying different actions and seeing what happens. They can go about this in two ways:

   *   **Value-based:** "Value-based algorithms consider **optimal policy** to be a **direct result** of estimating the value function of every state accurately. Using a **recursive relation** described by the **Bellman equation**, the agent interacts with the environment to sample trajectories of states and rewards. Given enough trajectories, the value function of the MDP can be estimated. Once the value function is known, discovering the optimal policy is simply a matter of acting greedily with respect to the value function at every state of the process. Some popular value-based algorithms are **SARSA and Q-learning**."
   
   Here, the algorithms work out the best actions by first understanding the 'value' of each possible situation they can find themselves in. They use a set of equations called the Bellman equations to help them learn from their experiences and gradually figure out the values. Once they know these, they always choose the action that leads to the highest value. Some well-known algorithms of this type are SARSA and Q-learning.

   *   **Policy-based:** "Policy-based algorithms, on the other hand, directly estimate the optimal policy without modeling the value function. By parametrizing the policy directly using learnable weights, they render the learning problem into an explicit optimization problem. Like value-based algorithms, the agent samples trajectories of states and rewards; however, this information is used to explicitly improve the policy by maximizing the average value function across all states. Popular policy-based RL algorithms include Monte Carlo policy gradient (REINFORCE) and deterministic policy gradient (DPG). Policy-based approaches suffer from a high variance which manifests as instabilities during the training process."
   
   These algorithms are a bit different because they don't bother working out the value of different situations. Instead, they focus on finding the best policy, which means the best set of actions to take. They tweak and improve their policy over time using the feedback from the actions they take. But these algorithms can sometimes have a tough time because they might experience a lot of ups and downs during the learning process.
   
   

2. **Model-based algorithms:** These actually try to understand and create a model of the environment to help them make decisions. They are not covered in detail here, but they tend to have a more structured approach compared to model-free methods.

" Value-based approaches, though more stable, are not suitable to model continuous action spaces.  One of the most powerful RL algorithms, called the actor-critic algorithm, is built by combining the value-based and policy-based approaches. In this algorithm, both the policy (actor) and the value function (critic) are parametrized to enable effective use of training data with stable convergence."

Lastly, there is a super powerful type of RL algorithm that combines the best parts of both value-based and policy-based approaches, called the actor-critic method. This method uses two parts:

**Actor:** This part is like the policy-based approach, focusing on finding the best actions to take.

**Critic:** This part is like the value-based approach, evaluating the actions chosen by the actor to help it learn and make better choices in the future.

This actor-critic method tries to get the best of both worlds, aiming to learn effectively and steadily get better over time, using both the policy and value methods together.






## Model-based RL algorithms

"Model-based RL algorithms build a model of the environment by sampling the states, taking actions, and observing the rewards. For every state and a possible action, the model predicts the expected reward and the expected future state. While the former is a regression problem, the latter is a density estimation problem. Given a model of the environment, the RL agent can plan its actions without directly interacting with the environment. This is like a thought experiment that a human might run when trying to solve a problem. When the process of planning is interweaved with the process of policy estimation, the RL agent’s ability to learn." [Source](https://www.synopsys.com/ai/what-is-reinforcement-learning.html#:~:text=Definition,environment%20to%20obtain%20maximum%20reward.)



---



**Model-based RL algorithms** are like detailed road maps that help a computer (called an RL agent) **figure out what to do next**. To create this road map, the RL agent takes different actions and notes down what rewards it gets and what new situations it finds itself in each time.

Let's break this down a bit:


*   **Predicting rewards:** For each action it can take in each situation, the agent makes a guess about what the reward will be. It's like guessing how many likes a specific type of post will get on social media.

*   **Guessing the next situation:** The agent also tries to predict what situation it will find itself in after taking each action. This is a bit like trying to guess what your friend will reply to a text message you're planning to send.

*   **Thinking ahead without acting:** The cool part about this is that once the RL agent has built its road map, it can think through its actions without actually taking them, kind of like how we sometimes think through different scenarios in our head before deciding what to do.

*   **Blending planning and learning:** The RL agent doesn't just stick to its initial road map. As it learns more, it keeps updating its plans to make better choices in the future. So, it's a continuous cycle of learning and improving.

By doing this, the RL agent gets better and better at choosing the actions that will give it the highest rewards over time."

# Examples of Reinforcement Learning

"RL can be a game-changer in scenarios where an agent needs to navigate unpredictable environments to achieve a certain goal. Here’s a look at some of the areas where RL has made a big impact:

**Robotics:** Traditional robots working on factory floors are great because they follow a set script, doing the same job day in, day out. But when we take robots out of this structured setup and place them in unpredictable situations, things get a lot tougher. That's where RL steps in, helping to create robots that can think on their feet, finding the safest and most efficient paths from A to B without bumping into anything — kind of like a smart GPS system but for robots.

**AlphaGo and the Ancient Game of Go:** The game of Go is ancient, intricate, and offers more possible moves than there are atoms in the universe — way more complex than chess. Enter AlphaGo, a system powered by RL, which managed to beat the top human Go player in 2016. How did it pull off this feat? Well, by playing and learning from countless games against human experts, and even practicing against itself to develop strategies no human has ever used.

**Self-Driving Cars:** Imagine a car that learns to drive by itself, understanding how to avoid obstacles, predict the moves of pedestrians, and choose the best route in real time — that's what RL can do in the realm of autonomous driving. It's all about teaching the car to predict what might happen next and make smart decisions on the spot, making self-driving cars safer and more efficient.

#How does Reinforcement Learning learn through Q-learning?

Q-learning is one of the foundational algorithms in reinforcement learning. It helps an agent find the optimal action-selection policy for any given finite Markov decision process, essentially helping it decide the best action from a set of actions based on its current state.

1. **Defining the Q-function**

The Q-function, denoted as Q(s,a), iis a representation of the "quality" of an action $a$ taken in state $s$. It essentially gives us a measure of the total expected rewards of an action taken in a particular state, considering a long-term future.

2. **Initializing Q-values**

Initially, we assume a $Q$-value for each state-action pair. Often, this is a table initialized with zeros.

3. **The Q-learning formula**

The Q-values are updated using the formula:

$$
Q(s, a) = Q(s, a) + \alpha [R(s) + \gamma \max_{a'} Q(s', a') - Q(s, a)]
$$

where

Q(s,a) represents the Q-value of a particular state-action pair (s, a), where $s$ is the current state and $a$ is the action taken in state $s$.

$α$ is the learning rate, which determines to what extent the newly acquired information will override the old information.

R(s) is the immediate reward received after transitioning from the current state s using action a.

γ is the discount factor, which models the agent's consideration for future rewards; a high value will make the agent prioritize long-term reward over short-term reward.

$$
\max_{a'} Q(s', a')
$$

is the estimate of optimal future value, by taking the maximum Q-value obtainable at the next state s' ,over all possible actions a'

s' represents the new state after action a
is taken in state s.

4. **Policy**
The policy, denoted as $ \pi $, dictates what action to take in each state. A common policy derived from Q-learning is the ε-greedy policy, where with probability $ 1-\varepsilon $ the agent chooses the action with the highest Q-value, and with probability $ \varepsilon $ chooses a random action. This helps in exploring the action space sufficiently during training.

5. **Learning through Iterations**
Through many iterations of interacting with the environment and updating the Q-values using the Q-learning update rule:

$$
Q(s, a) = Q(s, a) + \alpha [R(s) + \gamma \max_{a'} Q(s', a') - Q(s, a)]
$$

where:

*   $ Q(s, a) $: Current Q-value estimate
*   $ \alpha $: Learning rate
*   $ R(s) $: Reward at state $ s $
*   $ \gamma $: Discount factor
*   $ \max_{a'} Q(s', a') $: Maximum estimated future reward
*   $ s' $: New state
*   $ a' $: New action

the Q-values converge towards the optimal Q-values that would allow the agent to take the best actions in different states.

**Example Implementation**
To illustrate this with a simple example, let's assume the agent is learning to navigate a maze. The states are the different locations in the maze, the actions are the directions it can move in (up, down, left, right), and the rewards are positive when it moves closer to the exit and negative when it hits a wall. The Q-learning algorithm would iteratively update the Q-values using the formula as the agent explores the maze, and eventually, it would learn the optimal policy to navigate from any point in the maze to the exit.

**Conclusion**
Through Q-learning, a form of reinforcement learning, the agent learns to navigate its environment optimally by iteratively updating the expected rewards for different actions in different states, eventually finding the policy that gives the highest cumulative reward. It's a powerful tool for teaching agents to perform a wide variety of tasks, with the math and logic underpinning it being both robust and well-understood.

# How does Reinforcement Learning learn? (Q-learning)

Goal: To maximize the total reward

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-2-Introduction-to-machine-learning/imgs/formula1.webp' width=300px>

We expect, the rewards to come early as to make our training faster and thus quickly achieving desired outcomes.


<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-2-Introduction-to-machine-learning/imgs/formula2.webp' width=300px>

But, in a real case, we encounter late rewards, and to penalize late rewards we will introduce Discount Factor().

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-2-Introduction-to-machine-learning/imgs/formula3.webp' width=300px>


<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-2-Introduction-to-machine-learning/imgs/formula4.webp' width=700px>



<img src='https://raw.githubusercontent.com//MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-2-Introduction-to-machine-learning/imgs/formula5.webp' width=300px>


<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-2-Introduction-to-machine-learning/imgs/formula6.webp' width=600px>




<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-2-Introduction-to-machine-learning/imgs/formula8.webp' width=600px>


https://huggingface.co/learn/deep-rl-course/unit0/introduction

https://huggingface.co/blog/deep-rl-q-part1


https://huggingface.co/blog/deep-rl-q-part2

https://medium.com/p/d2643ca39b51

https://medium.com/p/d2643ca39b51

https://bharathikannann.github.io/blogs/an-introduction-to-machine-learning-and-its-types/

https://www.synopsys.com/ai/what-is-reinforcement-learning.html#:~:text=Definition,environment%20to%20obtain%20maximum%20reward.