# Reinforcement Learning Intuition

**Reinforcement learning is a subfield of machine learning where an agent learns to make decisions by performing actions in an environment to maximize some notion of cumulative reward. Unlike supervised learning, where the model is trained on a fixed dataset of input-output pairs, RL involves learning through interaction with the environment, with the aim of discovering the optimal actions that yield the highest reward over time.**

**Key Components of Reinforcement Learning**

Agent: The learner or decision-maker that interacts with the environment.

Environment: Everything the agent interacts with and operates within.

State (s): A representation of the current situation or configuration of the environment.

Action (a): A decision or move made by the agent that affects the state.

Reward (r): Feedback from the environment, typically a scalar value, which the agent aims to maximize.

Policy (π): A strategy used by the agent to decide which actions to take based on the current state. Policies can be deterministic or stochastic.

Value Function (V): Estimates the expected cumulative reward from a given state, helping the agent assess the long-term benefit of states.

Q-Value or Action-Value Function (Q): Estimates the expected cumulative reward from taking a particular action in a given state.

**How Reinforcement Learning Works**

Initialization: The agent starts with some initial policy or strategy.

Interaction: The agent interacts with the environment by observing the state, taking an action, and receiving a reward.

Learning: The agent updates its knowledge (policy, value functions) based on the received reward and observed transitions.

Iteration: Steps 2 and 3 are repeated, allowing the agent to learn and improve its policy over time.

**Applications of Reinforcement Learning**

**1.Robotics:**

Autonomous navigation and manipulation tasks.
Example: A robot learning to pick and place objects.

**Gaming:**

Training AI to play and excel in games.
Example: AlphaGo, which mastered the game of Go, and AlphaZero, which excelled in Go, Chess, and Shogi.

**Finance:**

Portfolio management, algorithmic trading, and risk management.
Example: RL-based strategies for trading stocks and options.

**Healthcare:**

Personalized treatment plans, medical diagnosis, and drug discovery.
Example: Optimizing chemotherapy dosage for cancer treatment.

**Examples of Reinforcement Learning in Action**

**AlphaGo by DeepMind:**

Utilized deep RL and Monte Carlo tree search to defeat human champions in the game of Go.

**OpenAI Five:**

Trained a team of AI agents to play the complex strategy game Dota 2, achieving superhuman performance.

**Autonomous Helicopter Flight:**

RL algorithms were used to teach helicopters to perform acrobatic maneuvers autonomously.

Although we have various types of reinforcement algorithms like **model free algorithms, policy based algorithms, model based algorithms and advanced algorithms**

**1. Model free algorithms** - Q-Learning, SARSA(State-Action-Reward-State-Action), Deep Q-Networks, Double DQN, Dueling DQN.

**2. Policy based algorithms** - REINFORCE, Actor-Critic, Advantage Actor-Critic(A2C), Asynchronous Advantage Actor-Critic(A3C).

**3. Model based algorithms** - Monte Carlo Tree Search(MCTS), AlphaZero

**4. Advanced Algorithms** - Proximal Policy Optimization(PPO), Soft Actor-Critic(SAC), Trust Region Policy Optimization, Rainbow, TD3 (Twin Delayed Deep Deterministic Policy Gradient).

**The multi armed bandit problem**

Of all the above variants, today we are going to deal with 2 significant reinforcement learning algorithms particularly in the domain of **multi-armed bandit problems**. They are,

1. Upper confidence Bound

2. Thomson Sampling

Both Upper Confidence Bound and Thompson Sampling algorithms are essential tools in the arsenal of reinforcement learning, particularly for problems where the **primary challenge is balancing exploration and exploitation.** They are simple, effective, and provide a good foundation for understanding more complex RL algorithms.
