**Table of contents**<a id='toc0_'></a>    
- [Lunar Lander](#toc1_)    
- [Neurons](#toc2_)    
- [Markov Decision Processes](#toc3_)    
- [Episodic vs. Continuing Tasks](#toc4_)    
- [Notes on Lunar Lander Environment](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Lunar Lander](#toc0_)

<p align="center">
  <img width="700" height="400" src="imgs/c4m2-reward-function.png">
</p>

# <a id='toc2_'></a>[Neurons](#toc0_)

- A neuron's output, or action, is called an **action potential**, transmitted to other neurons by its **axon**.
- When a neuron generates an action potential it is said to **fire**.
- Action potentials act on other neurons via **synapses**.
- Synapses have **efficacies** (weights or strengths) which change during learning. 

We could say 
- Neurons are RL agents.
- The brain is a **society** of RL agents.

$\textbf{Klopf's specific hypothesis:}$

> - When a neuron fires an action potential, **all the contributing synapses** become ${\textcolor{red}{\textbf{eligible}}}$ to undergo changes in their efficacies, or weights. 
>
> - If the action potential is followed **withing an appropriate time period** by an increase in reward, the efficacies of all eligible synpases increase (or decrease in the case of punishment).

$\textbf{What are Eligbility Traces?}$

- ${\textcolor{red}{\textbf{Eligibility Traces}}}$ are a **memory mechanism** that tracks which **synapses** (connections) were active recently, so that **delayed reward or punishments** can update the correct connections.
- They act like **"credit markers"** for learning: if a reward comes later, which synapses should get credit? 

In practice, RL uses a **simplified exponential decay** trace:

$$
    e_t(s) = \gamma\lambda e_{t-1}(s) + \nabla\hat{v}(s)
$$

$\textbf{Two Types of Eligbility Traces:}$

- ${\textcolor{blue}{\textbf{Contingent eligibility}}}$ (requires pre- and post-synaptic activity) $\rightarrow$ Used by the actor
  - Purpose: learns to act
- ${\textcolor{blue}{\textbf{Non-contingent eligibility}}}$ (only pre-synaptic activity) $\rightarrow$ Used by the critic
  - Purpose: predicts reward

$\textbf{Key Takeaways:}$

- Eligibility traces **connect past activity to delayed feedback**.
- They're used in nearly all RL algorithms, often in a simple one-step form.
- They help implement **time-sensitive learning** in both artificial and biological systems.
- Neuroscience increasingly supports their existence in the brain.

Before we tried to maximize RL now we try to ***predict*** RL. 


# <a id='toc3_'></a>[Markov Decision Processes](#toc0_)

- ${\textcolor{red}{\textbf{MDPs}}}$ provide a general framework for sequential decision-making
- The **dynamics** of an MDP are defined by a probability distribution

The k-Armed Bandit problem we looked at previously, introduces many interesting questions. However, it doesn't include many aspects of real-world problems. The agent is presented with the same situation and each time and the same action is always optimal. In many problems, different situations call for different responses. The actions we choose now affect the amount of reward we can get into the future. The Markov Decision Process formalism captures these two aspects of real-world problems.

$\textbf{Markov Property:}$

> Future state $s'$ and reward $r$ **only depends** on the current state $s$ and action $a$.
>
> Or differently said: 
>
> ${\textcolor{green}{\textbf{The present state contains all the information necessary to predict the future.}}}$

It means that the present state is sufficient and remembering earlier states would not improve predictions about the future.

# <a id='toc4_'></a>[Episodic vs. Continuing Tasks](#toc0_)

**🎮 Episodic Task Example: Video Game Agent**

- Agent collects treasure blocks for +1 reward.
- Episode ends when the agent touches a green enemy.
- Each episode **resets** to the same initial state.
- Objective: maximize total reward **per episode**.
- Naturally modeled as an **episodic task**.

**🖥️ Continuing Task Example: Job Scheduler**

- Agent schedules jobs on servers based on priority.
- Accepting high-priority jobs gives reward; rejecting gives penalty.
- Jobs keep arriving; servers are reused.
- Task **never ends** — no natural episodes.
- Modeled as a **continuing task**.

$\textbf{Key Takeaways:}$

- ${\textcolor{red}{\textbf{Episodic tasks}}}$:
  - break naturally into independent episodes
  - have clear beginnings and ends, often reset to initial state.
- ${\textcolor{red}{\textbf{Continuing tasks}}}$:
  - are assumed to continue indefinitely
  - go on indefinitely, agent continually interacts with the environment.
- Choose the formulation that fits the **structure and goals** of the problem.


# <a id='toc5_'></a>[Notes on Lunar Lander Environment](#toc0_)

In this notebook, you will be applying these functions to __structure the reward signal__ based on the following criteria:

1. **The lander will crash if** it touches the ground when ``y_velocity < -3`` (the downward velocity is greater than three).


2. **The lander will crash if** it touches the ground when ``x_velocity < -10 or 10 < x_velocity`` (horizontal speed is greater than $10$).


3. The lander's angle taken values in $[0, 359]$. It is completely vertical at $0$ degrees. **The lander will crash if** it touches the ground when ``5 < angle < 355`` (angle differs from vertical by more than $5$ degrees).


4. **The lander will crash if** it has yet to land and ``fuel <= 0`` (it runs out of fuel).


5. MST would like to save money on fuel when it is possible **(using less fuel is preferred)**.


6. The lander can only land in the landing zone. **The lander will crash if** it touches the ground when ``x_position`` $\not\in$ ``landing_zone`` (it lands outside the landing zone).


Fill in the methods below to create an environment for the lunar lander.