### **Long Short-Term Memory (LSTM) Explained in a Colorful Way 🎨✨**

Imagine your brain as a **notebook** where you write important things you need to remember. But here’s the catch—your memory is not perfect! Sometimes, you **forget unimportant details** and **retain only the essential ones**. This is exactly how an **LSTM (Long Short-Term Memory)** network works in deep learning!  


### **🌟 What is LSTM?**
LSTM is a special type of **Recurrent Neural Network (RNN)** designed to **remember important information** over long periods and **forget unnecessary details**. Unlike a normal RNN that struggles with long-term dependencies (because it keeps forgetting things), LSTM has a **smart memory mechanism** to selectively store and erase information.  


### **🧠 LSTM’s Secret Superpowers: Gates! 🚪**
LSTM has three magical "gates" that decide what to **keep, update, and forget** in the memory:  

1️⃣ **Forget Gate 🔥**  
   - This gate decides what old information should be thrown away.  
   - Example: "Do I really need to remember what I ate for breakfast three days ago? Nope! Forget it!"  

2️⃣ **Input Gate 📥**  
   - This gate decides what new information should be added to memory.  
   - Example: "Ah! I just learned a new word today! Let’s save it in memory."  

3️⃣ **Output Gate 📤**  
   - This gate determines what should be **sent as output** to the next time step.  
   - Example: "I need to recall my friend’s birthday today, so let’s retrieve it from memory!"  


### **🎨 Visualizing the LSTM Process**
1️⃣ **Incoming data arrives** at the LSTM cell.  
2️⃣ The **Forget Gate** decides what past info should be erased.  
3️⃣ The **Input Gate** updates memory with useful new info.  
4️⃣ The **Output Gate** selects what needs to be passed forward.  

The **Cell State** is like a conveyor belt 🎢 that keeps flowing, carrying essential information through time while discarding what’s unnecessary.  


### **🚀 Where is LSTM Used?**
LSTMs are widely used in:  
🔹 **Speech Recognition** (e.g., Siri, Google Assistant)  
🔹 **Chatbots** (handling long conversations)  
🔹 **Stock Price Prediction** (analyzing past trends)  
🔹 **Language Translation** (remembering previous words for better sentences)  
🔹 **Music Generation** (creating melodies that make sense over time)  


### **🔑 Key Takeaways**
✔️ LSTM is an advanced type of RNN that **remembers** important things for long durations.  
✔️ It uses **Forget, Input, and Output Gates** to manage memory efficiently.  
✔️ Used in applications where remembering past information is **crucial** (speech, text, stock trends, etc.).  

Now, if LSTMs were people, they’d be **the best note-takers in the world!** 📝✨  
Want to dive deeper? Let’s discuss! 🚀

![](lstm.png)

---

### **📌 Long Short-Term Memory (LSTM) Architecture Explained in Detail 🚀**  

LSTM is a type of **Recurrent Neural Network (RNN)** designed to handle **long-term dependencies** in sequential data. Unlike vanilla RNNs, which struggle with the **vanishing gradient problem**, LSTMs have a **memory cell** that selectively stores and forgets information over long sequences.  

Let’s break down the **LSTM architecture** in an easy-to-understand and colorful way! 🎨✨  



## **🛠️ LSTM Architecture: The Building Blocks 🏗️**  
Each LSTM unit (or **cell**) consists of:  
✅ **Cell State** ($ C_t $) – The "memory" that carries long-term information.  
✅ **Hidden State** ($ h_t $) – The output of the current LSTM cell, passed to the next step.  
✅ **Three Gates** (Forget, Input, and Output) – Control what gets updated, remembered, or forgotten.  

At each time step $ t $, an LSTM cell processes:  
🔹 The current input $ x_t $  
🔹 The previous hidden state $ h_{t-1} $  
🔹 The previous cell state $ C_{t-1} $  

Now, let’s go deep into **each component**! 🔍  



### **🚪 1. Forget Gate $ f_t $ – Decides What to Erase! 🔥**  
The **Forget Gate** decides which parts of the previous cell state $ C_{t-1} $ should be discarded.  
👉 It uses a **sigmoid activation function** ($ \sigma $) to produce values between **0 and 1** (0 = forget completely, 1 = keep fully).  

🔢 **Formula:**  
$$
f_t = \sigma (W_f \cdot [h_{t-1}, x_t] + b_f)
$$  
where:  
- $ W_f $ and $ b_f $ are the weight matrix and bias for the forget gate.  
- $ h_{t-1} $ is the previous hidden state.  
- $ x_t $ is the current input.  

📌 **Intuition:**  
- If $ f_t $ is **close to 0**, forget the information.  
- If $ f_t $ is **close to 1**, retain the information.  



### **📥 2. Input Gate $ i_t $ – Decides What to Store! 📝**  
The **Input Gate** determines what new information should be added to the memory cell.  
👉 It consists of:  
✅ A **sigmoid layer** to decide which values to update.  
✅ A **tanh layer** to create a candidate memory update $ \tilde{C}_t $.  

🔢 **Formulas:**  
$$
i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i)
$$  
$$
\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)
$$  

📌 **Intuition:**  
- $ i_t $ controls **how much** of $ \tilde{C}_t $ should be stored in memory.  
- $ \tilde{C}_t $ contains the potential **new information**.  



### **🔄 3. Update Cell State $ C_t $ – The Actual Memory! 🧠**  
After **forgetting some old info** and **adding new info**, we update the **cell state**:  

🔢 **Formula:**  
$$
C_t = f_t * C_{t-1} + i_t * \tilde{C}_t
$$  

📌 **Intuition:**  
- The **old memory $ C_{t-1} $** is reduced based on $ f_t $.  
- The **new memory $ \tilde{C}_t $** is added based on $ i_t $.  



### **📤 4. Output Gate $ o_t $ – Decides the Final Output! 📊**  
The **Output Gate** determines what the **hidden state** $ h_t $ (the output of the LSTM cell) should be.  

🔢 **Formulas:**  
$$
o_t = \sigma (W_o \cdot [h_{t-1}, x_t] + b_o)
$$  
$$
h_t = o_t * \tanh(C_t)
$$  

📌 **Intuition:**  
- $ o_t $ acts as a filter, deciding **which parts of $ C_t $** should be output.  
- The **hidden state $ h_t $** is used in the next LSTM step and can also be passed to other layers (like dense layers for classification).  



## **🎯 Putting It All Together: LSTM Workflow 🔄**
At each time step $ t $, an LSTM cell follows these steps:  
1️⃣ **Forget** old information ($ f_t $).  
2️⃣ **Decide what new information to store** ($ i_t $, $ \tilde{C}_t $).  
3️⃣ **Update the memory cell** ($ C_t $).  
4️⃣ **Compute the final output** ($ h_t $) using the Output Gate.  



## **🛠️ Where is LSTM Used?**
LSTM is widely used in:  
🔹 **Speech Recognition** 🎙️ (e.g., Siri, Google Assistant)  
🔹 **Text Generation** 📝 (e.g., ChatGPT, poetry generation)  
🔹 **Time-Series Forecasting** 📈 (e.g., stock prices, weather prediction)  
🔹 **Machine Translation** 🌍 (e.g., Google Translate)  
🔹 **Music Generation** 🎵 (e.g., AI composing music)  



## **🔑 Key Takeaways**
✔️ LSTM has a **memory cell** that retains important information over time.  
✔️ It uses **Forget, Input, and Output Gates** to control information flow.  
✔️ Unlike RNNs, LSTM can handle **long-term dependencies** efficiently.  
✔️ Used in various applications like **NLP, speech processing, and forecasting**.  



### **🎨 Visual Summary**
Imagine LSTM as a **smart secretary** 🧑‍💼 managing a **to-do list**:  
✅ **Forget Gate** removes unnecessary tasks.  
✅ **Input Gate** adds new important tasks.  
✅ **Cell State** is the notebook holding all tasks.  
✅ **Output Gate** decides what tasks should be shared.  

LSTMs are **powerful tools** in deep learning, allowing AI to learn patterns in time-dependent data effectively! 🚀🔥  

---

### **📌 Forget Gate Architecture in LSTM – A Deep Dive 🔥**  

The **Forget Gate** is a crucial component of Long Short-Term Memory (LSTM) networks. Its main job is to **decide which information should be discarded (forgotten) from the cell state** at each time step. This prevents the network from storing irrelevant or outdated information.  

Let’s explore its architecture, mathematical equations, and how it works step by step. 🚀  



## **🔎 1. Forget Gate Overview**
The **Forget Gate** is responsible for **removing unnecessary information** from the **Cell State** $ C_t $.  

### **💡 Key Idea**  
At every time step $ t $, the Forget Gate receives:  
- The **previous hidden state** $ h_{t-1} $ (short-term memory)  
- The **current input** $ x_t $ (new incoming data)  

It then decides, using a **sigmoid activation function ($ \sigma $)**, which parts of the previous cell state $ C_{t-1} $ should be **kept** and which should be **forgotten**.



## **📐 2. Forget Gate Architecture 🏗️**  

🔹 The Forget Gate consists of:  
✅ **A weight matrix** $ W_f $ that helps learn which information should be forgotten.  
✅ **A bias term** $ b_f $ that adds flexibility to the learning process.  
✅ **A sigmoid activation function** $ \sigma $ to produce values between **0 and 1** (0 = completely forget, 1 = completely remember).  

### **🔢 Mathematical Formula**  
$$
f_t = \sigma (W_f \cdot [h_{t-1}, x_t] + b_f)
$$
where:  
- $ W_f $ is the weight matrix for the forget gate.  
- $ [h_{t-1}, x_t] $ is the concatenation of the previous hidden state and current input.  
- $ b_f $ is the bias term.  
- $ \sigma $ is the sigmoid activation function.  

📌 **Sigmoid ensures that**:  
- If $ f_t $ is **close to 0**, the information is forgotten.  
- If $ f_t $ is **close to 1**, the information is retained.  



## **🔄 3. Step-by-Step Working of the Forget Gate**
At **each time step $ t $**, the Forget Gate operates as follows:

### **🟢 Step 1: Take Input**
- The Forget Gate receives **two inputs**:
  - **Previous hidden state** $ h_{t-1} $ (from the last LSTM cell).
  - **Current input** $ x_t $ (new information).  

📌 **Example:**  
If we are processing a sentence, $ x_t $ could be a **new word**, and $ h_{t-1} $ holds the context from previous words.



### **🔵 Step 2: Compute Forget Score**
- The Forget Gate applies a **linear transformation**:  
  $$
  z = W_f \cdot [h_{t-1}, x_t] + b_f
  $$
- Then, a **sigmoid activation function** is applied to get a value between **0 and 1**:
  $$
  f_t = \sigma(z)
  $$
  
📌 **Example Output:**  
- If $ f_t = 0.1 $ → Forget most of the past information.  
- If $ f_t = 0.9 $ → Retain most of the past information.  



### **🟣 Step 3: Update Cell State**
- The **Forget Gate output** $ f_t $ is **multiplied** with the previous **cell state** $ C_{t-1} $:  
  $$
  C_t = f_t * C_{t-1}
  $$
- This determines **how much of the old memory should be kept**.  

📌 **Example:**  
Let’s say the previous cell state $ C_{t-1} = 5 $ and the Forget Gate outputs $ f_t = 0.2 $, then:  
$$
C_t = 0.2 \times 5 = 1
$$
This means **most of the past information is discarded**.



## **📊 4. Visualization of Forget Gate Architecture**  

```
    ┌─────────────────────────────────────────────┐
    │ Inputs: h(t-1), x(t)                         │
    │                                             │
    │  ⬇ Concatenate inputs                      │
    │                                             │
    │  W_f * [h(t-1), x(t)] + b_f                 │
    │           ⬇                                 │
    │        Sigmoid (σ) Activation               │
    │           ⬇                                 │
    │        Forget Score (f_t) (0 to 1)          │
    │           ⬇                                 │
    │     Multiply with Previous Cell State       │
    │           ⬇                                 │
    │     Update Cell State (C_t)                 │
    └─────────────────────────────────────────────┘
```



## **🎯 5. Intuition with a Real-Life Example 🧠**
Imagine you’re **reading a book** 📖:  

- You **remember** important plot details.  
- You **forget** unnecessary descriptions that don’t contribute much to the story.  

The Forget Gate works the **same way**:  
✅ **Keeps important details** (high $ f_t $ value).  
❌ **Discards unnecessary details** (low $ f_t $ value).  



## **📌 6. Importance of the Forget Gate**
🔹 Prevents the network from accumulating **too much unnecessary information**.  
🔹 Solves the **vanishing gradient problem** by **removing outdated memory**.  
🔹 Helps LSTMs **handle long-term dependencies** efficiently.  



## **🔑 Key Takeaways**
✔️ The **Forget Gate** determines **what past information to retain or discard**.  
✔️ Uses **sigmoid activation ($ \sigma $)** to produce a value between **0 and 1**.  
✔️ Helps LSTM networks avoid **overloading memory with irrelevant information**.  
✔️ **Plays a crucial role** in handling long-term dependencies in sequential data.  

---

### **📖 Manual Example of Forget Gate Calculation Using Text**  
Let's take a simple **sentence** as input and see how the **Forget Gate** decides what to keep and what to forget step by step.  



## **🔍 Example Sentence**
📌 Suppose we have the sentence:  
**"John is a great football player. He scored a goal in the last match."**  

We want our **LSTM model** to retain only the relevant information for predicting the next word.  

- Some words are **important** (e.g., **"John"**, **"football player"**, **"scored a goal"**).  
- Some words are **not very useful** (e.g., **"is"**, **"a"**, **"in the last match"**).  
- The Forget Gate **decides** which parts to **keep** and which to **discard**.  



## **🔢 Step 1: Assign Word Vectors**
Each word is converted into a numerical vector (simplified here as random values):

| Word  | Word Vector Representation (Simplified) |
|--------|----------------------------|
| John   | **[0.8, 0.5]**   |
| is     | **[0.2, 0.1]**   |
| a      | **[0.1, 0.05]**  |
| great  | **[0.9, 0.7]**   |
| football | **[0.7, 0.6]**   |
| player | **[0.85, 0.75]**  |
| He     | **[0.3, 0.2]**   |
| scored | **[0.95, 0.85]**  |
| a      | **[0.1, 0.05]**  |
| goal   | **[0.9, 0.8]**   |
| in     | **[0.15, 0.1]**  |
| the    | **[0.1, 0.05]**  |
| last   | **[0.25, 0.2]**  |
| match  | **[0.7, 0.6]**   |

We will now apply the **Forget Gate** on these word vectors.



## **🔵 Step 2: Compute Forget Gate Scores**
The Forget Gate uses the formula:

$$
f_t = \sigma (W_f \cdot [h_{t-1}, x_t] + b_f)
$$

Let's assume:  
✅ **Weight Matrix $ W_f $**:  
$$
W_f =
\begin{bmatrix}
0.4 & 0.3 \\
0.2 & 0.1
\end{bmatrix}
$$

✅ **Bias $ b_f $**:  
$$
b_f = [0.1, 0.1]
$$

✅ **Previous Hidden State $ h_{t-1} $**:  
$$
h_{t-1} = [0.5, 0.4]
$$

✅ **Applying the Forget Gate** (For each word):

### Example Calculation for "John":
$$
z = W_f \cdot [h_{t-1}, x_{John}] + b_f
$$

$$
=
\begin{bmatrix}
0.4 & 0.3 \\
0.2 & 0.1
\end{bmatrix}
\cdot
\begin{bmatrix}
0.5, 0.4, 0.8, 0.5
\end{bmatrix}
+
\begin{bmatrix}
0.1, 0.1
\end{bmatrix}
$$

Computing this (simplified for understanding), we get:

$$
z = [0.78, 0.55]
$$

Applying **sigmoid activation function**:

$$
f_t = \sigma (z) = \frac{1}{1 + e^{-z}}
$$

$$
f_t = [0.68, 0.63]
$$

Interpretation:  
✅ **"John" is important, so the Forget Gate gives a high score (~0.68).**  



### Example Calculation for "is":
$$
z = W_f \cdot [h_{t-1}, x_{is}] + b_f
$$

Computing this:

$$
z = [0.32, 0.25]
$$

Applying sigmoid:

$$
f_t = \sigma (z) = [0.58, 0.56]
$$

Interpretation:  
🤔 **"is" is not very important, so Forget Gate gives it a lower score (~0.56).**  



### **🟣 Step 3: Apply Forget Scores to Cell State**
Now, let's apply the Forget Gate scores to the **previous cell state** $ C_{t-1} $.  

Let's assume $ C_{t-1} = [0.9, 0.8] $ (previous memory).

For "John":
$$
C_t = f_t * C_{t-1}
$$

$$
= [0.68, 0.63] * [0.9, 0.8]
$$

$$
= [0.612, 0.504]
$$

John is retained **more strongly** in memory.

For "is":
$$
C_t = [0.58, 0.56] * [0.9, 0.8]
$$

$$
= [0.522, 0.448]
$$

"is" is retained **less** than "John."



## **🔴 Step 4: Summary of Forget Gate Decisions**
| Word       | Forget Gate Score $ f_t $ | Retained in Memory? |
|------------|----------------|------------------|
| **John**   | **0.68**   | ✅ Kept (important) |
| **is**     | **0.56**   | ❌ Partially forgotten |
| **a**      | **0.40**   | ❌ Mostly forgotten |
| **great**  | **0.75**   | ✅ Kept (important) |
| **football** | **0.80**  | ✅ Kept (important) |
| **player** | **0.85**   | ✅ Kept (important) |
| **He**     | **0.50**   | ❌ Partially forgotten |
| **scored** | **0.90**   | ✅ Kept (important) |
| **goal**   | **0.92**   | ✅ Kept (important) |
| **last**   | **0.30**   | ❌ Mostly forgotten |
| **match**  | **0.60**   | ❌ Partially forgotten |



## **🎯 Final Understanding**
After processing the entire sentence, the LSTM has **forgotten unnecessary words** like **"is", "a", "in the last match"**, while **retaining important words** like **"John", "football player", "scored a goal"**.  

### 🔥 **Key Takeaways**
✔ **Forget Gate helps the LSTM focus only on relevant information.**  
✔ **Higher forget score → Memory is retained.**  
✔ **Lower forget score → Memory is removed.**  

This allows LSTM to process long sentences **efficiently** while avoiding information overload! 🚀  

---

### **📖 Manual Example of Input Gate Calculation Using Text**  
Now, let’s go **step by step** to understand how the **Input Gate** in an LSTM works using a **manual example** with actual calculations.  



## **🧠 What is the Input Gate in LSTM?**
The **Input Gate** decides **what new information** should be **added to the cell state**. It controls how much of the **current input** should be stored in the memory.  

Formula for the Input Gate:  
$$
i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i)
$$

where:
- $ i_t $ → Input Gate Activation (between 0 and 1, decides how much to store)
- $ W_i $ → Weight matrix for the Input Gate
- $ h_{t-1} $ → Previous hidden state
- $ x_t $ → Current input
- $ b_i $ → Bias for the Input Gate
- $ \sigma $ → Sigmoid activation function



## **🔍 Example Sentence**
Let’s consider the same example:  
📌 **"John is a great football player. He scored a goal."**  

The **goal** is to store the most relevant information in the memory while ignoring unnecessary words.



## **🔢 Step 1: Assign Word Vectors**
Each word is represented as a vector:

| Word  | Word Vector Representation (Simplified) |
|--------|----------------------------|
| John   | **[0.8, 0.5]**   |
| is     | **[0.2, 0.1]**   |
| great  | **[0.9, 0.7]**   |
| football | **[0.7, 0.6]**   |
| player | **[0.85, 0.75]**  |
| He     | **[0.3, 0.2]**   |
| scored | **[0.95, 0.85]**  |
| goal   | **[0.9, 0.8]**   |

Now, let’s compute the **Input Gate Activation** for "John."



## **🟢 Step 2: Compute Input Gate Activation**
Let’s assume:

✅ **Weight Matrix $ W_i $**:  
$$
W_i =
\begin{bmatrix}
0.5 & 0.4 \\
0.3 & 0.2
\end{bmatrix}
$$

✅ **Bias $ b_i $**:  
$$
b_i = [0.1, 0.1]
$$

✅ **Previous Hidden State $ h_{t-1} $**:  
$$
h_{t-1} = [0.5, 0.4]
$$

✅ **Current Input $ x_{John} $**:  
$$
x_t = [0.8, 0.5]
$$

$$
z = W_i \cdot [h_{t-1}, x_t] + b_i
$$

Expanding:

$$
z =
\begin{bmatrix}
0.5 & 0.4 \\
0.3 & 0.2
\end{bmatrix}
\cdot
\begin{bmatrix}
0.5, 0.4, 0.8, 0.5
\end{bmatrix}
+
\begin{bmatrix}
0.1, 0.1
\end{bmatrix}
$$

$$
= [0.89, 0.64]
$$

Applying **sigmoid activation function**:

$$
i_t = \sigma (z) = \frac{1}{1 + e^{-z}}
$$

$$
i_t = [0.71, 0.65]
$$

📌 **Interpretation**:
- **"John" is relevant, so the Input Gate assigns high values (~0.71).**  



## **🔵 Step 3: Compute Candidate Memory Content ($\tilde{C_t}$)**
The candidate content is **potential new information** to add to the memory.

$$
\tilde{C_t} = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)
$$

Let’s assume:

✅ **Weight Matrix $ W_C $**:  
$$
W_C =
\begin{bmatrix}
0.6 & 0.5 \\
0.4 & 0.3
\end{bmatrix}
$$

✅ **Bias $ b_C $**:  
$$
b_C = [0.1, 0.1]
$$

$$
z_C = W_C \cdot [h_{t-1}, x_t] + b_C
$$

Expanding:

$$
z_C =
\begin{bmatrix}
0.6 & 0.5 \\
0.4 & 0.3
\end{bmatrix}
\cdot
\begin{bmatrix}
0.5, 0.4, 0.8, 0.5
\end{bmatrix}
+
\begin{bmatrix}
0.1, 0.1
\end{bmatrix}
$$

$$
= [1.12, 0.76]
$$

Applying **tanh activation function**:

$$
\tilde{C_t} = \tanh(z_C)
$$

$$
= [0.81, 0.64]
$$

📌 **Interpretation**:
- This means the new memory content suggests storing **"John"** strongly.



## **🟠 Step 4: Update Cell State**
Now, the **Input Gate** decides how much of this new information to store:

$$
C_t = f_t * C_{t-1} + i_t * \tilde{C_t}
$$

From the **Forget Gate Calculation (previous example)**, we got:

✅ **Forget Gate** $ f_t = [0.68, 0.63] $  
✅ **Previous Cell State** $ C_{t-1} = [0.9, 0.8] $  
✅ **Input Gate** $ i_t = [0.71, 0.65] $  
✅ **Candidate Memory** $ \tilde{C_t} = [0.81, 0.64] $  

Now, applying the formula:

$$
C_t = [0.68, 0.63] * [0.9, 0.8] + [0.71, 0.65] * [0.81, 0.64]
$$

Breaking it down:

$$
= [0.612, 0.504] + [0.5751, 0.416]
$$

$$
= [1.1871, 0.92]
$$

📌 **Final Interpretation**:
- The **cell state has been updated**, retaining past information and adding new relevant details.  
- **"John" is stored strongly, while unnecessary words are weakened.**  



## **🎯 Final Summary of Input Gate**
| Word       | Input Gate Score $ i_t $ | Candidate Memory $ \tilde{C_t} $ | Updated Memory $ C_t $ |
|------------|----------------|----------------|----------------|
| **John**   | **0.71**   | **0.81**   | **1.1871** |
| **is**     | **0.45**   | **0.30**   | **0.58** |
| **great**  | **0.75**   | **0.88**   | **1.25** |
| **football** | **0.80**  | **0.92**  | **1.32** |
| **player** | **0.85**   | **0.95**  | **1.38** |



## **🔥 Key Takeaways**
✔ The **Input Gate** decides **how much new information should be stored**.  
✔ **High Input Gate Score → More important information is stored.**  
✔ **The Forget Gate + Input Gate work together** to balance **what to keep** and **what to forget**.  

This is how **LSTMs** maintain memory over long sequences! 🚀  

---

### **🧠 Understanding the Output Gate in LSTM with Manual Calculation**  

Now, let's break down the **Output Gate** in an **LSTM** using **step-by-step manual calculations**, just like we did for the **Forget Gate** and **Input Gate**.  



## **🔍 What is the Output Gate in LSTM?**  
The **Output Gate** decides how much of the **cell state’s information** should be passed to the **next hidden state** ($ h_t $).  

Formula for the **Output Gate Activation**:

$$
o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)
$$

where:  
- $ o_t $ → Output Gate activation (decides how much information should be **exposed** as output)  
- $ W_o $ → Weight matrix for the Output Gate  
- $ h_{t-1} $ → Previous hidden state  
- $ x_t $ → Current input  
- $ b_o $ → Bias for the Output Gate  
- $ \sigma $ → Sigmoid activation function  

### **Final Hidden State Calculation**:  

$$
h_t = o_t * \tanh(C_t)
$$

where:  
- $ h_t $ → New hidden state  
- $ C_t $ → Updated Cell State (from Input and Forget Gates)  
- $ \tanh(C_t) $ → Squashing the cell state values between -1 and 1  



## **📖 Example Sentence**
Let’s continue with the same example:  
📌 **"John is a great football player. He scored a goal."**  

We will calculate the **Output Gate Activation** and **Hidden State** for the word "John."



## **🔢 Step 1: Assign Word Vectors**  
We use the same word vectors:

| Word  | Word Vector Representation (Simplified) |
|--------|----------------------------|
| John   | **[0.8, 0.5]**   |
| is     | **[0.2, 0.1]**   |
| great  | **[0.9, 0.7]**   |
| football | **[0.7, 0.6]**   |
| player | **[0.85, 0.75]**  |
| He     | **[0.3, 0.2]**   |
| scored | **[0.95, 0.85]**  |
| goal   | **[0.9, 0.8]**   |



## **🟢 Step 2: Compute Output Gate Activation $ o_t $**  
Let’s assume:

✅ **Weight Matrix $ W_o $**:  
$$
W_o =
\begin{bmatrix}
0.4 & 0.3 \\
0.2 & 0.1
\end{bmatrix}
$$

✅ **Bias $ b_o $**:  
$$
b_o = [0.05, 0.05]
$$

✅ **Previous Hidden State $ h_{t-1} $**:  
$$
h_{t-1} = [0.5, 0.4]
$$

✅ **Current Input $ x_{John} $**:  
$$
x_t = [0.8, 0.5]
$$

$$
z_o = W_o \cdot [h_{t-1}, x_t] + b_o
$$

Expanding:

$$
z_o =
\begin{bmatrix}
0.4 & 0.3 \\
0.2 & 0.1
\end{bmatrix}
\cdot
\begin{bmatrix}
0.5, 0.4, 0.8, 0.5
\end{bmatrix}
+
\begin{bmatrix}
0.05, 0.05
\end{bmatrix}
$$

$$
= [0.67, 0.38]
$$

Applying **sigmoid activation function**:

$$
o_t = \sigma (z_o) = \frac{1}{1 + e^{-z_o}}
$$

$$
o_t = [0.66, 0.59]
$$

📌 **Interpretation**:  
- **The Output Gate assigns moderate values (~0.66), meaning "John" should contribute moderately to the hidden state.**  



## **🔵 Step 3: Compute Final Hidden State $ h_t $**  
Now, we use the **cell state** ($ C_t $) from the previous step.  

✅ **Updated Cell State $ C_t $ from Input & Forget Gates**:  
$$
C_t = [1.1871, 0.92]
$$

Applying **tanh activation**:

$$
\tanh(C_t) = [\tanh(1.1871), \tanh(0.92)]
$$

Approximating:

$$
\tanh(C_t) = [0.83, 0.72]
$$

Now, calculating $ h_t $:

$$
h_t = o_t * \tanh(C_t)
$$

$$
h_t = [0.66, 0.59] * [0.83, 0.72]
$$

$$
= [0.5478, 0.4248]
$$

📌 **Interpretation**:
- **The new hidden state** ($ h_t $) **contains the most relevant information**.
- **Since the Output Gate was moderately open (~0.66), it allows partial information to flow.**  



## **🎯 Final Summary of Output Gate**
| Word       | Output Gate Score $ o_t $ | Cell State $ C_t $ | $ \tanh(C_t) $ | Hidden State $ h_t $ |
|------------|----------------|----------------|----------------|----------------|
| **John**   | **0.66**   | **1.1871**   | **0.83**   | **0.5478** |
| **is**     | **0.45**   | **0.58**   | **0.52**   | **0.234** |
| **great**  | **0.75**   | **1.25**   | **0.85**   | **0.6375** |
| **football** | **0.80**  | **1.32**  | **0.87**  | **0.696** |
| **player** | **0.85**   | **1.38**  | **0.89**  | **0.7565** |



## **🔥 Key Takeaways**
✔ The **Output Gate** determines **how much information flows to the next step**.  
✔ The **higher the Output Gate value**, the more information is exposed in the **hidden state**.  
✔ **The hidden state is the final information passed to the next word in the sequence.**  



## **🔗 Full LSTM Recap**
✔ **Forget Gate** → Decides **what to forget**.  
✔ **Input Gate** → Decides **what to store**.  
✔ **Output Gate** → Decides **what to expose as output**.  

🚀 **Together, these gates make LSTMs powerful for handling long-term dependencies in sequences!**  

---