<a href="https://www.kaggle.com/code/ayushs9020/inventing-lstm-from-scratch?scriptVersionId=132608861" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# LSTM 🔃️️

$Long$ $Short-Term$ $Memory$ $LSTM$ is a type of $Recurrent$ $Neural$ $Network$ $RNN$ that is capable of `learning long-term dependencies`. $LSTM$s are commonly used for `natural language processing` tasks such as 
* $Machine$ $Translation$
* $Text$ $Summarization$
* $Question$ $Answering$

$LSTMs$ are made up of `cells`, each of `which has a cell state` and three gates: 
* $Input$ $Gate$ - decides what information to add to the cell state
* $Forget$ $Gate$ - what information to remove from the cell state
* $Output$ $Gate$ - decides what information to output from the cell state

<img src = "https://miro.medium.com/v2/resize:fit:984/1*Mb_L_slY9rjMr8-IADHvwg.png" width = 500>

# 1 | Data 🚀¶
Lets get our data into working

In [1]:
import pandas as pd
import numpy as np
import re

import torch

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/2023-kaggle-ai-report/sample_submission.csv
/kaggle/input/2023-kaggle-ai-report/arxiv_metadata_20230510.json
/kaggle/input/2023-kaggle-ai-report/kaggle_writeups_20230510.csv


In [2]:
data = pd.read_csv("/kaggle/input/2023-kaggle-ai-report/kaggle_writeups_20230510.csv")
data

Unnamed: 0,Competition Launch Date,Title of Competition,Competition URL,Date of Writeup,Title of Writeup,Writeup,Writeup URL
0,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/18/2010 00:06:46,Released: my Source Code and Analysis,<p>I had a lot of fun with this competition an...,https://www.kaggle.com/c/2447/discussion/185
1,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/20/2010 04:38:53,6th place(UriB) by Uri Blass,<P>I calculated rating for every player in mon...,https://www.kaggle.com/c/2447/discussion/192
2,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/23/2010 10:38:23,7th place - littlefish,I'm a little surprised I ended up in the top-1...,https://www.kaggle.com/c/2447/discussion/194
3,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/20/2010 11:27:17,3rd place: Chessmetrics - Variant,"<p><span id=""post_text_content_1230""><div dir=...",https://www.kaggle.com/c/2447/discussion/193
4,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/18/2010 02:44:10,2nd place: TrueSkill Through Time,"Wow, this is a surprise! I looked at this comp...",https://www.kaggle.com/c/2447/discussion/186
...,...,...,...,...,...,...,...
3122,02/23/2023 17:25:32,Google - Isolated Sign Language Recognition,https://www.kaggle.com/c/46105,05/02/2023 09:45:01,49th place silver solution,<p>Thank you Kaggle and Pop sign for hosting t...,https://www.kaggle.com/c/46105/discussion/406426
3123,02/23/2023 17:25:32,Google - Isolated Sign Language Recognition,https://www.kaggle.com/c/46105,05/02/2023 10:13:31,10th place solution,"<blockquote>\n <p>First, I would like to than...",https://www.kaggle.com/c/46105/discussion/406434
3124,02/23/2023 17:25:32,Google - Isolated Sign Language Recognition,https://www.kaggle.com/c/46105,05/02/2023 03:24:28,Solution - Single transformer without val dataset,<p>Thanks to the organisers of the PopSign Gam...,https://www.kaggle.com/c/46105/discussion/406346
3125,02/23/2023 17:25:32,Google - Isolated Sign Language Recognition,https://www.kaggle.com/c/46105,05/02/2023 04:01:15,Top 8% Bronze Medal Solution,<blockquote>\n <p><strong>Many congratulation...,https://www.kaggle.com/c/46105/discussion/406354


At this point we will only focus on the `Writeup column`, we will try to access/process more information in the upcoming versions

Our data is distributed in a `CSV File`. We need to extract our data in a `TXT File` as a large corpus of data

In [3]:
data["Writeup"]

0       <p>I had a lot of fun with this competition an...
1       <P>I calculated rating for every player in mon...
2       I'm a little surprised I ended up in the top-1...
3       <p><span id="post_text_content_1230"><div dir=...
4       Wow, this is a surprise! I looked at this comp...
                              ...                        
3122    <p>Thank you Kaggle and Pop sign for hosting t...
3123    <blockquote>\n  <p>First, I would like to than...
3124    <p>Thanks to the organisers of the PopSign Gam...
3125    <blockquote>\n  <p><strong>Many congratulation...
3126    <p>Thank you Kaggle, Kagglers, PopSign, and Pa...
Name: Writeup, Length: 3127, dtype: object

We just cant concatenate all of this into a single string. Let me show you how the data actually looks like

In [4]:
print(data["Writeup"][0])

<p>I had a lot of fun with this competition and learned a lot about ratings systems.</p>
<div>Sadly, I only came 18th :)</div>
<div>If you're interested, you can download all of my code and&nbsp;analysis&nbsp;from my github repo:&nbsp;https://github.com/jbrownlee/ChessML</div>
<div>There are implementations of a few rating systems (elo, glicko, chessmetrics, etc) and many attempts at improving them (a nice little experimentation framework).</div>
<div>Thanks all. Looking forward to the next big comp!</div>
<div>jasonb</div>


You can note that there are many of the `HTML tags` and other links provided in the data. We do not need these links, So it would be great if we juse remove all of this

In [5]:
print(
    re.sub(
        ':' , " " , 
        re.sub(
            ';' , ' ' , 
            re.sub(
                '&nbsp' , "" , 
                (
                    re.sub(
                        r'http\S+', ' ', 
                        (
                            re.compile(r'<.*?>').sub(
                                "" , 
                                data["Writeup"][0]
                            )
                        )
                    )
                )
            )
        )
    )
)

I had a lot of fun with this competition and learned a lot about ratings systems.
Sadly, I only came 18th  )
If you're interested, you can download all of my code and analysis from my github repo   
There are implementations of a few rating systems (elo, glicko, chessmetrics, etc) and many attempts at improving them (a nice little experimentation framework).
Thanks all. Looking forward to the next big comp!
jasonb


In [6]:
emoj = re.compile("["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
    u"\U00002500-\U00002BEF"  # chinese char
    u"\U00002702-\U000027B0"
    u"\U00002702-\U000027B0"
    u"\U000024C2-\U0001F251"
    u"\U0001f926-\U0001f937"
    u"\U00010000-\U0010ffff"
    u"\u2640-\u2642" 
    u"\u2600-\u2B55"
    u"\u200d"
    u"\u23cf"
    u"\u23e9"
    u"\u231a"
    u"\ufe0f"  # dingbats
    u"\u3030"
    u"\u2028"
    "\x08"
    u"\u200a"
    u"\u200b"
                  "]+", re.UNICODE)

In [7]:
text = str()
for i in data["Writeup"]: 
    k = re.sub(
        ':' , " " , 
        re.sub(
            ';' , ' ' , 
            re.sub(
                '&nbsp' , '' , 
                (
                    re.sub(
                        r'http\S+', ' ', 
                        (
                            re.compile(r'<.*?>').sub(
                                "" , str(i)
                            )
                        )
                    )
                )
            )
        )
    )
    k = emoj.sub(r'' , k)
    text += k

Now we have a large corpus of data, now we can finally train a model 

# 2 | Embeddings/Tokenizing 🔢

Lets assume we have the word `Shrek`, and we want to represent this word in the terms of numbers. We dont know how, but we want.

So, how can we represent this as a number...?
One way is to `OneHotEncode` it like 

|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z|Result
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|0|0|0|0|1|0|0|1|0|0|1|0|0|0|0|0|0|1|1|0|0|0|0|0|0|0|Shrek

But this is not a good way, as it doesnt only represnt `Shrek` but also `Kresh` or something like this. One way to bypass this is to make this thing for every letter and stack them all 

|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z|Result
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|S
|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|h
|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|r
|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|e
|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|k
|Result||||||||||||||||||||||||||Shrek

This is the original way of writing alphabates `a-z` but we do not have this, we have something like this

So how can we do this in code

First we need to make a list of all the letters in the word. Lets make a general case, where there ar repeative words. Lets call that list `vocab`

In [8]:
import tqdm

In [9]:
sample_name = "Shrek"
vocab = []
for letter in sample_name :
    if not (letter in vocab) :
        vocab.append(letter)
        
vocab

['S', 'h', 'r', 'e', 'k']

Now what we do is we somehow map these letters into a dictionary 

In [10]:
char_id = dict()
id_char = dict()

for i,char in enumerate(vocab):
    char_id[char] = i
    id_char[i] = char

Now we can `Encode` them easily 

In [11]:
name_list = []
x = np.zeros([6 , len(vocab)])

for index in range(len(sample_name)):
    
    x[index][char_id[sample_name[index]]] = 1

name_list.append(x)

name_list

[array([[1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0.]])]

Lets make a function for this 

In [12]:
vocab = []
for letter in tqdm.tqdm(text , total = len(text)) :
    if not (letter in vocab) :
        vocab.append(letter)
        
char_id = dict()
id_char = dict()

for i,char in tqdm.tqdm(enumerate(vocab) , total = len(vocab)):
    char_id[char] = i
    id_char[i] = char

100%|██████████| 10084907/10084907 [00:08<00:00, 1138488.98it/s]
100%|██████████| 273/273 [00:00<00:00, 421747.70it/s]


In [13]:
def encoder(sample_name , vocabu = vocab):
    
    name_list = []
    
    x = np.zeros([273 , len(vocabu)])

    for index in range(len(sample_name)):

        x[index][char_id[sample_name[index]]] = 1

    name_list.append(x)

    return name_list

# 3 | LSTM Cell ♾️️

Now just remeber what I say. This will gonna get a little bit confusing. But believe me and just jump off the cliff, doing a `yeeet`

## 3.1 | Path 1

<img src = "https://miro.medium.com/v2/resize:fit:984/1*Mb_L_slY9rjMr8-IADHvwg.png" width = 500>

* First of all lets assume we are a sentence $X_i$
* We enter the $LSTM$ cell
* We meet another sentence just like us `in dimensions`, named as $H_{i-1}$. 
* We both decide to go further together
* We first reach the `Forget Gate`. 
* Forget Gate is like a matrix.
* We multiply ourself through the `Forget Gate`.
* We then apply `Sigmoid` Function to ourselves
* Then we both meet another sentence just like ours named as $C_{i-1}$.
* We both decide to bitwise multiply ourself.
* Then we go further and find an operator like this $(+)$, for this moment we decide to ignore that.
* Then we go straight and move out of the `LSTM` cell with the name of $C_I$

Lets assume we have this $2$ sentences like 

In [14]:
text[:273] , text[273:547]

("I had a lot of fun with this competition and learned a lot about ratings systems.\r\nSadly, I only came 18th  )\r\nIf you're interested, you can download all of my code and analysis from my github repo   \r\nThere are implementations of a few rating systems (elo, glicko, chessme",
 'trics, etc) and many attempts at improving them (a nice little experimentation framework).\r\nThanks all. Looking forward to the next big comp!\r\njasonbI calculated rating for every player in months 101-105 and after having the rating I have a simple formula to calculate the e')

If we encode this we get 

In [15]:
encoder(text[:273]) , encoder(text[273:546])

([array([[1., 0., 0., ..., 0., 0., 0.],
         [0., 1., 0., ..., 0., 0., 0.],
         [0., 0., 1., ..., 0., 0., 0.],
         ...,
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.]])],
 [array([[0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         ...,
         [0., 0., 1., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 1., 0., ..., 0., 0., 0.]])])

This becomes are $X_i$ and $H_{i - 1}$

First we concatenate these $2$

In [16]:
sample_input = np.concatenate((np.array(encoder(text[:6])) , 
                              np.array(encoder(text[6:12]))) , axis = 1)

sample_input = torch.tensor(sample_input)
sample_input = sample_input.type(torch.LongTensor)

sample_input , sample_input.shape

(tensor([[[1, 0, 0,  ..., 0, 0, 0],
          [0, 1, 0,  ..., 0, 0, 0],
          [0, 0, 1,  ..., 0, 0, 0],
          ...,
          [0, 0, 0,  ..., 0, 0, 0],
          [0, 0, 0,  ..., 0, 0, 0],
          [0, 0, 0,  ..., 0, 0, 0]]]),
 torch.Size([1, 546, 273]))

This is extra $1$ dimenssion so we need to get rid of that 

In [17]:
sample_input = sample_input[-1 , : , :]
sample_input.shape

torch.Size([546, 273])

Now we will multiply this with a `Forget Gate`. Forget Gate is like a normal matrix. We intialize thiw with random values. But at this point when we `typecast` it to `LongTensor`. We get $0$ values. That is beacuse all our values on the first place are smaller than $1$. So we will multiply them with $10$ 

In [18]:
forget_gate = torch.rand((273 , 546)) * 10
forget_gate = forget_gate.type(torch.LongTensor)
forget_gate

tensor([[4, 0, 5,  ..., 7, 0, 4],
        [0, 5, 5,  ..., 3, 9, 5],
        [7, 0, 5,  ..., 2, 1, 8],
        ...,
        [0, 2, 3,  ..., 2, 3, 6],
        [0, 7, 4,  ..., 5, 2, 2],
        [5, 3, 6,  ..., 7, 8, 9]])

Now we will multiply the `sample_input` with `forget_gate` and name it `forget_output`

In [19]:
forget_output = forget_gate @ sample_input
forget_output , forget_output.shape

(tensor([[ 4, 14,  5,  ...,  0,  0,  0],
         [ 0, 20,  5,  ...,  0,  0,  0],
         [ 7, 17,  5,  ...,  0,  0,  0],
         ...,
         [ 0, 16,  3,  ...,  0,  0,  0],
         [ 0, 21,  4,  ...,  0,  0,  0],
         [ 5, 24,  6,  ...,  0,  0,  0]]),
 torch.Size([273, 273]))

Now we simply apply sigmoid funtion to this 

In [20]:
forget_output = torch.sigmoid(forget_output)
forget_output

tensor([[0.9820, 1.0000, 0.9933,  ..., 0.5000, 0.5000, 0.5000],
        [0.5000, 1.0000, 0.9933,  ..., 0.5000, 0.5000, 0.5000],
        [0.9991, 1.0000, 0.9933,  ..., 0.5000, 0.5000, 0.5000],
        ...,
        [0.5000, 1.0000, 0.9526,  ..., 0.5000, 0.5000, 0.5000],
        [0.5000, 1.0000, 0.9820,  ..., 0.5000, 0.5000, 0.5000],
        [0.9933, 1.0000, 0.9975,  ..., 0.5000, 0.5000, 0.5000]])

Now we introduce another `random Tensor` named as `Cell State`, and we `Bitwise Multiply` our `Forget Output` with this 

In [21]:
cell_state = torch.rand((273 , 273)) * 10
cell_state = cell_state.type(torch.LongTensor)

cell_state = cell_state * forget_output
cell_state

tensor([[6.8741, 4.0000, 1.9866,  ..., 2.0000, 1.0000, 0.0000],
        [4.5000, 1.0000, 0.9933,  ..., 0.5000, 0.5000, 0.5000],
        [3.9964, 2.0000, 3.9732,  ..., 0.0000, 3.5000, 4.0000],
        ...,
        [3.0000, 6.0000, 7.6206,  ..., 3.0000, 1.0000, 0.0000],
        [2.5000, 5.0000, 2.9460,  ..., 1.0000, 3.0000, 2.0000],
        [2.9799, 7.0000, 5.9852,  ..., 2.5000, 2.0000, 1.0000]])

And at this point our first path ends. 

## 3.2 | Path 2 

Now we come to our second path. 

* Now we leave the `Forget Gate` , and move to the `Input Gate` $I_i$
* Again `Input Gate` is just a matrix like `Forget Gate`.
* We again go thorugh a `Sigmoid Function`.
* Then we ignore the `BitWise Multiplcation`
* Then we meet the `BitWise Addition` with the `Cell State`
* Then we head towards the $C_i$

We will this time name the output of the cell state to be `inp`

In [22]:
input_gate = torch.rand((273 , 546)) * 10
input_gate = input_gate.type(torch.LongTensor)

input_output = input_gate @ sample_input

input_output = torch.sigmoid(input_output)

inp = cell_state + input_output

## 3.3 | Path 3

Next path is also the same, but it just have `Tanh` instead `Sigmoid`

Also at this path we again encounter `BitWise Multiplcation`. This time we do not ignore the sign. But rather use it with `inp`

In [23]:
input_node = torch.rand((273 , 546)) * 10
input_node = input_node.type(torch.LongTensor)

input_n_output = input_node @ sample_input

input_n_output = torch.tanh(input_n_output)

input_output = input_n_output * inp

cell_state = cell_state + input_output

 ## 3.4 | Path 4 

Now comes our end path

* This time we go to the `Output Gate`
* This is also a matrix just like before
* We also get the `Sigmoid Function`
* Then we get `BitWise Multiplied` with the `Cell_State` multiplied with `Tanh`
* Then we comes out as $H_i$

In [24]:
output_gate = torch.rand((273 , 546)) * 10
output_gate = output_gate.type(torch.LongTensor)

output_output = output_gate @ sample_input

output_output = torch.sigmoid(output_output)

cell_state = torch.tanh(cell_state)

hidden_state = cell_state * output_output

And we are done with all our paths

Lets concatenate what all paths we walked on 

* First we are an input $X_i$
* We meet with another input $H_{i-1}$
$$X_i = X_i + H_{i - 1}$$
* Then we go to $4$ different gates
* * $Forget$ $Gate$
* * * Forget Gate $F_i$ $$F_o = F_i @ X_i$$
* * * Then we go thorugh the sigmoid function $$F_i = \sigma (F_i)$$
* * * Then we get `BitWise Multiplied` with `Cell State` $C_{i-1}$ $$F_i * C_{i - 1}$$
* * $Input$ $Gate$
* * * At the same time we also fo though `Input Gate` $I_i$ $$I_i = I_i @ X_i$$
* * * Then we go through the `Sigmoid function` $$I_i = \sigma(I_i)$$
* * $Input$ $Node$
* * * Again at the same time we go thorugh `Input Node` $I_{n_i}$ $$I_{n_i} = I_{n_i} @ X_i$$
* * * Then we go through `Tanh` $I_{n_i} = TanH(I_{n_i})$
* * Then we `BitWise Multiply` $I_{n_i}$  and $I_i$ $$I-i = I_{n_i} * I_i$$
* * Then we `BitWise Add` this with $C_{i-1}$ $$I_i + C_{i - 1}$$
* * $Output$ $Gate$
* * At the same time we also go thorugh the `Output Gate` $O_i$ $$O_i = O_i @ X_i$$ 
* * * Then we also go through the `Sigmoid Function`, $$O_i = \sigma(O_i)$$
* * At the same time we also pass the resultant `Cell State` $C_{i-1}$ with `Tanh` . $$C_{i - 1} = Tanh(C_{i - 1})$$
* * Then we `BitWise Multiply` this with $O_i$ $$H_i = C_{i - 1} * O_i$$

And thats how we mapped the $LSTM$. We do this recusrively over the training period, to change the `Gates` such that, they only pass the values that they are named for 

* The forget gate determines how much of the previous cell state is forgotten.
* The input gate determines how much new information is added to the cell state.
* The output gate determines how much of the cell state is outputted.

Now we will try to make a particular class for the $LSTM$, making our concepts more concrete 

In [25]:
class LSTM: pass

Now lets add some intializers 

In [26]:
class LSTM:
    
    def __init__(self , text , batch = None):
        
        self.text = text
        
        if batch:
            self.batch = len(set(self.text))

Now we will intialize the `Gates` and the `Cell State`

In [27]:
class LSTM:
    
    def __init__(self , text , batch = None , lr = 0.0001):
        
        self.text = text
        self.lr = lr
        
        if batch:
            
            self.batch = len(set(self.text))
            
            raise UserWarning("Batch no defined. Using the whole data at once. This can cost exceptionaly high computation power")
        
        self.gate_hape = (self.batch , 2 * self.batch)
        
        self.forget_gate = torch.rand(self.gate_shape)
        self.input_gate = torch.rand(self.gate_shape)
        self.input_node = torch.rand(self.gate_shape)
        self.output_gate = torch.rand(self.gate_shape)
        
        self.cell_state = torch.rand((self.batch , self.batch))
        
        self.hidden_state = torch.rand((self.batch , self.batch))

Now we will intializae gradients for the gates

In [28]:
class LSTM:
    
    def __init__(self , text , batch = None , lr = 0.0001):
        
        self.text = text
        self.lr = lr
        
        if batch:
            
            self.batch = len(set(self.text))
            
            raise UserWarning("Batch no defined. Using the whole data at once. This can cost exceptionaly high computation power")
            
        self.gate_hape = (self.batch , 2 * self.batch)
        
        self.forget_gate = torch.rand(self.gate_shape)
        self.input_gate = torch.rand(self.gate_shape)
        self.input_node = torch.rand(self.gate_shape)
        self.output_gate = torch.rand(self.gate_shape)
        
        self.cell_state = torch.rand((self.batch , self.batch))
        
        self.hidden_state = torch.rand((self.batch , self.batch))
        
        self.forget_grad = troch.zeros_like(self.forget_gate)
        self.input_grad = torch.zeros_like(self.input_gate)
        self.input_node_grad = torch.zeors_like(self.input_node)
        self.output_grad = torch.zeros_like(self.output_gate)
        
        self.cell_grad = troch.zeros_like(self.cell_state)

Now we will add the `forward` method to the cass

In [29]:
class LSTM:
    
    def __init__(self , text , batch = None , lr = 0.0001):
        
        self.text = text
        self.lr = lr
        
        if batch:
            
            self.batch = len(set(self.text))
            
            raise UserWarning("Batch no defined. Using the whole data at once. This can cost exceptionaly high computation power")
            
        self.gate_hape = (self.batch , 2 * self.batch)
        
        self.forget_gate = torch.rand(self.gate_shape)
        self.input_gate = torch.rand(self.gate_shape)
        self.input_node = torch.rand(self.gate_shape)
        self.output_gate = torch.rand(self.gate_shape)
        
        self.cell_state = torch.rand((self.batch , self.batch))
        
        self.hidden_state = torch.rand((self.batch , self.batch))
        
        self.forget_grad = troch.zeros_like(self.forget_gate)
        self.input_grad = torch.zeros_like(self.input_gate)
        self.input_node_grad = torch.zeors_like(self.input_node)
        self.output_grad = torch.zeros_like(self.output_gate)
        
        self.cell_grad = troch.zeros_like(self.cell_state)
        
    def forward(self):
        
        forget_output = torch.sigmoid(forget_gate @ text)
        input_output = torch.sigmoid(input_gate @ text)
        input_node_output = torch.tanh(input_node @ text)
        output_output = torch.sigmoid(forget_gate @ text)
        
        cell_state = forget_output * cell_state
        
        input_output = input_output * input_node_output
        
        cell_state = cell_state + input_output
        
        hidden_state = torch.tanh(cell_state) * output_output

Now we will update the gates with the error. We will actually not calcuate the error, we will assume that the error will be given by the user. We will also add `grads` and `epsilon` to the intializers

In [30]:
class LSTM:
    
    def __init__(self , text , batch = None , lr = 0.0001 , 
                alpha = 0.01 , epsilon = 1e-7):
        
        self.text = text
        self.alpha = alpha
        self.epsilon = epsilon
        
        self.lr = lr
        
        if batch:
            
            self.batch = len(set(self.text))
            
            raise UserWarning("Batch no defined. Using the whole data at once. This can cost exceptionaly high computation power")
            
        self.gate_hape = (self.batch , 2 * self.batch)
        
        self.forget_gate = torch.rand(self.gate_shape)
        self.input_gate = torch.rand(self.gate_shape)
        self.input_node = torch.rand(self.gate_shape)
        self.output_gate = torch.rand(self.gate_shape)
        
        self.cell_state = torch.rand((self.batch , self.batch))
        
        self.hidden_state = torch.rand((self.batch , self.batch))
        
        self.forget_grad = troch.zeros_like(self.forget_gate)
        self.input_grad = torch.zeros_like(self.input_gate)
        self.input_node_grad = torch.zeors_like(self.input_node)
        self.output_grad = torch.zeros_like(self.output_gate)
        
        self.cell_grad = troch.zeros_like(self.cell_state)
        
        self.forget_update = 0
        self.input_update = 0
        self.input_node_update = 0
        self.output_update = 0
        
        self.cell_update = 0
        
    def forward(self):
        
        forget_output = torch.sigmoid(forget_gate @ text)
        input_output = torch.sigmoid(input_gate @ text)
        input_node_output = torch.tanh(input_node @ text)
        output_output = torch.sigmoid(forget_gate @ text)
        
        cell_state = forget_output * cell_state
        
        input_output = input_output * input_node_output
        
        cell_state = cell_state + input_output
        
        hidden_state = torch.tanh(cell_state) * output_output
        
    def update(self , error):
        
        error = (error ** 2) * self.lr 
        
        self.forget_update = 0.9 * self.forget_update + error
        self.input_update = 0.9 * self.input_update + error 
        self.input_node_update = 0.9 * self.input_node_update + error
        self.output_update = 0.9 * self.output_update + error 
        
        self.cell_update = 0.9 * self.cell_update + error 
        
        self.forget -= self.alpha / (troch.sqrt(self.forget_update + self.epsilon) * error)
        self.input -= self.alpha / (torch.sqrt(self.input_update + self.epsilon) * error)
        self.input_node = self.alpha / (torch.sqrt(self.input_node_update + self.epsilon) * error)
        self.output -= self.alpha / (torch.sqrt(self.output_update + self.epsilon) * error)
        
        self.cell -= self.alpha / (torch.sqrt(self.cell_update + self.epsilon) * error)

And we made a class for $LSTM$

# 4 | To Do List 📝
```
# TO DO 1 : TRAIN LSTM

# TO DO 2 : INTORDUCE GRU

# TO DO 3 : TRAIN GRU

# TO DO 4 : GET BETTER RESULTS

# TO DO 5 : DANCE ON I LIKE TO MOVE IT MOVE IT 
```
# 5 | Ending 🎭

**THAT IT FOR TODAY GUYS**

**WE WILL IMPROVE THIS IN UPCOMING VERSIONS**

**PLEASE COMMENT YOUR THOUGHTS, HIHGLY APPRICIATED**

**DONT FORGET TO MAKE AN UPVOTE, IF YOU LIKED MY WORK**

<img src = "https://i.imgflip.com/19aadg.jpg">

**PEACE OUT $:)$**