# Natural Language Processing

![](https://www.thuatngumarketing.com/wp-content/uploads/2017/12/NLP.png.pagespeed.ce_.1YNuw_5dJH.png)

Natural Language Processing or NLP is a field of Artificial Intelligence that gives the machines the ability to read, understand and derive meaning from human languages.

NLP can help you with lots of tasks and the fields of application just seem to increase on a daily basis
>* Amazon Comprehend Medical is a service that uses NLP to extract disease conditions, medications and treatment outcomes from patient notes, clinical trial reports and other electronic health records.
>* Organizations can determine what customers are saying about a service or product by identifying and extracting information in sources like social media.
>* An inventor at IBM developed a cognitive assistant that works like a personalized search engine by learning all about you and then remind you of a name, a song, or anything you can’t remember the moment you need it to.
>* Companies like Yahoo and Google filter and classify your emails with NLP by analyzing text in emails that flow through their servers and stopping spam before they even enter your inbox.

![](https://miro.medium.com/max/1120/1*CGHaWd635jtRa47n4nhsiQ.png)
Number of publications containing the sentence “natural language processing” in PubMed in the period 1978–2018. As of 2018, PubMed comprised more than 29 million citations for biomedical literature

## Part 1: Pre-processing data

* Delete html
* Delete link
* Delete punctuation 
* Delete emotion icon
* Delete numbers
* Delete space
* Tokenize text

In [34]:
import re
import json
import time
import _pickle

import bs4
import demoji
import numpy as np
import pandas as pd
import flashtext
import demoji
from pyvi.ViTokenizer import tokenize

demoji.download_codes()

Downloading emoji data ...
... OK (Got response in 0.66 seconds)
Writing emoji data to C:\Users\X1\.demoji\codes.json ...
... OK


In [23]:
# Read data from json data
with open('data.json', 'r') as file:
    data = json.load(file)

In [73]:
data

[{'title': 'Thủ tướng: Khác với đa số các nước, dư địa chính sách tài khóa, tiền tệ của Việt Nam còn khá lớn',
  'snippet': '(Tổ Quốc) - Với bối cảnh hiện nay, Hội đồng Tư vấn chính sách tài chính, tiền tệ quốc gia thống nhất kịch bản tăng trưởng từ 3-4%. Kiểm soát lạm phát dưới 4%. Năm 2020 và đầu 2021, tăng trưởng tín dụng trên 10%, chủ trương tăng thêm bội chi ngân sách, nợ công khoảng 3-4 % GDP.',
  'content': 'Sáng ngày 9/7, Hội đồng Tư vấn chính sách tài chính, tiền tệ quốc gia đã họp dưới sự chủ trì của Thủ tướng Nguyễn Xuân Phúc, Chủ tịch Hội đồng. Dư địa chính sách tài khóa, tiền tệ còn lớn Thủ tướng Nguyễn Xuân Phúc cho rằng, dịch bệnh COVID-19 bùng phát diện rộng trên toàn cầu, diễn biến phức tạp, chưa dừng lại, nhất là tại các đối tác lớn của nước ta. Kinh tế toàn cầu sụt giảm mạnh. Nhiều tổ chức quốc tế dự báo kinh tế thế giới tăng trưởng âm trong năm nay. Trong bối cảnh đó, hầu hết các nước đều nới lỏng chính sách tài chính, tiền tệ với mức độ chưa từng có. Theo thống kê 

In [24]:
# Delete html
def del_html(text):
    soup = bs4.BeautifulSoup(text)
    return soup.get_text(' ')

In [25]:
# Delete link 
def del_link(text):
    link = r'http[\S]*'
    text = re.sub(link, ' ', str(text))
    return text

In [26]:
# Delete punctuation 
def del_punctuation(doc):
    pattern = r'[\,\.\/\\\!\@\#\+\"\'\;\)\(\“\”\\\-\:…&><=\-\%\|\^\$\&\)\(\[\]\{\}\?\*\•]'
    record = re.sub(pattern, ' ', doc)
    return re.sub(r'\n', ' ', record)

In [40]:
# Delete emotion icon
def del_emoji(text):
    return demoji.replace(text, '')

In [28]:
# Delete numbers
def del_numbers(text):
    return re.sub(r'\d+', ' ', text)

In [29]:
# Delete space
def del_space(doc):
    space_pattern = r'\s+'
    return re.sub(space_pattern, ' ', doc.lower())

In [30]:
# Tokenize text
def text_token(text):
    return tokenize(text)

In [41]:
# Clean text
def clean_text(text):
    text = del_html(text)
    text = del_link(text)
    text = del_numbers(text)
    text = del_emoji(text)
    text = del_punctuation(text)
    text = del_space(text)
    return text_token(text)

In [42]:
corpus = []
for record in data:
    text = ' '.join([record['snippet'], record['title'], record['content']])
    text = clean_text(text)
    corpus.append(text)

In [43]:
corpus

['tổ_quốc với bối_cảnh hiện_nay hội_đồng tư_vấn chính_sách tài_chính tiền_tệ quốc_gia thống_nhất kịch_bản tăng_trưởng từ kiểm_soát lạm_phát dưới năm và đầu tăng_trưởng tín_dụng trên chủ_trương tăng thêm bội_chi ngân_sách nợ công khoảng gdp thủ_tướng khác với đa_số các nước dư địa_chính_sách tài khóa tiền_tệ của việt nam còn khá lớn sáng ngày hội_đồng tư_vấn chính_sách tài_chính tiền_tệ quốc_gia đã họp dưới sự chủ_trì của thủ_tướng nguyễn xuân phúc chủ_tịch hội_đồng dư địa_chính_sách tài khóa tiền_tệ còn lớn thủ_tướng nguyễn xuân phúc cho rằng dịch_bệnh covid bùng_phát diện rộng trên toàn_cầu diễn_biến phức_tạp chưa dừng lại nhất_là tại các đối_tác lớn của nước ta kinh_tế toàn_cầu sụt_giảm mạnh nhiều tổ_chức quốc_tế dự_báo kinh_tế thế_giới tăng_trưởng âm trong năm nay trong bối_cảnh đó hầu_hết các nước đều nới lỏng chính_sách tài_chính tiền_tệ với mức_độ chưa từng có theo thống_kê mới nhất nếu tháng tổng_các gói kích_thích tài khóa mới là tỷ usd thì đến nay đã tăng lên tỷ usd chưa có dấ

## Part 2: Word embedding

### One-hot encoding

Suppose we have a sentence as “Can I eat the Pizza”. Try to apply one hot ending i.e converting the categories into numerical labels. 
>1. Firstly, convert the text to lower and then sort the words in ascending form i.e A-Z. Now we’ll have “can, eat, i, pizza, the”.
2. Give a numerical label as we can see can is at 0th position and eat is at 1 same way, assign the values like can:0, i:2, eat:1, the:4, pizza:3.
3. Transform to binary vectors.

Categorical variables are basically the fixed value number on the basis of some qualitative properties. Such as Sex of an individual as it can be either male or female or trans. Weather is also one example as it can be sunny, cloudy, or rainy.

**Steps to follow:**
1. Convert Text to lower case
2. Tokenize the text
3. Get unique words
4. Sort the word list
5. Get the integer/position of the words
6. Create a vector of each word by marking its position as 1 and rest as 0
7. Create a matrix of the found vectors.

In [19]:
docs = "Can I eat the Pizza".lower().split()
doc1 = set(docs)
doc1 = sorted(doc1)
print ("\nvalues: ", doc1)

integer_encoded = []
for i in docs:
    v = np.where( np.array(doc1) == i)[0][0]
    integer_encoded.append(v)
print ("\ninteger encoded: ",integer_encoded)

def get_vec(len_doc,word):
    empty_vector = [0] * len_doc
    vect = 0
    find = np.where( np.array(doc1) == word)[0][0]
    empty_vector[find] = 1
    return empty_vector

def get_matrix(doc1):
    mat = []
    len_doc = len(doc1)
    for i in docs:
        vec = get_vec(len_doc,i)
        mat.append(vec)
        
    return np.asarray(mat)

print ("\nMATRIX:")
print (get_matrix(doc1))


values:  ['can', 'eat', 'i', 'pizza', 'the']

integer encoded:  [0, 2, 1, 4, 3]

MATRIX:
[[1 0 0 0 0]
 [0 0 1 0 0]
 [0 1 0 0 0]
 [0 0 0 0 1]
 [0 0 0 1 0]]


### Bag-of-words

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms. 

The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

>1. A vocabulary of known words.
2. A measure of the presence of known words.

Example:

*“It was the best of times”*

*“It was the worst of times”*

*“It was the age of wisdom”*

*“It was the age of foolishness”*
    

We treat each sentence as a separate document and we make a list of all words from all the four documents excluding the punctuation. We get, *‘It’, ‘was’, ‘the’, ‘best’, ‘of’, ‘times’, ‘worst’, ‘age’, ‘wisdom’, ‘foolishness’*.

The next step is the create vectors. Vectors convert text that can be used by the machine learning algorithm.

We take the first document — “It was the best of times” and we check the frequency of words from the 10 unique words.

“it” = 1

“was” = 1

“the” = 1

“best” = 1

“of” = 1

“times” = 1

“worst” = 0

“age” = 0

“wisdom” = 0

“foolishness” = 0

Rest of the documents will be:

“It was the best of times” = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

“It was the worst of times” = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

“It was the age of wisdom” = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]

“It was the age of foolishness” = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

In this approach, each word or token is called a “gram”. Creating a vocabulary of two-word pairs is called a bigram model.

## TF - IDF
TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

TF-IDF (term frequency-inverse document frequency) was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.

TF-IDF for a word in a document is calculated by multiplying two different metrics:
>* The term frequency of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are ways to adjust the frequency, by length of a document, or by the raw frequency of the most frequent word in a document.
>* The inverse document frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.
>* So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.


To put it in more formal mathematical terms, the TF-IDF score for the word t in the document d from the document set D is calculated as follows:

$$ tf-idf(t, d, D)  = tf(t, d) \dot idf(t, D)$$

where:

$$ tf(t, d) = \log(1 + freq(t, d)) $$
$$ idf(t, D) = \log \left( \dfrac{N}{count(d \in D: t \in d)} \right)$$

In [57]:
# small corpus
import numpy as np
import gensim
import pprint
from gensim import corpora, models
from gensim.utils import simple_preprocess

doc_list = [
   "Hello, how are you?", "How do you do?", 
   "Hey what are you doing? yes you What are you doing?"
]

doc_tokenized = [simple_preprocess(doc) for doc in doc_list]
dictionary = corpora.Dictionary()
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]

for doc in BoW_corpus:
    print([[dictionary[id], freq] for id, freq in doc])

tfidf = models.TfidfModel(BoW_corpus, smartirs='ntc')
for doc in tfidf[BoW_corpus]:
    print([[dictionary[id], np.around(freq)] for id, freq in doc])

[['are', 1], ['hello', 1], ['how', 1], ['you', 1]]
[['how', 1], ['you', 1], ['do', 2]]
[['are', 2], ['you', 3], ['doing', 2], ['hey', 1], ['what', 2], ['yes', 1]]
[['are', 0.0], ['hello', 1.0], ['how', 0.0], ['you', 0.0]]
[['how', 0.0], ['you', 0.0], ['do', 1.0]]
[['are', 0.0], ['you', 0.0], ['doing', 1.0], ['hey', 0.0], ['what', 1.0], ['yes', 0.0]]


In [49]:
from gensim.models import TfidfModel
from gensim.corpora import Dictionary

In [63]:
dictionary = corpora.Dictionary()
doc_tokenized = [doc.split(' ') for doc in corpus]
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]

model = TfidfModel(BoW_corpus)
print(model[BoW_corpus[0]])

[(0, 0.041056810958015416), (1, 0.04908549713856512), (2, 0.022150413279576557), (3, 0.07118236260548674), (4, 0.044608066725595884), (5, 0.049957499668618746), (6, 0.03466252866181522), (7, 0.10062651461836701), (8, 0.06060873104638708), (9, 0.04908549713856512), (10, 0.06915152597396647), (11, 0.03617867248710988), (12, 0.010316815220866141), (13, 0.026828466033872247), (14, 0.06111081378160896), (15, 0.0028250430885219864), (16, 0.06235428291639299), (17, 0.16319377892084536), (18, 0.078057438529472), (19, 0.052419825952484185), (20, 0.02820453338317384), (21, 0.04547000476013332), (22, 0.05253766443453049), (23, 0.09514985068582214), (24, 0.04231184045728052), (25, 0.06663472514424916), (26, 0.008110526439717347), (27, 0.030484355333397262), (28, 0.007526929182420807), (29, 0.13148204782600803), (30, 0.04069441407621065), (31, 0.04692231586209833), (32, 0.010444859212821667), (33, 0.0226333639147486), (34, 0.039697435784915046), (35, 0.07015038562083413), (36, 0.007070734738354771)

## Word2vec

Consider the following similar sentences: Have a good day and Have a great day. They hardly have different meaning. If we construct an exhaustive vocabulary (let’s call it V), it would have V = {Have, a, good, great, day}.

Have = [1,0,0,0,0]`; a=[0,1,0,0,0]` ; good=[0,0,1,0,0]` ; great=[0,0,0,1,0]` ; day=[0,0,0,0,1]` (` represents transpose)

From this representation, 'good' and ‘great’ are as different as ‘day’ and ‘have’, which is not true. Our objective is to have words with similar context occupy close spatial positions. Mathematically, the cosine of the angle between such vectors should be close to 1, i.e. angle close to 0.

Word2Vec was developed at Google by Tomas Mikolov, et al. and uses Neural Networks to learn word embeddings. The beauty with word2vec is that the vectors are learned by understanding the context in which words appear. The result is vectors in which words with similar meanings end up with a similar numerical representation.

For example, in a regular one-hot encoded Vector, all words end up with the same distance between each other, even though their meanings are completely different. In other words, information is lost in the encoding.
![](https://i2.wp.com/insightbot.blob.core.windows.net/blogimagecontainer/b3c56245-db43-48ab-b652-9ba03f4d9900.jpg?ssl=1)

With word embeddings methods such as Word2Vec, the resulting vector does a better job of maintaining context. For instance, cats and dogs are more similar than fish and sharks. This extra information makes a big difference in your machine learning algorithms.
![](https://i1.wp.com/insightbot.blob.core.windows.net/blogimagecontainer/8cbfc874-3ba3-46c8-ab68-2711812ecbf1.jpg?ssl=1)

Word2Vec is composed of two different learning models, CBOW and Skip-Gram. CBOW stands for Continuous Bag of Words model. 

Continuous Bag of Words (CBOW) model can be thought of as learning word embeddings by training a model to predict a word given its context.

Skip-Gram Model is the opposite, learning word embeddings by training a model to predict context given a word.

![](https://i0.wp.com/insightbot.blob.core.windows.net/blogimagecontainer/7938152f-71c8-4f28-9c25-06735e6e2b67.jpg?ssl=1)

**Skip-gram**

Window Size defines how many words before and after the target word will be used as context, typically a Window Size is 5. 
![](https://i2.wp.com/insightbot.blob.core.windows.net/blogimagecontainer/a8066c1d-c532-4549-bb24-19dfea5eb178_med.jpg?ssl=1)

Using a window size of 2 the input pairs for training on w(4) royal would be:
![](https://israelg99.github.io/images/2017-03-23-Word2Vec-Explained/training_data.png)

In this sense, a vanilla explanation for context is that context is words that usually appear with one another. In a trained skip-gram model, by inputting the word “Royal”, the context would be predicted to be the words “The” and “King” given by the model’s output layer which contains a probability distribution of each word in the vocabulary when provided a word.
![](https://i1.wp.com/insightbot.blob.core.windows.net/blogimagecontainer/56060b2d-41f6-4788-9ae7-bba23fa00f0e_med.jpg?ssl=1)

In Skip-Gram, the input layer consists of your vocabulary a  (R x V) vector in which V=Vocabulary Size and R is the number of training samples. Each word in your vocabulary is represented by a one-hot encoded vector. This input vector then goes through a hidden layer vector (V x E) in which E = Embedding Dimensions or Features you are trying to learn. The output layer is a vector (R x V). In the output layer, the vector holds a probability for each word in your vocabulary for the given word input. Softmax is applied to this layer. The word embedding is the Hidden-Layer a vector (V x E). 
![](https://israelg99.github.io/images/2017-03-23-Word2Vec-Explained/word2vec_weight_matrix_lookup_table.png)

![](https://israelg99.github.io/images/2017-03-23-Word2Vec-Explained/matrix_mult_w_one_hot.png)

![](https://israelg99.github.io/images/2017-03-23-Word2Vec-Explained/linear-relationships.png)

In [59]:
from gensim.models import Word2Vec

In [64]:
word2vec = Word2Vec(doc_tokenized, min_count=2)

In [65]:
vocabulary = word2vec.wv.vocab
print(vocabulary)

{'tổ_quốc': <gensim.models.keyedvectors.Vocab object at 0x00000215E3603DA0>, 'với': <gensim.models.keyedvectors.Vocab object at 0x00000215E442A6D8>, 'bối_cảnh': <gensim.models.keyedvectors.Vocab object at 0x00000215E442A898>, 'hiện_nay': <gensim.models.keyedvectors.Vocab object at 0x00000215E442A7F0>, 'hội_đồng': <gensim.models.keyedvectors.Vocab object at 0x00000215E442A780>, 'tư_vấn': <gensim.models.keyedvectors.Vocab object at 0x00000215E442A7B8>, 'chính_sách': <gensim.models.keyedvectors.Vocab object at 0x00000215E442A6A0>, 'tài_chính': <gensim.models.keyedvectors.Vocab object at 0x00000215E442A470>, 'tiền_tệ': <gensim.models.keyedvectors.Vocab object at 0x00000215E442A320>, 'quốc_gia': <gensim.models.keyedvectors.Vocab object at 0x00000215E442A4A8>, 'thống_nhất': <gensim.models.keyedvectors.Vocab object at 0x00000215E442A438>, 'kịch_bản': <gensim.models.keyedvectors.Vocab object at 0x00000215E442A3C8>, 'tăng_trưởng': <gensim.models.keyedvectors.Vocab object at 0x00000215E442A4E0>,

In [67]:
v1 = word2vec.wv['tổ_quốc']
v1

array([ 0.12416643, -0.0042013 ,  0.2530074 ,  0.3738347 , -0.01195809,
        0.24246785, -0.08152559,  0.15248968, -0.13780174, -0.49535426,
        0.15098403, -0.0854509 ,  0.3534799 , -0.14441416, -0.31444937,
       -0.0076703 , -0.022089  , -0.13297945,  0.01288203, -0.0766298 ,
        0.17047074,  0.19727796, -0.09008525, -0.2136717 , -0.05885999,
        0.09681803, -0.20772286, -0.34104082,  0.01049158,  0.40065753,
       -0.03859928, -0.15907373, -0.10808727,  0.21206434, -0.03683862,
       -0.06156401, -0.1186076 , -0.2005384 , -0.21708606,  0.09689088,
       -0.0634633 , -0.03559205,  0.02914888, -0.19676241, -0.43852183,
       -0.1829509 ,  0.07776973,  0.10082722, -0.01793282,  0.04023144,
        0.06146409,  0.02169067,  0.47531855, -0.10721502,  0.13048634,
       -0.2281351 , -0.35125458, -0.3424357 ,  0.13317683,  0.10750926,
        0.44313076, -0.04781875,  0.10149067, -0.1967978 , -0.2053582 ,
        0.0796575 , -0.15381086, -0.16167809,  0.32356513,  0.13

In [72]:
sim_words = word2vec.wv.most_similar('kéo_dài')
sim_words

[('ngắn', 0.6219161152839661),
 ('nó', 0.5537564754486084),
 ('khiếu_kiện', 0.5367236137390137),
 ('kết_thúc', 0.5046229362487793),
 ('phong_tỏa', 0.48663008213043213),
 ('giãn', 0.47034239768981934),
 ('hết', 0.46944886445999146),
 ('khó', 0.46831953525543213),
 ('chặng', 0.4581029415130615),
 ('điểm_nóng', 0.4571770131587982)]

## Part 3: RNN and LSTM models

### RNN (Recurrent Neural Networks)
Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states.

They are typically as follows:
![](https://miro.medium.com/max/1225/1*AQ52bwW55GsJt6HTxPDuMA.gif)

While processing, it passes the previous hidden state to the next step of the sequence. The hidden state acts as the neural networks memory. It holds information on previous data the network has seen before. 

![](https://miro.medium.com/max/1225/1*o-Cq5U8-tfa1_ve2Pf3nfg.gif)

The input and previous hidden state are combined to form a vector. That vector now has information on the current input and previous inputs. The vector goes through the tanh activation, and the output is the new hidden state, or the memory of the network.

**Tanh activation**
The tanh activation is used to help regulate the values flowing through the network. The tanh function squishes values to always be between -1 and 1.

![](https://miro.medium.com/max/1225/1*iRlEg1GBKRzGTre5aOQUCg.gif)

![](https://miro.medium.com/max/1225/1*LgbEFcGiUpseZ--M7wuZhg.gif)
vector transformations without tanh

![](https://miro.medium.com/max/1225/1*gFC2bTg3uihp1klknWU0qg.gif)
vector transformations with tanh

### Long-Short Term Memory (LSTM)
An LSTM has a similar control flow as a recurrent neural network. It processes data passing on information as it propagates forward. The differences are the operations within the LSTM’s cells.

![](https://miro.medium.com/max/1225/1*0f8r3Vd-i4ueYND1CUrhMA.png)

The core concept of LSTM’s are the cell state, and it’s various gates. The cell state act as a transport highway that transfers relative information all the way down the sequence chain. You can think of it as the “memory” of the network. The cell state, in theory, can carry relevant information throughout the processing of the sequence. So even information from the earlier time steps can make it’s way to later time steps, reducing the effects of short-term memory. As the cell state goes on its journey, information get’s added or removed to the cell state via gates. The gates are different neural networks that decide which information is allowed on the cell state. The gates can learn what information is relevant to keep or forget during training.

**Sigmoid**

Gates contains sigmoid activations. A sigmoid activation is similar to the tanh activation. Instead of squishing values between -1 and 1, it squishes values between 0 and 1. That is helpful to update or forget data because any number getting multiplied by 0 is 0, causing values to disappears or be “forgotten.”
![](https://miro.medium.com/max/1225/1*rOFozAke2DX5BmsX2ubovw.gif)

**Forget gate**
 This gate decides what information should be thrown away or kept. Information from the previous hidden state and information from the current input is passed through the sigmoid function. Values come out between 0 and 1. The closer to 0 means to forget, and the closer to 1 means to keep.
 
![](https://miro.medium.com/max/1225/1*GjehOa513_BgpDDP6Vkw2Q.gif)

**Input Gate**
To update the cell state, we have the input gate. First, we pass the previous hidden state and current input into a sigmoid function. That decides which values will be updated by transforming the values to be between 0 and 1. 0 means not important, and 1 means important. You also pass the hidden state and current input into the tanh function to squish values between -1 and 1 to help regulate the network. Then you multiply the tanh output with the sigmoid output. The sigmoid output will decide which information is important to keep from the tanh output.

![](https://miro.medium.com/max/1225/1*TTmYy7Sy8uUXxUXfzmoKbA.gif)

**Cell State**
Now we should have enough information to calculate the cell state. First, the cell state gets pointwise multiplied by the forget vector. This has a possibility of dropping values in the cell state if it gets multiplied by values near 0. Then we take the output from the input gate and do a pointwise addition which updates the cell state to new values that the neural network finds relevant. That gives us our new cell state.

![](https://miro.medium.com/max/1225/1*S0rXIeO_VoUVOyrYHckUWg.gif)

**Output Gate**
Last we have the output gate. The output gate decides what the next hidden state should be. Remember that the hidden state contains information on previous inputs. The hidden state is also used for predictions. First, we pass the previous hidden state and the current input into a sigmoid function. Then we pass the newly modified cell state to the tanh function. We multiply the tanh output with the sigmoid output to decide what information the hidden state should carry. The output is the hidden state. The new cell state and the new hidden is then carried over to the next time step.

![](https://miro.medium.com/max/1225/1*VOXRGhOShoWWks6ouoDN3Q.gif)

To review, the Forget gate decides what is relevant to keep from prior steps. The input gate decides what information is relevant to add from the current step. The output gate determines what the next hidden state should be.