<a href="https://colab.research.google.com/github/FrancescoMinchio/NLU.Lab.2021/blob/master/FINAL_Assignement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FINAL_ASSIGNEMENT

*   Student name: Francesco Minchio
*   Student contact: francesco.minchio@studenti.unitn.it
*   Student referal: 225269


## INTRODUCTION




### **Neural Network**

Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated.

### **LSTM**

The gradient is the value used to update a NN weight. The vanishing gradient problem is when a gradient strengths as it back propagates through time of a gradient value becomes extremely small it doesn’t contribute to much learning. In recurrent NN layers that get a small gradient update doesn’t learn those are usually the earlier layers so, because of it, don’t learn, RNNs can forget what is seen and long sequences does having short term memory. LSTM is created as the solution to short-term memory; it has internal mechanisms called gates which can regulate the flow of information. These gates can learn which data in a sequence is important to keep or throw away by doing that it learns to use relevant information to make predictions almost all state-of-the-art results based on RNN. We can find these networks in speech recognition models, speech synthesis, text generation and we can even use it to generate captions for videos. When you read review your brain subconsciously only remember important keywords if your goal is trying to judge if a certain review is good or bad. We can learn to keep only relevant information to make predictions.

### **LSTM for Text Generation**

Activation outputs from neurons propagate in both directions, by allowing the network to learn information in a state of memory (loop) by remembering what was previously learned. This network has a state through which it makes changes to the flow of information by remembering or forgetting trends more selectively.



*   Input Layer
*   LSTM Layer
*   Dropout Layer
*   Output Layer





### **Source**

New York Times Dataset (articles, comments).

### **Purpose**

Text generation by using LSTMs. The aim is the predict next word/token.








## CODE

As the first step, we need to import the required libraries:

Keras is an API which follows best practices for reducing cognitive load, it minimizes the number of user actions required for common use cases and it provides clear and actionable error messages.

In [None]:
#keras module for building LSTM 
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 

#set seeds for reproducability
from numpy.random import seed
seed(1)
from tensorflow import random
random.set_seed(1)

#standard library
import pandas as pd
import numpy as np
import string, os 

#base class of all warning category classes (subclass of Exception)
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

# Dataset load

In [None]:
curr_dir = '../input/'
all_headlines = []
for filename in os.listdir(curr_dir):
    if 'Archive' in filename:
        article_df = pd.read_csv(curr_dir + filename)
        all_headlines.extend(list(article_df.headline.values))
        break

all_headlines = [h for h in all_headlines if h != "Unknown"]
len(all_headlines)

# Dataset Preparation & Text Cleaning

We require a sequence input data, as given a sequence of words/tokens.
By using the tokenization we extract tokens from a corpus, in fact Keras is used for tokenization to obtain tokens and their index in the corpus. After that, every text in the dataset is converted into a sequence of tokens.
In this way we obtain N-gram phrases generated by imput data which are represented as output value by a table composed of two columns. In the first we have N-gram, in the second we have the sequence of tokens. In essence, each word is associated with a sequence of numbers [ , , ... ], where each integer corresponds to the index of a word in the vocabulary containing the words of the text.

In [None]:
#removal of punctuations and lower casing all the words
def clean_text(txt):
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 

corpus = [clean_text(x) for x in all_headlines]
corpus[:10]

#generate sequence of N-gram Tokens

tokenizer = Tokenizer()

def get_sequence_of_tokens(corpus):
    ## tokenization
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    ## convert data to sequence of tokens 
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

inp_sequences, total_words = get_sequence_of_tokens(corpus)
inp_sequences[:10]

#get sequences of words making them equal before training the model (length)

def generate_padded_sequences(input_sequences):
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    ##creating predictors(N-gram sequence) and labels (next word N-gram)
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len
    ##predictor = vector X 
    ##label = vector Y

predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

# Model Creation

In [None]:
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    
    ##Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    ##Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(100))
    model.add(Dropout(0.1))
    
    ##Add Output Layer
    model.add(Dense(total_words, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

model = create_model(max_sequence_len, total_words)
model.summary()

#model training

model.fit(predictors, label, epochs=100, verbose=5)

#text generation

def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()