[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Chuvard/NLP_generator_and_detector/blob/main/NLP.ipynb)

## Outline

1. Project overview
2. Importing modules and libabies
3. NLP Generator 
    * 3.1. Problem statement
    * 3.2. Load and inspect the data
    * 3.3. Model construction
    * 3.4. Testing
4. NLP Recognizer 
    * 4.1. Problem statement
    * 4.2. Load and inspect the data
    * 4.3. Model construction
    * 4.4. Testing

## 1. Project overview

The project consists of two subprojects that use natural language processing techniques: one to generate text and the other to detect sarcastic sentiment in the text with the good accuracy.

## 2. Importing modules and libabies

In [3]:
## Required libraries for generator
import requests
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional

## Required libraries for recognizer
import requests
import numpy as np
import json
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

## 3. NLP Generator

### 3.1. Problem statement

The goal of this subproject is to build a model that predicts the next word in a text. We'll train this model using Shakespeare's sonnets and make some helper functions to prepare the data.

### 3.2. Load and inspect the data

In [17]:
## Load data to train the model

# Define path for file with sonnets
SONNETS_FILE = 'https://raw.githubusercontent.com/Chuvard/NLP_generator_and_detector/main/data/sonnets.txt'

# Read the data from the URL
response = requests.get(SONNETS_FILE)
data = response.text

# Convert to lower case and save as a list
corpus = data.lower().split("\n")

In [16]:
## Inspect the data
print(f"There are {len(corpus)} lines of sonnets\n")
print(f"The first 5 lines look like this:\n")
for i in range(5):
  print(corpus[i])

There are 2159 lines of sonnets

The first 5 lines look like this:

from fairest creatures we desire increase,
that thereby beauty's rose might never die,
but as the riper should by time decease,
his tender heir might bear his memory:
but thou, contracted to thine own bright eyes,


### 3.3. Model construction

#### Tokenizing the text

Now fit the Tokenizer to the corpus and save the total number of words.

In [5]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

When converting the text into sequences we can use the `texts_to_sequences` method.

In the next cells we will need to process this corpus one line at a time. Given this, it is important to keep in mind that the way we are feeding the data unto this method affects the result. Check the following example to make this clearer.

The first example of the corpus is a string and looks like this:

In [6]:
corpus[0]

'from fairest creatures we desire increase,'

If we pass this text directly into the `texts_to_sequences` method will get an unexpected result:

In [7]:
tokenizer.texts_to_sequences(corpus[0])

[[],
 [],
 [58],
 [],
 [],
 [],
 [17],
 [6],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [17],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [6],
 [],
 [],
 [],
 [6],
 [],
 [],
 [],
 [],
 [17],
 [],
 [],
 []]

This happened because `texts_to_sequences` expects a LIST and we are providing a STRING. However a string is still and `iterable` in Python so we will get the word index of every character in the string.

Instead we need to place the example whithin a list before passing it to the method:

In [8]:
tokenizer.texts_to_sequences([corpus[0]])

[[34, 417, 877, 166, 213, 517]]

Notice that we received the sequence wrapped inside a list so in order to get only the desired sequence we need to explicitly get the first item in the list like this:

In [9]:
tokenizer.texts_to_sequences([corpus[0]])[0]

[34, 417, 877, 166, 213, 517]

#### Generating n_grams (subphrases)

Now complete the `n_gram_seqs` function below. This function receives the fitted tokenizer and the corpus (which is a list of strings) and should return a list containing the `n_gram` sequences for each line in the corpus:

In [10]:
def n_gram_seqs(corpus, tokenizer):
    """
    Generates a list of n-gram sequences

    Args:
        corpus (list of string): lines of texts to generate n-grams for
        tokenizer (object): an instance of the Tokenizer class containing the word-index dictionary

    Returns:
        input_sequences (list of int): the n-gram sequences for each line in the corpus
    """
    input_sequences = []

    # Loop over every line
    for line in corpus:

	    # Tokenize the current line
	    token_list = tokenizer.texts_to_sequences([line])[0]

	    # Loop over the line several times to generate the subphrases
	    for i in range(1, len(token_list)):

		    # Generate the subphrase
		    n_gram_sequence = token_list[:i+1]

		    # Append the subphrase to the sequences list
		    input_sequences.append(n_gram_sequence)

    return input_sequences

In [13]:
# Test our function with one example (one raw)
first_example_sequence = n_gram_seqs([corpus[0]], tokenizer)

print("n_gram sequences for first example look like this:\n")
first_example_sequence

n_gram sequences for first example look like this:



[[34, 417],
 [34, 417, 877],
 [34, 417, 877, 166],
 [34, 417, 877, 166, 213],
 [34, 417, 877, 166, 213, 517]]

In [14]:
# Test our function with a bigger corpus (three raws)
next_3_examples_sequence = n_gram_seqs(corpus[1:4], tokenizer)

print("n_gram sequences for next 3 examples look like this:\n")
next_3_examples_sequence

n_gram sequences for next 3 examples look like this:



[[8, 878],
 [8, 878, 134],
 [8, 878, 134, 351],
 [8, 878, 134, 351, 102],
 [8, 878, 134, 351, 102, 156],
 [8, 878, 134, 351, 102, 156, 199],
 [16, 22],
 [16, 22, 2],
 [16, 22, 2, 879],
 [16, 22, 2, 879, 61],
 [16, 22, 2, 879, 61, 30],
 [16, 22, 2, 879, 61, 30, 48],
 [16, 22, 2, 879, 61, 30, 48, 634],
 [25, 311],
 [25, 311, 635],
 [25, 311, 635, 102],
 [25, 311, 635, 102, 200],
 [25, 311, 635, 102, 200, 25],
 [25, 311, 635, 102, 200, 25, 278]]

Apply the `n_gram_seqs` transformation to the whole corpus and save the maximum sequence length to use it later:

In [15]:
# Apply the n_gram_seqs transformation to the whole corpus
input_sequences = n_gram_seqs(corpus, tokenizer)

# Save max length
max_sequence_len = max([len(x) for x in input_sequences])

print(f"n_grams of input_sequences have length: {len(input_sequences)}")
print(f"maximum length of sequences is: {max_sequence_len}")

n_grams of input_sequences have length: 15462
maximum length of sequences is: 11


#### Add padding to the sequences

Now code the `pad_seqs` function which will pad any given sequences to the desired maximum length. Notice that this function receives a list of sequences and should return a numpy array with the padded sequences:

In [18]:
def pad_seqs(input_sequences, maxlen):
    """
    Pads tokenized sequences to the same length

    Args:
        input_sequences (list of int): tokenized sequences to pad
        maxlen (int): maximum length of the token sequences

    Returns:
        padded_sequences (array of int): tokenized sequences padded to the same length
    """

    padded_sequences = pad_sequences(input_sequences, maxlen=maxlen)

    return padded_sequences

In [20]:
# Test our function with the n_grams_seq of the first example (subphrases from first raw)
first_padded_seq = pad_seqs(first_example_sequence, max([len(x) for x in first_example_sequence]))
first_padded_seq

array([[  0,   0,   0,   0,  34, 417],
       [  0,   0,   0,  34, 417, 877],
       [  0,   0,  34, 417, 877, 166],
       [  0,  34, 417, 877, 166, 213],
       [ 34, 417, 877, 166, 213, 517]])

In [21]:
# Test your function with the n_grams_seq of the next 3 examples (subphrases from three raws)
next_3_padded_seq = pad_seqs(next_3_examples_sequence, max([len(s) for s in next_3_examples_sequence]))
next_3_padded_seq

array([[  0,   0,   0,   0,   0,   0,   8, 878],
       [  0,   0,   0,   0,   0,   8, 878, 134],
       [  0,   0,   0,   0,   8, 878, 134, 351],
       [  0,   0,   0,   8, 878, 134, 351, 102],
       [  0,   0,   8, 878, 134, 351, 102, 156],
       [  0,   8, 878, 134, 351, 102, 156, 199],
       [  0,   0,   0,   0,   0,   0,  16,  22],
       [  0,   0,   0,   0,   0,  16,  22,   2],
       [  0,   0,   0,   0,  16,  22,   2, 879],
       [  0,   0,   0,  16,  22,   2, 879,  61],
       [  0,   0,  16,  22,   2, 879,  61,  30],
       [  0,  16,  22,   2, 879,  61,  30,  48],
       [ 16,  22,   2, 879,  61,  30,  48, 634],
       [  0,   0,   0,   0,   0,   0,  25, 311],
       [  0,   0,   0,   0,   0,  25, 311, 635],
       [  0,   0,   0,   0,  25, 311, 635, 102],
       [  0,   0,   0,  25, 311, 635, 102, 200],
       [  0,   0,  25, 311, 635, 102, 200,  25],
       [  0,  25, 311, 635, 102, 200,  25, 278]])

In [22]:
# Pad the whole corpus
input_sequences = pad_seqs(input_sequences, max_sequence_len)

print(f"padded corpus has shape: {input_sequences.shape}")

padded corpus has shape: (15462, 11)


#### Split the data into features and labels

Before feeding the data into the neural network we should split it into features and labels. In this case the features will be the padded n_gram sequences with the last word removed from them and the labels will be the removed word.

We need to complete the `features_and_labels` function below. This function expects the padded n_gram sequences as input and should return a tuple containing the features and the one hot encoded labels.

Notice that the function also receives the total of words in the corpus, this parameter will be very important when one hot enconding the labels since every word in the corpus will be a label at least once. If we need a refresh of how the `to_categorical` function works take a look at the [docs](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical)

In [24]:
def features_and_labels(input_sequences, total_words):
    """
    Generates features and labels from n-grams

    Args:
        input_sequences (list of int): sequences to split features and labels from
        total_words (int): vocabulary size

    Returns:
        features, one_hot_labels (array of int, array of int): arrays of features and one-hot encoded labels
    """
    
    # Create inputs by splitting the last token in the subphrases
    features = input_sequences[:,:-1]

    # Create labels by splitting the last token in the subphrases
    labels = input_sequences[:,-1]

    # Convert the label into one-hot arrays
    one_hot_labels = to_categorical(labels, num_classes=total_words)

    return features, one_hot_labels

In [25]:
# Test your function with the padded n_grams_seq of the first example
first_features, first_labels = features_and_labels(first_padded_seq, total_words)

print(f"labels have shape: {first_labels.shape}")
print("\nfeatures look like this:\n")
first_features

labels have shape: (5, 3211)

features look like this:



array([[  0,   0,   0,   0,  34],
       [  0,   0,   0,  34, 417],
       [  0,   0,  34, 417, 877],
       [  0,  34, 417, 877, 166],
       [ 34, 417, 877, 166, 213]])

In [26]:
# Split the whole corpus
features, labels = features_and_labels(input_sequences, total_words)

print(f"features have shape: {features.shape}")
print(f"labels have shape: {labels.shape}")

features have shape: (15462, 10)
labels have shape: (15462, 3211)


#### Create the model

Now we should define a model architecture capable of achieving an accuracy of at least 80%.

To reach this task we should take the following considerations into account:

- An appropriate `output_dim` for the first layer (Embedding) is 100.
- A Bidirectional LSTM is helpful for this particular problem.
- The last layer should have the same number of units as the total number of words in the corpus and a softmax activation function.
- The problem can be solved with only two layers (excluding the Embedding) so try out small architectures first.

In [27]:
def create_model(total_words, max_sequence_len):
    """
    Creates a text generator model

    Args:
        total_words (int): size of the vocabulary for the Embedding layer input
        max_sequence_len (int): length of the input sequences

    Returns:
        model (tf.keras Model): the text generator model
    """
    model = Sequential()

    model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
    model.add(Bidirectional(LSTM(150)))
    model.add(Dense(total_words, activation='softmax'))

    # Compile the model
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    return model

In [28]:
# Get the untrained model
model = create_model(total_words, max_sequence_len)

# Train the model
history = model.fit(features, labels, epochs=50, verbose=1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


#### Evaluate the model

In [None]:
# Take a look at the training curves of your model

acc = history.history['accuracy']
loss = history.history['loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'b', label='Training accuracy')
plt.title('Training accuracy')

plt.figure()

plt.plot(epochs, loss, 'b', label='Training Loss')
plt.title('Training loss')
plt.legend()

plt.show()

### 3.4. Testing

#### See your model in action

After our model was built let's play with it by generating the next 25 words of a seed text.

In [30]:
## Feed the text to generate the next 20 words
seed_text = "Help me Obi Wan Kenobi, you're my only hope"
next_words = 20

for _ in range(next_words):
    # Convert the text into sequences
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    # Pad the sequences
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    # Get the probabilities of predicting a word
    predicted = model.predict(token_list, verbose=0)
    # Choose the next word based on the maximum probability
    predicted = np.argmax(predicted, axis=-1).item()
    # Get the actual word from the word index
    output_word = tokenizer.index_word[predicted]
    # Append to the current text
    seed_text += " " + output_word

print(seed_text)

Help me Obi Wan Kenobi, you're my only hope to another fair place doth tell my glory be it is new old 'will ' still to be fire your


## 4. NLP Recognizer

### 4.1. Problem statement

The objective of this subproject is to build and train a classifier for detecting sarcasm in text using a given sarcasm dataset. The model will be evaluated on its ability to correctly identify sarcasm in a set of sentences that it has not encountered during training. The performance of the classifier will be scored based on its accuracy in detecting sarcastic sentiment in these unseen sentences.

### 4.2. Load and inspect the data

In [31]:
# Define empty lists with sentences and corresponding labels for them
sentences = []
labels = []

# Set path for sarcasm data set
path = 'https://raw.githubusercontent.com/Chuvard/NLP_generator_and_detector/main/data/sarcasm.json'

# Fetch the data from the URL
response = requests.get(path)
datastore = response.json()

# Add labels and sentences
for item in datastore:
    sentences.append(item['headline'].lower())
    labels.append(item['is_sarcastic'])

# Define the size for the training set
training_size = int(len(sentences) * 0.8)

# Split the sentences
training_sentences = sentences[:training_size]
testing_sentences = sentences[training_size:]

# Split the labels
training_labels = labels[:training_size]
testing_labels = labels[training_size:]

In [32]:
## Inspect the data
print(f"Total sentences: {len(sentences)}")
print(f"Training sentences: {len(training_sentences)}")
print(f"Testing sentences: {len(testing_sentences)}")

Total sentences: 26709
Training sentences: 21367
Testing sentences: 5342


### 4.3. Model construction

#### Set hyperparametrs

In [36]:
# Set hyperparameters
vocab_size = 1000          # The number of words to keep in the vocabulary
embedding_dim = 16         # Dimension of the embedding vectors
max_length = 120           # Maximum length of input sequences
trunc_type = 'post'        # Truncate sequences after the max_length
padding_type = 'post'      # Pad sequences after the end of the sequence
oov_tok = "<OOV>"          # Token to represent out-of-vocabulary words
training_size = 20000      # Size of the training data

#### Preprocessing the train and test sets

In [35]:
# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)

# Generate the word index dictionary
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index

# Generate and pad the training sequences
training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

# Generate and pad the testing sequences
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

# Convert the labels lists into numpy arrays
training_labels = np.array(training_labels)
testing_labels = np.array(testing_labels)

#### Train a model

In [37]:
# Define the architecture the model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [38]:
# Summary the architecture the model
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 120, 16)           16000     
                                                                 
 bidirectional_1 (Bidirecti  (None, 120, 64)           12544     
 onal)                                                           
                                                                 
 bidirectional_2 (Bidirecti  (None, 64)                24832     
 onal)                                                           
                                                                 
 dense_1 (Dense)             (None, 16)                1040      
                                                                 
 dense_2 (Dense)             (None, 1)                 17        
                                                                 
Total params: 54433 (212.63 KB)
Trainable params: 5443

#### Compile and fit the model

In [39]:
# Training the model with following loss and optimization functions
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
num_epochs = 20

# Fit the model on the training data and evaluate it on the validation data
history = model.fit(training_padded, training_labels, epochs=num_epochs,
                        validation_data=(testing_padded, testing_labels), verbose=2)

Epoch 1/20
668/668 - 64s - loss: 0.4483 - accuracy: 0.7752 - val_loss: 0.3939 - val_accuracy: 0.8229 - 64s/epoch - 96ms/step
Epoch 2/20
668/668 - 55s - loss: 0.3545 - accuracy: 0.8383 - val_loss: 0.3739 - val_accuracy: 0.8287 - 55s/epoch - 82ms/step
Epoch 3/20
668/668 - 56s - loss: 0.3277 - accuracy: 0.8532 - val_loss: 0.3812 - val_accuracy: 0.8220 - 56s/epoch - 84ms/step
Epoch 4/20
668/668 - 55s - loss: 0.3096 - accuracy: 0.8618 - val_loss: 0.3762 - val_accuracy: 0.8287 - 55s/epoch - 82ms/step
Epoch 5/20
668/668 - 55s - loss: 0.2979 - accuracy: 0.8691 - val_loss: 0.3716 - val_accuracy: 0.8362 - 55s/epoch - 82ms/step
Epoch 6/20
668/668 - 54s - loss: 0.2882 - accuracy: 0.8743 - val_loss: 0.3837 - val_accuracy: 0.8261 - 54s/epoch - 81ms/step
Epoch 7/20
668/668 - 54s - loss: 0.2820 - accuracy: 0.8761 - val_loss: 0.4045 - val_accuracy: 0.8190 - 54s/epoch - 80ms/step
Epoch 8/20
668/668 - 55s - loss: 0.2732 - accuracy: 0.8815 - val_loss: 0.4073 - val_accuracy: 0.8272 - 55s/epoch - 82ms/step


#### Evaluate the model

In [40]:
# Extract accuracy values from the training history
training_accuracy = history.history['accuracy']
validation_accuracy = history.history['val_accuracy']

# Calculate average accuracy
average_training_accuracy = sum(training_accuracy) / len(training_accuracy)
average_validation_accuracy = sum(validation_accuracy) / len(validation_accuracy)

# Print or use the average accuracies as needed
print(f'Average Training Accuracy: {average_training_accuracy}')
print(f'Average Validation Accuracy: {average_validation_accuracy}')

Average Training Accuracy: 0.8833855956792831
Average Validation Accuracy: 0.8240546584129333


### 4.4. Testing

In [41]:
def predict_sarcasm(text):
    # Convert text to lowercase
    text = text.lower()

    # Tokenize and pad the input text
    sequence = tokenizer.texts_to_sequences([text])
    padded_sequence = pad_sequences(sequence, maxlen=max_length, padding=padding_type, truncating=trunc_type)

    # Make prediction
    prediction = model.predict(padded_sequence)

    # Return True if prediction is above a certain threshold (adjust as needed)
    return prediction[0, 0] > 0.5

# Test the model with some examples
test_texts = [
    "This is a normal sentence.",
    "Well, that's just great.",
    "Oh, what a surprise!",
    "Scientists confirm that water is wet. Brilliant!",
    "Sure, because that makes total sense."
]

for text in test_texts:
    is_sarcasm = predict_sarcasm(text)
    sarcasm_label = "Sarcasm" if is_sarcasm else "Not Sarcasm"
    print(f'Text: "{text}" is {sarcasm_label}')

Text: "This is a normal sentence." is Not Sarcasm
Text: "Well, that's just great." is Sarcasm
Text: "Oh, what a surprise!" is Not Sarcasm
Text: "Scientists confirm that water is wet. Brilliant!" is Sarcasm
Text: "Sure, because that makes total sense." is Sarcasm
