### Natural Language Processing (NLP)
Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) that aims to bridge the gap between human communication and computational systems by enabling machines to interpret, understand, and generate human language. NLP combines linguistics (the structure and meaning of language) with machine learning algorithms to process and analyze large amounts of natural language data.

- Text preprocessing - Implies text cleaning by removing unnecessary characters or sequences (e.g. HTML tags, punctuation, or any other non-alphanumeric characters), casing transformation, stopwords removal etc.
Stopwords are any words that do not carry significant meaning or contribute much to the analysis in most cases. These words are
usually high-frequency, functional words such as articles, conjunctions, prepositions, and pronouns, which help structure sentences but
don’t provide useful information.
- Tokenization - Process of splitting text into smaller units called tokens. These tokens can be words, subwords, or characters, depending on the tokenization method used, and serve as the basic units for further text analysis or processing.
- Vectorization - Transforming tokens (e.g. words) into numerical representation (usually n-dimensional vectors).

The most common NLP tasks are:
- Language Understanding - For example identifying entities like names, dates, locations etc. (NER - Named Entity Recognition), labeling words as nouns, verbs, adjectives, etc. (Part-of-Speech Tagging) or analyzing the
grammatical structure of sentences (Syntactic Parsing).
- Sentiment Analysis - Determining the emotional tone or sentiment expressed in a piece of text, like whether a review is positive or negative.
- Machine Translation - Automatically translating text from one language to another (e.g., Google Translate).
- Text Generation - Creating new text based on learned patterns (e.g.,generating coherent responses in chatbots or summarizing long documents).
- Speech Recognition - Converting spoken language into written text (used in virtual assistants like Siri or Alexa).

In [1]:
import os
from keras.utils import get_file


# Download dataset from provided URL
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
# url = '/content/drive/MyDrive/Colab Notebooks/aclImdb'
dataset = get_file("aclImdb_v1", url, untar=True, cache_dir=".", cache_subdir="")
dataset_dir = os.path.join(os.path.dirname(dataset), "aclImdb")

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
[1m84125825/84125825[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 0us/step


### Load the data

In [2]:
import os
import shutil
from keras.utils import text_dataset_from_directory


# Define train path
dataset_dir = '/content/drive/MyDrive/Colab Notebooks/aclImdb'
train_dir = os.path.join(dataset_dir, "train")

# Remove additional unsup/ directory from the train/ directory
remove_dir = os.path.join(train_dir, "unsup")
if os.path.exists(remove_dir):
  shutil.rmtree(remove_dir)

### Train the model
The text_dataset_from_directory function scans the train/ directory for subdirectories (i.e., pos/ and neg/) and uses their names as class labels. It reads each text file in the subdirectories and assigns the text content to a dataset sample and the corresponding class label based on the subdirectory.

In [3]:
from keras.utils import text_dataset_from_directory

train_dataset = '/content/drive/MyDrive/Colab Notebooks/aclImdb/train'
# Load train dataset
train_dataset = text_dataset_from_directory(train_dataset, batch_size=32)

Found 25020 files belonging to 2 classes.


### Text pre-processing
Individual reviews are stored in separate text files. This function takes one input argument called text which represents a single review read from the text file. First, function transforms all letters to lowercase, then using a simple regex removes all HTML tags and non-alphanumerical characters from the string, and finally returns cleaned text. This is a custom defined function and could be further expanded or redefined by using additional cleaning techniques.

In [4]:
from tensorflow.strings import lower, regex_replace

def preprocess(text):
  # Lower casing
  text = lower(text)

  # Remove HTML tags
  text = regex_replace(text, "<br />", " ")

  # Remove special characters and punctuation
  text = regex_replace(text, "[^A-Za-z0-9]+", " ")

  return text

### TextVectorization Layer
The TextVectorization layer in Keras is a powerful preprocessing layer that transforms raw text into numeric
tensors, preparing the text for ML/AI models. It involves three main steps: standardization (i.e., text preprocessing), tokenization, and vectorization. Each of these steps is configurable, making this layer highly adaptable to different text preprocessing needs.

In the code snippet above, we first initialize a TextVectorization layer by defining its key parameters: preprocessing function for standardization, maximum vocabulary size, output mode and output sequence length. After the layer has been created, the adapt is called on the list of strings from the train dataset in order to create the vocabulary.
Detailed breakdown of each step in the TextVectorization pipeline is provided below:
1. Standardize each example - This step is the first part of the TextVectorization process. Its purpose is to clean and normalize the input text to ensure uniformity, regardless of variations like casing, punctuation,
or other noise in the data. Although Keras provides a default standardization processes that can be applied
automatically (e.g. lower_and_strip_punctuation), in your case, a custom standardization function (preprocess) is used, overriding this default behavior.
2. Split each example into substrings - Split the standardized text into substrings, typically words. By
default, the TextVectorization layer splits the example by whitespace (which means that each substring is an individual word).
3. Recombine substrings into tokens - Tokens are the building blocks of text that have been extracted from
the raw text data. In the simplest case (like ours), each resulting substring (i.e., word) is treated as a token.
However, tokens do not have to be single words. In fact, tokens can also be a combination of words or characters
(this can be accomplished by setting the ngram parameter).
4. Index tokens - In this step each unique token is associated with an integer index based on the vocabulary
built during the adapt step. When TextVectorization layer is adapted, it will analyze the train dataset, determine the frequency of individual string values, and create a vocabulary from them. This vocabulary can have unlimited size or can be capped. In our case, the vocabulary is capped to 10,000 words. This means that the least frequent words are removed so that only the 10,000 most frequent ones are used to create the vocabulary.
5. Transform each example using the index - The goal of this step is to transform the array of token indices into a fixed-length output. In our case, the output is an array of exactly 250 integers. If the array is shorter than 250,array is padded (i.e., zeros are added), and if it is longer, it is truncated to exactly 250 elements.

In [5]:
from keras.layers import TextVectorization


# Initialize TextVectorization layer
vectorize_layer = TextVectorization(
  standardize=preprocess, # Custom standardization function
  max_tokens=10000, # Maximum size of the vocabulary
  output_mode="int", # Output of this layer will be a sequence of integer indices
  output_sequence_length=250 # Pad or truncate sequences to exactly 250 values
)

# Extract text from train dataset (without labels)
trainX = [x for x, _ in train_dataset.unbatch()]

# Call adapt on the list of strings to create the vocabulary
vectorize_layer.adapt(trainX)

### Applying the TextVectorization Layer
The TextVectorization layer in Keras expects each input sample to be a 1D tensor, which represents a single string. This means that we have to add an extra dimension with expand_dims (text, -1) to reshape it so that each element is a single string in the batch. We can do this by defining custom vectorize_text function, which we will further be used to vectorize both train and test datasets.

In [6]:
from tensorflow import expand_dims

def vectorize_text(text, label):
  text = expand_dims(text, -1)
  return vectorize_layer(text), label

train_dataset = train_dataset.map(vectorize_text)

### Defining the Model
After the TextVectorization layer is applied, input text is transformed into a vector of integers, where every entry represents one token (one word in this case). Further, the classification model for sentiment analysis should be defined.

The Embedding layer is used to transform the integer-encoded tokens into a set of dense vectors of predefined dimensions (regulated by output_dim parameter). These vectors are learned as the model trains. Further, the set of vectors is averaged using GlobalAveragePooling1D layer to produce a fixed-length output vector suitable for feeding into a dense layer of the neural network. The output dense layer has only one neuron, since output is encoded as a single integer (0 for negative and 1 for
positive reviews), with sigmoid activation function that is suitable for this kind of output format. Finally, model is compiled using adam optimizer and binary_crossentropy loss function.

In [7]:
from keras.models import Sequential
from keras import layers


def define_model():
  model = Sequential()
  # Input dim - Maximum size of the vocabulary
  # Output dim - Embedding vectors dimension
  model.add(layers.Embedding(input_dim=10000, output_dim=16))
  model.add(layers.Dropout(0.2))
  model.add(layers.GlobalAveragePooling1D())
  model.add(layers.Dropout(0.2))
  model.add(layers.Dense(1, activation="sigmoid"))
  # Compile model
  model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
  return model

### Assignment

In [8]:
# import os
# import shutil
# from keras.utils import text_dataset_from_directory


# Define train path
dataset_dir = '/content/drive/MyDrive/Colab Notebooks/aclImdb'
test_dir = os.path.join(dataset_dir, "test")

# Remove additional unsup/ directory from the train/ directory
remove_dir = os.path.join(train_dir, "unsup")
if os.path.exists(remove_dir):
  shutil.rmtree(remove_dir)

In [9]:
# Extract labels from train dataset
train_labels = [label.numpy() for _, label in train_dataset.unbatch()]

# Print first 10 labels
print(train_labels[:10])


[1, 0, 1, 1, 1, 0, 1, 1, 0, 1]


In [10]:
import os
import tensorflow as tf
from tensorflow.strings import lower, regex_replace


print(os.listdir(train_dir))  # Should show 'pos' and 'neg'
print(os.listdir(os.path.join(train_dir, "pos"))[:5])  # List first 5 positive reviews
print(os.listdir(os.path.join(train_dir, "neg"))[:5])  # List first 5 negative reviews
# print(os.listdir(os.path.join(train_dir, "neg")))  # List all negative reviews


# Print positive review example
pos_file = os.path.join(train_dir, "pos", "5576_9.txt")  # Adjust filename if needed

with open(pos_file, "r", encoding="utf-8") as f:
    # print("Positive review example:", f.read())
    text = f.read()



def preprocess(text):
    text = lower(text)  # Convert to lowercase
    text = regex_replace(text, "<br />", " ")  # Remove HTML tags
    text = regex_replace(text, "[^A-Za-z0-9]+", " ")  # Remove non-alphanumeric characters
    return text

print("Positive review example:", text)
processed_text = preprocess(tf.convert_to_tensor(text))
print("Positive review example preprocessed:", processed_text.numpy().decode("utf-8"))



['urls_unsup.txt', 'urls_pos.txt', 'urls_neg.txt', 'labeledBow.feat', 'unsupBow.feat', 'neg', 'pos']
['11410_7.txt', '11669_9.txt', '11771_10.txt', '11659_9.txt', '11778_10.txt']
['11508_1.txt', '11519_1.txt', '11632_1.txt', '11636_3.txt', '11671_2.txt']
Positive review example: This movie is hilarious! I watched it with my friend and we just had to see it again. This movie is not for you movie-goers who will only watch the films that are nominated for Academy Awards (you know who you are.)I won't recap it because you have seen that from all the other reviews.<br /><br />"Whipped" is a light-hearted comedy that had me laughing throughout. It doesn't take itself too seriously and should be watched with your friends, not your girlfriend. It won't win any awards, but it just has to be watched to be appreciated. True, some of the jokes are toilet humor, but that is not necessarily a bad thing. Everyone can use some of it sometimes. Some people need to lighten up and see "Whipped" for what 

In [None]:
import matplotlib.pyplot as plt


# Load train and test datasets
train_dataset = text_dataset_from_directory(train_dir, batch_size=32, label_mode="int")
# test_dataset = text_dataset_from_directory(test_dir, batch_size=32, label_mode="int")

# Extract only text data from train_dataset to adapt the TextVectorization layer
text_only_train = train_dataset.map(lambda text, label: text)

# Adapt the TextVectorization layer
vectorize_layer.adapt(text_only_train)

# Apply the TextVectorization layer to both datasets
train_dataset = train_dataset.map(lambda text, label: (vectorize_layer(text), label))
# test_dataset = test_dataset.map(lambda text, label: (vectorize_layer(text), label))

# Define the model using the provided function
model = define_model()

# Train the model
model.fit(
    x=train_dataset,
    epochs=10,
    validation_data=train_dataset,
    batch_size=64
)


import matplotlib.pyplot as plt

# Evaluate the model
loss, accuracy = model.evaluate(train_dataset)
print(f"Accuracy: {accuracy:.4f}")

# Retrieve training history
history = model.history

# Plot accuracy and loss in one plot
plt.figure(figsize=(10, 5))

plt.plot(history.history["accuracy"], label="Training Accuracy", color="blue")
plt.plot(history.history["val_accuracy"], label="Validation Accuracy", color="green")
plt.plot(history.history["loss"], label="Training Loss", linestyle="dashed", color="red")
plt.plot(history.history["val_loss"], label="Validation Loss", linestyle="dashed", color="orange")

plt.xlabel("Epochs")
plt.ylabel("Value")
plt.legend()
plt.title("Training and Validation Accuracy & Loss Over Epochs")
plt.show()
