# Intro to NLP: Representations, Embeddings, and Sentiment Classification

This notebook explores core NLP representation techniques (one‑hot, bag‑of‑words, TF‑IDF), embeddings/visualizations, and a simple sentiment classifier.

## How to represent text for AI

## One-Hot Encoding

**What it does:**  
Turns each word in a sentence into a simple list of numbers, where only one number is "on" (1) and the rest are "off" (0). This helps computers understand which words are present, without any math or meaning.

**Why it's useful:**  
It's a basic way to show computers what words are in a sentence, so they can start to "see" text.

**Example:**

Sentence: "I like pizza"

Vocabulary: ["i", "like", "pizza"]

One-hot encoding matrix:

| Word   | i | like | pizza |
|--------|---|------|-------|
| i      | 1 | 0    | 0     |
| like   | 0 | 1    | 0     |
| pizza  | 0 | 0    | 1     |

So, "i" becomes [1, 0, 0], "like" becomes [0, 1, 0], and "pizza" becomes [0, 0, 1].

This is how computers turn words into numbers using one-hot encoding!

### How Rows and Tokens Work in One-Hot Encoding
- Each **row** in the one-hot encoding matrix stands for a word (token) from your sentence, in the order they appear.
- For example, if your sentence is:
  `Should we go to a pizzeria or do you a prefer a restaurant?`
- The tokens will be:
  `['should', 'we', 'go', 'to', 'a', 'pizzeria', 'or', 'do', 'you', 'a', 'prefer', 'a', 'restaurant']`
- The matrix will have 13 rows (one for each token above).
- Each row shows which word from the vocabulary is present at that position in the sentence.

**Summary:**
- Number of rows = number of tokens (words) in your sentence.
- Number of columns = number of unique words (vocabulary) in your sentence.

### What are tokens?
- **Tokens** are just the individual words in your sentence, after cleaning (removing punctuation, making lowercase, etc.).
- For example, if your sentence is "Should we go to a pizzeria or do you a prefer a restaurant?", the tokens will be:
  `['should', 'we', 'go', 'to', 'a', 'pizzeria', 'or', 'do', 'you', 'a', 'prefer', 'a', 'restaurant']`
- You can see the tokens by adding this line to your code:
```python
print("Tokens:", tokens)
```
- This helps you understand how the sentence is split into words before any encoding happens.

### Understanding the One-Hot Encoding Output
- The output matrix has rows for each word in your sentence (in order) and columns for each unique word in the sentence (the vocabulary).
- Each row has a single `1` in the column that matches the word at that position; all other entries are `0`.
- This lets the computer know which word is present at each spot, but doesn't tell it anything about meaning or similarity yet.

**Vocabulary = Unique Words**
- The vocabulary is simply the list of all unique words found in your sentence.
- For example, if your sentence is "Should we go to a pizzeria or do you a prefer a restaurant?", the vocabulary will be all the different words in that sentence, with no repeats.

**Summary:**
- One-hot encoding is a way for computers to turn words into numbers, so they can start working with text.

In [None]:
import numpy as np

In [None]:
import re
from typing import List


def clean_tokenize(text: str) -> List[str]:
    """Lowercase, strip punctuation, collapse spaces, optionally remove digits, and split.
    Keeps basic ASCII letters; removes punctuation so tokens like 'awesome?' become 'awesome'.
    """
    text = text.lower()
    # Replace non-letters with space (keeps a–z only)
    text = re.sub(r"[^a-z]+", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text.split()

In [None]:
def one_hot_encoding(sentence):
    tokens = clean_tokenize(sentence)
    vocabulary = sorted(set(tokens))
    word_to_index = {word: i for i, word in enumerate(vocabulary)}
    one_hot_matrix = np.zeros((len(tokens), len(vocabulary)), dtype=int)
    for i, word in enumerate(tokens):
        one_hot_matrix[i, word_to_index[word]] = 1
    return one_hot_matrix, vocabulary, tokens


# Example of usage
sentence = "Should we go to a pizzeria or do you a prefer a restaurant?"
one_hot_matrix, vocabulary, tokens = one_hot_encoding(sentence)

print("Original Sentence:", sentence)
print("Tokens :", tokens)
print("Vocabulary is unique ({}) words:".format(
    len(vocabulary)), vocabulary[:20])
print("One-Hot Encoding Matrix shape:", one_hot_matrix.shape)
print(one_hot_matrix)

## Bag of Words

What it does:
Counts how many times each word appears in each sentence (or document). It builds a vocabulary of unique words and then makes a table of counts.

Why it’s useful:
Lets us compare sentences by the words they contain and how often. Helpful for search, clustering, or as input features for ML models.

Tiny example (3 sentences)
Sentences:
1) "This movie is awesome awesome"
2) "I do not say is good, but neither awesome"
3) "Awesome? Only a fool can say that"

Vocabulary (sorted): ["a", "awesome", "but", "can", "do", "fool", "good", "i", "is", "movie", "neither", "not", "only", "say", "that", "this"]

Counts matrix (rows = sentences, columns = vocabulary):
| sentence | a | awesome | but | can | do | fool | good | i | is | movie | neither | not | only | say | that | this |
|---------:|:-:|:------:|:---:|:---:|:--:|:----:|:----:|:-:|:--:|:-----:|:-------:|:---:|:----:|:---:|:----:|:----:|
| 1        | 0 |   2    |  0  |  0  | 0  |  0   |  0   | 0 |  1 |   1   |    0    |  0  |  0   |  0  |  0   |  1   |
| 2        | 0 |   1    |  1  |  0  | 1  |  0   |  1   | 1 |  1 |   0   |    1    |  1  |  0   |  1  |  0   |  0   |
| 3        | 1 |   1    |  0  |  1  | 0  |  1   |  0   | 0 |  0 |   0   |    0    |  0  |  1   |  1  |  1   |  0   |

How to read it
- Row 1 has "awesome" twice, and contains the words "is", "movie", and "this" once each.
- Row 2 includes many single‑occurrence words like "but", "do", "good", "i", "is", "neither", "not", "say", and one "awesome".
- Row 3 contains words like "a", "awesome", "can", "fool", "only", "say", "that" each once.

Notes
- Bag of Words ignores order: "movie awesome" equals "awesome movie".
- It captures counts, not meaning. Later, TF‑IDF and embeddings improve on this.

In [None]:
def bag_of_words(sentences):
    """
    Creates a bag-of-words representation of a list of documents.
    """
    tokenized_sentences = [clean_tokenize(sentence) for sentence in sentences]
    flat_words = [word for sublist in tokenized_sentences for word in sublist]
    vocabulary = sorted(set(flat_words))
    word_to_index = {word: i for i, word in enumerate(vocabulary)}

    bow_matrix = np.zeros((len(sentences), len(vocabulary)), dtype=int)
    for i, sentence in enumerate(tokenized_sentences):
        for word in sentence:
            if word in word_to_index:
                bow_matrix[i, word_to_index[word]] += 1
    return vocabulary, bow_matrix


# Example of usage
corpus = [
    "This movie is awesome awesome",
    "I do not say is good, but neither awesome",
    "Awesome? Only a fool can say that"
]
vocabulary, bow_matrix = bag_of_words(corpus)
print("Vocabulary ({}):".format(len(vocabulary)), vocabulary[:20])
print("Bag of Words Matrix shape:", bow_matrix.shape)
print(bow_matrix)

In [None]:
# Optional: view Bag of Words as a labeled table
import pandas as pd
bow_df = pd.DataFrame(bow_matrix, columns=vocabulary)
bow_df

## Term Frequency (TF) and TF-IDF

**What it does:**  
- **TF:** Shows how often each word appears in a sentence, compared to the total number of words.
- **TF-IDF:** Adjusts these counts to highlight words that are special or unique to each sentence, and downplays words that appear everywhere.

**Why it's useful:**  
TF-IDF helps computers find the most important words in your text, which is great for searching, sorting, or understanding meaning.


### How to read TF and TF‑IDF

- Term Frequency (TF): Within one sentence, count each word and divide by total words in that sentence. TF values are between 0 and 1.
- Inverse Document Frequency (IDF): Words that appear in many sentences get lower weight; rare words get higher weight.
- TF‑IDF = TF × IDF: High when a word is frequent in a sentence but rare across other sentences.

Tiny example
Sentences:
1) "this movie is awesome awesome"
2) "i do not say is good, but neither awesome"
3) "awesome? only a fool can say that"

Intuition:
- The word "awesome" appears in all three sentences, so its IDF weight is lower than a word that appears in only one sentence.
- A word that appears many times in a single sentence (like "awesome" in sentence 1) can still get a decent score, but very common words across sentences (like "is") get pushed down.

Reading the printed matrix:
- Rows are sentences; columns are words in the vocabulary.
- Larger numbers indicate words that are more important to that sentence.
- Compare columns: a word with small values in many rows is probably not very informative; a word with a large value in a single row is likely distinctive there.

#### Tiny numeric walk‑through

Let’s compute TF and IDF for the word "awesome" across the three sentences above.

1) Token counts
- Sentence 1: this(1), movie(1), is(1), awesome(2) → total words = 5
- Sentence 2: i(1), do(1), not(1), say(1), is(1), good(1), but(1), neither(1), awesome(1) → total words = 9
- Sentence 3: a(1), awesome(1), can(1), fool(1), only(1), say(1), that(1) → total words = 7

2) TF (per sentence)
- TF1(awesome) = 2 / 5 = 0.40
- TF2(awesome) = 1 / 9 ≈ 0.11
- TF3(awesome) = 1 / 7 ≈ 0.14

3) IDF (with smoothing like in code): idf = log((1 + N) / (1 + df)) + 1
- N = 3 sentences; df(awesome) = 3 (appears in all three)
- IDF(awesome) = log((1+3)/(1+3)) + 1 = log(1) + 1 = 1.0

4) TF‑IDF = TF × IDF
- TF‑IDF1(awesome) = 0.40 × 1.0 = 0.40
- TF‑IDF2(awesome) ≈ 0.11 × 1.0 ≈ 0.11
- TF‑IDF3(awesome) ≈ 0.14 × 1.0 ≈ 0.14

Interpretation
- "awesome" is still most important to sentence 1 due to repetition.
- Words rare across sentences (df close to 1) would get higher IDF and stand out in their sentence; words present everywhere get down‑weighted.

#### Tiny TF table (normalized counts)

Using the same three sentences:
1) "this movie is awesome awesome" (5 words)
2) "i do not say is good, but neither awesome" (9 words)
3) "awesome? only a fool can say that" (7 words)

Vocabulary (sorted): ["a", "awesome", "but", "can", "do", "fool", "good", "i", "is", "movie", "neither", "not", "only", "say", "that", "this"]

TF matrix (rows = sentences, columns = vocabulary; values rounded to 2 decimals):
| sentence | a | awesome | but | can | do | fool | good | i | is | movie | neither | not | only | say | that | this |
|---------:|:-:|:------:|:---:|:---:|:--:|:----:|:----:|:-:|:--:|:-----:|:-------:|:---:|:----:|:---:|:----:|:----:|
| 1        |0.00|  0.40  |0.00 |0.00 |0.00| 0.00 |0.00  |0.00|0.20| 0.20 | 0.00    |0.00 |0.00 |0.00 |0.00 |0.20 |
| 2        |0.00|  0.11  |0.11 |0.00 |0.11| 0.00 |0.11  |0.11|0.11| 0.00 | 0.11    |0.11 |0.00 |0.11 |0.00 |0.00 |
| 3        |0.14|  0.14  |0.00 |0.14 |0.00| 0.14 |0.00  |0.00|0.00| 0.00 | 0.00    |0.00 |0.14 |0.14 |0.14 |0.00 |

Notes
- Each row sums to 1.0 (up to rounding), because TF divides counts by total words in that sentence.
- TF emphasizes repetition within a sentence; IDF will reduce the weight of words spread across many sentences.

In [None]:
def compute_tf(sentences):
    """Compute the term frequency matrix for a list of sentences."""
    tokenized = [clean_tokenize(s) for s in sentences]
    vocabulary = sorted(set(w for sent in tokenized for w in sent))
    word_index = {word: i for i, word in enumerate(vocabulary)}
    tf = np.zeros((len(sentences), len(vocabulary)), dtype=np.float32)
    for i, words in enumerate(tokenized):
        word_count = len(words) if len(words) > 0 else 1
        for word in words:
            tf[i, word_index[word]] += 1 / word_count
    return tf, vocabulary


def compute_idf(sentences, vocabulary):
    """Compute the inverse document frequency with sklearn-style smoothing."""
    tokenized = [set(clean_tokenize(s)) for s in sentences]
    num_documents = len(sentences)
    idf = np.zeros(len(vocabulary), dtype=np.float32)
    word_index = {word: i for i, word in enumerate(vocabulary)}
    for word in vocabulary:
        df = sum(1 for sent in tokenized if word in sent)
        # sklearn-style: log((1 + N) / (1 + df)) + 1 → equals 1.0 for df == N
        idf[word_index[word]] = np.log((1 + num_documents) / (1 + df)) + 1.0
    return idf


def tf_idf(sentences):
    """Generate a TF-IDF matrix for a list of sentences."""
    tf, vocabulary = compute_tf(sentences)
    idf = compute_idf(sentences, vocabulary)
    tf_idf_matrix = tf * idf
    return vocabulary, tf_idf_matrix


vocabulary, tf_idf_matrix = tf_idf(corpus)
print("Vocabulary ({}):".format(len(vocabulary)), vocabulary[:20])
print("TF-IDF Matrix shape:", tf_idf_matrix.shape)
print(tf_idf_matrix)

In [None]:
# Optional: view TF / IDF / TF-IDF as tables

import pandas as pd
from IPython.display import display

tf_matrix, tf_vocab = compute_tf(corpus)
idf_vector = compute_idf(corpus, tf_vocab)


tf_df = pd.DataFrame(tf_matrix, columns=tf_vocab)
idf_df = pd.DataFrame([idf_vector], columns=tf_vocab)
tfidf_df = pd.DataFrame(tf_idf_matrix, columns=vocabulary)

print("TF table:")
display(tf_df.round(2))
print("IDF vector:")
display(idf_df.round(2))
print("TF-IDF table:")
display(tfidf_df.round(2))

In [None]:
# Show top TF-IDF terms per sentence
import numpy as np


def top_tfidf_per_doc(vocab, tfidf_matrix, top_k=3):
    for i in range(tfidf_matrix.shape[0]):
        row = tfidf_matrix[i]
        idxs = np.argsort(-row)[:top_k]
        pairs = [(vocab[j], float(row[j])) for j in idxs if row[j] > 0]
        print(f"Sentence {i+1}: {pairs}")


top_tfidf_per_doc(vocabulary, tf_idf_matrix, top_k=3)

## Embeddings: Making Words Understandable for Computers

**What it does:**  
Turns words into lists of numbers (called "embeddings") so computers can compare words and find patterns. These numbers capture some meaning and relationships between words.

**Why it's useful:**  
Embeddings help computers understand that words like "good" and "great" are similar, while "good" and "bad" are different. This is a big step up from just counting words!

In [None]:
import numpy as np
import pandas as pd
import os
import re
import time
import nltk
from gensim.models import Word2Vec
from tqdm import tqdm
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.cluster.hierarchy import dendrogram, linkage
from adjustText import adjust_text
from umap import UMAP
# nltk.download('punkt')

In [None]:
#this for unzip and read the file NON-WINDOWS
!wget https://github.com/SalvatoreRa/tutorial/blob/main/datasets/IMDB.zip?raw=true
!unzip IMDB.zip?raw=true
df=pd.read_csv("IMDB Dataset.csv")

Windows users: Skip the shell cell above and run the next cell instead (pure-Python download/extract).

In [None]:
# Download and extract IMDB.zip using Python (Windows-friendly). No external tools needed.
import os
import zipfile
import urllib.request
import pandas as pd


def ensure_imdb_csv(csv_name="IMDB Dataset.csv"):
    if os.path.exists(csv_name):
        return csv_name

    zip_url = "https://raw.githubusercontent.com/SalvatoreRa/tutorial/main/datasets/IMDB.zip"
    zip_path = "IMDB.zip"

    print(f"Downloading {zip_url} -> {zip_path} ...")
    urllib.request.urlretrieve(zip_url, zip_path)

    print(f"Extracting {zip_path} ...")
    with zipfile.ZipFile(zip_path, "r") as zf:
        zf.extractall(".")

    try:
        os.remove(zip_path)
    except OSError:
        pass

    if not os.path.exists(csv_name):
        raise FileNotFoundError(
            f"Expected '{csv_name}' after extraction, but it was not found.")
    return csv_name


csv_path = ensure_imdb_csv()
df = pd.read_csv(csv_path)
df.head()

In [40]:
# Ensure NLTK punkt is available (needed for word_tokenize)
import nltk
try:
    print("Checking for NLTK punkt tokenizer...")
    nltk.data.find('tokenizers/punkt')
    print("NLTK punkt tokenizer is available.")
except LookupError:
    print("Downloading NLTK punkt tokenizer...")
    nltk.download('punkt')

Checking for NLTK punkt tokenizer...
NLTK punkt tokenizer is available.


In [None]:
def preprocessing_reviews(reviews):
    """
    simple preprocessing: splitting on the space and remove word less than 1 chr
    """

    processed_reviews = []

    for review in tqdm(reviews):
        review = re.sub('<[^>]+>', '', review)
        processed = re.sub('[^a-zA-Z ]', '', review)
        words = processed.split()
        processed_reviews.append(
            ' '.join([word.lower() for word in words if len(word) > 1]))
    return processed_reviews


df['reviews_processed'] = preprocessing_reviews(df['review'])
df['tokens'] = df['reviews_processed'].apply(nltk.word_tokenize)
df.head()

In [None]:
start_time = time.time()
# embedding
model = Word2Vec(sentences=df['tokens'].tolist(),
                 sg=1,
                 vector_size=100,
                 window=5,
                 workers=4)

print(f'Time needed : {(time.time() - start_time) / 60:.2f} mins')

In [None]:
# Entire set of words in the model
all_words = list(model.wv.index_to_key)
all_vectors = np.array([model.wv[word] for word in all_words])

# Highlighted words and their vectors
highlight_words = ['Berlin', 'Paris', 'London', 'Rome', 'Italy',
                   'France', 'Germany', 'England', 'movie', 'production', 'good', 'bad']
highs = [w.lower() for w in highlight_words]
indices = [all_words.index(word) for word in highs if word in all_words]
highlight_vectors = np.array([all_vectors[index] for index in indices])

linked = linkage(highlight_vectors, 'ward')

plt.figure(figsize=(5, 4))
dendrogram(linked,
           orientation='top',
           labels=highlight_words,
           distance_sort='descending',
           show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Words')
plt.ylabel('Euclidean distances')
plt.xticks(rotation=45)
plt.savefig('word_dendrogram.jpg', format='jpeg', bbox_inches='tight')
plt.show()

In [None]:
# Apply t-SNE to the entire set of vectors
tsne = TSNE(n_components=2, random_state=0)
Y_tsne = tsne.fit_transform(all_vectors)

highlight_words = ['berlin', 'rome', 'London', 'France', 'Germany',
                   'movie', 'production', 'mother', 'family']

highs = [w.lower() for w in highlight_words]
indices = [all_words.index(word) for word in highs if word in all_words]
highlight_vectors = np.array([all_vectors[index] for index in indices])
Y_highlight = Y_tsne[indices]


plt.figure(figsize=(10, 7))


sns.scatterplot(x=Y_tsne[:, 0], y=Y_tsne[:, 1], color="lightgrey", alpha=0.3)

# Plot highlighted words
palette = sns.color_palette("hsv", len(highlight_words))
texts = []
for i, word in enumerate(highlight_words):
    plt.scatter(Y_highlight[i, 0], Y_highlight[i, 1],
                color=palette[i], s=100, label=word)
    # adjust text
    texts.append(plt.text(Y_highlight[i, 0],
                 Y_highlight[i, 1], word, fontsize=12))

adjust_text(texts, arrowprops=dict(arrowstyle='->', color='red'))

plt.title('t-SNE visualization of Word2Vec embeddings', fontsize=20)
plt.xlabel('Component 1')
plt.ylabel('Component 2')


plt.grid(True)
plt.legend(title='Highlighted Words', title_fontsize='13', fontsize='11')
plt.savefig('word_tsne.jpg', format='jpeg')
plt.show()

In [None]:
# Apply UMAP to the entire set of vectors
umap = UMAP(n_components=2, random_state=42)
Y_umap = umap.fit_transform(all_vectors)

Y_highlight = Y_umap[indices]


plt.figure(figsize=(10, 7))
sns.scatterplot(x=Y_umap[:, 0], y=Y_umap[:, 1], color="lightgrey", alpha=0.3)

palette = sns.color_palette("hsv", len(highlight_words))
texts = []
for i, word in enumerate(highlight_words):
    plt.scatter(Y_highlight[i, 0], Y_highlight[i, 1],
                color=palette[i], s=100, label=word)

    texts.append(plt.text(Y_highlight[i, 0],
                 Y_highlight[i, 1], word, fontsize=12))

adjust_text(texts, arrowprops=dict(arrowstyle='->', color='red'))

plt.title('UMAP visualization of Word2Vec embeddings', fontsize=20)
plt.xlabel('Component 1')
plt.ylabel('Component 2')


plt.grid(True)
plt.legend(title='Highlighted Words', title_fontsize='13', fontsize='11')
plt.savefig('word_umap.jpg', format='jpeg')
plt.show()

In [None]:
def plot_vectors_and_angle(v1, v2):

    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    cosine_similarity = dot_product / (norm_v1 * norm_v2)
    angle_radians = np.arccos(cosine_similarity)
    angle_degrees = np.degrees(angle_radians)

    fig, ax = plt.subplots(figsize=(5, 5))

    ax.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy',
              scale=1, color='r', label=f"Vector 1: {v1}")
    ax.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy',
              scale=1, color='b', label=f"Vector 2: {v2}")

    start_angle = np.arctan2(v1[1], v1[0])
    if np.cross(v1, v2) < 0:
        angle_radians = -angle_radians

    theta = np.linspace(start_angle, start_angle + angle_radians, 100)
    r = 0.5 * min(np.linalg.norm(v1), np.linalg.norm(v2))
    x = r * np.cos(theta)
    y = r * np.sin(theta)

    ax.plot(x, y, linestyle='-', color='green', lw=2)

    midpoint = (start_angle + angle_radians / 2)
    ax.annotate(r'$\theta$', xy=(r * np.cos(midpoint), r * np.sin(midpoint)), xytext=(20, 10),
                textcoords='offset points', fontsize=16, arrowprops=dict(arrowstyle='->', lw=0.5))

    max_range = np.max(
        np.abs(np.vstack([v1, v2, [x.max(), y.max()]]))) * 1.1  # 10% padding
    ax.set_xlim([-max_range, max_range])
    ax.set_ylim([-max_range, max_range])

    plt.grid(True)
    plt.axhline(0, color='black', linewidth=0.5)
    plt.axvline(0, color='black', linewidth=0.5)
    plt.title(f'Angle between vectors: {angle_degrees:.2f} degrees')
    plt.suptitle(
        f'Similarity between vectors: {cosine_similarity:.2f}', fontsize=10, y=.95)
    plt.xlabel('Component 1')
    plt.ylabel('Component 2')
    plt.legend(loc='lower right')
    plt.savefig('cosine_similarity.jpg', format='jpeg', bbox_inches='tight')
    plt.show()

    return cosine_similarity, angle_degrees


# Example usage
v1 = np.array([2, 3])
v2 = np.array([-1, 2])
cos_sim, angle = plot_vectors_and_angle(v1, v2)

In [None]:
word_1 = "good"
syn = "great"
ant = "bad"
most_sim = model.wv.most_similar("good")
print("Top 3 most simalr words to {} are :{}".format(word_1, most_sim[:3]))

synonyms_dist = model.wv.distance(word_1, syn)
antonyms_dist = model.wv.distance(word_1, ant)
print("Synonyms {}, {} have cosine distance: {}".format(word_1, syn, synonyms_dist))
print("Antonyms {}, {} have cosine distance: {}".format(word_1, ant, antonyms_dist))
a = 'king'
a_star = 'man'
b = 'woman'
b_star = model.wv.most_similar(positive=[a, b], negative=[a_star])
print("{} is to {} as {} is to: {} ".format(a, a_star, b, b_star[0][0]))

# RNN, LSTM, GRU, CNN for Text

**What it does:**  
These are different types of computer models that can read and understand text. They help computers learn patterns, remember information, and make predictions about words or sentences.

**Why it's useful:**  
These models are the building blocks for things like chatbots, translators, and tools that can read and understand human language.

In [None]:
import numpy as np
import torch
import torch.nn as nn

In [None]:
array = np.random.random((10, 5, 3))

data_tensor = torch.tensor(array, dtype=torch.float32)
RNN = nn.RNN(input_size=3, hidden_size=10,
             num_layers=1, batch_first=True)
output, hidden = RNN(data_tensor)
output.shape

In [None]:
data_tensor = torch.tensor(np.random.random((10, 5, 3)), dtype=torch.float32)
LSTM = nn.LSTM(input_size=3, hidden_size=10,
               num_layers=1, batch_first=True)
output, (hidden, cell) = LSTM(data_tensor)
output.shape

In [None]:
data_tensor = torch.tensor(np.random.random((10, 5, 3)), dtype=torch.float32)
GRU = nn.GRU(input_size=3, hidden_size=10,
             num_layers=1, batch_first=True)
output, hidden = GRU(data_tensor)
output.shape

In [None]:
data_tensor = torch.tensor(np.random.random((10, 5, 3)), dtype=torch.float32)
Conv1d = nn.Conv1d(in_channels=5, out_channels=16,
                   kernel_size=3, stride=1, padding=1)
output = Conv1d(data_tensor)
output.shape

# Classify Reviews with Deep Learning

**What it does:**  
Uses a computer model to read movie reviews and decide if they are positive or negative. The model learns from lots of examples and gets better over time.

**Why it's useful:**  
This is how computers can automatically sort reviews, detect spam, or even understand how people feel about products or movies.

In [None]:
from nltk.corpus import stopwords
from wordcloud import WordCloud
from sklearn.manifold import TSNE
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import os
import sys
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader
import matplotlib.pyplot as plt
from tqdm import tqdm
import seaborn as sns
import re
import string
from collections import Counter
from nltk.tokenize import word_tokenize
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import nltk
nltk.download('stopwords')
nltk.download('punkt')

is_cuda = torch.cuda.is_available()

# Check if we GPU available
if is_cuda:
    device = torch.device("cuda")
    print("Using GPU")
else:
    device = torch.device("cpu")
    print("Using CPU")

In [None]:
#this for unzip and read the file
try:
    df=pd.read_csv("IMDB Dataset.csv")
except:
    !wget https://github.com/SalvatoreRa/tutorial/blob/main/datasets/IMDB.zip?raw=true
    !unzip IMDB.zip?raw=true

df['sentiment_encoded'] = np.where(df['sentiment']=='positive',0,1)
X,y = df['review'].values, df['sentiment_encoded'].values
x_train,x_test,y_train,y_test = train_test_split(X,y,stratify=y, test_size=.2)
x_train,x_val,y_train,y_val = train_test_split(x_train,y_train,stratify=y_train, test_size=.1)
y_train, y_val, y_test = np.array(y_train), np.array(y_val), np.array(y_test)

Windows users: If the shell commands above fail on this platform, run the next cell to download/extract IMDB via Python.

In [None]:
# Windows-friendly: download and extract IMDB.zip without external tools
import os
import zipfile
import urllib.request
import pandas as pd


def ensure_imdb_csv(csv_name="IMDB Dataset.csv", zip_url="https://raw.githubusercontent.com/SalvatoreRa/tutorial/main/datasets/IMDB.zip"):
    if os.path.exists(csv_name):
        return csv_name
    zip_path = "IMDB.zip"
    print(f"Downloading {zip_url} -> {zip_path} ...")
    urllib.request.urlretrieve(zip_url, zip_path)
    print(f"Extracting {zip_path} ...")
    with zipfile.ZipFile(zip_path, "r") as zf:
        zf.extractall(".")
    try:
        os.remove(zip_path)
    except OSError:
        pass
    if not os.path.exists(csv_name):
        raise FileNotFoundError(
            f"Expected '{csv_name}' after extraction, but it was not found.")
    return csv_name


csv_path = ensure_imdb_csv()
df = pd.read_csv(csv_path)

df['sentiment_encoded'] = np.where(df['sentiment'] == 'positive', 0, 1)
X, y = df['review'].values, df['sentiment_encoded'].values
x_train, x_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=.2)
x_train, x_val, y_train, y_val = train_test_split(
    x_train, y_train, stratify=y_train, test_size=.1)
y_train, y_val, y_test = np.array(y_train), np.array(y_val), np.array(y_test)

In [None]:
def generate_wordclouds(df):
    '''
    Generate two word clouds from the 50 most frequent words in the list of positive and negative reviews respectively.

    '''
    stop_words = set(stopwords.words('english'))

    # Separating reviews by sentiment
    positive_reviews = df[df['sentiment'] == 'positive']['review']
    negative_reviews = df[df['sentiment'] == 'negative']['review']

    def get_words(reviews):
        all_words = []
        for review in reviews:
            review = re.sub(r"[^\w\s]", '', review)
            review = re.sub(r"\d", '', review)
            words = review.split()
            filtered_words = [
                word for word in words if word not in stop_words and len(word) > 1]
            all_words.extend(filtered_words)
        return all_words

    positive_words = get_words(positive_reviews)
    negative_words = get_words(negative_reviews)

    positive_counts = Counter(positive_words)
    negative_counts = Counter(negative_words)

    positive_wordcloud = WordCloud(
        width=400,
        height=400,
        max_words=200,
        max_font_size=100,
        background_color='white',
        color_func=lambda *args, **kwargs: "green"
    ).generate_from_frequencies(positive_counts)

    negative_wordcloud = WordCloud(
        width=400,
        height=400,
        max_words=200,
        max_font_size=100,
        background_color='white',
        color_func=lambda *args, **kwargs: "red"
    ).generate_from_frequencies(negative_counts)

    plt.figure(figsize=(12, 6))
    plt.subplot(1, 2, 1)
    plt.imshow(positive_wordcloud, interpolation='bilinear')
    plt.title('Positive Reviews')
    plt.axis("off")

    plt.subplot(1, 2, 2)
    plt.imshow(negative_wordcloud, interpolation='bilinear')
    plt.title('Negative Reviews')
    plt.axis("off")
    plt.savefig('word_clouds.jpg', format='jpeg', bbox_inches='tight')
    plt.show()


generate_wordclouds(df)

In [None]:
def plot_review_length_by_sentiment(df):
    '''
    Plots histograms of the number of words per review for positive and negative reviews with summary statistics.

    '''

    positive_reviews = df[df['sentiment'] == 'positive']['review']
    negative_reviews = df[df['sentiment'] == 'negative']['review']

    def get_review_lengths(reviews):
        return [len(review.split()) for review in reviews]

    positive_lengths = get_review_lengths(positive_reviews)
    negative_lengths = get_review_lengths(negative_reviews)

    def get_summary_stats(lengths):
        return {
            'min': np.min(lengths),
            'avg': np.mean(lengths),
            'median': np.median(lengths),
            'max': np.max(lengths)
        }

    pos_stats = get_summary_stats(positive_lengths)
    neg_stats = get_summary_stats(negative_lengths)

    plt.figure(figsize=(12, 6))

    # Plot for positive reviews
    plt.subplot(1, 2, 1)
    plt.hist(positive_lengths, bins=30, color='green',
             edgecolor='black', alpha=0.7)
    plt.title('Word Distribution for Positive Reviews')
    plt.xlabel('Number of Words')
    plt.ylabel('Number of Reviews')
    plt.grid(True)
    stats_text = f"Min: {pos_stats['min']}\nAvg: {pos_stats['avg']:.2f}\nMedian: {pos_stats['median']}\nMax: {pos_stats['max']}"
    plt.text(0.95, 0.95, stats_text, transform=plt.gca().transAxes, fontsize=10, verticalalignment='top',
             horizontalalignment='right', bbox=dict(boxstyle="round,pad=0.5", facecolor='wheat', alpha=0.5))

    # Plot for negative reviews
    plt.subplot(1, 2, 2)
    plt.hist(negative_lengths, bins=30, color='red',
             edgecolor='black', alpha=0.7)
    plt.title('Word Distribution for Negative Reviews')
    plt.xlabel('Number of Words')
    plt.ylabel('Number of Reviews')
    plt.grid(True)
    stats_text = f"Min: {neg_stats['min']}\nAvg: {neg_stats['avg']:.2f}\nMedian: {neg_stats['median']}\nMax: {neg_stats['max']}"
    plt.text(0.95, 0.95, stats_text, transform=plt.gca().transAxes, fontsize=10, verticalalignment='top',
             horizontalalignment='right', bbox=dict(boxstyle="round,pad=0.5", facecolor='wheat', alpha=0.5))

    plt.tight_layout()
    plt.savefig('review_length.jpg', format='jpeg', bbox_inches='tight')
    plt.show()


plot_review_length_by_sentiment(df)

In [None]:
def preprocess_review(review):
    '''
    Cleaning of the review: remove non-alphanumeric characters, collapse whitespace, and remove digits.
    '''
    review = re.sub(r"[^\w\s]", ' ',
                    review)  # Replace non-word characters with space
    # Replace multiple spaces with a single space
    review = re.sub(r"\s+", ' ', review)
    review = re.sub(r"\d", '', review)        # Remove digits
    return review.strip().lower()


def tokenize_reviews(x_train, x_val, x_test):
    stop_words = set(stopwords.words('english'))

    # tokenize and clean list of reviews
    def tokenize_and_filter(reviews):
        word_list = []
        for review in reviews:
            words = word_tokenize(preprocess_review(review))
            filtered_words = [
                word for word in words if word not in stop_words and len(word) > 1]
            word_list.extend(filtered_words)
        return word_list

    # Create a corpus
    corpus = Counter(tokenize_and_filter(x_train))
    # Select the 1000 most common words
    vocab = {word: i+1 for i,
             word in enumerate([word for word, freq in corpus.most_common(1000)])}

    # convert reviews into sequences of indices
    def vectorize_reviews(reviews):
        vectorized = []
        for review in reviews:
            tokenized = word_tokenize(preprocess_review(review))
            indexed = [vocab[word] for word in tokenized if word in vocab]
            vectorized.append(indexed)
        return vectorized

    _x_train = vectorize_reviews(x_train)
    _x_val = vectorize_reviews(x_val)
    _x_test = vectorize_reviews(x_test)

    return _x_train, _x_val, _x_test, vocab


x_train, x_val, x_test, vocab = tokenize_reviews(x_train, x_val, x_test)

In [None]:
def plot_review_length_distribution(tokenized_reviews):
    '''
    Plots a histogram of the lengths of tokenized reviews and includes a box with summary statistics.

    '''

    review_lengths = [len(review) for review in tokenized_reviews]

    # Calculate summary statistics
    min_length = np.min(review_lengths)
    avg_length = np.mean(review_lengths)
    median_length = np.median(review_lengths)
    max_length = np.max(review_lengths)

    plt.figure(figsize=(10, 6))
    plt.hist(review_lengths, bins=30, color='blue',
             edgecolor='black', alpha=0.7)
    plt.title('Distribution of Review Lengths')
    plt.xlabel('Number of Tokens')
    plt.ylabel('Frequency')
    plt.grid(True)

    stats_text = f'Min Length: {min_length}\nAverage Length: {avg_length:.2f}\nMedian Length: {median_length}\nMax Length: {max_length}'
    plt.gca().text(0.95, 0.95, stats_text, transform=plt.gca().transAxes, fontsize=10, verticalalignment='top',
                   horizontalalignment='right', bbox=dict(boxstyle="round,pad=0.5", facecolor='wheat', alpha=0.5))
    plt.savefig('review_length_after_tokenization.jpg',
                format='jpeg', bbox_inches='tight')
    plt.show()


plot_review_length_distribution(x_train)

In [None]:
def padding_(reviews, max_seq):
    features = np.zeros((len(reviews), max_seq), dtype=int)
    for ii, review in enumerate(reviews):
        if len(review) != 0:
            features[ii, -len(review):] = np.array(review)[:max_seq]
    return np.array(features)


train_data = TensorDataset(torch.from_numpy(
    padding_(x_train, 500)), torch.from_numpy(y_train))
valid_data = TensorDataset(torch.from_numpy(
    padding_(x_val, 500)), torch.from_numpy(y_val))
test_data = TensorDataset(torch.from_numpy(
    padding_(x_test, 500)), torch.from_numpy(y_test))

train_loader = DataLoader(train_data, shuffle=True, batch_size=50)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=50)
test_loader = DataLoader(test_data, shuffle=True, batch_size=50)

In [None]:
class SentimentRNN(nn.Module):
    def __init__(self, no_layers, vocab_size, hidden_dim, embedding_dim, drop_prob=0.5):
        super(SentimentRNN, self).__init__()

        self.output_dim = output_dim
        self.hidden_dim = hidden_dim
        self.no_layers = no_layers
        self.vocab_size = vocab_size

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.GRU(input_size=embedding_dim, hidden_size=self.hidden_dim,
                          num_layers=no_layers, batch_first=True)
        self.dropout = nn.Dropout(drop_prob)
        self.fc = nn.Linear(self.hidden_dim, output_dim)
        self.sig = nn.Sigmoid()

    def forward(self, x, hidden):
        batch_size = x.size(0)
        embeds = self.embedding(x)
        rnn_out, hidden = self.rnn(embeds, hidden)
        rnn_out = rnn_out.contiguous().view(-1, self.hidden_dim)
        out = self.dropout(rnn_out)
        sig_out = self.sig(out)
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1]
        return sig_out, hidden

    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''

        h0 = torch.zeros((self.no_layers, batch_size,
                         self.hidden_dim)).to(device)
        return h0


# Hyperparameters
no_layers = 3
vocab_size = len(vocab) + 1
embedding_dim = 300
output_dim = 1
hidden_dim = 256

# Initialize the model
model = SentimentRNN(no_layers, vocab_size, hidden_dim,
                     embedding_dim, drop_prob=0.5)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

print(model)

In [None]:
x_train_tsne = padding_(x_train, 500)
x_train_tsne = x_train_tsne[:1000, :]
y_train_tsne = y_train[:1000]


def plot_embeddings(x_train, y_train, model, device, batch_size=50):
    model.eval()
    embeddings_list = []

    # Create a DataLoader to handle the x_train data in batches
    train_dataset = torch.utils.data.TensorDataset(torch.from_numpy(x_train),
                                                   torch.from_numpy(y_train))
    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, shuffle=False)

    with torch.no_grad():  # No need to track gradients
        for x_batch, _ in train_loader:
            x_batch = x_batch.to(device)
            hidden = model.init_hidden(x_batch.size(0))

            # Feed forward through the model to get to the embeddings layer
            embeds = model.embedding(x_batch)
            rnn_out, hidden = model.rnn(embeds, hidden)
            rnn_out = rnn_out.contiguous().view(-1, model.hidden_dim)  # Flatten the output
            out = model.dropout(rnn_out)
            linear_output = model.fc(out)

            embeddings_list.append(linear_output.cpu())  # Store CPU data

    # Concatenate all batch embeddings into a single matrix
    all_embeddings = torch.cat(embeddings_list, dim=0)

    all_embeddings = all_embeddings.view(-1, 500)

    # Reduce dimensions to 2D using t-SNE for visualization
    tsne = TSNE(n_components=2, random_state=42)
    embeddings_2d = tsne.fit_transform(all_embeddings.numpy())

    df = pd.DataFrame(data=embeddings_2d, columns=['TSNE-1', 'TSNE-2'])
    df['label'] = y_train
    custom_palette = {0: 'green', 1: 'red'}

    plt.figure(figsize=(10, 8))
    scatter = sns.scatterplot(data=df, x='TSNE-1', y='TSNE-2',
                              hue='label', palette=custom_palette, s=60, alpha=0.6)
    plt.title('2D t-SNE Visualization of Sentence Embeddings')
    plt.xlabel('t-SNE dimension 1')
    plt.ylabel('t-SNE dimension 2')
    plt.legend(title='Label', bbox_to_anchor=(1.05, 1), loc=2)
    plt.savefig('tsne_model_untrained_projection.jpg',
                format='jpeg', bbox_inches='tight')
    plt.show()


plot_embeddings(x_train_tsne, y_train_tsne, model, device, batch_size=50)

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim


lr = 0.001

criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=lr)


def acc(pred, label):
    """Calculate accuracy by comparing predicted labels with true labels."""
    pred = torch.round(pred.squeeze())
    return torch.sum(pred == label.squeeze()).item()


clip = 5
epochs = 5
valid_loss_min = np.inf


epoch_tr_loss, epoch_vl_loss = [], []
epoch_tr_acc, epoch_vl_acc = [], []


for epoch in range(epochs):
    train_losses = []
    train_acc = 0.0
    model.train()  # Set model to training mode

    # Initialize hidden state
    h = model.init_hidden(50)

    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        # Detach hidden states
        h = h.data

        model.zero_grad()
        output, h = model(inputs, h)

        # Calculate the loss
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        train_losses.append(loss.item())

        # Calculate accuracy
        accuracy = acc(output, labels)
        train_acc += accuracy

        # Clip gradients to prevent exploding gradient issues in RNNs
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()

    # Validation phase
    val_losses = []
    val_acc = 0.0
    model.eval()  # Set model to evaluation mode
    val_h = model.init_hidden(50)

    for inputs, labels in valid_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        # Detach hidden states
        val_h = val_h.data

        output, val_h = model(inputs, val_h)
        val_loss = criterion(output.squeeze(), labels.float())

        val_losses.append(val_loss.item())

        accuracy = acc(output, labels)
        val_acc += accuracy

    epoch_train_loss = np.mean(train_losses)
    epoch_val_loss = np.mean(val_losses)
    epoch_train_acc = train_acc / len(train_loader.dataset)
    epoch_val_acc = val_acc / len(valid_loader.dataset)

    epoch_tr_loss.append(epoch_train_loss)
    epoch_vl_loss.append(epoch_val_loss)
    epoch_tr_acc.append(epoch_train_acc)
    epoch_vl_acc.append(epoch_val_acc)

    print(f'Epoch {epoch+1}')
    print(f'Train Loss: {epoch_train_loss} Val Loss: {epoch_val_loss}')
    print(
        f'Train Accuracy: {epoch_train_acc * 100}% Val Accuracy: {epoch_val_acc * 100}%')
    print(' ')

In [None]:
fig = plt.figure(figsize=(20, 6))
plt.subplot(1, 2, 1)
plt.plot(epoch_tr_acc, label='Train Acc')
plt.plot(epoch_vl_acc, label='Validation Acc')
plt.title("Accuracy")
plt.legend()
plt.grid()

plt.subplot(1, 2, 2)
plt.plot(epoch_tr_loss, label='Train loss')
plt.plot(epoch_vl_loss, label='Validation loss')
plt.title("Loss")
plt.legend()
plt.grid()
plt.savefig('accuracy_and_loss.jpg', format='jpeg', bbox_inches='tight')
plt.show()

In [None]:
def predict_batch(model, data_loader, device):
    """Predict output for a batch of data using the RNN model."""
    model.eval()
    predictions = []
    true_labels = []

    with torch.no_grad():
        for inputs, labels in data_loader:
            inputs = inputs.to(device)
            batch_size = inputs.size(0)

            hidden = model.init_hidden(batch_size).to(device)

            output, _ = model(inputs, hidden)

            predicted_probs = torch.sigmoid(output)
            predicted_labels = (predicted_probs > 0.60).float()

            predictions.extend(predicted_labels.cpu().numpy())
            true_labels.extend(labels.cpu().numpy())

    return predictions, true_labels, predicted_probs, labels


predictions, true_labels, predicted_probs, labels = predict_batch(
    model, test_loader, device)
print(f'Accuracy on test set: {accuracy_score(true_labels, predictions)}')
# Plot confusion matrix
conf_matrix = confusion_matrix(true_labels, predictions)
plt.figure(figsize=(10, 7))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=[
            'Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.savefig('confusion_matrix.jpg', format='jpeg', bbox_inches='tight')
plt.show()

In [None]:
plot_embeddings(x_train_tsne, y_train_tsne, model, device, batch_size=50)