Intro to NLP Practical<br>
======================<br>
Students will work through problems on n-grams, probabilities, OOV handling, and classifiers.<br>

In [1]:
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

Toy corpus for language modeling

In [2]:
corpus = [
    "Mary had a little lamb",
    "Its fleece was white as snow",
    "And everywhere that Mary went",
    "The lamb was sure to go"
]

--- Part 1: Preprocessing ---

Q1.1 Sequence notation
Exercise: Write sequence notation for the sentence:
"Mary had a little lamb, its fleece was white as snow"

In [3]:
sentence=["Mary had a little lamb, its fleece was white as snow"]

In [4]:
import nltk
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words = 100)  #specifies that the tokenizer will only consider the top 100 most frequent words in the corpus.

tokenizer.fit_on_texts(sentence) # It essentially builds a word index where each unique word in the sentence is assigned an integer ID.

word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentence)

print("Word Index: ", word_index)
print("Sequences: ", sequences)



Word Index:  {'mary': 1, 'had': 2, 'a': 3, 'little': 4, 'lamb': 5, 'its': 6, 'fleece': 7, 'was': 8, 'white': 9, 'as': 10, 'snow': 11}
Sequences:  [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]


 Q1.2 Add start/end tokens<br>
Exercise: Write a function to tokenize the corpus and add <s>, </s>

In [5]:
def add_start_end_tokens(corpus):
    start_token = "<s>"
    end_token = "</s>"
    tokenized_corpus = []
    for sentence in corpus:
        tokenized_sentence = [start_token] + sentence.split() + [end_token]
        tokenized_corpus.append(tokenized_sentence)
    return tokenized_corpus

tokenized_corpus = add_start_end_tokens(sentence)
print(tokenized_corpus)

[['<s>', 'Mary', 'had', 'a', 'little', 'lamb,', 'its', 'fleece', 'was', 'white', 'as', 'snow', '</s>']]


--- Part 2: N-grams & Probabilities ---

 Q2.1 Extract unigrams, bigrams, trigrams

In [6]:
from nltk.util import ngrams

unigrams = []
bigrams = []
trigrams = []

for sentence in tokenized_corpus:
    unigrams.extend(list(ngrams(sentence, 1)))
    bigrams.extend(list(ngrams(sentence, 2)))
    trigrams.extend(list(ngrams(sentence, 3)))

print("Unigrams:", unigrams)
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)

Unigrams: [('<s>',), ('Mary',), ('had',), ('a',), ('little',), ('lamb,',), ('its',), ('fleece',), ('was',), ('white',), ('as',), ('snow',), ('</s>',)]
Bigrams: [('<s>', 'Mary'), ('Mary', 'had'), ('had', 'a'), ('a', 'little'), ('little', 'lamb,'), ('lamb,', 'its'), ('its', 'fleece'), ('fleece', 'was'), ('was', 'white'), ('white', 'as'), ('as', 'snow'), ('snow', '</s>')]
Trigrams: [('<s>', 'Mary', 'had'), ('Mary', 'had', 'a'), ('had', 'a', 'little'), ('a', 'little', 'lamb,'), ('little', 'lamb,', 'its'), ('lamb,', 'its', 'fleece'), ('its', 'fleece', 'was'), ('fleece', 'was', 'white'), ('was', 'white', 'as'), ('white', 'as', 'snow'), ('as', 'snow', '</s>')]


 Q2.2 Bigram probabilities<br>
Exercise: Write function to compute P(w_i | w_{i-1})

In [7]:
def bigram_probabilities(bigrams):
    bigram_counts = Counter(bigrams)
    unigram_counts = Counter([bigram[0] for bigram in bigrams])
    probabilities = {}
    for bigram, count in bigram_counts.items():
        word1, word2 = bigram
        unigram_count = unigram_counts[word1]
        # Calculate conditional probability
        probability = count / unigram_count
        probabilities[bigram] = probability

    return probabilities



In [8]:
bigram_probabilities = bigram_probabilities(bigrams)
print(bigram_probabilities)

{('<s>', 'Mary'): 1.0, ('Mary', 'had'): 1.0, ('had', 'a'): 1.0, ('a', 'little'): 1.0, ('little', 'lamb,'): 1.0, ('lamb,', 'its'): 1.0, ('its', 'fleece'): 1.0, ('fleece', 'was'): 1.0, ('was', 'white'): 1.0, ('white', 'as'): 1.0, ('as', 'snow'): 1.0, ('snow', '</s>'): 1.0}


 Q2.3 Sentence probability<br>
Exercise: Compute probability of "Mary had a little lamb"

Q2.4 Handling OOV/UNK<br>
Exercise: Replace unseen words with <UNK> and recompute


--- Part 3: Classifier ---

 Q3.1 Naive Bayes sentiment classifier

# 📽 Exercise 3.1: Sentiment Classification on toy dataset

In this exercise, you will build a simple sentiment classification model that predicts whether a given sentence is **positive** or **negative**.

---

## ✏️ Instructions:


### 1️⃣ Perform Feature Extraction
- Use **TF-IDF Vectorization** to convert names into numerical features.


---

### 2️⃣ Train a Machine Learning Classifier
- Use any classifier you are familiar with (e.g., **Logistic Regression** or **Naive Bayes**).
- Split the data into **training** and **testing** sets.
- Train the classifier on the training data.


🚀 **Goal:** By the end of this exercise, you should be able to:
- Apply **feature extraction** to text data.
- Train and evaluate a **text classification model** using **machine learning**.

In [9]:
train_texts = [
    "I love my dog",
    "This food is great",
    "I hate waiting",
    "The movie was boring",
    "Happy with my phone",
    "This is awful"
]
train_labels = ["pos", "pos", "neg", "neg", "pos", "neg"]

# 📽 Exercise 3.2: Movie Review Classification using Movies Review Corpus

In this exercise, you will build a simple text classification model that predicts whether a given **movie review** is **positive** or **negative** using the **NLTK Movie Reviews Corpus**.

This is a classical example of text classification at the **sentence level**.

---

## ✏️ Instructions:

### 1️⃣ Load the Data
- Import the **Movie Reviews corpus** from **NLTK**.
- Create a dataset where each example is a review and the label is either `'positive'` or `'negative'`.

---

### 2️⃣ Perform Feature Extraction
- Use **TF-IDF Vectorization** to convert names into numerical features.


---

### 3️⃣ Train a Machine Learning Classifier
- Use any classifier you are familiar with (e.g., **Logistic Regression** or **Naive Bayes**).
- Split the data into **training** and **testing** sets.
- Train the classifier on the training data.

---

### 4️⃣ Evaluate the Classifier
- Use **accuracy** and a **classification report** to evaluate your model on the test set.
- Think about: How well does the model perform? Which reviews are harder to classify?

---

✅ You are free to explore:
- Trying different classifiers.
- Visualizing the results (e.g., confusion matrix).

---

🚀 **Goal:** By the end of this exercise, you should be able to:
- Apply **feature extraction** to text data.
- Train and evaluate a **text classification model** using **machine learning**.

 Q3.3 Discussion: Why bigrams vs unigrams?<br>

 Q3.4 Limitations of n-grams

--- Part 4: Wrap-up Reflection ---

 Discussion Questions<br>
1. Why do we need <UNK> tokens?<br>
2. Why start/end tokens?<br>
3. Why not always use higher n-grams?<br>
4. How do classifiers differ from language models?