Intro to NLP Practical<br>
======================<br>
Students will work through problems on n-grams, probabilities, OOV handling, and classifiers.<br>

In [1]:
! pip install nltk

from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

Toy corpus for language modeling

In [2]:
corpus = [
    "Mary had a little lamb",
    "Its fleece was white as snow",
    "And everywhere that Mary went",
    "The lamb was sure to go"
]

--- Part 1: Preprocessing ---

 Q1.1 Sequence notation<br>
Exercise: Write sequence notation for the sentence:<br>
"Mary had a little lamb, its fleece was white as snow"



In [6]:
text = "Mary had a little lamb, its fleece was white as snow"
tokens = text.split()
for i in range(len(tokens)):
  print(f"W_{i} ", end="")

W_0 W_1 W_2 W_3 W_4 W_5 W_6 W_7 W_8 W_9 W_10 

 Q1.2 Add start/end tokens<br>
Exercise: Write a function to tokenize the corpus and add <s>, </s>

In [7]:


def tokenize(text):
  tokens = text.split()
  return ['<s>'] + tokens + ['</s>']

tokenized_corpus = [tokenize(text) for  text in corpus]
tokenized_corpus

[['<s>', 'Mary', 'had', 'a', 'little', 'lamb', '</s>'],
 ['<s>', 'Its', 'fleece', 'was', 'white', 'as', 'snow', '</s>'],
 ['<s>', 'And', 'everywhere', 'that', 'Mary', 'went', '</s>'],
 ['<s>', 'The', 'lamb', 'was', 'sure', 'to', 'go', '</s>']]

--- Part 2: N-grams & Probabilities ---

 Q2.1 Extract unigrams, bigrams, trigrams

In [13]:

from nltk.util import ngrams
# unigrams: just get each seperate token
unigrams = [token for text in tokenized_corpus for token in text]

# bigrams:
bigrams = ngrams(unigrams, 2)
bigrams = list(bigrams)

# trigrams:
trigrams = ngrams(unigrams, 3)
trigrams = list(trigrams)




 Q2.2 Bigram probabilities<br>
Exercise: Write function to compute P(w_i | w_{i-1})

In [22]:
def bigram_prob(w2, w1): # ~ P(w2/w1)
  w1_count = unigrams.count(w1)
  w1_w2_count = bigrams.count((w1, w2))
  return w1_w2_count / w1_count

bigram_prob("had", "Mary")


0.5

 Q2.3 Sentence probability<br>
Exercise: Compute probability of "Mary had a little lamb"

In [26]:
sentence = "Mary had a little lamb"
probabilties = []
tokens = tokenize(sentence)
for i in range(1, len(tokens)):
  if i == 0:
    probabilties.append(bigram_prob(tokens[i], '<s>'))
  else:
    probabilties.append(bigram_prob(tokens[i], tokens[i-1]))

multiply = 1
for prob in probabilties:
  multiply *= prob

print(probabilties)
print(multiply)

[0.25, 0.5, 1.0, 1.0, 1.0, 0.5]
0.0625


Q2.4 Handling OOV/UNK<br>
Exercise: Replace unseen words with <UNK> and recompute


In [33]:
def replace(text):
  tokens = text.split()
  new_tokens = []
  for token in tokens:
    if token not in unigrams:
      new_tokens.append('<UNK>')
    else:
      new_tokens.append(token)
  return new_tokens

sentence = "Mary had a hello little lamb "
print(replace(sentence))



['Mary', 'had', 'a', '<UNK>', 'little', 'lamb']


--- Part 3: Classifier ---

 Q3.1 Naive Bayes sentiment classifier

# 📽 Exercise 3.1: Sentiment Classification on toy dataset

In this exercise, you will build a simple sentiment classification model that predicts whether a given sentence is **positive** or **negative**.

---

## ✏️ Instructions:


### 1️⃣ Perform Feature Extraction
- Use **TF-IDF Vectorization** to convert names into numerical features.


---

### 2️⃣ Train a Machine Learning Classifier
- Use any classifier you are familiar with (e.g., **Logistic Regression** or **Naive Bayes**).
- Split the data into **training** and **testing** sets.
- Train the classifier on the training data.


🚀 **Goal:** By the end of this exercise, you should be able to:
- Apply **feature extraction** to text data.
- Train and evaluate a **text classification model** using **machine learning**.

In [35]:
train_texts = [
    "I love my dog",
    "This food is great",
    "I hate waiting",
    "The movie was boring",
    "Happy with my phone",
    "This is awful"
]
train_labels = ["pos", "pos", "neg", "neg", "pos", "neg"]

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Feature Extraction using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(train_texts)
y = train_labels



# Train a Logistic Regression classifier
model = LogisticRegression()
model.fit(X, y)





In [43]:
new_text = ["movie boring"]
new_text = vectorizer.transform(new_text)
y_pred_new = model.predict(new_text)
print("\nPrediction for 'movie boring':", y_pred_new)


Prediction for 'movie boring': ['neg']


# 📽 Exercise 3.2: Movie Review Classification using Movies Review Corpus

In this exercise, you will build a simple text classification model that predicts whether a given **movie review** is **positive** or **negative** using the **NLTK Movie Reviews Corpus**.

This is a classical example of text classification at the **sentence level**.

---

## ✏️ Instructions:

### 1️⃣ Load the Data
- Import the **Movie Reviews corpus** from **NLTK**.
- Create a dataset where each example is a review and the label is either `'positive'` or `'negative'`.

---

### 2️⃣ Perform Feature Extraction
- Use **TF-IDF Vectorization** to convert names into numerical features.


---

### 3️⃣ Train a Machine Learning Classifier
- Use any classifier you are familiar with (e.g., **Logistic Regression** or **Naive Bayes**).
- Split the data into **training** and **testing** sets.
- Train the classifier on the training data.

---

### 4️⃣ Evaluate the Classifier
- Use **accuracy** and a **classification report** to evaluate your model on the test set.
- Think about: How well does the model perform? Which reviews are harder to classify?

---

✅ You are free to explore:
- Trying different classifiers.
- Visualizing the results (e.g., confusion matrix).

---

🚀 **Goal:** By the end of this exercise, you should be able to:
- Apply **feature extraction** to text data.
- Train and evaluate a **text classification model** using **machine learning**.

 Q3.3 Discussion: Why bigrams vs unigrams?<br>

 Q3.4 Limitations of n-grams

--- Part 4: Wrap-up Reflection ---

 Discussion Questions<br>
1. Why do we need <UNK> tokens?<br>
2. Why start/end tokens?<br>
3. Why not always use higher n-grams?<br>
4. How do classifiers differ from language models?