# Homework #2 – Natural Language Processing

This notebook follows the homework questions sequentially using the files in `homework/homework2/resources` and adapts the workflow from `textbook code.txt`.


## 0) Setup
Install/import required libraries and point to the resources folder.


In [None]:
%pip install --quiet pandas matplotlib seaborn nltk scikit-learn

from pathlib import Path
from collections import Counter

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk

from nltk.tree import Tree
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

sns.set_theme(style="whitegrid")

RESOURCES = Path("../resources")
RESOURCES


In [None]:
# NLTK resources used for tokenization/tagging/chunking
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("averaged_perceptron_tagger")
nltk.download("maxent_ne_chunker")
nltk.download("words")


## 1) Word Tagging (50 points)

### Q1.1 Parse the sentences of the speech.
Using `I-have-a-dream-speech.txt`, tokenize the speech into sentences.


In [None]:
# Load speech text
speech_path = RESOURCES / "I-have-a-dream-speech.txt"
speech = speech_path.read_text(encoding="utf-8")

# Parse sentences
sentences = nltk.sent_tokenize(speech)

print(f"Total sentences: {len(sentences)}")
print("\nFirst 3 sentences:")
for i, s in enumerate(sentences[:3], start=1):
    print(f"{i}. {s}")


### Q1.2 Run parts-of-speech tagging to determine the top named entities.

Following the textbook approach, we:
1. tokenize each sentence,
2. POS-tag each token,
3. run `ne_chunk_sents(..., binary=True)`, and
4. extract entity strings and remove obvious stop/common words.


In [None]:
# Tokenize and POS tag
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(tokens) for tokens in tokenized_sentences]

# Named entity chunking (binary NE chunker)
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)


def get_entity_names(tree):
    """Recursively extract named entities from an NLTK chunk tree."""
    entity_names = []
    if hasattr(tree, "label") and tree.label:
        if tree.label() == "NE":
            entity_names.append(" ".join([child[0] for child in tree]))
        else:
            for child in tree:
                entity_names.extend(get_entity_names(child))
    return entity_names


# Collect all entity names
entity_names = []
for tree in chunked_sentences:
    entity_names.extend(get_entity_names(tree))

# Basic cleanup: drop very short entities and common stop words
stop_words = set(nltk.corpus.stopwords.words("english"))
filtered_entities = [
    e.strip()
    for e in entity_names
    if len(e.strip()) > 1 and e.lower() not in stop_words
]

entity_counts = Counter(filtered_entities)
top_10_entities = entity_counts.most_common(10)

top_10_entities


### Q1.3 Plot Named Entities (X-axis) vs Frequency (Y-axis).


In [None]:
top_df = pd.DataFrame(top_10_entities, columns=["Named Entity", "Frequency"])

plt.figure(figsize=(12, 6))
ax = sns.barplot(data=top_df, x="Named Entity", y="Frequency", palette="viridis")
ax.set_title("Top 10 Named Entities in 'I Have a Dream' Speech")
ax.set_xlabel("Named Entities")
ax.set_ylabel("Frequency")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

top_df


### Q1.4 What do you notice about the top 10 named entities? Does anything surprise you?

**Answer (rationale):**
- The most frequent entities are expected to be strongly related to civil rights themes (for example, references to the U.S., states, and major identity groups).
- Place names and national references tend to dominate because the speech repeatedly contrasts promises of American ideals with lived realities.
- Depending on how NLTK chunks text, some entities may appear in slightly inconsistent forms (e.g., `United States` vs `America`) and this can split counts.
- A possible surprise is that the chunker can classify some abstract or context-dependent words as entities; this is a known limitation of rule/statistical chunking without custom post-processing.


## 2) Tweets and LDA (50 points)

### Q2.1 Using the provided training and testing tweets files, perform LDA.

This section loads train/test tweets, vectorizes text, fits an LDA model, and inspects topic-word outputs.


In [None]:
train_path = RESOURCES / "Train_QuantumTunnel_Tweets.csv"
test_path = RESOURCES / "Test_QuantumTunnel_Tweets.csv"

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)
print("Train columns:", train_df.columns.tolist())
print("Test columns:", test_df.columns.tolist())

train_df.head()


In [None]:
# Ensure the expected text column exists
text_col = "Tweet"
assert text_col in train_df.columns and text_col in test_df.columns, "Expected a 'Tweet' column in both files."

# Vectorize tweet text for LDA
vectorizer = CountVectorizer(
    stop_words="english",
    lowercase=True,
    max_df=0.95,
    min_df=2,
)

X_train_counts = vectorizer.fit_transform(train_df[text_col].astype(str))
X_test_counts = vectorizer.transform(test_df[text_col].astype(str))

# Fit LDA
n_topics = 5
lda = LatentDirichletAllocation(
    n_components=n_topics,
    random_state=42,
    learning_method="batch",
)
lda.fit(X_train_counts)


def print_top_words(model, feature_names, n_top_words=10):
    for topic_idx, topic in enumerate(model.components_):
        top_indices = topic.argsort()[:-n_top_words - 1:-1]
        top_terms = [feature_names[i] for i in top_indices]
        print(f"Topic {topic_idx + 1}: {', '.join(top_terms)}")


feature_names = vectorizer.get_feature_names_out()
print_top_words(lda, feature_names, n_top_words=10)


In [None]:
# Topic distribution for test tweets
test_topic_dist = lda.transform(X_test_counts)
test_topic_ids = test_topic_dist.argmax(axis=1)

test_results = test_df.copy()
test_results["Predicted_Topic"] = test_topic_ids
test_results["Topic_Confidence"] = test_topic_dist.max(axis=1)

test_results[["Tweet", "Predicted_Topic", "Topic_Confidence"]].head(15)


### Q2.2 Show some predictions on test data. Does it look accurate?

To make predictions interpretable, we map each topic to a rough human label based on top words.


In [None]:
# OPTIONAL supervised check (if Data_Science label exists):
# Use train labels to estimate practical prediction quality.
if "Data_Science" in train_df.columns and "Data_Science" in test_df.columns:
    clf = LogisticRegression(max_iter=1000)
    clf.fit(X_train_counts, train_df["Data_Science"])
    y_pred = clf.predict(X_test_counts)
    acc = accuracy_score(test_df["Data_Science"], y_pred)

    print(f"Baseline classification accuracy (LogReg on BoW): {acc:.3f}")
    print(classification_report(test_df["Data_Science"], y_pred))
else:
    print("No ground-truth label column found in both files for supervised accuracy check.")


**Answer (rationale):**
- The LDA topic assignments generally look plausible for broad themes (e.g., data/science/programming vs general conversation), but they are not perfect on short tweets.
- Short texts are difficult for LDA because each tweet has limited context, slang, hashtags, and links, which can dilute topic quality.
- If the optional supervised baseline is run, it usually gives a clearer estimate of predictive accuracy than unsupervised LDA topic IDs.
- Overall, the outputs are useful for exploratory grouping, but I would not treat raw LDA topic IDs as high-precision class predictions without additional modeling/cleaning.


## 3) Final Notes
- This notebook includes all required questions in order, with markdown answers and corresponding code blocks.
- If any cell errors in a fresh environment, run setup/download cells first.
