# 🧠 Topic Modeling on COVID-19 Tweets Using NMF

**Author:** Yiling Xu  
**Last Updated:** Sept 2025  
**Status:** Ongoing exploration | NLP | Topic Modeling  

---

## 📌 Project Description

This notebook explores the use of **unsupervised learning** to uncover hidden themes in a large-scale Twitter dataset about the COVID-19 pandemic.  
Using **TF-IDF vectorization** and **Non-negative Matrix Factorization (NMF)**, we project tweet content into a low-dimensional semantic space and extract **interpretable topics**.

---

## 💡 Key Steps:
- Load and clean ~100,000 English-language tweets from April 30, 2020
- Vectorize text using TF-IDF
- Apply NMF to extract latent topics
- Visualize tweet distributions and outlier content in topic space

---

## Dataset:

The dataset is obtained from [Kaggle](https://www.kaggle.com/smid80/coronavirus-covid19-tweets-late-april?select=2020-04-30+Coronavirus+Tweets.CSV) and the preprocessing we have done followed the steps [here](https://www.kaggle.com/satanizer/covid-19-tweets-analysis). 


## Setup & Data Loading

In [None]:
import numpy as np
import pandas as pd

# Reproducibility
np.random.seed(416)

# Load dataset (expects a CSV with at least a 'text' column; optional 'lang' column)
df = pd.read_csv("tweets-2020-4-30.csv")
df = df.copy()
df["text"] = df["text"].fillna("")
df.tail()


: 

## 🧹 Preprocessing

The original tweets have been pre-cleaned to support topic modeling:
- Filtered for English language only
- Removed URLs, punctuation, and common COVID hashtags
- Lowercased and stopword-removed using `nltk`

The cleaned text is stored in the `text` column.

```python
import re, string
import pandas as pd
import nltk
from nltk.corpus import stopwords

# Make sure NLTK resources are available
nltk.download("stopwords")

# Filter to English if the column exists
if "lang" in df.columns:
    df = df.loc[df["lang"].astype(str).str.lower() == "en"].copy()

# Basic cleaning functions
URL_PATTERN = re.compile(r"https?://\S+|www\.\S+", re.IGNORECASE)

def strip_urls(s: str) -> str:
    return URL_PATTERN.sub("", s)

def strip_punct(s: str) -> str:
    return s.translate(str.maketrans("", "", string.punctuation))

# Stopwords & domain-specific high-frequency tokens
stop_words = set(stopwords.words("english"))
stop_words.update([
    "#coronavirus", "#coronavirusoutbreak", "#coronaviruspandemic",
    "#covid19", "#covid_19", "#epitwitter", "#ihavecorona",
    "amp", "coronavirus", "covid19", "covid-19", "covidー19"
])

def remove_stopwords(s: str) -> str:
    return " ".join(w for w in s.split() if w not in stop_words)

# Apply cleaning
clean = (
    df["text"]
    .astype(str)
    .str.lower()
    .apply(strip_urls)
    .apply(strip_punct)
    .apply(remove_stopwords)
)

df["text_clean"] = clean
df["text_clean"].head()
```

## TF-IDF Matrix

Convert cleaned text into a sparse document–term matrix via TF‑IDF.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.95)  # ignore extremely common terms
tf_idf = vectorizer.fit_transform(df["text_clean"])  # shape: (num_docs, num_terms)
feature_names = vectorizer.get_feature_names_out()

num_tweets, num_words = tf_idf.shape
print(f"TF‑IDF shape: {tf_idf.shape}  | tweets={num_tweets}, terms={num_words}")


(119147, 183012)

## 🧩 Topic Modeling with NMF

Use Non‑negative Matrix Factorization (NMF) to uncover latent themes.

In [None]:
import numpy as np
from sklearn.decomposition import NMF

# 5 topics for interpretability; adjust n_components as needed
nmf = NMF(n_components=5, init="nndsvd", random_state=416)
tweets_projected = nmf.fit_transform(tf_idf)   # (num_docs, n_topics)
components = nmf.components_                   # (n_topics, num_terms)


## 🔑 Top Words per Topic


In [None]:
import numpy as np

def words_from_topic(topic_row: np.ndarray, vocab: np.ndarray, top_k: int = 10):
    idx = np.argsort(topic_row)[::-1][:top_k]
    return [vocab[i] for i in idx]

for i, row in enumerate(components):
    top_words = ", ".join(words_from_topic(row, feature_names, top_k=10))
    print(f"Topic #{i}: {top_words}")


## 🔎 Inspect a Sample Tweet

Check a specific tweet’s topic weights (use a safe index).

In [None]:
idx = min(40151, len(df) - 1)  # safe-guard if dataset is smaller
print("Tweet:", df.iloc[idx]["text"])
print("Cleaned:", df.iloc[idx]["text_clean"])
print("Topic weights:", tweets_projected[idx])

['cats', 'axolotl', 'dogs']

## 📊 Dominant Topic Distribution

Compute each tweet’s dominant topic and find the most frequent one.

In [None]:
dom = np.argmax(tweets_projected, axis=1)         # dominant topic per tweet
largest_topic = int(np.bincount(dom).argmax())    # most common dominant topic
print(f"Most frequent dominant topic: {largest_topic}")


## 📈 3D Visualization (NMF with 3 Topics)

Project into 3 topics to enable a 3D scatter visualization.

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D  # noqa: F401 (required for 3D projection)

nmf_small = NMF(n_components=3, init="nndsvd", random_state=416)
proj3 = nmf_small.fit_transform(tf_idf)  # (num_docs, 3)

fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(projection="3d")
ax.scatter(proj3[:, 0], proj3[:, 1], proj3[:, 2], s=2, alpha=0.5)
ax.set_xlabel("Topic 0"); ax.set_ylabel("Topic 1"); ax.set_zlabel("Topic 2")
ax.set_title("Tweets in 3D Topic Space (NMF, k=3)")
plt.tight_layout()
plt.show()


attention seattle shoppers grocery stores working hard keep employees customers safe part help slow spread ☑️ limit trips ☑️ respect special shopping hours ☑️ follow socialdistance guidance stores wegotthisseattle
[0.00823661 0.         0.02895533 0.         0.01529455]


## Outlier Tweets in Topic Space

Identify tweets strongly aligned with Topic 2 in the 3‑topic model.

In [None]:
threshold = 0.15
mask = proj3[:, 2] >= threshold
outlier_tweets = df.loc[mask, "text_clean"].drop_duplicates().to_numpy()
print(f"Outlier tweets (unique): {len(outlier_tweets)}")
outlier_tweets[:10]  # preview
