# Topic Modeling & Keyword Extraction Assignment using Gensim

## Introduction
This assignment delves into **Topic Modeling** and **Keyword Extraction**, two powerful techniques in Natural Language Processing (NLP) for uncovering hidden thematic structures and identifying important terms within a collection of documents. You'll primarily use the **Gensim** library, a robust open-source library for unsupervised topic modeling and natural language understanding.

**Topic modeling** helps us discover abstract "topics" that occur in a collection of documents. **Keyword extraction** focuses on identifying the most representative words or phrases from a single document or a set of documents.

---

## Learning Objectives
By completing this assignment, you should be able to:
- Load and preprocess text data effectively for topic modeling.
- Create a Gensim dictionary and corpus from a collection of documents.
- Apply Latent Dirichlet Allocation (LDA) using Gensim to discover topics.
- Interpret the topics generated by an LDA model.
- Implement basic keyword extraction techniques.
- Discuss the challenges and applications of topic modeling and keyword extraction.

---

## Dataset
For this assignment, we'll use a collection of news articles or abstracts. A suitable dataset would be a subset of the **20 Newsgroups dataset** or a collection of research paper abstracts.

**Assumption:** We'll assume you have a list of text documents. If you have a CSV, you might need to load it and extract a text column. We'll use a small, built-in example dataset for demonstration purposes if a file isn't provided.

**If you need a real dataset, consider downloading a small subset of:**
- **20 Newsgroups Dataset:** Available via scikit-learn (`from sklearn.datasets import fetch_20newsgroups`). It has various categories, which can be useful for seeing how well LDA uncovers them.
- **Abstracts from ArXiv:** Often available as JSON or CSV files.

---

In [None]:
import gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel
from gensim.summarization import keywords # For TextRank keywords

import re
import string

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

import warnings
warnings.filterwarnings('ignore') # Ignore warnings, especially about gensim.summarization being deprecated


# Download NLTK data (if not already downloaded)
try:
    nltk.data.find('corpora/stopwords')
except nltk.downloader.DownloadError:
    nltk.download('stopwords')
try:
    nltk.data.find('corpora/wordnet')
except nltk.downloader.DownloadError:
    nltk.download('wordnet')
try:
    nltk.data.find('corpora/omw-1.4')
except nltk.downloader.DownloadError:
    nltk.download('omw-1.4') # Required for WordNetLemmatizer
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt')


# --- Sample Data (if you don't have a specific dataset) ---
documents = [
    "The quick brown fox jumps over the lazy dog. Dogs are loyal pets.",
    "Machine learning is a fascinating field. Neural networks are a type of machine learning algorithm.",
    "Artificial intelligence is rapidly advancing. AI models are becoming more sophisticated.",
    "Computers process data very quickly. Data science involves analyzing large datasets.",
    "Space exploration is a huge endeavor. Astronauts explore the cosmos and discover new planets.",
    "Deep learning, a subset of machine learning, has revolutionized image recognition.",
    "Robotics combine engineering and computer science to build intelligent machines.",
    "The stock market saw a surge in tech stocks. Investors are optimistic about future growth.",
    "Climate change impacts the environment. Scientists are studying global warming.",
    "Genetics research explores DNA and heredity. The human genome project was a landmark study.",
    "The new smartphone has an amazing camera and long battery life. Users love the features.",
    "Cryptocurrency values are volatile. Blockchain technology underpins many digital currencies.",
    "Travel to distant galaxies is currently science fiction, but inspiring for astronomy.",
    "Big data analytics helps businesses make informed decisions from vast amounts of information."
]

print("Sample documents loaded. Total documents:", len(documents))
print("\nFirst document example:\n", documents[0])

---

## Assignment Questions

---

### Question 1: Text Preprocessing for Topic Modeling
Topic models perform best on clean, well-processed text. Create a function `preprocess_text(text)` that performs the following steps:
1.  **Convert to Lowercase:** Convert the entire text to lowercase.
2.  **Remove Punctuation:** Remove all punctuation marks.
3.  **Remove Numbers:** Remove all numerical digits.
4.  **Remove Extra Whitespace:** Replace multiple spaces with a single space and strip leading/trailing whitespace.
5.  **Tokenization:** Tokenize the text into individual words.
6.  **Remove Stop Words:** Remove common English stop words using NLTK's `stopwords` corpus.
7.  **Lemmatization:** Apply lemmatization to each token using NLTK's `WordNetLemmatizer`. (For simplicity, you can assume default POS tag `'n'` or `'v'` for better results).
8.  **Filter Short/Long Words:** Remove words that are too short (e.g., length < 3) or too long (e.g., length > 15) as they might be noise or not meaningful.

Apply this `preprocess_text` function to each document in your `documents` list. Store the result as a list of lists of tokens (e.g., `[['word1', 'word2'], ['wordA', 'wordB']]`). Print the preprocessed version of the first document.

---

### Question 2: Create Gensim Dictionary and Corpus
Gensim's LDA model requires input in a specific format: a dictionary (mapping words to IDs) and a corpus (a list of documents represented as bags-of-words).

1.  **Create a Dictionary:** Use `gensim.corpora.Dictionary` to create a dictionary from your list of preprocessed documents (from Question 1).
2.  **Filter Extremes (Optional but Recommended):** Filter out words that appear too frequently or too infrequently. This can help remove very common words (even if not stop words) and very rare words that don't contribute much to topic identification. Use `dictionary.filter_extremes()` (e.g., `no_below=5`, `no_above=0.5`).
3.  **Create a Corpus:** Use the `dictionary.doc2bow()` method to convert each preprocessed document into a bag-of-words (list of `(word_id, count)` tuples). Store this as your Gensim corpus.
4.  Print the dictionary size and the bag-of-words representation of the first document in your corpus.

---

### Question 3: Train an LDA Model
Now, let's train the LDA model to discover topics within your document collection.

1.  **Initialize and Train LDA Model:** Use `gensim.models.LdaModel`.
    * Set `num_topics` to a reasonable number (e.g., 3-5 for our small sample, or 10-20 for larger datasets). Experiment if you like!
    * Pass your `corpus` and `dictionary`.
    * Set `passes` (number of training iterations) to a value like 10-20.
    * Set `random_state` for reproducibility.
2.  **Print Topics:** Print the top 10 most significant words for each discovered topic. Use `lda_model.print_topics()`.
3.  **Interpret Topics:** Based on the words associated with each topic, try to assign a descriptive name or theme to each topic. Discuss what each topic appears to be about.

---

### Question 4: Evaluate Topic Coherence (Optional/Bonus)
Topic coherence measures how interpretable and meaningful the topics are to humans. A higher coherence score generally indicates better topics.

1.  **Calculate Coherence Score:** Use `gensim.models.CoherenceModel`.
    * Set `model` to your trained `lda_model`.
    * Set `texts` to your preprocessed documents (list of lists of tokens from Q1).
    * Set `dictionary` to your Gensim dictionary.
    * Set `coherence='c_v'` (a common and robust coherence measure).
2.  Print the coherence score.
3.  Briefly explain what this score indicates about your topics.

---

### Question 5: Assign Topics to Documents
After training the LDA model, you can determine the dominant topic(s) for each document.

1.  Iterate through your `corpus` (bag-of-words documents).
2.  For each document, use `lda_model.get_document_topics()` to get the topic distribution.
3.  Identify the dominant topic (the one with the highest probability) for at least the first 5 documents.
4.  Print the original document and its dominant topic (including its ID and probability).

---

### Question 6: Keyword Extraction (using Gensim's `keywords` module - TextRank)
Gensim provides a simple way to extract keywords from text, often using a TextRank-like algorithm. This is different from the important words in LDA topics, as it focuses on single document salience.

1.  Choose one of your original `documents` (e.g., `documents[0]`).
2.  Use `gensim.summarization.keywords()` to extract keywords from this document.
3.  Print the original document and the extracted keywords.
4.  Discuss whether the extracted keywords accurately represent the main themes of the document. How do these compare to the words you saw in the LDA topics?

---

### Question 7: Discussion and Applications
Reflect on your experience with topic modeling and keyword extraction.

1.  **Challenges of Topic Modeling:** What were some challenges or ambiguities you faced when interpreting the LDA topics? (e.g., overlapping topics, vague words).
2.  **Applications:** Describe two real-world scenarios (different from news articles) where topic modeling and/or keyword extraction could be highly beneficial. Explain *how* they would be used in each scenario.
3.  **Limitations of Simple Keyword Extraction:** What are some limitations of simple keyword extraction methods like the one used in Q6 (TextRank-like)? How might they struggle with context or nuance?

---

## Submission Guidelines
- Ensure your notebook runs without errors from top to bottom.
- Save your notebook as `your_name_topic_modeling_assignment.ipynb`.
- Clearly answer all questions and provide explanations where requested in Markdown cells.
- Feel free to add additional code cells or markdown cells for clarity or experimentation.

---