### Name : Sai Kumar Gandham
### Student ID: IG45378

Homework:

For this line of code:
```# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)
```
Rerun with no_above=.75, .9 and removed.  How do the topics change?

For this line of code:

```
# Set training parameters.
num_topics = 15
```
Set to 10, 15, 20.  Try to interpret the topics.

In [5]:
import io
import os.path
import re
import tarfile

import smart_open
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from pprint import pprint

# Here we are download necessary NLTK resources
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/saikumargandham/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/saikumargandham/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
def extract_documents(url='https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'):
    with smart_open.open(url, "rb") as file:
        with tarfile.open(fileobj=file) as tar:
            for member in tar.getmembers():
                if member.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', member.name):
                    member_bytes = tar.extractfile(member).read()
                    yield member_bytes.decode('utf-8', errors='replace')

In [7]:
# Extracting the documents
docs = list(extract_documents())

# Tokenize the documents.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

# Lemmatize the documents.
lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

# Remove rare and common tokens and create dictionary
dictionary = Dictionary(docs)
dictionary.filter_extremes(no_below=20, no_above=0.5)

# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [8]:
# Here we are writing a function to check if a document belongs to a topic with a certain threshold
def check_topic_threshold(x, topic, threshold):
    topics = model.get_document_topics(corpus[x])
    for i in topics:
        if i[0] == topic and i[1] > threshold:
            return True
    else:
        return False

In [9]:
# Function to train LDA model and print top topics
def train_lda_model(num_topics):
    # Set training parameters.
    chunksize = 2000
    passes = 20
    iterations = 400
    eval_every = 50 

    # Make an index to word dictionary.
    temp = dictionary[0] 
    id2word = dictionary.id2token

    model = LdaModel(
        corpus=corpus,
        id2word=id2word,
        chunksize=chunksize,
        alpha='auto',
        eta='auto',
        iterations=iterations,
        num_topics=num_topics,
        passes=passes,
        eval_every=eval_every
    )

    top_topics = model.top_topics(corpus)
    avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
    print('Average topic coherence: %.4f.' % avg_topic_coherence)
    pprint(top_topics)

    return model

In [2]:
# Homework:
# Rerun with no_above=.75, .9 and removed. How do the topics change?
no_above_values = [0.75, 0.9, None] 
for no_above in no_above_values:
    if no_above is None:
        dictionary.filter_extremes(no_below=20)
    else:
        dictionary.filter_extremes(no_below=20, no_above=no_above)
    corpus = [dictionary.doc2bow(doc) for doc in docs]
    print(f'\nNumber of unique tokens (no_above={no_above}): {len(dictionary)}')
    print('Number of documents:', len(corpus))
    # Here we are setting the number of topics to 15 for comparison
    model = train_lda_model(15)  

# Set to 10, 15, 20. Try to interpret the topics.
num_topics_values = [10, 15, 20]
for num_topics in num_topics_values:
    model = train_lda_model(num_topics)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/saikumargandham/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/saikumargandham/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



Number of unique tokens (no_above=0.75): 6617
Number of documents: 1740
Average topic coherence: -1.2467.
[([(0.016425958, 'mixture'),
   (0.012225279, 'likelihood'),
   (0.01066263, 'gaussian'),
   (0.010462013, 'density'),
   (0.010308881, 'em'),
   (0.008545883, 'expert'),
   (0.0072623435, 'markov'),
   (0.0071332203, 'log'),
   (0.0066856025, 'estimate'),
   (0.006619725, 'posterior'),
   (0.0060191755, 'approximation'),
   (0.0058542686, 'conditional'),
   (0.005572926, 'estimation'),
   (0.0054739336, 'hidden'),
   (0.0051836977, 'matrix'),
   (0.0051822113, 'prior'),
   (0.0050695976, 'field'),
   (0.0047448375, 'noise'),
   (0.004385949, 'maximum'),
   (0.0043639583, 'xt')],
  -0.9204694810589242),
 ([(0.030993043, 'neuron'),
   (0.017409876, 'cell'),
   (0.011736575, 'spike'),
   (0.011304937, 'response'),
   (0.010290459, 'stimulus'),
   (0.009857588, 'synaptic'),
   (0.008739324, 'firing'),
   (0.008286187, 'activity'),
   (0.0057290317, 'potential'),
   (0.005004475, 'fre

***Lower no_above (0.75):*** This means we're keeping more words in our analysis, even if they appear in many documents. So, we might get topics that are more detailed, covering specific things in our text. This could make our topics more understandable because they capture more nuances.This leads to a higher average topic coherence score.

***Higher no_above (0.9):*** Now we're being stricter. We're filtering out more common words. This might make our topics more general, talking about broader ideas instead of specific details. So, the topics might not be as detailed or interesting because they're missing some of the smaller pieces of information.This resulted in a decrease in average topic coherence.

***No removal of common tokens (None):*** This means we're keeping everything, no matter how common it is. So, our topics might cover a wide range of things, both specific and general. This could be good because we don't miss anything, but it might also make our topics less focused or clear because they cover so much ground.

***Interpretation of Topics***:
Topic interpretation involves examining the terms with the highest probabilities within each topic and understanding the semantic coherence of these terms.

For example, in the first set of topics, terms like "mixture," "likelihood," "gaussian," and "density" suggest a topic related to probabilistic models.

Similarly, terms like "neuron," "cell," "activity," and "response" suggest a topic related to neuroscience or neural networks.