<a href="https://colab.research.google.com/github/RST0310/INFO-5731/blob/main/RAYABARAPU_SAITEJA_Exercise_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [None]:
pip install nltk gensim




In [None]:
import gensim
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from gensim.models import CoherenceModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

nltk.download('stopwords')
nltk.download('punkt')

# Sample text data
documents = [
    "One interesting text classification task related to cricket could be classifying match commentary or articles into different categories such as match summaries, player profiles, match predictions, and analysis pieces. Here are some features that could be useful for building a machine learning model for this task:",
    "1. Word Frequency Features: The frequency of specific cricket-related terms or phrases in the text could be indicative of its category. For example, terms like 'century,' 'wicket,' 'runs,' 'dismissal,' and 'boundary' might be more common in match summaries, while terms like 'player statistics,' 'performance analysis,' and 'strategy' might be more common in analysis pieces.",
    "2. N-grams: Analyzing sequences of words (n-grams) could provide valuable context. For instance, phrases like 'man of the match,' 'caught behind,' or 'run rate' might be indicative of specific types of content.",
    "3. Sentiment Analysis: The sentiment expressed in the text could help classify articles as positive (e.g., celebrating a team's victory), negative (e.g., criticizing a player's performance), or neutral. This could be achieved by analyzing the sentiment of individual sentences or using pre-trained sentiment analysis models.",
    "4. Named Entity Recognition (NER): Identifying named entities such as player names, team names, tournament names, and locations mentioned in the text could provide clues about the content's category. For example, mentions of players and teams are likely to appear in player profiles or match summaries.",
    "5. Part-of-Speech (POS) Tagging: Analyzing the grammatical structure of the text by tagging words with their parts of speech could reveal patterns specific to certain categories. For instance, articles with a high frequency of verbs related to analysis (e.g., 'analyze,' 'evaluate,' 'assess') might belong to the analysis category.",
    "6. Topic Modeling: Identifying underlying topics in the text using techniques like Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF) could help categorize articles based on the dominant themes present. For instance, topics related to match tactics, player performances, or match outcomes might be indicative of specific categories.",
    "These features capture different aspects of the text content, allowing the machine learning model to learn patterns associated with each category and improve classification accuracy. By combining multiple types of features, the model can better understand the usage of cricket-related text and accurately classify it into relevant categories."
]

# Preprocess the text data
stop_words = set(stopwords.words('english'))
texts = [word_tokenize(document.lower()) for document in documents]
texts = [[word for word in text if word.isalnum() and word not in stop_words] for text in texts]

# Create a dictionary representation of the documents
dictionary = Dictionary(texts)

# Create a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

# Determine the optimal number of topics based on coherence score
coherence_scores = []
for k in range(2, 11):
    lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=k, random_state=42)
    coherence_model = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append((k, coherence_score))

best_k, best_coherence = max(coherence_scores, key=lambda x: x[1])

# Train the LDA model with the best number of topics
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=best_k, random_state=42)

# Print the topics
print(f"The best number of topics (K) based on coherence score: {best_k}")
print("Summarized Topics:")
for idx, topic in lda_model.print_topics():
    print(f'Topic {idx + 1}: {topic}')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


The best number of topics (K) based on coherence score: 7
Summarized Topics:
Topic 1: 0.030*"content" + 0.029*"indicative" + 0.029*"might" + 0.028*"words" + 0.028*"specific" + 0.028*"sequences" + 0.028*"instance" + 0.028*"context" + 0.028*"could" + 0.028*"types"
Topic 2: 0.046*"tagging" + 0.045*"analysis" + 0.025*"articles" + 0.025*"might" + 0.025*"specific" + 0.025*"categories" + 0.025*"grammatical" + 0.025*"parts" + 0.025*"frequency" + 0.025*"words"
Topic 3: 0.032*"sentiment" + 0.028*"match" + 0.025*"could" + 0.024*"topics" + 0.022*"using" + 0.020*"player" + 0.020*"text" + 0.019*"help" + 0.019*"articles" + 0.017*"analysis"
Topic 4: 0.008*"match" + 0.008*"might" + 0.008*"could" + 0.008*"analysis" + 0.008*"text" + 0.008*"like" + 0.008*"player" + 0.008*"features" + 0.008*"specific" + 0.008*"sentiment"
Topic 5: 0.039*"model" + 0.037*"features" + 0.035*"text" + 0.028*"match" + 0.025*"categories" + 0.024*"classification" + 0.024*"machine" + 0.023*"different" + 0.022*"learning" + 0.021*"typ

## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [None]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

# Sample text data
documents = [
    "One interesting text classification task related to cricket could be classifying match commentary or articles into different categories such as match summaries, player profiles, match predictions, and analysis pieces. Here are some features that could be useful for building a machine learning model for this task:",
    "1. Word Frequency Features: The frequency of specific cricket-related terms or phrases in the text could be indicative of its category. For example, terms like 'century,' 'wicket,' 'runs,' 'dismissal,' and 'boundary' might be more common in match summaries, while terms like 'player statistics,' 'performance analysis,' and 'strategy' might be more common in analysis pieces.",
    "2. N-grams: Analyzing sequences of words (n-grams) could provide valuable context. For instance, phrases like 'man of the match,' 'caught behind,' or 'run rate' might be indicative of specific types of content.",
    "3. Sentiment Analysis: The sentiment expressed in the text could help classify articles as positive (e.g., celebrating a team's victory), negative (e.g., criticizing a player's performance), or neutral. This could be achieved by analyzing the sentiment of individual sentences or using pre-trained sentiment analysis models.",
    "4. Named Entity Recognition (NER): Identifying named entities such as player names, team names, tournament names, and locations mentioned in the text could provide clues about the content's category. For example, mentions of players and teams are likely to appear in player profiles or match summaries.",
    "5. Part-of-Speech (POS) Tagging: Analyzing the grammatical structure of the text by tagging words with their parts of speech could reveal patterns specific to certain categories. For instance, articles with a high frequency of verbs related to analysis (e.g., 'analyze,' 'evaluate,' 'assess') might belong to the analysis category.",
    "6. Topic Modeling: Identifying underlying topics in the text using techniques like Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF) could help categorize articles based on the dominant themes present. For instance, topics related to match tactics, player performances, or match outcomes might be indicative of specific categories.",
    "These features capture different aspects of the text content, allowing the machine learning model to learn patterns associated with each category and improve classification accuracy. By combining multiple types of features, the model can better understand the usage of cricket-related text and accurately classify it into relevant categories."
]

# Preprocess the text data
stop_words = set(stopwords.words('english'))
texts = [' '.join([word for word in word_tokenize(document.lower()) if word.isalnum() and word not in stop_words]) for document in documents]

# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)

# Compute the truncated SVD (LSA)
num_topics_range = range(2, 11)
coherence_scores = []

for num_topics in num_topics_range:
    svd = TruncatedSVD(n_components=num_topics)
    document_topics = svd.fit_transform(tfidf_matrix)

    # Compute coherence-like score
    similarity_matrix = cosine_similarity(document_topics)
    np.fill_diagonal(similarity_matrix, -1)  # Set diagonal to -1 to ignore self-similarity
    coherence_score = np.mean(similarity_matrix)
    coherence_scores.append((num_topics, coherence_score))

# Choose the number of topics with the highest coherence-like score
best_k, best_coherence = max(coherence_scores, key=lambda x: x[1])

# Train LSA model with the best number of topics
svd = TruncatedSVD(n_components=best_k)
document_topics = svd.fit_transform(tfidf_matrix)

# Normalize document-topic matrix
document_topics = normalize(document_topics, axis=1, norm='l2')

# Print the topics
print(f"The best number of topics (K) based on coherence-like score: {best_k}")
print("Summarized Topics:")
feature_names = vectorizer.get_feature_names_out()
for idx, topic in enumerate(svd.components_):
    top_words_idx = topic.argsort()[-10:][::-1]  # Get the indices of top 10 words for each topic
    top_words = [feature_names[i] for i in top_words_idx]
    print(f"Topic {idx + 1}: {' '.join(top_words)}")


The best number of topics (K) based on coherence-like score: 2
Summarized Topics:
Topic 1: match analysis could might like terms text specific frequency features
Topic 2: model features task learning classification different machine improve combining capture


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [None]:
import numpy as np
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from gensim.models import CoherenceModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

nltk.download('stopwords')
nltk.download('punkt')

# Sample text data
documents = [
    "One interesting text classification task related to cricket could be classifying match commentary or articles into different categories such as match summaries, player profiles, match predictions, and analysis pieces. Here are some features that could be useful for building a machine learning model for this task:",
    "1. Word Frequency Features: The frequency of specific cricket-related terms or phrases in the text could be indicative of its category. For example, terms like 'century,' 'wicket,' 'runs,' 'dismissal,' and 'boundary' might be more common in match summaries, while terms like 'player statistics,' 'performance analysis,' and 'strategy' might be more common in analysis pieces.",
    "2. N-grams: Analyzing sequences of words (n-grams) could provide valuable context. For instance, phrases like 'man of the match,' 'caught behind,' or 'run rate' might be indicative of specific types of content.",
    "3. Sentiment Analysis: The sentiment expressed in the text could help classify articles as positive (e.g., celebrating a team's victory), negative (e.g., criticizing a player's performance), or neutral. This could be achieved by analyzing the sentiment of individual sentences or using pre-trained sentiment analysis models.",
    "4. Named Entity Recognition (NER): Identifying named entities such as player names, team names, tournament names, and locations mentioned in the text could provide clues about the content's category. For example, mentions of players and teams are likely to appear in player profiles or match summaries.",
    "5. Part-of-Speech (POS) Tagging: Analyzing the grammatical structure of the text by tagging words with their parts of speech could reveal patterns specific to certain categories. For instance, articles with a high frequency of verbs related to analysis (e.g., 'analyze,' 'evaluate,' 'assess') might belong to the analysis category.",
    "6. Topic Modeling: Identifying underlying topics in the text using techniques like Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF) could help categorize articles based on the dominant themes present. For instance, topics related to match tactics, player performances, or match outcomes might be indicative of specific categories.",
    "These features capture different aspects of the text content, allowing the machine learning model to learn patterns associated with each category and improve classification accuracy. By combining multiple types of features, the model can better understand the usage of cricket-related text and accurately classify it into relevant categories."
]

# Preprocess the text data
stop_words = set(stopwords.words('english'))
texts = [word_tokenize(document.lower()) for document in documents]
texts = [[word for word in text if word.isalnum() and word not in stop_words] for text in texts]

# Remove documents with no terms after preprocessing
texts = [text for text in texts if text]

# Create a dictionary representation of the documents
dictionary = Dictionary(texts)

# Filter out words that occur less than 1 document, or more than 50% of the documents
dictionary.filter_extremes(no_below=1, no_above=0.5)

# Create a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

# Determine the optimal number of topics
coherence_scores = []
for k in range(2, 11):
    lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=k, random_state=42)
    coherence_model = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append((k, coherence_score))

best_k, best_coherence = max(coherence_scores, key=lambda x: x[1])

# Train the LDA model with the best number of topics
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=best_k, random_state=42)

# Print the topics
print(f"The best number of topics (K) based on coherence score: {best_k}")
print("Summarized Topics:")
for idx, topic in lda_model.print_topics():
    print(f'Topic {idx + 1}: {topic}')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


The best number of topics (K) based on coherence score: 8
Summarized Topics:
Topic 1: 0.044*"analysis" + 0.043*"tagging" + 0.027*"speech" + 0.027*"analyzing" + 0.026*"structure" + 0.026*"grammatical" + 0.026*"categories" + 0.026*"high" + 0.025*"patterns" + 0.025*"parts"
Topic 2: 0.052*"sentiment" + 0.027*"features" + 0.027*"analysis" + 0.027*"classify" + 0.027*"model" + 0.019*"articles" + 0.018*"player" + 0.018*"using" + 0.018*"help" + 0.018*"categories"
Topic 3: 0.036*"tagging" + 0.035*"analysis" + 0.024*"pos" + 0.024*"reveal" + 0.023*"certain" + 0.022*"articles" + 0.021*"instance" + 0.021*"specific" + 0.020*"words" + 0.020*"might"
Topic 4: 0.034*"topics" + 0.021*"categories" + 0.021*"lda" + 0.021*"factorization" + 0.020*"allocation" + 0.020*"related" + 0.020*"categorize" + 0.020*"outcomes" + 0.020*"performances" + 0.020*"help"
Topic 5: 0.053*"task" + 0.029*"analysis" + 0.029*"features" + 0.028*"pieces" + 0.028*"summaries" + 0.028*"interesting" + 0.028*"profiles" + 0.028*"player" + 0.

## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [None]:
pip install bertopic


Collecting bertopic
  Downloading bertopic-0.16.0-py2.py3-none-any.whl (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap-learn-0.5.5.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9/90.9 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence_transformers-2.6.1-py3-none-any.whl (163 kB)
[2K     [90m━━━━━

In [37]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

nltk.download('stopwords')
nltk.download('punkt')

# Sample text data
documents = [
    "One interesting text classification task related to cricket could be classifying match commentary or articles into different categories such as match summaries, player profiles, match predictions, and analysis pieces. Here are some features that could be useful for building a machine learning model for this task:",
    "1. Word Frequency Features: The frequency of specific cricket-related terms or phrases in the text could be indicative of its category. For example, terms like 'century,' 'wicket,' 'runs,' 'dismissal,' and 'boundary' might be more common in match summaries, while terms like 'player statistics,' 'performance analysis,' and 'strategy' might be more common in analysis pieces.",
    "2. N-grams: Analyzing sequences of words (n-grams) could provide valuable context. For instance, phrases like 'man of the match,' 'caught behind,' or 'run rate' might be indicative of specific types of content.",
    "3. Sentiment Analysis: The sentiment expressed in the text could help classify articles as positive (e.g., celebrating a team's victory), negative (e.g., criticizing a player's performance), or neutral. This could be achieved by analyzing the sentiment of individual sentences or using pre-trained sentiment analysis models.",
    "4. Named Entity Recognition (NER): Identifying named entities such as player names, team names, tournament names, and locations mentioned in the text could provide clues about the content's category. For example, mentions of players and teams are likely to appear in player profiles or match summaries.",
    "5. Part-of-Speech (POS) Tagging: Analyzing the grammatical structure of the text by tagging words with their parts of speech could reveal patterns specific to certain categories. For instance, articles with a high frequency of verbs related to analysis (e.g., 'analyze,' 'evaluate,' 'assess') might belong to the analysis category.",
    "6. Topic Modeling: Identifying underlying topics in the text using techniques like Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF) could help categorize articles based on the dominant themes present. For instance, topics related to match tactics, player performances, or match outcomes might be indicative of specific categories.",
    "These features capture different aspects of the text content, allowing the machine learning model to learn patterns associated with each category and improve classification accuracy. By combining multiple types of features, the model can better understand the usage of cricket-related text and accurately classify it into relevant categories."
]

# Preprocess the text data
stop_words = set(stopwords.words('english'))
texts = [' '.join([word for word in word_tokenize(document.lower()) if word.isalnum() and word not in stop_words]) for document in documents]

# Initialize BERTopic model
model = BERTopic()

# Fit BERTopic
topics, _ = model.fit_transform(texts)

# Summarize the topics
print("Summarized Topics:")
for topic_id in range(max(model.get_topic_freq().Topic)):
    topic_words = model.get_topic(topic_id)[:10]  # Get top 10 words for each topic
    print(f"Topic {topic_id + 1}: {' '.join(topic_words)}")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


ValueError: k must be less than or equal to the number of training points

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [None]:
# LDA (Latent Dirichlet Allocation):

Pros: Provides interpretable topics, widely applicable in various contexts.
Cons: Requires parameter tuning and is less effective with very large datasets.
LSA (Latent Semantic Analysis):

Pros: Offers consistent results, relatively easy to implement.
Cons: Topics may lack interpretability and are sensitive to dataset size.
Word Clouds:

Pros: Simple to create and understand.
Cons: Limited in capturing topic coherence and hierarchy, may lack depth in insights.
BERTopic:

Pros: Captures semantic relationships effectively and yields consistent outcomes.
Cons: Demands greater computational resources and may struggle with scalability.


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# It was a worthwhile learning experience to work with text data and extract features using topic modeling methods. My understanding of how to find latent subjects in the text to help with feature extraction has improved since I started using LDA. Preprocessing the text and adjusting the settings to get the best results were challenges. With its insights into textual data categorization and analysis—a critical skill in domains such as information retrieval, document classification, and sentiment analysis—this exercise is extremely pertinent to natural language processing (NLP).





'''