## Lab8: Unsupervised Machine Leanring - Topic extraction example
**Note:** This practicum is graded. Complete the exercises and turn in the assignment here under "Lab 8: Keyword Extraction using Clustering and Topic Modeling Techniques" (https://utexas.instructure.com/courses/1382133/assignments/6627284) by end of today (03/08)

### Introduction:

Topic extraction from text is a fundamental natural language processing (NLP) task that involves automatically identifying and categorizing the main themes or subjects present in a collection of textual documents. This unsupervised method is motivated by the need to efficiently organize and summarize large volumes of text data, making it more manageable and informative. By automatically uncovering latent topics within the text, topic extraction helps in various applications, including document clustering, recommendation systems, content summarization, and content understanding.

**Example:**

Consider a news website that publishes articles on various topics like politics, sports, technology, and entertainment. Using unsupervised topic extraction, we can automatically categorize each article into its respective topic without needing manual labels. This allows the website to:

- Organize articles on its homepage by topics, making it easier for users to navigate and find articles of interest.
- Recommend related articles to readers based on their past reading history, leveraging the identified topics.
- Perform sentiment analysis on each topic to gauge public sentiment on current events or issues.

## 1. Using topic models such as LDA

The Gensim library provides a widely used implementation of Latent Dirichlet Allocation (LDA) for topic modeling. LDA is a probabilistic generative model that assumes each document is a mixture of topics, and each word in a document is attributable to one of the document's topics. It iteratively uncovers these topics from the given text corpus.

In [1]:
import gensim
from gensim import corpora
from gensim.models import LdaModel
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
import re
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


# Sample text data (you can replace this with your own text data)
text_data = [
    # Technology
    "The latest smartphone model boasts a revolutionary camera system that enhances low-light photography.",
    "Artificial intelligence algorithms are reshaping industries by automating routine tasks and streamlining operations.",
    "Quantum computing holds the promise of solving complex problems exponentially faster than classical computers.",

    # Environment
    "Deforestation in the Amazon rainforest continues to pose a significant threat to biodiversity and indigenous communities.",
    "Renewable energy sources such as solar and wind power are crucial for reducing carbon emissions and combating climate change.",
    "Plastic pollution in oceans is a pressing environmental issue, with millions of marine animals suffering from ingestion or entanglement.",

    # Health
    "Vaccination campaigns are essential for preventing the spread of infectious diseases and achieving herd immunity.",
    "Mental health awareness initiatives aim to reduce stigma and promote access to support services for individuals struggling with psychological disorders.",
    "Regular exercise and a balanced diet are key components of maintaining a healthy lifestyle and preventing chronic illnesses like heart disease and diabetes."
]

# Define preprocessing function
def preprocess(text):
  # Remove punctuation and convert to lowercase
  text = re.sub(f"[{re.escape(string.punctuation)}]", "", text.lower())
  # Tokenize the text
  tokens = word_tokenize(text)
  # Remove stopwords
  tokens = [word for word in tokens if word not in stopwords.words("english")]
  # Lemmatize words
  lemmatizer = WordNetLemmatizer()
  tokens = [lemmatizer.lemmatize(word) for word in tokens]
  return tokens

# Preprocess the text data
processed_data = [preprocess(text) for text in text_data]

# Create a Gensim dictionary from the processed data
dictionary = corpora.Dictionary(processed_data)

# Create a corpus
corpus = [dictionary.doc2bow(text) for text in processed_data]

# Build the LDA model
lda_model = LdaModel(corpus, num_topics=3, id2word=dictionary, passes=15)

# Print the topics and associated keywords
for topic_id, topic_keywords in lda_model.print_topics():
  print(f"Topic {topic_id}: {topic_keywords}")


[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Topic 0: 0.020*"routine" + 0.020*"vaccination" + 0.020*"essential" + 0.020*"achieving" + 0.020*"immunity" + 0.020*"herd" + 0.020*"spread" + 0.020*"artificial" + 0.020*"campaign" + 0.020*"infectious"
Topic 1: 0.018*"crucial" + 0.018*"change" + 0.018*"access" + 0.018*"solar" + 0.018*"energy" + 0.018*"combating" + 0.018*"reducing" + 0.018*"renewable" + 0.018*"climate" + 0.018*"initiative"
Topic 2: 0.018*"preventing" + 0.018*"disease" + 0.018*"regular" + 0.018*"diabetes" + 0.018*"lifestyle" + 0.018*"heart" + 0.018*"boast" + 0.018*"enhances" + 0.018*"component" + 0.018*"revolutionary"


## 2. By applying K-means Clustering on GloVe vectors

- This method combines pre-trained word embeddings, such as GloVe, with K-means clustering to extract topics based on word similarity and clustering.
- It represents words as vectors in a high-dimensional space and groups similar words into clusters, which can be interpreted as topics.

**Example:** In the code snippet shared above, we used GloVe embeddings and K-means clustering to identify keywords and topics within a set of text documents. This method leverages the semantic similarity of words to form topic clusters.

In [2]:
import gensim.downloader as api
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

# Load GloVe word vectors (You can choose a different word vector model if you prefer)
# We choose a 100d vector
# I just realized that this is a much better way of getting glove vectors than manually downloading and using them
glove_model = api.load("glove-wiki-gigaword-100")

In [3]:
# Sample text data (you can replace this with your own text data)
# Sample text data (you can replace this with your own text data)
text_data = [
    # Technology
    "The latest smartphone model boasts a revolutionary camera system that enhances low-light photography.",
    "Artificial intelligence algorithms are reshaping industries by automating routine tasks and streamlining operations.",
    "Quantum computing holds the promise of solving complex problems exponentially faster than classical computers.",

    # Environment
    "Deforestation in the Amazon rainforest continues to pose a significant threat to biodiversity and indigenous communities.",
    "Renewable energy sources such as solar and wind power are crucial for reducing carbon emissions and combating climate change.",
    "Plastic pollution in oceans is a pressing environmental issue, with millions of marine animals suffering from ingestion or entanglement.",

    # Health
    "Vaccination campaigns are essential for preventing the spread of infectious diseases and achieving herd immunity.",
    "Mental health awareness initiatives aim to reduce stigma and promote access to support services for individuals struggling with psychological disorders.",
    "Regular exercise and a balanced diet are key components of maintaining a healthy lifestyle and preventing chronic illnesses like heart disease and diabetes."
]

# Define preprocessing function
def preprocess(text):
  # Remove punctuation and convert to lowercase
  text = text.lower()
  text = "".join([char for char in text if char not in string.punctuation])

  # Tokenize the text
  tokens = word_tokenize(text)

  # Remove stopwords and lemmatize words
  lemmatizer = WordNetLemmatizer()
  tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords.words("english")]
  return tokens

# Preprocess the text data and store only unique words
processed_data = [preprocess(text) for text in text_data]

# Maintain a list of id_to_word
token_list = []
for doc in processed_data:
  for token in doc:
    token_list.append(token)

token_list = list(set(token_list))

# Create a list of word vectors for each token in the text
word_vectors = []
for token in token_list:
  if token in glove_model:
    word_vectors.append(glove_model[token])

# Convert the list of word vectors to a numpy array because scikit-learn's kmeans accepts numpy array
word_vectors = np.array(word_vectors)

# Let's randomly choose K = 3 for the Kmeans algorithm
# This means, we are assuming that three broad topics are covered in the data
k_clusters = 3

# Perform K-means clustering
kmeans = KMeans(n_clusters=k_clusters, random_state=0, n_init=10).fit(word_vectors)

# Map words to their corresponding clusters
word_clusters = {}
for word, cluster_label in zip(token_list, kmeans.labels_):
    if cluster_label not in word_clusters:
        word_clusters[cluster_label] = []
    word_clusters[cluster_label].append(word)

# Print 5 words words in each cluster
for cluster_label, cluster_words in word_clusters.items():
    print(f"Cluster {cluster_label}: {', '.join(cluster_words[:5])} (total words: {len(cluster_words)})")

Cluster 1: struggling, ocean, intelligence, like, system (total words: 63)
Cluster 2: exponentially, renewable, model, quantum, photography (total words: 24)
Cluster 0: chronic, suffering, stigma, diabetes, deforestation (total words: 18)


## Exercise E1. Analyze topics of a website data

1. Pick up a website of your choice that has significant amount of text (e.g, wikipedia articles or blog pages). If you are a multilingual speaker, you can and should select a page that is in the other non-English langauge that you speak.

2. Copy-paste the content of the page to a `txt` file and proceed with step 3. Alternatively, if you are fluent with web-scraping, you can scrape the website automatically and store the content in a file.

3. Read the file, perform `sentence tokenization` using NLTK to obtain a list of sentences.

4. Perform topic analysis following both LDA and Clustering based methods by reusing above code. Play with different parameters (number of topics / clusters, number of words per topic).

5. Write down your observations and summary.

**Optional:** Can you plot word clouds of topics based on the scroes returned by LDA / clustering methods? Do the word clouds look interesting.

In [35]:
# from https://en.wikipedia.org/wiki/Force_carrier

# part 2
sentences = []
with open("file.txt") as f:
    sentences = (f.readlines())

x = 0
for sen in sentences:
    sentences[x] = sen.strip()
    x += 1

In [38]:
# Define preprocessing function
def preprocess(text):
  # Remove punctuation and convert to lowercase
  text = re.sub(f"[{re.escape(string.punctuation)}]", "", text.lower())
  # Tokenize the text
  tokens = word_tokenize(text)
  # Remove stopwords
  tokens = [word for word in tokens if word not in stopwords.words("english")]
  # Lemmatize words
  lemmatizer = WordNetLemmatizer()
  tokens = [lemmatizer.lemmatize(word) for word in tokens]
  return tokens

# Preprocess the text data
processed_data = [preprocess(text) for text in text_data]

# Create a Gensim dictionary from the processed data
dictionary = corpora.Dictionary(processed_data)

# Create a corpus
corpus = [dictionary.doc2bow(text) for text in processed_data]

# Build the LDA model with 6 topics
lda_model = LdaModel(corpus, num_topics=6, id2word=dictionary, passes=15)

# Print the topics and associated keywords
for topic_id, topic_keywords in lda_model.print_topics():
  print(f"Topic {topic_id}: {topic_keywords}")

Topic 0: 0.029*"pollution" + 0.029*"entanglement" + 0.029*"environmental" + 0.029*"issue" + 0.029*"million" + 0.029*"computer" + 0.029*"ocean" + 0.029*"marine" + 0.029*"faster" + 0.029*"pressing"
Topic 1: 0.036*"disorder" + 0.036*"aim" + 0.036*"struggling" + 0.036*"stigma" + 0.036*"mental" + 0.036*"support" + 0.036*"promote" + 0.036*"service" + 0.036*"health" + 0.036*"psychological"
Topic 2: 0.029*"source" + 0.029*"energy" + 0.029*"change" + 0.029*"crucial" + 0.029*"emission" + 0.029*"solar" + 0.029*"power" + 0.029*"climate" + 0.029*"renewable" + 0.029*"wind"
Topic 3: 0.027*"chronic" + 0.027*"key" + 0.027*"healthy" + 0.027*"lifestyle" + 0.027*"component" + 0.027*"regular" + 0.027*"exercise" + 0.027*"diabetes" + 0.027*"heart" + 0.027*"balanced"
Topic 4: 0.042*"photography" + 0.042*"model" + 0.042*"smartphone" + 0.042*"revolutionary" + 0.042*"boast" + 0.042*"enhances" + 0.042*"camera" + 0.042*"lowlight" + 0.042*"latest" + 0.042*"system"
Topic 5: 0.042*"disease" + 0.042*"preventing" + 0.0

It does not seem like changing the topics number after a certain point does anything. Might be the law of diminishing returns

In [39]:
# Define preprocessing function
def preprocess(text):
  # Remove punctuation and convert to lowercase
  text = text.lower()
  text = "".join([char for char in text if char not in string.punctuation])

  # Tokenize the text
  tokens = word_tokenize(text)

  # Remove stopwords and lemmatize words
  lemmatizer = WordNetLemmatizer()
  tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords.words("english")]
  return tokens

# Preprocess the text data and store only unique words
processed_data = [preprocess(text) for text in text_data]

# Maintain a list of id_to_word
token_list = []
for doc in processed_data:
  for token in doc:
    token_list.append(token)

token_list = list(set(token_list))

# Create a list of word vectors for each token in the text
word_vectors = []
for token in token_list:
  if token in glove_model:
    word_vectors.append(glove_model[token])

# Convert the list of word vectors to a numpy array because scikit-learn's kmeans accepts numpy array
word_vectors = np.array(word_vectors)

# Let's randomly choose K = 7 for the Kmeans algorithm
# This means, we are assuming that three broad topics are covered in the data
k_clusters = 7

# Perform K-means clustering
kmeans = KMeans(n_clusters=k_clusters, random_state=0, n_init=10).fit(word_vectors)

# Map words to their corresponding clusters
word_clusters = {}
for word, cluster_label in zip(token_list, kmeans.labels_):
    if cluster_label not in word_clusters:
        word_clusters[cluster_label] = []
    word_clusters[cluster_label].append(word)

# Print 5 words words in each cluster
for cluster_label, cluster_words in word_clusters.items():
    print(f"Cluster {cluster_label}: {', '.join(cluster_words[:5])} (total words: {len(cluster_words)})")

Cluster 3: struggling, intelligence, like, system, complex (total words: 30)
Cluster 6: ocean, marine, climate, wind, amazon (total words: 8)
Cluster 1: exponentially, pose, streamlining, combating, reshaping (total words: 15)
Cluster 2: renewable, deforestation, energy, pollution, reduce (total words: 9)
Cluster 0: chronic, suffering, stigma, diabetes, disorder (total words: 18)
Cluster 5: model, quantum, photography, smartphone, lifestyle (total words: 12)
Cluster 4: promote, million, support, health, awareness (total words: 13)


We can see that there is quite a lot of weirdness in this group of clusters. It seems to be working well though, getting the different clusters together to mark all the words in. I think it is good.