## LDA & LSA

**Using the above data analyze:**
1. TFIDF and TruncatedSVD
    - Use identiy the top 4 topics
    - most predominant topic for each document
2. CountVectorizer and LatentDirichletAllocation
    - From each topic, extract the top features
    - name of the top n topics
    - Label each document with the most predominant topic


**Classification using TFIDF vectors**
- Explore and predict the sentiment using https://www.cs.cornell.edu/people/pabo/movie-review-data

**Word Embedding**
- Create count based word embedding
- One hot embedding
- Word embedding using Gensim
    - Create a word2vec model using text from wikipedia https://en.wikipedia.org/wiki/Machine_learning
    - what are possible preprocessing
    - tokenization?
    - what is the trained vocab?
    - how to treat words not in vocab?
    - length of vector?
- interpret the output of the following

In [1]:
my_docs = ["The economic slowdown is becoming more severe",
           "The movie was simply awesome",
           "I like cooking my own food",
           "Samsung is announcing a new technology",
           "Machine Learning is an example of awesome technology",
           "All of us were excited at the movie",
           "We have to do more to reverse the economic slowdown"]

In [32]:
import warnings
warnings.filterwarnings('ignore')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition  import TruncatedSVD
from sklearn.preprocessing import normalize

## TFIDF

In [24]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(my_docs)

In [25]:
tfidf_matrix.shape
# (m, n) = (No of documents, No of features)

(7, 18)

In [26]:
# View the features
features = tfidf.get_feature_names_out()
features

array(['announcing', 'awesome', 'cooking', 'economic', 'example',
       'excited', 'food', 'learning', 'like', 'machine', 'movie', 'new',
       'reverse', 'samsung', 'severe', 'simply', 'slowdown', 'technology'],
      dtype=object)

## TruncatedSVD

In [36]:
# Apply TruncatedSVD (LSA)
n_topics = 4
lsa_model = TruncatedSVD(n_components=n_topics)
lsa_topic_matrix = lsa_model.fit_transform(tfidf_matrix)
lsa_topic_matrix = normalize(lsa_topic_matrix, norm='l2')

# Identify the top 4 topics
top_topics = []
for topic_idx in range(n_topics):
    top_words_indices = lsa_model.components_[topic_idx].argsort()[::-1][:4]
    top_words = [tfidf.get_feature_names_out()[index] for index in top_words_indices]
    top_topics.append(top_words)

print("Top 4 topics:")
for i, topic_words in enumerate(top_topics):
    print(f"Topic {i + 1}: {', '.join(topic_words)}")

Top 4 topics:
Topic 1: economic, slowdown, severe, reverse
Topic 2: movie, awesome, simply, excited
Topic 3: technology, samsung, new, announcing
Topic 4: like, cooking, food, excited


In [35]:
# Find the most predominant topic for each document
predominant_topics = lsa_topic_matrix.argmax(axis=1)

print("\nMost Predominant Topic for Each Document:")
for doc_idx, topic_idx in enumerate(predominant_topics):
    print(f"Document {doc_idx + 1}: Topic {topic_idx + 1}")


Most Predominant Topic for Each Document:
Document 1: Topic 1
Document 2: Topic 2
Document 3: Topic 4
Document 4: Topic 3
Document 5: Topic 3
Document 6: Topic 2
Document 7: Topic 1


## LDA

In [42]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np

# CountVectorizer
count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform(my_docs)

# LatentDirichletAllocation
n_topics = 4
lda_model = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda_topic_matrix = lda_model.fit_transform(count_matrix)

# Extract top features from each topic
top_features = []
for topic_idx, topic in enumerate(lda_model.components_):
    top_feature_indices = topic.argsort()[::-1][:5]
    top_feature_names = [count_vectorizer.get_feature_names_out()[index] for index in top_feature_indices]
    top_features.append(top_feature_names)

print("Top features for each topic:")
for i, features in enumerate(top_features):
    print(f"Topic {i + 1}: {', '.join(features)}")

# Name of the top n topics
n_top_topics = 2
top_topics_indices = np.argsort(np.max(lda_topic_matrix, axis=0))[::-1][:n_top_topics]
top_topic_names = [f"Topic {index + 1}" for index in top_topics_indices]

print(f"\nTop {n_top_topics} topics: {', '.join(top_topic_names)}")

# Label each document with the most predominant topic
predominant_topics = lda_topic_matrix.argmax(axis=1)

print("\nMost Predominant Topic for Each Document:")
for doc_idx, topic_idx in enumerate(predominant_topics):
    print(f"Document {doc_idx + 1}: Topic {topic_idx + 1}")

Top features for each topic:
Topic 1: to, more, slowdown, economic, the
Topic 2: technology, awesome, is, of, learning
Topic 3: the, movie, were, at, excited
Topic 4: is, becoming, severe, like, food

Top 2 topics: Topic 1, Topic 3

Most Predominant Topic for Each Document:
Document 1: Topic 4
Document 2: Topic 3
Document 3: Topic 4
Document 4: Topic 4
Document 5: Topic 2
Document 6: Topic 3
Document 7: Topic 1
