# Session 15 🐍

☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️

***

# 114. Gensim 
Gensim is a powerful, efficient, and easy-to-use Python library for topic modeling, document similarity analysis, and natural language processing (NLP). It is designed to handle large text collections using streaming and incremental algorithms, making it suitable for big data applications.

***

# 115. Important Features of Gensim
- Topic Modeling: Extracts topics from documents using algorithms like Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA/LSI), and Hierarchical Dirichlet Process (HDP).
- Word Embeddings: Implements Word2Vec, FastText, and Doc2Vec for word and document vector representations.
- Document Similarity: Computes similarity between documents using TF-IDF, BM25, and Word Mover’s Distance (WMD).
- Efficient & Scalable: Works well with large datasets using memory-efficient streaming.
- Preprocessing Tools: Includes tokenization, stopword removal, and lemmatization support.

***

# 116. Core Components & Usage

***

## 116-1. Preprocessing Text
Gensim provides tools to clean and prepare text data:

In [None]:
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import remove_stopwords, STOPWORDS

text = "Gensim is a powerful library for NLP and topic modeling."

# Tokenization & lowercase conversion
tokens = simple_preprocess(text)
print(tokens)  # Output: ['gensim', 'powerful', 'library', 'for', 'nlp', 'and', 'topic', 'modeling']

# Remove stopwords
filtered_text = remove_stopwords(text)
print(filtered_text)  # Output: "Gensim powerful library NLP topic modeling."

***

## 116-2. Creating a Dictionary & Corpus
Before topic modeling, we convert text into numerical representations:

In [None]:
from gensim import corpora

documents = [
    "Gensim is great for NLP tasks.",
    "Topic modeling is useful for text analysis.",
    "Gensim supports Word2Vec and LDA."
]

# Tokenize and create a dictionary
tokenized_docs = [simple_preprocess(doc) for doc in documents]
dictionary = corpora.Dictionary(tokenized_docs)

# Create a Bag-of-Words (BoW) corpus
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

print("Dictionary:", dictionary.token2id)
print("Corpus (BoW):", corpus)

**Output:**

***

## 116-3. Topic Modeling with LDA

In [None]:
from gensim.models import LdaModel

# Train LDA model
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=2,
    random_state=42,
    passes=10
)

# Print topics
print(lda_model.print_topics())

**output:**

***

## 116-4. Word Embeddings with Word2Vec

In [None]:
from gensim.models import Word2Vec

sentences = [
    ["gensim", "is", "great", "for", "NLP"],
    ["word2vec", "is", "used", "for", "embeddings"],
    ["topic", "modeling", "is", "useful"]
]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Find similar words
similar_words = model.wv.most_similar("gensim")
print(similar_words)

**output:**

***

## 116-5. Document Similarity with TF-IDF

In [None]:
from gensim.models import TfidfModel
from gensim.similarities import SparseMatrixSimilarity

# Train TF-IDF model
tfidf = TfidfModel(corpus)

# Convert corpus to TF-IDF vectors
tfidf_corpus = tfidf[corpus]

# Index for similarity search
index = SparseMatrixSimilarity(tfidf_corpus, num_features=len(dictionary))

# Query a new document
query = "NLP and topic modeling"
query_bow = dictionary.doc2bow(simple_preprocess(query))
query_tfidf = tfidf[query_bow]

# Find similar documents
similarities = index[query_tfidf]
print(list(enumerate(similarities)))

**output:**

***

# 117. Real-World Applications

***

## 117-1. News Article Classification

In [None]:
# Train LDA on news dataset
# Assign topics to new articles
new_doc = ["economy", "grows", "5%"]
new_bow = dictionary.doc2bow(new_doc)
topics = lda[new_bow]

***

## 117-2. Recommender Systems

In [None]:
# Use Doc2Vec to find similar products
model = Doc2Vec(product_descriptions)
sims = model.dv.most_similar("iphone 13")

***

# 118. Advanced Features

***

## 118-1. Streaming Large Datasets

In [None]:
from gensim.corpora import WikiCorpus

wiki = WikiCorpus("enwiki-latest-pages-articles.xml.bz2", lemmatize=False)
for text in wiki.get_texts():
    process(text)  # Processes one article at a time

***

## 118-2. Distributed Computing

In [None]:
from gensim.models import Word2Vec
from multiprocessing import cpu_count

model = Word2Vec(sentences, workers=cpu_count())

***

# 119. Optimization & Performance Tuning

***

## 119-1. Memory Efficiency

In [None]:
# Save memory by streaming corpus
class MyCorpus:
    def __iter__(self):
        for line in open('bigdata.txt'):
            yield simple_preprocess(line)

corpus = MyCorpus()  # Doesn’t load full data into RAM

***

## 119-2. Model Evaluation

In [None]:
from gensim.models import CoherenceModel

coherence = CoherenceModel(
    model=lda,
    texts=docs,
    dictionary=dictionary,
    coherence='c_v'
)
print(coherence.get_coherence())  # Higher = Better

***

# 120. Troubleshooting & Best Practices

***

## 120-1. Common Issues

***

## 120-2. Hyperparameter Tuning

|Model	|Key Parameters	|Typical Values|
|-------|---------------|--------------|
|LDA	|num_topics, passes	|10-100, 10-50|
|Word2Vec	|vector_size, window	|100-300, 5-10|
|FastText	|min_n, max_n (ngrams)	|3, 6|

***

***

# Some Excercises

**1.**  Clean and tokenize raw text data
- Remove punctuation and stopwords
- Apply lemmatization (use pattern or spaCy integration)
- Detect and merge bigrams automatically

___

**2.** Handle a large text file without loading into RAM
- Implement a Python generator that yields one preprocessed document at a time
- Build a dictionary from the streaming corpus

---

**3.**  Discover hidden topics in news articles
- Train an LDA model on 20 Newsgroups dataset
- Compute topic coherence score (c_v)
- Visualize topics with pyLDAvis

---

**4.**  Train and evaluate word embeddings
- Train Skip-gram and CBOW models on Wikipedia text
- Find most similar words to "algorithm"
- Solve analogies: "king - man + woman = ?"

***

**5.** Build a news article recommender
- Convert articles to TF-IDF vectors
- Implement similarity search using SparseMatrixSimilarity
- Query with "climate change policy" and return top 3 matches

***

**6.** Scale training across CPU cores
- Configure Word2Vec to use all available CPU cores
- Monitor memory usage during training
- Compare speed vs single-core training

***

**7.** Tune hyperparameters for best results
- Grid search over num_topics (5,10,20) and passes (5,10,20)
- Evaluate using topic coherence scores
- Implement memory-efficient online LDA (update_every)

***

**8.** Create a Flask endpoint for topic prediction
- Save trained LDA model to disk
- Build a Flask API that:
- Accepts raw text input
- Returns predicted topics
- Test with cURL/POSTMAN

***

#                                                        🌞 https://github.com/AI-Planet 🌞