# Week 6, Day 4: Topic Modeling and Text Summarization

## Learning Objectives
- Understand topic modeling concepts
- Learn text summarization techniques
- Master document clustering
- Practice implementing topic models

## Topics Covered
1. Topic Modeling
2. Text Summarization
3. Document Clustering
4. Model Evaluation

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.cluster import KMeans
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.summarization import summarize
import gensim
from gensim import corpora, models

## 1. Topic Modeling

In [None]:
def topic_modeling_example():
    # Sample documents
    documents = [
        "Machine learning algorithms help computers learn from data.",
        "Deep learning models have revolutionized AI applications.",
        "Neural networks are inspired by biological brains.",
        "Data science combines statistics and programming.",
        "Python is a popular programming language for AI.",
        "Statistical methods are fundamental to data analysis.",
        "Natural language processing helps computers understand text.",
        "Computer vision systems can recognize images and video.",
        "Reinforcement learning enables autonomous decision making.",
        "Big data analytics requires efficient processing systems."
    ]
    
    # Create document-term matrix
    vectorizer = CountVectorizer(max_features=1000, stop_words='english')
    doc_term_matrix = vectorizer.fit_transform(documents)
    
    # Apply LDA
    n_topics = 3
    lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
    doc_topics = lda.fit_transform(doc_term_matrix)
    
    # Get feature names
    feature_names = vectorizer.get_feature_names_out()
    
    # Print top words for each topic
    n_top_words = 5
    for topic_idx, topic in enumerate(lda.components_):
        top_words = [feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]]
        print(f"Topic {topic_idx + 1}: {', '.join(top_words)}")
    
    # Visualize topic distribution
    plt.figure(figsize=(12, 5))
    
    # Document-Topic Distribution
    plt.subplot(121)
    plt.imshow(doc_topics, aspect='auto', cmap='YlOrRd')
    plt.title('Document-Topic Distribution')
    plt.xlabel('Topic')
    plt.ylabel('Document')
    plt.colorbar()
    
    # Topic Proportions
    plt.subplot(122)
    topic_proportions = doc_topics.mean(axis=0)
    plt.bar(range(1, n_topics + 1), topic_proportions)
    plt.title('Topic Proportions')
    plt.xlabel('Topic')
    plt.ylabel('Proportion')
    
    plt.tight_layout()
    plt.show()

topic_modeling_example()

## 2. Text Summarization

In [None]:
def text_summarization_example():
    # Sample text
    text = """
    Artificial Intelligence (AI) has transformed various industries in recent years.
    Machine learning algorithms enable computers to learn from data and improve their performance.
    Deep learning, a subset of machine learning, has achieved remarkable results in tasks like image recognition and natural language processing.
    Neural networks, inspired by biological brains, form the basis of deep learning systems.
    These networks can automatically learn hierarchical representations of data.
    Applications of AI include autonomous vehicles, medical diagnosis, and personal assistants.
    However, challenges remain in areas such as ethics, bias, and transparency.
    Researchers continue to work on making AI systems more reliable and interpretable.
    The future of AI holds great promise for solving complex problems.
    Continued development of AI technology will likely lead to more breakthroughs.
    """
    
    # Extractive summarization
    def extractive_summarize(text, n_sentences=3):
        # Tokenize sentences
        sentences = sent_tokenize(text)
        
        # Create TF-IDF matrix
        vectorizer = TfidfVectorizer(stop_words='english')
        tfidf_matrix = vectorizer.fit_transform(sentences)
        
        # Calculate sentence scores
        sentence_scores = tfidf_matrix.sum(axis=1).A1
        
        # Get top sentences
        top_indices = sentence_scores.argsort()[-n_sentences:][::-1]
        summary = ' '.join([sentences[i] for i in sorted(top_indices)])
        
        return summary
    
    # Generate summary
    summary = extractive_summarize(text)
    
    # Print results
    print("Original Text:")
    print(text)
    print("\nSummary:")
    print(summary)
    
    # Analyze compression
    original_words = len(word_tokenize(text))
    summary_words = len(word_tokenize(summary))
    compression_ratio = summary_words / original_words
    
    # Visualize results
    plt.figure(figsize=(8, 4))
    plt.bar(['Original', 'Summary'], [original_words, summary_words])
    plt.title(f'Text Compression (Ratio: {compression_ratio:.2f})')
    plt.ylabel('Word Count')
    plt.show()

text_summarization_example()

## 3. Document Clustering

In [None]:
def document_clustering_example():
    # Sample documents
    documents = [
        "Python programming basics for beginners",
        "Machine learning algorithms in Python",
        "Data structures and algorithms",
        "Neural networks and deep learning",
        "Python web development frameworks",
        "Introduction to programming concepts",
        "Deep learning for computer vision",
        "Web development best practices",
        "Algorithm optimization techniques",
        "Python libraries for data science"
    ]
    
    # Create TF-IDF matrix
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(documents)
    
    # Apply K-means clustering
    n_clusters = 3
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(tfidf_matrix)
    
    # Create DataFrame
    df = pd.DataFrame({
        'Document': documents,
        'Cluster': clusters
    })
    
    # Print clusters
    for cluster in range(n_clusters):
        print(f"\nCluster {cluster + 1}:")
        for doc in df[df['Cluster'] == cluster]['Document']:
            print(f"- {doc}")
    
    # Visualize clusters
    from sklearn.decomposition import PCA
    
    # Reduce dimensionality for visualization
    pca = PCA(n_components=2)
    coords = pca.fit_transform(tfidf_matrix.toarray())
    
    # Plot clusters
    plt.figure(figsize=(10, 6))
    scatter = plt.scatter(coords[:, 0], coords[:, 1], c=clusters, cmap='viridis')
    plt.title('Document Clusters')
    plt.colorbar(scatter)
    plt.show()

document_clustering_example()

## Practical Exercises

In [None]:
# Exercise 1: Advanced Topic Modeling

def topic_modeling_exercise():
    # Sample articles
    articles = [
        """The latest advances in artificial intelligence have led to significant
        breakthroughs in natural language processing and computer vision...""",
        """Climate change continues to affect global weather patterns, leading to
        more extreme weather events and rising sea levels...""",
        """Researchers have discovered new potential treatments for cancer using
        targeted immunotherapy approaches...""",
        # Add more articles...
    ]
    
    print("Task: Implement advanced topic modeling")
    print("1. Preprocess the articles")
    print("2. Apply different topic modeling methods")
    print("3. Compare results")
    print("4. Visualize topics")
    
    # Your code here

topic_modeling_exercise()

In [None]:
# Exercise 2: Custom Summarizer

def summarization_exercise():
    # Sample long text
    text = """
    [Your long text here with multiple paragraphs discussing a topic in detail...]
    """
    
    print("Task: Create a custom summarization system")
    print("1. Implement sentence scoring")
    print("2. Select important sentences")
    print("3. Generate summary")
    print("4. Evaluate quality")
    
    # Your code here

summarization_exercise()

## MCQ Quiz

1. What is topic modeling?
   - a) Text classification
   - b) Topic discovery
   - c) Text generation
   - d) Language translation

2. What is LDA used for?
   - a) Text summarization
   - b) Topic modeling
   - c) Machine translation
   - d) Sentiment analysis

3. What type of summarization is extractive?
   - a) Generating new text
   - b) Selecting existing sentences
   - c) Translating text
   - d) Paraphrasing

4. What is document clustering?
   - a) Text generation
   - b) Document grouping
   - c) Translation
   - d) Summarization

5. What is perplexity in topic modeling?
   - a) Number of topics
   - b) Model quality measure
   - c) Text length
   - d) Word count

6. Which algorithm is NOT for topic modeling?
   - a) LDA
   - b) NMF
   - c) K-means
   - d) LSA

7. What is coherence score?
   - a) Text length
   - b) Topic quality measure
   - c) Word count
   - d) Document length

8. What is abstractive summarization?
   - a) Sentence selection
   - b) New text generation
   - c) Word counting
   - d) Text clustering

9. What is silhouette score used for?
   - a) Topic modeling
   - b) Cluster evaluation
   - c) Text generation
   - d) Summarization

10. What is the purpose of dimensionality reduction?
    - a) Text generation
    - b) Data visualization
    - c) Translation
    - d) Summarization

Answers: 1-b, 2-b, 3-b, 4-b, 5-b, 6-c, 7-b, 8-b, 9-b, 10-b