# Multimedia Processing Course - Part 5: Text Processing

Text is a fundamental type of multimedia data. In this notebook, we explore how to manipulate, analyze, and extract information from text using Python.

**Content:**
1.  **Level 1 (Basic)**: String Manipulation and Tokenization.
2.  **Level 2 (Intermediate)**: Text Cleaning, Frequency Analysis, and Word Clouds.
3.  **Level 3 (Advanced)**: Sentiment Analysis and Text Similarity.

In [None]:
# Install required libraries (run once if needed)
# !pip install nltk wordcloud matplotlib

import re
import string
from collections import Counter

import nltk
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('vader_lexicon', quiet=True)
nltk.download('punkt_tab', quiet=True)

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer

print("Libraries loaded successfully!")

### Explanation
We import the core libraries: `re` for regular expressions, `nltk` for natural language processing, `Counter` for frequency counting, `WordCloud` for visualization, and `matplotlib` for plotting.

## Level 1: String Manipulation and Tokenization
The first step in text processing is breaking raw text into meaningful units called **tokens** (words or sentences).

In [None]:
# Sample text for the entire notebook
sample_text = """
Multimedia is a combination of text, images, audio, and video. It is widely used in education,
entertainment, and communication. Text processing is one of the most important areas in
multimedia systems. Natural language processing (NLP) enables computers to understand,
interpret, and generate human language. Text data is everywhere: emails, websites, social
media, books, and more. Analyzing text helps us extract valuable information and insights.
"""

# Basic string operations
print("Original length (chars):", len(sample_text))
print("Uppercase:", sample_text[:50].upper())
print("Lowercase:", sample_text[:50].lower())

### Explanation
Python strings have built-in methods like `.upper()`, `.lower()`, and `len()` for basic manipulation. These are the first step when working with raw text data.

In [None]:
# Sentence tokenization
sentences = sent_tokenize(sample_text.strip())
print(f"Number of sentences: {len(sentences)}")
for i, sent in enumerate(sentences):
    print(f"  Sentence {i+1}: {sent}")

### Explanation
`sent_tokenize` from NLTK splits the text into a list of sentences. It handles abbreviations and punctuation intelligently.

In [None]:
# Word tokenization
words = word_tokenize(sample_text.lower())
print(f"Total tokens (words + punctuation): {len(words)}")
print("First 20 tokens:", words[:20])

### Explanation
`word_tokenize` splits text into individual words and punctuation marks. We convert to lowercase first to normalize the text.

## Level 2: Text Cleaning, Frequency Analysis, and Word Cloud
Raw text is messy. We need to remove stop words, punctuation, and then analyze what words appear most often.

In [None]:
# Text cleaning: remove punctuation and stop words
stop_words = set(stopwords.words('english'))

# Keep only alphabetic tokens and remove stop words
clean_words = [
    word for word in words
    if word.isalpha() and word not in stop_words
]

print(f"Tokens after cleaning: {len(clean_words)}")
print("Clean words sample:", clean_words[:20])

### Explanation
**Stop words** are common words (like 'the', 'is', 'in') that carry little meaning. Removing them reduces noise. `.isalpha()` filters out numbers and punctuation.

In [None]:
# Word frequency analysis
word_freq = Counter(clean_words)
most_common = word_freq.most_common(10)

print("Top 10 most frequent words:")
for word, count in most_common:
    print(f"  '{word}': {count} time(s)")

# Bar chart of frequencies
words_list, counts_list = zip(*most_common)
plt.figure(figsize=(10, 5))
plt.bar(words_list, counts_list, color='steelblue')
plt.title('Top 10 Word Frequencies')
plt.xlabel('Word')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

### Explanation
`Counter` from Python's built-in `collections` module counts occurrences of each element. `.most_common(10)` returns the 10 most frequent words as (word, count) pairs.

In [None]:
# Word Cloud visualization
wordcloud_text = ' '.join(clean_words)

wc = WordCloud(
    width=800,
    height=400,
    background_color='white',
    colormap='viridis',
    max_words=50
).generate(wordcloud_text)

plt.figure(figsize=(12, 6))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Sample Text', fontsize=16)
plt.tight_layout()
plt.show()

### Explanation
A **Word Cloud** is a visual representation where more frequent words appear larger. `WordCloud` from the `wordcloud` library generates it. We join our cleaned words into a single string and pass it to `.generate()`.

## Level 3: Sentiment Analysis and Text Similarity
**Sentiment analysis** determines whether a piece of text is positive, negative, or neutral. **Text similarity** measures how alike two pieces of text are.

In [None]:
# Sentiment Analysis using NLTK's VADER
sia = SentimentIntensityAnalyzer()

test_sentences = [
    "Text processing is absolutely amazing and very powerful!",
    "This task is terrible and extremely difficult.",
    "Multimedia combines text, audio, image, and video."
]

print("Sentiment Analysis Results:")
print("-" * 50)
for sentence in test_sentences:
    scores = sia.polarity_scores(sentence)
    label = 'POSITIVE' if scores['compound'] >= 0.05 else ('NEGATIVE' if scores['compound'] <= -0.05 else 'NEUTRAL')
    print(f"Text   : {sentence}")
    print(f"Scores : {scores}")
    print(f"Label  : {label}")
    print("-" * 50)

### Explanation
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon-based sentiment analyzer. `polarity_scores()` returns:
- `neg`, `neu`, `pos`: proportion of negative, neutral, and positive tokens.
- `compound`: a single normalized score between -1 (most negative) and +1 (most positive).

In [None]:
# Text Similarity using Jaccard Similarity
def jaccard_similarity(text1, text2):
    """Compute Jaccard similarity between two texts."""
    set1 = set(word_tokenize(text1.lower()))
    set2 = set(word_tokenize(text2.lower()))
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    return len(intersection) / len(union) if union else 0.0

text_a = "Multimedia includes text, images, audio, and video."
text_b = "Text and audio are key parts of multimedia systems."
text_c = "Python is a popular programming language for data science."

sim_ab = jaccard_similarity(text_a, text_b)
sim_ac = jaccard_similarity(text_a, text_c)

print(f"Similarity (A vs B): {sim_ab:.2f}  <- Both are about multimedia")
print(f"Similarity (A vs C): {sim_ac:.2f}  <- Different topics")

### Explanation
**Jaccard Similarity** measures overlap between two sets of words:

$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$

A score of 1.0 means the texts are identical (same words). A score of 0.0 means no words in common. It is simple but effective for short text comparison.