# 1. Text Cleaning and Frequency Analysis

**Project:** Text EDA (20 Newsgroups)  
**Goal:** Transform raw, messy text into clean tokens and analyze statistical properties (Zipf's Law, N-grams).

---

## 1. Imports and Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
from sklearn.datasets import fetch_20newsgroups
from collections import Counter
import os

# Ensure NLTK resources are available
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 2. Load Dataset
We use the **20 Newsgroups** dataset. To make it challenging (and realistic), we strip headers, footers, and quotes, leaving only the raw message body.

In [None]:
categories = ['sci.space', 'comp.graphics', 'talk.politics.mideast', 'rec.sport.hockey']
# Download to local project folder to avoid cache corruption and permission issues
data_home = '../../data/raw/scikit_learn_data'
if not os.path.exists(data_home):
    os.makedirs(data_home)

newsgroups = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), 
                               categories=categories, data_home=data_home)

df = pd.DataFrame({'text': newsgroups.data, 'target': newsgroups.target})
df['category'] = df['target'].map(lambda x: newsgroups.target_names[x])

print(f"Dataset Shape: {df.shape}")
print("\nSample Text:")
print(df['text'].iloc[0][:500])

## 3. Text Preprocessing Pipeline
Raw text is noisy. We apply a standard cleaning pipeline:
1.  **Lowercasing**: Uniformity.
2.  **Regex Cleaning**: Remove special chars, numbers, and extra whitespace.
3.  **Stopword Removal**: Remove common functional words (the, is, at).
4.  **Lemmatization**: Convert words to base form (running -> run).

In [None]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def clean_text(text):
    # 1. Lowercase
    text = text.lower()
    # 2. Remove special chars and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # 3. Tokenize (simple split)
    tokens = text.split()
    # 4. Remove stopwords and lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words and len(word) > 2]
    
    return " ".join(tokens)

df['clean_text'] = df['text'].apply(clean_text)

print("Cleaning Complete. Sample Comparison:")
print(f"Original: {df['text'].iloc[0][:100]}...")
print(f"Clean:    {df['clean_text'].iloc[0][:100]}...")

## 4. Word Frequency Analysis
What are the most common terms in these domains?

In [None]:
all_words = " ".join(df['clean_text']).split()
word_freq = Counter(all_words)

common_words = pd.DataFrame(word_freq.most_common(20), columns=['Word', 'Count'])

sns.barplot(x='Count', y='Word', data=common_words, palette='viridis')
plt.title('Top 20 Most Frequent Words (Cleaned)')
plt.show()

## 5. Zipf's Law Verification
Zipf's Law states that the frequency of any word is inversely proportional to its rank in the frequency table. On a log-log plot, this should appear as a straight line.

In [None]:
counts = np.array(list(word_freq.values()))
counts = -np.sort(-counts) # Descending sort
ranks = np.arange(1, len(counts) + 1)

plt.figure(figsize=(10, 6))
plt.loglog(ranks, counts, marker=".", linestyle='none', color='purple')
plt.title("Zipf's Law: Log-Log Plot of Word Frequency vs Rank")
plt.xlabel('Rank')
plt.ylabel('Frequency')
plt.grid(True, which="both", ls="--")
plt.show()

## 6. Word Cloud
Visualizing the corpus content.

In [None]:
wordcloud = WordCloud(width=800, height=400, background_color='white', colormap='plasma').generate(" ".join(all_words))

plt.figure(figsize=(14, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of 20 Newsgroups (Subset)')
plt.show()

## 7. N-Gram Analysis (Bi-grams)
Single words miss context. Let's look at pairs of words (bi-grams).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def plot_top_ngrams(text, n=2, top_k=15):
    vec = CountVectorizer(ngram_range=(n, n), stop_words='english').fit(text)
    bag_of_words = vec.transform(text)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
    
    df_ngram = pd.DataFrame(words_freq[:top_k], columns=['Ngram', 'Count'])
    
    plt.figure(figsize=(10, 6))
    sns.barplot(x='Count', y='Ngram', data=df_ngram, palette='magma')
    plt.title(f'Top {top_k} {n}-grams')
    plt.show()

plot_top_ngrams(df['clean_text'], n=2, top_k=15)