# 📄 Text Classification Deep Dive

This notebook explores advanced text classification techniques beyond sentiment analysis:
- **Multi-class text classification**
- **Topic classification**
- **Spam detection**
- **Advanced evaluation metrics**
- **Handling imbalanced datasets**

## 📚 Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    precision_recall_fscore_support, roc_auc_score, 
    multilabel_confusion_matrix
)
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("Set2")

## 📊 Multi-Class Dataset Creation

Let's create a comprehensive multi-class text classification dataset with various categories:

In [None]:
# Create multi-class dataset - News Categories
news_data = {
    'Technology': [
        "New smartphone features artificial intelligence capabilities",
        "Software company releases innovative machine learning platform",
        "Tech startup develops breakthrough quantum computing solution",
        "Major update brings enhanced security features to popular app",
        "Cloud computing services expand to new global regions",
        "Artificial intelligence helps improve medical diagnosis accuracy",
        "New programming language simplifies data science workflows",
        "Cryptocurrency adoption increases among financial institutions"
    ],
    'Sports': [
        "Football team wins championship after incredible season performance",
        "Basketball player breaks scoring record in playoff game",
        "Tennis tournament features exciting matches and upsets",
        "Olympic athletes prepare for upcoming international competition",
        "Soccer team signs new player for record transfer fee",
        "Baseball season starts with promising rookie performances",
        "Swimming world records broken at international championship",
        "Golf tournament concludes with thrilling final round"
    ],
    'Health': [
        "Medical researchers discover new treatment for rare disease",
        "Health experts recommend regular exercise for mental wellness",
        "New vaccine shows promising results in clinical trials",
        "Diet and nutrition study reveals benefits of Mediterranean eating",
        "Mental health awareness campaign launches in schools nationwide",
        "Medical device innovation improves patient care outcomes",
        "Health insurance coverage expands to include preventive care",
        "Fitness tracker data helps doctors monitor patient health"
    ],
    'Business': [
        "Stock market reaches new highs amid economic recovery",
        "Company reports strong quarterly earnings and revenue growth",
        "Merger between two major corporations creates industry leader",
        "Small business loans program helps entrepreneurs during pandemic",
        "E-commerce sales continue rapid growth in retail sector",
        "Investment firm announces new fund for sustainable projects",
        "Manufacturing sector shows signs of economic improvement",
        "Banking industry adapts to digital transformation trends"
    ],
    'Entertainment': [
        "New movie breaks box office records on opening weekend",
        "Popular television series announces final season premiere date",
        "Music festival features lineup of international artists",
        "Award ceremony celebrates achievements in film and television",
        "Streaming service launches original content production studio",
        "Celebrity couple announces engagement at red carpet event",
        "Video game releases generate excitement among gaming community",
        "Concert tour sells out venues across multiple countries"
    ]
}

# Create DataFrame
texts = []
categories = []

for category, articles in news_data.items():
    texts.extend(articles)
    categories.extend([category] * len(articles))

df_multi = pd.DataFrame({
    'text': texts,
    'category': categories
})

print(f"Multi-class dataset created with {len(df_multi)} samples")
print(f"Categories: {df_multi['category'].unique().tolist()}")
print("\nCategory distribution:")
print(df_multi['category'].value_counts())

## 📊 Dataset Analysis

In [None]:
# Visualize category distribution
plt.figure(figsize=(12, 8))

# Category distribution
plt.subplot(2, 2, 1)
df_multi['category'].value_counts().plot(kind='bar', color='skyblue', alpha=0.8)
plt.title('Category Distribution')
plt.xlabel('Category')
plt.ylabel('Count')
plt.xticks(rotation=45)

# Text length analysis
df_multi['text_length'] = df_multi['text'].str.len()
df_multi['word_count'] = df_multi['text'].str.split().str.len()

plt.subplot(2, 2, 2)
sns.boxplot(data=df_multi, x='category', y='text_length')
plt.title('Text Length by Category')
plt.xticks(rotation=45)

plt.subplot(2, 2, 3)
sns.boxplot(data=df_multi, x='category', y='word_count')
plt.title('Word Count by Category')
plt.xticks(rotation=45)

plt.subplot(2, 2, 4)
for category in df_multi['category'].unique():
    category_data = df_multi[df_multi['category'] == category]['text_length']
    plt.hist(category_data, alpha=0.6, label=category, bins=10)
plt.title('Text Length Distribution by Category')
plt.xlabel('Text Length')
plt.ylabel('Frequency')
plt.legend()

plt.tight_layout()
plt.show()

## 🧹 Text Preprocessing

In [None]:
def advanced_preprocessing(text):
    """
    Advanced text preprocessing for multi-class classification
    """
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and digits (keep some for context)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Tokenize and remove short words
    words = [word for word in text.split() if len(word) > 2]
    
    # Remove stopwords (but keep some domain-specific ones)
    stop_words = set(stopwords.words('english'))
    # Remove some words that might be important for classification
    domain_words = {'new', 'major', 'first', 'last', 'best', 'good', 'great'}
    stop_words = stop_words - domain_words
    
    words = [word for word in words if word not in stop_words]
    
    # Stemming
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]
    
    return ' '.join(words)

# Apply preprocessing
df_multi['processed_text'] = df_multi['text'].apply(advanced_preprocessing)

print("Preprocessing Examples:")
for i in range(3):
    print(f"\nCategory: {df_multi.iloc[i]['category']}")
    print(f"Original: {df_multi.iloc[i]['text']}")
    print(f"Processed: {df_multi.iloc[i]['processed_text']}")

## 🎯 Multi-Class Classification

In [None]:
# Prepare data for classification
X = df_multi['processed_text']
y = df_multi['category']

# Encode labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.3, random_state=42, stratify=y_encoded
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"Number of classes: {len(label_encoder.classes_)}")
print(f"Classes: {label_encoder.classes_.tolist()}")

In [None]:
# Feature extraction with TF-IDF
tfidf_vectorizer = TfidfVectorizer(
    max_features=2000,
    ngram_range=(1, 2),
    min_df=1,
    max_df=0.8
)

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print(f"Feature matrix shape: {X_train_tfidf.shape}")
print(f"Vocabulary size: {len(tfidf_vectorizer.get_feature_names_out())}")

## 🤖 Multi-Class Algorithm Comparison

In [None]:
# Define algorithms for multi-class classification
algorithms = {
    'Multinomial NB': MultinomialNB(),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'SVM (OvR)': OneVsRestClassifier(SVC(random_state=42, probability=True)),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100)
}

# Train and evaluate each algorithm
results = {}

for name, algorithm in algorithms.items():
    print(f"\nTraining {name}...")
    
    # Train the model
    algorithm.fit(X_train_tfidf, y_train)
    
    # Make predictions
    y_pred = algorithm.predict(X_test_tfidf)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision, recall, f1, support = precision_recall_fscore_support(
        y_test, y_pred, average='weighted'
    )
    
    results[name] = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'predictions': y_pred
    }
    
    print(f"{name} - Accuracy: {accuracy:.4f}, F1: {f1:.4f}")

# Create results DataFrame
results_df = pd.DataFrame({
    'Algorithm': list(results.keys()),
    'Accuracy': [results[name]['accuracy'] for name in results.keys()],
    'Precision': [results[name]['precision'] for name in results.keys()],
    'Recall': [results[name]['recall'] for name in results.keys()],
    'F1-Score': [results[name]['f1_score'] for name in results.keys()]
})

print("\nMulti-Class Classification Results:")
print(results_df.round(4).to_string(index=False))

In [None]:
# Visualize algorithm comparison
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))

# Accuracy comparison
ax1.bar(results_df['Algorithm'], results_df['Accuracy'], color='skyblue', alpha=0.8)
ax1.set_title('Accuracy Comparison')
ax1.set_ylabel('Accuracy')
ax1.tick_params(axis='x', rotation=45)
ax1.set_ylim(0, 1)

# Precision comparison
ax2.bar(results_df['Algorithm'], results_df['Precision'], color='lightgreen', alpha=0.8)
ax2.set_title('Precision Comparison')
ax2.set_ylabel('Precision')
ax2.tick_params(axis='x', rotation=45)
ax2.set_ylim(0, 1)

# Recall comparison
ax3.bar(results_df['Algorithm'], results_df['Recall'], color='lightcoral', alpha=0.8)
ax3.set_title('Recall Comparison')
ax3.set_ylabel('Recall')
ax3.tick_params(axis='x', rotation=45)
ax3.set_ylim(0, 1)

# F1-Score comparison
ax4.bar(results_df['Algorithm'], results_df['F1-Score'], color='gold', alpha=0.8)
ax4.set_title('F1-Score Comparison')
ax4.set_ylabel('F1-Score')
ax4.tick_params(axis='x', rotation=45)
ax4.set_ylim(0, 1)

plt.tight_layout()
plt.show()

## 📊 Detailed Analysis of Best Model

In [None]:
# Find best model
best_model_name = results_df.loc[results_df['F1-Score'].idxmax(), 'Algorithm']
best_predictions = results[best_model_name]['predictions']

print(f"Best performing model: {best_model_name}")
print(f"F1-Score: {results[best_model_name]['f1_score']:.4f}")

# Detailed classification report
print("\nDetailed Classification Report:")
print(classification_report(
    y_test, best_predictions, 
    target_names=label_encoder.classes_,
    digits=4
))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, best_predictions)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=label_encoder.classes_,
            yticklabels=label_encoder.classes_)
plt.title(f'Confusion Matrix - {best_model_name}')
plt.xlabel('Predicted Category')
plt.ylabel('Actual Category')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Per-class accuracy
class_accuracy = cm.diagonal() / cm.sum(axis=1)
accuracy_df = pd.DataFrame({
    'Category': label_encoder.classes_,
    'Accuracy': class_accuracy
})

print("\nPer-Class Accuracy:")
print(accuracy_df.round(4).to_string(index=False))

## 🔍 Feature Analysis

In [None]:
# Analyze most important features for each class (using Logistic Regression)
if best_model_name == 'Logistic Regression':
    lr_model = algorithms['Logistic Regression']
    feature_names = tfidf_vectorizer.get_feature_names_out()
    
    # Get coefficients for each class
    coefficients = lr_model.coef_
    
    # Plot top features for each class
    fig, axes = plt.subplots(2, 3, figsize=(20, 12))
    axes = axes.ravel()
    
    for i, class_name in enumerate(label_encoder.classes_):
        if i < len(axes):
            # Get top features for this class
            class_coef = coefficients[i]
            top_indices = np.argsort(class_coef)[-10:][::-1]
            top_features = [feature_names[idx] for idx in top_indices]
            top_values = [class_coef[idx] for idx in top_indices]
            
            # Plot
            axes[i].barh(range(len(top_features)), top_values, color=f'C{i}', alpha=0.7)
            axes[i].set_yticks(range(len(top_features)))
            axes[i].set_yticklabels(top_features)
            axes[i].set_title(f'Top Features - {class_name}')
            axes[i].set_xlabel('Coefficient Value')
    
    # Hide empty subplot
    if len(label_encoder.classes_) < len(axes):
        axes[-1].set_visible(False)
    
    plt.tight_layout()
    plt.show()

## 🧪 Testing with New Examples

In [None]:
def predict_category(text, model, vectorizer, label_encoder, preprocessor):
    """
    Predict category for new text
    """
    # Preprocess
    processed_text = preprocessor(text)
    
    # Vectorize
    text_vector = vectorizer.transform([processed_text])
    
    # Predict
    prediction = model.predict(text_vector)[0]
    probabilities = model.predict_proba(text_vector)[0]
    
    # Get category name
    category = label_encoder.inverse_transform([prediction])[0]
    
    # Get top predictions
    top_indices = np.argsort(probabilities)[::-1]
    top_predictions = [(label_encoder.classes_[idx], probabilities[idx]) for idx in top_indices]
    
    return category, top_predictions

# Test examples
test_texts = [
    "Apple releases new iPhone with improved camera and battery life",
    "Tennis player wins Wimbledon championship in straight sets",
    "Scientists develop new cancer treatment using immunotherapy",
    "Stock market surges after positive earnings reports",
    "Netflix announces new original series starring popular actors",
    "Machine learning algorithm helps doctors diagnose diseases faster",
    "Football team trades star quarterback to division rival",
    "Cryptocurrency prices fluctuate amid regulatory uncertainty"
]

best_model = algorithms[best_model_name]

print("Category Predictions for New Texts:")
print("=" * 80)

for text in test_texts:
    predicted_category, top_predictions = predict_category(
        text, best_model, tfidf_vectorizer, label_encoder, advanced_preprocessing
    )
    
    print(f"\nText: {text}")
    print(f"Predicted Category: {predicted_category}")
    print(f"Confidence: {top_predictions[0][1]:.4f}")
    print("Top 3 Predictions:")
    for category, prob in top_predictions[:3]:
        print(f"  {category}: {prob:.4f}")

## 📧 Spam Detection Example

In [None]:
# Create spam detection dataset
spam_emails = [
    "Congratulations! You have won $1000000! Click here to claim now!",
    "Buy cheap medications online! Best prices guaranteed!",
    "Make money fast! Work from home opportunity!",
    "FREE! Get rich quick scheme! Limited time offer!",
    "Urgent! Your account will be closed! Verify now!",
    "Hot singles in your area want to meet you!",
    "Lose weight fast with this amazing pill!",
    "Get your loan approved instantly! No credit check!"
]

legitimate_emails = [
    "Meeting scheduled for tomorrow at 2 PM in conference room",
    "Please find attached the quarterly financial report",
    "Thank you for your order. It will be shipped within 2 days",
    "Your subscription has been renewed successfully",
    "Welcome to our newsletter! Here are this week's updates",
    "Your appointment has been confirmed for next Monday",
    "Project deadline has been extended by one week",
    "Happy birthday! Hope you have a wonderful day"
]

# Create spam dataset
spam_texts = spam_emails + legitimate_emails
spam_labels = ['spam'] * len(spam_emails) + ['legitimate'] * len(legitimate_emails)

spam_df = pd.DataFrame({
    'text': spam_texts,
    'label': spam_labels
})

print("Spam Detection Dataset:")
print(spam_df['label'].value_counts())

# Preprocess spam data
spam_df['processed_text'] = spam_df['text'].apply(advanced_preprocessing)

# Train spam classifier
X_spam = spam_df['processed_text']
y_spam = spam_df['label']

X_train_spam, X_test_spam, y_train_spam, y_test_spam = train_test_split(
    X_spam, y_spam, test_size=0.3, random_state=42
)

# Vectorize spam data
spam_vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
X_train_spam_vec = spam_vectorizer.fit_transform(X_train_spam)
X_test_spam_vec = spam_vectorizer.transform(X_test_spam)

# Train spam classifier
spam_classifier = MultinomialNB()
spam_classifier.fit(X_train_spam_vec, y_train_spam)

# Evaluate spam classifier
spam_predictions = spam_classifier.predict(X_test_spam_vec)
spam_accuracy = accuracy_score(y_test_spam, spam_predictions)

print(f"\nSpam Classification Accuracy: {spam_accuracy:.4f}")
print("\nSpam Classification Report:")
print(classification_report(y_test_spam, spam_predictions))

## 📈 Cross-Validation Analysis

In [None]:
# Perform stratified k-fold cross-validation
from sklearn.model_selection import cross_validate

cv_results = {}
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']

# Use stratified k-fold for multi-class
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, algorithm in algorithms.items():
    print(f"Cross-validating {name}...")
    
    cv_scores = cross_validate(
        algorithm, X_train_tfidf, y_train, cv=cv, scoring=scoring
    )
    
    cv_results[name] = {
        'accuracy': cv_scores['test_accuracy'],
        'precision': cv_scores['test_precision_weighted'],
        'recall': cv_scores['test_recall_weighted'],
        'f1': cv_scores['test_f1_weighted']
    }

# Create CV results DataFrame
cv_summary = []
for name, scores in cv_results.items():
    cv_summary.append({
        'Algorithm': name,
        'Accuracy_Mean': np.mean(scores['accuracy']),
        'Accuracy_Std': np.std(scores['accuracy']),
        'F1_Mean': np.mean(scores['f1']),
        'F1_Std': np.std(scores['f1'])
    })

cv_df = pd.DataFrame(cv_summary)
print("\nCross-Validation Results:")
print(cv_df.round(4).to_string(index=False))

## 📋 Key Takeaways and Best Practices

### **Multi-Class Classification Insights:**
- **Class Balance**: Ensure balanced representation across all classes
- **Feature Selection**: TF-IDF with n-grams works well for most text classification tasks
- **Algorithm Choice**: Naive Bayes and Logistic Regression are excellent starting points
- **Evaluation Metrics**: Use F1-score for imbalanced datasets, accuracy for balanced ones

### **Performance Optimization:**
- **Preprocessing**: Proper text cleaning significantly improves performance
- **Feature Engineering**: Experiment with different n-gram ranges and max_features
- **Cross-Validation**: Always use stratified CV for multi-class problems
- **Hyperparameter Tuning**: Grid search can improve model performance

### **Real-World Applications:**
- **Email Classification**: Spam detection, priority classification
- **News Categorization**: Automatic content organization
- **Customer Support**: Ticket routing and prioritization
- **Content Moderation**: Automatic content filtering

### **Common Challenges:**
- **Class Imbalance**: Use SMOTE, class weights, or stratified sampling
- **New Categories**: Consider online learning or model retraining
- **Multilingual Text**: Language detection and separate models
- **Short Text**: Consider context expansion or embeddings

### **Next Steps:**
1. **Advanced Techniques**: Word embeddings, BERT, transformers
2. **Ensemble Methods**: Combine multiple models for better performance
3. **Active Learning**: Improve models with minimal labeling effort
4. **Production Deployment**: API development and model serving