# Exercise 2: Traditional Text Classification

Welcome to text classification! You'll learn how to automatically categorize text using traditional machine learning methods.

## Learning Objectives
By the end of this exercise, you will be able to:
1. **Feature Engineering**: Convert text to numerical features (TF-IDF, Count Vectors)
2. **Dataset Creation**: Build and prepare text classification datasets
3. **Model Training**: Train traditional ML classifiers (Naive Bayes, SVM, Logistic Regression)
4. **Model Evaluation**: Assess classifier performance using metrics and cross-validation
5. **German Text Classification**: Handle German language specifics
6. **Pipeline Creation**: Build complete classification pipelines

## What You'll Build
- German sentiment analysis system
- Multi-class text classifier
- Feature comparison tools
- Model evaluation framework
- Production-ready classification pipeline

## Applications
- **Sentiment Analysis**: Classify reviews as positive/negative
- **Topic Classification**: Categorize news articles by topic
- **Spam Detection**: Filter unwanted emails
- **Content Moderation**: Identify inappropriate content

**Ready to build your first text classifier?** ðŸ¤–

## Exercise 1: Building Your First Text Classifier

**Goal**: Create a German sentiment analysis system using traditional machine learning.

**Your Tasks**: 
1. Prepare text data and labels
2. Create feature vectors from text
3. Train multiple classifiers
4. Evaluate and compare performance

### Setup and Imports

In [None]:
# Essential imports for text classification
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter

# Scikit-learn for machine learning
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline

# Text preprocessing
import re
import string

print("ðŸ¤– Text Classification Toolkit Ready!")
print("Available classifiers: Naive Bayes, Logistic Regression, SVM, Random Forest")

### Step 1: Create Sample Dataset

In [None]:
# Create a sample German sentiment dataset
# In practice, you would load this from a file or API

# Create a tiny, simple dataset
texts = [
    "Das ist super gut",      # positive
    "Ich bin sehr glÃ¼cklich", # positive  
    "Fantastisch",           # positive
    "Das ist schlecht",      # negative
    "Ich bin traurig",       # negative
    "Furchtbar"              # negative
]

labels = ["positive", "positive", "positive", "negative", "negative", "negative"]

print("Our simple dataset:")
for text, label in zip(texts, labels):
    print(f"'{text}' -> {label}")

### Step 2: Text Preprocessing

In [None]:
def preprocess_text(text, remove_stopwords=True, min_length=2):
    """
    Preprocess German text for classification.
    
    Args:
        text (str): Input text
        remove_stopwords (bool): Whether to remove stopwords
        min_length (int): Minimum word length to keep
    
    Returns:
        str: Preprocessed text
    """
    # TODO: Implement text preprocessing:
    # 1. Convert to lowercase
    # 2. Remove special characters and digits
    # 3. Remove extra whitespace
    # 4. Optionally remove stopwords
    # 5. Filter words by minimum length
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and digits, keep only letters and spaces
    text = re.sub(r'[^a-zÃ¤Ã¶Ã¼ÃŸ\s]', '', text)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    # Remove stopwords and filter by length
    if remove_stopwords:
        words = text.split()
        words = [word for word in words if word not in german_stopwords and len(word) >= min_length]
        text = ' '.join(words)
    
    return text

# Apply preprocessing
df['text_processed'] = df['text'].apply(preprocess_text)

print("Preprocessing example:")
for i in range(3):
    print(f"Original: {df['text'].iloc[i]}")
    print(f"Processed: {df['text_processed'].iloc[i]}")
    print()

### Step 3: Feature Extraction Comparison

In [None]:
def compare_feature_extraction_methods(texts, labels):
    """
    Compare different feature extraction methods.
    
    Args:
        texts (list): List of preprocessed texts
        labels (list): List of labels
    
    Returns:
        dict: Results for different feature extraction methods
    """
    # TODO: Implement and compare the following feature extraction methods:
    # 1. Count Vectorizer (Bag of Words)
    # 2. TF-IDF Vectorizer (word level)
    # 3. TF-IDF Vectorizer with n-grams
    
    methods = {}
    
    # Split data for testing
    X_train, X_test, y_train, y_test = train_test_split(
        texts, labels, test_size=0.3, random_state=42, stratify=labels
    )
    
    # Method 1: Count Vectorizer (Bag of Words)
    count_vectorizer = CountVectorizer(max_features=1000)
    X_train_count = count_vectorizer.fit_transform(X_train)
    X_test_count = count_vectorizer.transform(X_test)
    
    # Train a simple Naive Bayes classifier
    nb_count = MultinomialNB()
    nb_count.fit(X_train_count, y_train)
    count_accuracy = nb_count.score(X_test_count, y_test)
    
    methods['Count Vectorizer'] = {
        'vectorizer': count_vectorizer,
        'accuracy': count_accuracy,
        'feature_count': X_train_count.shape[1]
    }
    
    # Method 2: TF-IDF (word level)
    tfidf_vectorizer = TfidfVectorizer(max_features=1000)
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
    X_test_tfidf = tfidf_vectorizer.transform(X_test)
    
    nb_tfidf = MultinomialNB()
    nb_tfidf.fit(X_train_tfidf, y_train)
    tfidf_accuracy = nb_tfidf.score(X_test_tfidf, y_test)
    
    methods['TF-IDF Words'] = {
        'vectorizer': tfidf_vectorizer,
        'accuracy': tfidf_accuracy,
        'feature_count': X_train_tfidf.shape[1]
    }
    
    # Method 3: TF-IDF with n-grams
    tfidf_ngram = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
    X_train_ngram = tfidf_ngram.fit_transform(X_train)
    X_test_ngram = tfidf_ngram.transform(X_test)
    
    nb_ngram = MultinomialNB()
    nb_ngram.fit(X_train_ngram, y_train)
    ngram_accuracy = nb_ngram.score(X_test_ngram, y_test)
    
    methods['TF-IDF N-grams'] = {
        'vectorizer': tfidf_ngram,
        'accuracy': ngram_accuracy,
        'feature_count': X_train_ngram.shape[1]
    }
    
    return methods, (X_train, X_test, y_train, y_test)

# Compare feature extraction methods
feature_methods, data_split = compare_feature_extraction_methods(
    df['text_processed'].tolist(), 
    df['sentiment'].tolist()
)

print("Feature Extraction Method Comparison:")
for method_name, results in feature_methods.items():
    print(f"{method_name}:")
    print(f"  Accuracy: {results['accuracy']:.3f}")
    print(f"  Feature Count: {results['feature_count']}")
    print()

### Step 4: Multi-Algorithm Comparison

In [None]:
def compare_classification_algorithms(X_train, X_test, y_train, y_test, vectorizer):
    """
    Compare different classification algorithms.
    
    Args:
        X_train, X_test: Training and test texts
        y_train, y_test: Training and test labels
        vectorizer: Fitted vectorizer to use
    
    Returns:
        dict: Results for different algorithms
    """
    # TODO: Implement and compare the following algorithms:
    # 1. Naive Bayes
    # 2. Support Vector Machine
    # 3. Logistic Regression
    # 4. Random Forest
    
    # Transform texts to features
    X_train_vec = vectorizer.transform(X_train)
    X_test_vec = vectorizer.transform(X_test)
    
    algorithms = {
        'Naive Bayes': MultinomialNB(),
        'SVM': SVC(kernel='linear', random_state=42),
        'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
    }
    
    results = {}
    
    for name, algorithm in algorithms.items():
        # Train the algorithm
        algorithm.fit(X_train_vec, y_train)
        
        # Make predictions
        y_pred = algorithm.predict(X_test_vec)
        
        # Calculate metrics
        accuracy = accuracy_score(y_test, y_pred)
        
        # Cross-validation score
        cv_scores = cross_val_score(algorithm, X_train_vec, y_train, cv=5)
        
        results[name] = {
            'model': algorithm,
            'accuracy': accuracy,
            'cv_mean': cv_scores.mean(),
            'cv_std': cv_scores.std(),
            'predictions': y_pred,
            'classification_report': classification_report(y_test, y_pred)
        }
    
    return results

# Use the best feature extraction method (TF-IDF with n-grams)
best_vectorizer = feature_methods['TF-IDF N-grams']['vectorizer']
X_train, X_test, y_train, y_test = data_split

# Compare algorithms
algorithm_results = compare_classification_algorithms(
    X_train, X_test, y_train, y_test, best_vectorizer
)

print("Classification Algorithm Comparison:")
for name, results in algorithm_results.items():
    print(f"\n{name}:")
    print(f"  Test Accuracy: {results['accuracy']:.3f}")
    print(f"  CV Score: {results['cv_mean']:.3f} (Â±{results['cv_std']:.3f})")

### Step 5: Detailed Evaluation

In [None]:
def detailed_evaluation(results_dict, y_test, algorithm_name='Logistic Regression'):
    """
    Perform detailed evaluation of the best performing algorithm.
    
    Args:
        results_dict (dict): Algorithm results dictionary
        y_test (list): True test labels
        algorithm_name (str): Name of algorithm to evaluate
    """
    # TODO: Create detailed evaluation including:
    # 1. Confusion matrix visualization
    # 2. Classification report
    # 3. Feature importance (if available)
    
    if algorithm_name not in results_dict:
        print(f"Algorithm {algorithm_name} not found in results")
        return
    
    results = results_dict[algorithm_name]
    y_pred = results['predictions']
    
    # Create subplots
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=sorted(set(y_test)), 
                yticklabels=sorted(set(y_test)),
                ax=axes[0])
    axes[0].set_title(f'Confusion Matrix - {algorithm_name}')
    axes[0].set_xlabel('Predicted Label')
    axes[0].set_ylabel('True Label')
    
    # Algorithm Comparison Bar Plot
    alg_names = list(results_dict.keys())
    accuracies = [results_dict[name]['accuracy'] for name in alg_names]
    
    axes[1].bar(alg_names, accuracies, color=['skyblue', 'lightcoral', 'lightgreen', 'orange'])
    axes[1].set_title('Algorithm Accuracy Comparison')
    axes[1].set_ylabel('Accuracy')
    axes[1].set_ylim(0, 1)
    axes[1].tick_params(axis='x', rotation=45)
    
    # Add value labels on bars
    for i, v in enumerate(accuracies):
        axes[1].text(i, v + 0.01, f'{v:.3f}', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()
    
    # Print detailed classification report
    print(f"\nDetailed Classification Report for {algorithm_name}:")
    print(results['classification_report'])

# Evaluate the best performing algorithm
detailed_evaluation(algorithm_results, y_test, 'Logistic Regression')

### Step 6: Pipeline Creation and Hyperparameter Tuning

In [None]:
def create_optimized_pipeline(texts, labels):
    """
    Create an optimized classification pipeline with hyperparameter tuning.
    
    Args:
        texts (list): List of texts
        labels (list): List of labels
    
    Returns:
        Pipeline: Optimized pipeline
    """
    # TODO: Create a pipeline with:
    # 1. TF-IDF Vectorization
    # 2. Classification algorithm
    # 3. Hyperparameter tuning using GridSearchCV
    
    # Create pipeline
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('classifier', LogisticRegression(random_state=42))
    ])
    
    # Define parameter grid for tuning
    param_grid = {
        'tfidf__max_features': [500, 1000, 2000],
        'tfidf__ngram_range': [(1, 1), (1, 2)],
        'tfidf__min_df': [1, 2],
        'classifier__C': [0.1, 1, 10]
    }
    
    # Perform grid search
    grid_search = GridSearchCV(
        pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1
    )
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        texts, labels, test_size=0.3, random_state=42, stratify=labels
    )
    
    # Fit the grid search
    grid_search.fit(X_train, y_train)
    
    # Get best pipeline
    best_pipeline = grid_search.best_estimator_
    
    # Evaluate on test set
    test_accuracy = best_pipeline.score(X_test, y_test)
    
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best cross-validation score: {grid_search.best_score_:.3f}")
    print(f"Test accuracy: {test_accuracy:.3f}")
    
    return best_pipeline, (X_train, X_test, y_train, y_test)

# Create optimized pipeline
print("Creating optimized pipeline (this may take a moment)...")
best_pipeline, final_data_split = create_optimized_pipeline(
    df['text_processed'].tolist(), 
    df['sentiment'].tolist()
)

### Step 7: Model Testing and Prediction

In [None]:
def test_model_predictions(pipeline, test_texts=None):
    """
    Test the trained model with new text examples.
    
    Args:
        pipeline: Trained classification pipeline
        test_texts (list): Optional list of test texts
    
    Returns:
        list: Predictions for test texts
    """
    # TODO: Test the model with new examples and get predictions with confidence
    
    if test_texts is None:
        test_texts = [
            "Das Produkt ist absolut fantastisch und sehr empfehlenswert!",
            "Ich bin sehr enttÃ¤uscht von der schlechten QualitÃ¤t.",
            "Das ist ein ganz normales Produkt, nichts Besonderes.",
            "Hervorragende Leistung und ausgezeichneter Service!",
            "Furchtbar schlecht, das Geld nicht wert."
        ]
    
    # Preprocess test texts
    processed_texts = [preprocess_text(text) for text in test_texts]
    
    # Make predictions
    predictions = pipeline.predict(processed_texts)
    
    # Get prediction probabilities
    try:
        probabilities = pipeline.predict_proba(processed_texts)
        classes = pipeline.classes_
    except:
        probabilities = None
        classes = None
    
    print("Model Predictions on New Texts:")
    print("=" * 80)
    
    for i, (original, processed, pred) in enumerate(zip(test_texts, processed_texts, predictions)):
        print(f"\nExample {i+1}:")
        print(f"Original: {original}")
        print(f"Processed: {processed}")
        print(f"Prediction: {pred}")
        
        if probabilities is not None:
            print("Confidence scores:")
            for class_name, prob in zip(classes, probabilities[i]):
                print(f"  {class_name}: {prob:.3f}")
    
    return predictions

# Test the model
predictions = test_model_predictions(best_pipeline)

### Step 8: Feature Analysis

In [None]:
def analyze_important_features(pipeline, top_n=10):
    """
    Analyze the most important features for each class.
    
    Args:
        pipeline: Trained pipeline
        top_n (int): Number of top features to show per class
    """
    # TODO: Extract and analyze the most important features:
    # 1. Get feature names from vectorizer
    # 2. Get feature coefficients from classifier
    # 3. Find top features for each class
    
    try:
        # Get vectorizer and classifier from pipeline
        vectorizer = pipeline.named_steps['tfidf']
        classifier = pipeline.named_steps['classifier']
        
        # Get feature names
        feature_names = vectorizer.get_feature_names_out()
        
        # Get coefficients (works for LogisticRegression)
        if hasattr(classifier, 'coef_'):
            coefficients = classifier.coef_
            classes = classifier.classes_
            
            print("Most Important Features by Class:")
            print("=" * 50)
            
            for i, class_name in enumerate(classes):
                print(f"\n{class_name.upper()} class:")
                
                if len(classes) == 2:  # Binary classification
                    # For binary classification, use the single coefficient vector
                    coef = coefficients[0] if i == 1 else -coefficients[0]
                else:  # Multi-class classification
                    coef = coefficients[i]
                
                # Get top positive coefficients (most indicative)
                top_indices = np.argsort(coef)[-top_n:]
                top_features = [(feature_names[idx], coef[idx]) for idx in reversed(top_indices)]
                
                print("Most indicative features:")
                for feature, score in top_features:
                    print(f"  {feature}: {score:.3f}")
        else:
            print("Feature importance analysis not available for this classifier type.")
    
    except Exception as e:
        print(f"Error in feature analysis: {e}")

# Analyze important features
analyze_important_features(best_pipeline)

## Exercise Tasks

Complete the following tasks to deepen your understanding:

1. **Dataset Expansion**: 
   - Create a larger, more diverse German sentiment dataset
   - Include different domains (movies, products, restaurants)
   - Handle class imbalance using techniques like SMOTE

2. **Advanced Feature Engineering**:
   - Implement character-level n-grams
   - Add sentiment lexicon features
   - Include text length and other statistical features

3. **Cross-validation Strategy**:
   - Implement stratified k-fold cross-validation
   - Compare different validation strategies
   - Analyze variance in model performance

4. **Error Analysis**:
   - Identify misclassified examples
   - Analyze common error patterns
   - Suggest improvements based on errors

5. **Model Deployment**:
   - Save and load the trained model
   - Create a simple web interface for predictions
   - Implement batch prediction functionality

## Reflection Questions

1. Which feature extraction method worked best and why?
2. How does the choice of preprocessing affect model performance?
3. What are the trade-offs between different classification algorithms?
4. How would you handle a much larger dataset?
5. What challenges are specific to German text classification?

## Next Steps

- Explore ensemble methods (voting, bagging, boosting)
- Learn about advanced feature selection techniques
- Study domain adaptation for different text types
- Prepare for deep learning approaches (next topic!)