# Text Classification with Machine Learning

This notebook demonstrates how to build text classification models for categorizing documents.

## What you'll learn:
- Text feature extraction (Bag of Words, TF-IDF)
- Multiple classification algorithms
- Model evaluation and comparison
- Handling imbalanced datasets
- Building a complete classification pipeline

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## Creating Sample Dataset
We'll create a news classification dataset with different categories.

In [None]:
# Sample news articles for different categories
sports_articles = [
    "The basketball team won their championship game with a score of 98-87",
    "Football season starts next month with high expectations for the home team",
    "Tennis player advances to semifinals after defeating opponent in straight sets",
    "Baseball game postponed due to rain, rescheduled for tomorrow evening",
    "Olympic swimmer breaks world record in 200m freestyle competition",
    "Soccer match ends in draw after intense 90-minute battle",
    "Hockey team trades star player to rival team for draft picks",
    "Marathon runner completes race in personal best time despite challenging weather",
    "Golf tournament concludes with surprise victory by amateur player",
    "Boxing match scheduled for next weekend features heavyweight champions"
]

technology_articles = [
    "New smartphone features advanced AI camera technology and 5G connectivity",
    "Software company releases major update with improved security features",
    "Artificial intelligence breakthrough enables more accurate medical diagnosis",
    "Electric car manufacturer announces new battery technology with extended range",
    "Tech startup develops innovative app for remote work collaboration",
    "Quantum computing research shows promising results for encryption algorithms",
    "Social media platform introduces new privacy controls for user data",
    "Robotics company unveils autonomous delivery system for urban areas",
    "Cloud computing service expands to new regions with enhanced performance",
    "Virtual reality headset offers immersive experience for gaming and education"
]

health_articles = [
    "Medical study reveals new treatment options for diabetes patients",
    "Health experts recommend daily exercise for cardiovascular wellness",
    "Vaccine research shows promising results in clinical trials",
    "Nutrition guidelines updated to include more plant-based food options",
    "Mental health awareness campaign launches in schools nationwide",
    "Hospital introduces new surgical technique with faster recovery times",
    "Pharmaceutical company develops innovative drug for rare disease treatment",
    "Health insurance coverage expands to include preventive care services",
    "Medical device helps patients monitor vital signs at home",
    "Research indicates meditation benefits for stress reduction and sleep quality"
]

# Create dataset
texts = sports_articles + technology_articles + health_articles
labels = ['sports'] * len(sports_articles) + ['technology'] * len(technology_articles) + ['health'] * len(health_articles)

df = pd.DataFrame({
    'text': texts,
    'category': labels
})

print(f"Dataset created with {len(df)} articles")
print(f"Categories: {df['category'].value_counts().to_dict()}")
print("\nFirst few examples:")
print(df.head())

## Exploratory Data Analysis

In [None]:
# Text length analysis
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Category distribution
df['category'].value_counts().plot(kind='bar', ax=axes[0, 0], color='skyblue')
axes[0, 0].set_title('Articles per Category')
axes[0, 0].set_xlabel('Category')
axes[0, 0].set_ylabel('Count')
axes[0, 0].tick_params(axis='x', rotation=45)

# Text length distribution
df['text_length'].hist(bins=15, ax=axes[0, 1], color='lightcoral', alpha=0.7)
axes[0, 1].set_title('Distribution of Text Length')
axes[0, 1].set_xlabel('Characters')
axes[0, 1].set_ylabel('Frequency')

# Word count by category
df.boxplot(column='word_count', by='category', ax=axes[1, 0])
axes[1, 0].set_title('Word Count by Category')
axes[1, 0].set_xlabel('Category')
axes[1, 0].set_ylabel('Word Count')

# Average text length by category
avg_length = df.groupby('category')['text_length'].mean()
avg_length.plot(kind='bar', ax=axes[1, 1], color='lightgreen')
axes[1, 1].set_title('Average Text Length by Category')
axes[1, 1].set_xlabel('Category')
axes[1, 1].set_ylabel('Average Characters')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("Dataset Statistics:")
print(df.groupby('category')[['text_length', 'word_count']].describe())

## Feature Extraction
Converting text to numerical features that machine learning algorithms can understand.

In [None]:
# Split the data
X = df['text']
y = df['category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"Training distribution: {pd.Series(y_train).value_counts().to_dict()}")
print(f"Test distribution: {pd.Series(y_test).value_counts().to_dict()}")

In [None]:
# Method 1: Bag of Words (Count Vectorizer)
count_vectorizer = CountVectorizer(max_features=500, stop_words='english', lowercase=True)
X_train_counts = count_vectorizer.fit_transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)

print("Bag of Words Features:")
print(f"Feature matrix shape: {X_train_counts.shape}")
print(f"Number of unique words: {len(count_vectorizer.vocabulary_)}")
print(f"Sample feature names: {list(count_vectorizer.get_feature_names_out())[:10]}")

In [None]:
# Method 2: TF-IDF (Term Frequency-Inverse Document Frequency)
tfidf_vectorizer = TfidfVectorizer(max_features=500, stop_words='english', lowercase=True)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("TF-IDF Features:")
print(f"Feature matrix shape: {X_train_tfidf.shape}")
print(f"Number of unique words: {len(tfidf_vectorizer.vocabulary_)}")

# Show top TF-IDF features for each category
def show_top_features(vectorizer, X, y, category, top_n=5):
    """Show top features for a specific category"""
    # Get indices for the category
    category_indices = [i for i, label in enumerate(y) if label == category]
    
    # Sum TF-IDF scores for the category
    category_tfidf = X[category_indices].sum(axis=0).A1
    
    # Get feature names
    feature_names = vectorizer.get_feature_names_out()
    
    # Get top features
    top_indices = category_tfidf.argsort()[-top_n:][::-1]
    top_features = [(feature_names[i], category_tfidf[i]) for i in top_indices]
    
    return top_features

print("\nTop TF-IDF features by category:")
for category in df['category'].unique():
    top_features = show_top_features(tfidf_vectorizer, X_train_tfidf, y_train, category)
    print(f"\n{category.upper()}:")
    for feature, score in top_features:
        print(f"  {feature}: {score:.3f}")

## Building Classification Models
We'll compare different algorithms to see which works best for our data.

In [None]:
# Define classifiers
classifiers = {
    'Naive Bayes': MultinomialNB(),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'SVM': SVC(random_state=42, kernel='linear'),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100)
}

# Train and evaluate models with both feature types
results = {}

for feature_type, (X_train_feat, X_test_feat) in [('Count', (X_train_counts, X_test_counts)), 
                                                   ('TF-IDF', (X_train_tfidf, X_test_tfidf))]:
    results[feature_type] = {}
    
    for name, classifier in classifiers.items():
        # Train the model
        classifier.fit(X_train_feat, y_train)
        
        # Make predictions
        y_pred = classifier.predict(X_test_feat)
        
        # Calculate accuracy
        accuracy = accuracy_score(y_test, y_pred)
        
        # Store results
        results[feature_type][name] = {
            'accuracy': accuracy,
            'predictions': y_pred,
            'model': classifier
        }
        
        print(f"{feature_type} - {name}: {accuracy:.3f}")

print("\nModel training completed!")

## Model Evaluation and Comparison

In [None]:
# Create comparison visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Accuracy comparison
count_accuracies = [results['Count'][name]['accuracy'] for name in classifiers.keys()]
tfidf_accuracies = [results['TF-IDF'][name]['accuracy'] for name in classifiers.keys()]

x = np.arange(len(classifiers))
width = 0.35

axes[0].bar(x - width/2, count_accuracies, width, label='Count Vectorizer', color='skyblue')
axes[0].bar(x + width/2, tfidf_accuracies, width, label='TF-IDF', color='lightcoral')
axes[0].set_xlabel('Classifiers')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Model Performance Comparison')
axes[0].set_xticks(x)
axes[0].set_xticklabels(classifiers.keys(), rotation=45)
axes[0].legend()
axes[0].set_ylim(0, 1)

# Add value labels on bars
for i, (count_acc, tfidf_acc) in enumerate(zip(count_accuracies, tfidf_accuracies)):
    axes[0].text(i - width/2, count_acc + 0.01, f'{count_acc:.3f}', ha='center', va='bottom')
    axes[0].text(i + width/2, tfidf_acc + 0.01, f'{tfidf_acc:.3f}', ha='center', va='bottom')

# Find best model
best_feature_type = 'TF-IDF' if max(tfidf_accuracies) > max(count_accuracies) else 'Count'
best_classifier = list(classifiers.keys())[np.argmax(results[best_feature_type][name]['accuracy'] for name in classifiers.keys())]
best_accuracy = results[best_feature_type][best_classifier]['accuracy']

print(f"Best model: {best_classifier} with {best_feature_type} features")
print(f"Best accuracy: {best_accuracy:.3f}")

# Confusion matrix for best model
best_predictions = results[best_feature_type][best_classifier]['predictions']
cm = confusion_matrix(y_test, best_predictions)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1],
            xticklabels=df['category'].unique(),
            yticklabels=df['category'].unique())
axes[1].set_title(f'Confusion Matrix - {best_classifier} ({best_feature_type})')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')

plt.tight_layout()
plt.show()

In [None]:
# Detailed classification report for best model
print(f"Detailed Classification Report - {best_classifier} ({best_feature_type}):")
print("=" * 60)
print(classification_report(y_test, best_predictions))

# Show some predictions
print("\nSample Predictions:")
print("=" * 40)
for i, (text, true_label, pred_label) in enumerate(zip(X_test, y_test, best_predictions)):
    if i < 5:  # Show first 5 predictions
        status = "✓" if true_label == pred_label else "✗"
        print(f"{status} Text: {text[:60]}...")
        print(f"  True: {true_label}, Predicted: {pred_label}")
        print("-" * 40)

## Building a Complete Classification Pipeline

In [None]:
# Create a pipeline with the best performing combination
if best_feature_type == 'TF-IDF':
    vectorizer = TfidfVectorizer(max_features=500, stop_words='english', lowercase=True)
else:
    vectorizer = CountVectorizer(max_features=500, stop_words='english', lowercase=True)

pipeline = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifiers[best_classifier])
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Test the pipeline
pipeline_predictions = pipeline.predict(X_test)
pipeline_accuracy = accuracy_score(y_test, pipeline_predictions)

print(f"Pipeline Accuracy: {pipeline_accuracy:.3f}")

# Function to classify new text
def classify_text(text):
    """
    Classify a new piece of text.
    
    Args:
        text (str): Text to classify
    
    Returns:
        str: Predicted category
    """
    prediction = pipeline.predict([text])[0]
    probabilities = pipeline.predict_proba([text])[0]
    confidence = max(probabilities)
    
    return prediction, confidence

# Test with new examples
test_texts = [
    "The soccer team scored three goals in the first half of the match",
    "Researchers develop new machine learning algorithm for data analysis",
    "Clinical trials show positive results for new cancer treatment drug"
]

print("\nClassifying New Texts:")
print("=" * 40)
for text in test_texts:
    prediction, confidence = classify_text(text)
    print(f"Text: {text}")
    print(f"Predicted Category: {prediction} (confidence: {confidence:.3f})")
    print("-" * 40)

## Cross-Validation for Robust Evaluation

In [None]:
# Perform cross-validation on the full dataset
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')

print("Cross-Validation Results:")
print(f"Scores: {cv_scores}")
print(f"Mean accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# Visualize CV scores
plt.figure(figsize=(10, 6))
plt.bar(range(1, 6), cv_scores, color='lightblue', alpha=0.7)
plt.axhline(y=cv_scores.mean(), color='red', linestyle='--', label=f'Mean: {cv_scores.mean():.3f}')
plt.xlabel('Cross-Validation Fold')
plt.ylabel('Accuracy')
plt.title('Cross-Validation Scores')
plt.legend()
plt.ylim(0, 1)

# Add value labels on bars
for i, score in enumerate(cv_scores):
    plt.text(i + 1, score + 0.01, f'{score:.3f}', ha='center', va='bottom')

plt.show()

## Feature Importance Analysis

In [None]:
# For linear models, we can analyze feature importance
if best_classifier in ['Logistic Regression', 'SVM']:
    # Get feature names
    feature_names = pipeline.named_steps['vectorizer'].get_feature_names_out()
    
    # Get coefficients
    if hasattr(pipeline.named_steps['classifier'], 'coef_'):
        coefficients = pipeline.named_steps['classifier'].coef_
        
        # Plot top features for each class
        fig, axes = plt.subplots(1, 3, figsize=(18, 6))
        
        classes = pipeline.classes_
        
        for idx, (class_name, ax) in enumerate(zip(classes, axes)):
            # Get top positive and negative coefficients
            class_coef = coefficients[idx]
            top_indices = np.argsort(np.abs(class_coef))[-10:]
            top_features = [(feature_names[i], class_coef[i]) for i in top_indices]
            
            # Sort by coefficient value
            top_features.sort(key=lambda x: x[1])
            
            features, values = zip(*top_features)
            colors = ['red' if v < 0 else 'green' for v in values]
            
            ax.barh(range(len(features)), values, color=colors, alpha=0.7)
            ax.set_yticks(range(len(features)))
            ax.set_yticklabels(features)
            ax.set_title(f'Top Features for {class_name.title()}')
            ax.set_xlabel('Coefficient Value')
        
        plt.tight_layout()
        plt.show()
        
        print("Feature Importance Analysis:")
        print("Green bars: Positive association with class")
        print("Red bars: Negative association with class")
        
elif best_classifier == 'Random Forest':
    # For Random Forest, show feature importance
    feature_names = pipeline.named_steps['vectorizer'].get_feature_names_out()
    importances = pipeline.named_steps['classifier'].feature_importances_
    
    # Get top 20 most important features
    top_indices = np.argsort(importances)[-20:]
    top_features = [(feature_names[i], importances[i]) for i in top_indices]
    top_features.sort(key=lambda x: x[1])
    
    features, values = zip(*top_features)
    
    plt.figure(figsize=(10, 8))
    plt.barh(range(len(features)), values, color='skyblue', alpha=0.7)
    plt.yticks(range(len(features)), features)
    plt.xlabel('Feature Importance')
    plt.title('Top 20 Most Important Features (Random Forest)')
    plt.tight_layout()
    plt.show()

## Interactive Classification Demo

In [None]:
def interactive_classifier():
    """
    Interactive function to classify user input.
    """
    print("Text Classification Demo")
    print("=" * 30)
    print("Categories: sports, technology, health")
    print("Enter your text below (or 'quit' to stop):\n")
    
    while True:
        user_text = input("Enter text: ")
        
        if user_text.lower() == 'quit':
            print("Thanks for using the classifier!")
            break
        
        if user_text.strip():
            prediction, confidence = classify_text(user_text)
            print(f"Predicted Category: {prediction}")
            print(f"Confidence: {confidence:.3f}")
            print("-" * 30)
        else:
            print("Please enter some text.")

# For demonstration purposes, let's test with predefined examples
demo_texts = [
    "The quarterback threw a perfect pass for a touchdown",
    "Scientists discover new way to edit genes using CRISPR technology",
    "Regular exercise can help prevent heart disease and diabetes"
]

print("Demo Classification Results:")
print("=" * 40)
for text in demo_texts:
    prediction, confidence = classify_text(text)
    print(f"Text: {text}")
    print(f"Category: {prediction} (confidence: {confidence:.3f})")
    print("-" * 40)

# Uncomment the line below to run the interactive classifier
# interactive_classifier()

## Key Takeaways

1. **Feature extraction** is crucial - TF-IDF often performs better than simple word counts
2. **Different algorithms** have different strengths - linear models work well for text classification
3. **Evaluation** should include multiple metrics, not just accuracy
4. **Cross-validation** provides a more robust estimate of model performance
5. **Pipelines** make it easy to combine preprocessing and modeling steps
6. **Feature importance** analysis helps understand what the model has learned

## Next Steps

- Try with larger datasets for better performance
- Experiment with n-grams (bigrams, trigrams)
- Use word embeddings (Word2Vec, GloVe) as features
- Try deep learning approaches with neural networks
- Handle imbalanced datasets with appropriate techniques