# TF-IDF Text Classification with Naive Bayes

This notebook demonstrates:
1. **TF-IDF Vectorization** using scikit-learn
2. **Naive Bayes Classification** for Tamil news articles
3. **Model Evaluation** with comprehensive metrics

**Features:**
- Unigram TF-IDF features (max 10,000 features)
- Document frequency filtering (min_df=3, max_df=0.8)
- Multinomial Naive Bayes classifier
- Dual task: Category classification and Sentiment classification

**Dataset:** Tamil news articles with categories and processed text

## 1. Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import pickle
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

print("Libraries imported successfully!")

## 2. Load the Cleaned Data

In [None]:
# Load the processed data from preprocessing notebook
df = pd.read_csv('output/processed_data.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Check data distribution
print("Category distribution:")
print(df['category'].value_counts())
print(f"\nTotal samples: {len(df)}")
print(f"Missing values:\n{df.isnull().sum()}")

## 3. Prepare Text Data for TF-IDF

We'll use the `cleaned_title` column which contains preprocessed Tamil text.

In [None]:
# Select the text column and target variable
documents = df['cleaned_title'].fillna('').tolist()
labels = df['category'].tolist()

print(f"Total documents: {len(documents)}")
print(f"Total labels: {len(labels)}")
print(f"\nSample document: {documents[0]}")
print(f"Sample label: {labels[0]}")

## 4. TF-IDF Vectorization

**TF-IDF (Term Frequency-Inverse Document Frequency)** converts text into numerical features:
- **TF**: Measures how frequently a term appears in a document
- **IDF**: Measures how important a term is across all documents
- **TF-IDF**: Combines both to get weighted features

**Parameters:**
- `max_features=10000`: Keep only top 10,000 most important words
- `min_df=3`: Ignore terms appearing in less than 3 documents
- `max_df=0.8`: Ignore terms appearing in more than 80% of documents
- `ngram_range=(1,1)`: Use only single words (unigrams)

In [None]:
# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(
    max_features=10000,
    min_df=3,
    max_df=0.8,
    ngram_range=(1, 1),  # Unigrams only
    token_pattern=r'\S+',  # Split on whitespace
    dtype=np.float32
)

# Fit and transform the documents
print("Creating TF-IDF matrix...")
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

print(f"\n✓ TF-IDF Matrix created: {tfidf_matrix.shape}")
print(f"  Documents: {tfidf_matrix.shape[0]:,}")
print(f"  Features: {tfidf_matrix.shape[1]:,}")
print(f"  Matrix type: {type(tfidf_matrix)}")
print(f"  Data type: {tfidf_matrix.dtype}")
print(f"  Sparsity: {(1 - tfidf_matrix.nnz / (tfidf_matrix.shape[0] * tfidf_matrix.shape[1])) * 100:.2f}%")

## 5. Prepare Data for Machine Learning

Split the data into training (80%) and testing (20%) sets with stratified sampling.

In [None]:
# Use TF-IDF matrix as features
X = tfidf_matrix
y = np.array(labels)

# Split data: 80% train, 20% test with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")
print(f"Feature dimension: {X_train.shape[1]}")
print(f"\nTraining set label distribution:")
print(pd.Series(y_train).value_counts())
print(f"\nTest set label distribution:")
print(pd.Series(y_test).value_counts())

## 6. Train Naive Bayes Model

**Naive Bayes** is a probabilistic classifier based on Bayes' theorem:
- **Fast training and prediction**
- **Works well with text data**
- **Assumes independence between features**
- **MultinomialNB** is designed for discrete features like word counts/TF-IDF

In [None]:
# Initialize and train Naive Bayes model
print("Training Naive Bayes model...")
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Make predictions
nb_train_pred = nb_model.predict(X_train)
nb_test_pred = nb_model.predict(X_test)

print("✓ Naive Bayes training completed")
print(f"  Classes: {nb_model.classes_}")
print(f"  Number of classes: {len(nb_model.classes_)}")

## 7. Model Evaluation

Evaluate the model using multiple metrics:
- **Accuracy**: Overall correctness
- **Precision**: How many selected items are relevant
- **Recall**: How many relevant items are selected
- **F1-Score**: Harmonic mean of precision and recall

In [None]:
def evaluate_model(y_true, y_pred, model_name, dataset_name):
    """
    Evaluate model performance with multiple metrics.
    """
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y_true, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
    
    print(f"\n{'='*60}")
    print(f"{model_name} - {dataset_name} Set")
    print(f"{'='*60}")
    print(f"Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
    print(f"Precision: {precision:.4f} ({precision*100:.2f}%)")
    print(f"Recall:    {recall:.4f} ({recall*100:.2f}%)")
    print(f"F1-Score:  {f1:.4f} ({f1*100:.2f}%)")
    
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

# Evaluate on training and test sets
nb_train_metrics = evaluate_model(y_train, nb_train_pred, "Naive Bayes", "Training")
nb_test_metrics = evaluate_model(y_test, nb_test_pred, "Naive Bayes", "Test")

In [None]:
# Detailed classification report
print("\nDetailed Classification Report (Test Set):")
print(classification_report(y_test, nb_test_pred, zero_division=0))

## 8. Confusion Matrix

Visualize the confusion matrix to see which categories are confused with each other.

In [None]:
# Compute confusion matrix
nb_cm = confusion_matrix(y_test, nb_test_pred)
classes = sorted(list(set(y_test)))

# Plot confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(nb_cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=classes, yticklabels=classes, cbar_kws={'label': 'Count'})
plt.title('Confusion Matrix - Naive Bayes (Category Classification)', fontsize=14, fontweight='bold')
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
plt.tight_layout()
plt.show()

## 9. Save Model and Vectorizer

In [None]:
# Create directories if they don't exist
os.makedirs('models', exist_ok=True)
os.makedirs('reports', exist_ok=True)
os.makedirs('output', exist_ok=True)

# Save the trained model
with open('models/category_naive_bayes_tfidf.pkl', 'wb') as f:
    pickle.dump(nb_model, f)

# Save the TF-IDF vectorizer
with open('models/category_tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf_vectorizer, f)

# Save evaluation report
import json

report = {
    'model': 'MultinomialNB',
    'vectorizer': 'TfidfVectorizer',
    'ngram_range': '(1,1)',
    'max_features': 10000,
    'train_metrics': nb_train_metrics,
    'test_metrics': nb_test_metrics,
    'classification_report': classification_report(y_test, nb_test_pred, output_dict=True, zero_division=0)
}

with open('reports/category_naive_bayes_tfidf_report.json', 'w') as f:
    json.dump(report, f, indent=2)

print("✓ Model saved to models/category_naive_bayes_tfidf.pkl")
print("✓ Vectorizer saved to models/category_tfidf_vectorizer.pkl")
print("✓ Evaluation report saved to reports/category_naive_bayes_tfidf_report.json")

---

# SENTIMENT CLASSIFICATION

---

## 10. Load Sentiment Dataset

In [None]:
# Load sentiment data
df_sentiment = pd.read_csv('output/processed_sentiment_data.csv')

print(f"Sentiment Dataset shape: {df_sentiment.shape}")
print(f"\nSentiment distribution:")
print(df_sentiment['sentiment'].value_counts())

sentiment_documents = df_sentiment['tokenized_title'].fillna('').tolist()
sentiment_labels = df_sentiment['sentiment'].tolist()

print(f"\nTotal sentiment documents: {len(sentiment_documents)}")
print(f"Sample: {sentiment_documents[0]}")

## 11. TF-IDF Vectorization for Sentiment

In [None]:
# Initialize TF-IDF vectorizer for sentiment
sentiment_tfidf_vectorizer = TfidfVectorizer(
    max_features=10000,
    min_df=3,
    max_df=0.8,
    ngram_range=(1, 1),  # Unigrams only
    token_pattern=r'\S+',
    dtype=np.float32
)

# Fit and transform sentiment documents
print("Creating TF-IDF matrix for sentiment...")
sentiment_tfidf_matrix = sentiment_tfidf_vectorizer.fit_transform(sentiment_documents)

print(f"\n✓ Sentiment TF-IDF Matrix: {sentiment_tfidf_matrix.shape}")
print(f"  Documents: {sentiment_tfidf_matrix.shape[0]:,}")
print(f"  Features: {sentiment_tfidf_matrix.shape[1]:,}")
print(f"  Sparsity: {(1 - sentiment_tfidf_matrix.nnz / (sentiment_tfidf_matrix.shape[0] * sentiment_tfidf_matrix.shape[1])) * 100:.2f}%")

## 12. Prepare Sentiment Data for Training

In [None]:
X_sent = sentiment_tfidf_matrix
y_sent = np.array(sentiment_labels)

# Split data with stratification
X_sent_train, X_sent_test, y_sent_train, y_sent_test = train_test_split(
    X_sent, y_sent, test_size=0.2, random_state=42, stratify=y_sent
)

print(f"Sentiment Training set: {X_sent_train.shape[0]} samples")
print(f"Sentiment Test set: {X_sent_test.shape[0]} samples")
print(f"Features: {X_sent_train.shape[1]}")
print(f"\nSentiment distribution (train):")
print(pd.Series(y_sent_train).value_counts())
print(f"\nSentiment distribution (test):")
print(pd.Series(y_sent_test).value_counts())

## 13. Train Sentiment Naive Bayes Model

In [None]:
# Initialize and train Naive Bayes for sentiment
print("Training Sentiment Naive Bayes model...")
sent_nb_model = MultinomialNB()
sent_nb_model.fit(X_sent_train, y_sent_train)

# Make predictions
sent_nb_train_pred = sent_nb_model.predict(X_sent_train)
sent_nb_test_pred = sent_nb_model.predict(X_sent_test)

print("✓ Sentiment Naive Bayes training completed")
print(f"  Classes: {sent_nb_model.classes_}")

## 14. Evaluate Sentiment Model

In [None]:
# Evaluate sentiment model
sent_nb_train_metrics = evaluate_model(y_sent_train, sent_nb_train_pred, "Sentiment Naive Bayes", "Training")
sent_nb_test_metrics = evaluate_model(y_sent_test, sent_nb_test_pred, "Sentiment Naive Bayes", "Test")

In [None]:
# Detailed classification report
print("\nDetailed Classification Report (Sentiment - Test Set):")
print(classification_report(y_sent_test, sent_nb_test_pred, zero_division=0))

## 15. Sentiment Confusion Matrix

In [None]:
# Compute and plot sentiment confusion matrix
sent_nb_cm = confusion_matrix(y_sent_test, sent_nb_test_pred)
sent_classes = sorted(list(set(y_sent_test)))

plt.figure(figsize=(8, 6))
sns.heatmap(sent_nb_cm, annot=True, fmt='d', cmap='Greens', 
            xticklabels=sent_classes, yticklabels=sent_classes, cbar_kws={'label': 'Count'})
plt.title('Confusion Matrix - Naive Bayes (Sentiment Classification)', fontsize=14, fontweight='bold')
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
plt.tight_layout()
plt.show()

## 16. Save Sentiment Model and Vectorizer

In [None]:
# Save sentiment model
with open('models/sentiment_naive_bayes_tfidf.pkl', 'wb') as f:
    pickle.dump(sent_nb_model, f)

# Save sentiment vectorizer
with open('models/sentiment_tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(sentiment_tfidf_vectorizer, f)

# Save evaluation report
sent_report = {
    'model': 'MultinomialNB',
    'vectorizer': 'TfidfVectorizer',
    'ngram_range': '(1,1)',
    'max_features': 10000,
    'train_metrics': sent_nb_train_metrics,
    'test_metrics': sent_nb_test_metrics,
    'classification_report': classification_report(y_sent_test, sent_nb_test_pred, output_dict=True, zero_division=0)
}

with open('reports/sentiment_naive_bayes_tfidf_report.json', 'w') as f:
    json.dump(sent_report, f, indent=2)

print("✓ Sentiment model saved to models/sentiment_naive_bayes_tfidf.pkl")
print("✓ Sentiment vectorizer saved to models/sentiment_tfidf_vectorizer.pkl")
print("✓ Sentiment report saved to reports/sentiment_naive_bayes_tfidf_report.json")

## 17. Final Summary

**Model Performance Summary**

In [None]:
print("\n" + "="*80)
print("TF-IDF + NAIVE BAYES CLASSIFICATION SUMMARY")
print("="*80)

print("\nCATEGORY CLASSIFICATION:")
print(f"  Dataset: {len(documents)} documents")
print(f"  Feature Count: {X_train.shape[1]:,} unigrams")
print(f"  Train/Test Split: {X_train.shape[0]}/{X_test.shape[0]}")
print(f"  Test Accuracy: {nb_test_metrics['accuracy']:.4f} ({nb_test_metrics['accuracy']*100:.2f}%)")
print(f"  Test F1-Score: {nb_test_metrics['f1']:.4f}")

print("\nSENTIMENT CLASSIFICATION:")
print(f"  Dataset: {len(sentiment_documents)} documents")
print(f"  Feature Count: {X_sent_train.shape[1]:,} unigrams")
print(f"  Train/Test Split: {X_sent_train.shape[0]}/{X_sent_test.shape[0]}")
print(f"  Test Accuracy: {sent_nb_test_metrics['accuracy']:.4f} ({sent_nb_test_metrics['accuracy']*100:.2f}%)")
print(f"  Test F1-Score: {sent_nb_test_metrics['f1']:.4f}")

print("\nKEY FEATURES:")
print("  ✓ TF-IDF vectorization with sklearn")
print("  ✓ Multinomial Naive Bayes classifier")
print("  ✓ Unigram features (1,1)")
print("  ✓ Document frequency filtering")
print("  ✓ Fast training and prediction")

print("\nSAVED ARTIFACTS:")
print("  Models:")
print("    - models/category_naive_bayes_tfidf.pkl")
print("    - models/sentiment_naive_bayes_tfidf.pkl")
print("  Vectorizers:")
print("    - models/category_tfidf_vectorizer.pkl")
print("    - models/sentiment_tfidf_vectorizer.pkl")
print("  Reports:")
print("    - reports/category_naive_bayes_tfidf_report.json")
print("    - reports/sentiment_naive_bayes_tfidf_report.json")

print("\n" + "="*80)
print("TF-IDF CLASSIFICATION PIPELINE COMPLETED SUCCESSFULLY!")
print("="*80)