# Tamil News Headline Prediction System

## Comprehensive Multi-Model Prediction System

This notebook loads **ALL trained models** and predicts:
1. **Category Classification** - Using all 3 models (Naive Bayes, SVM, Logistic Regression)
2. **Sentiment Classification** - Using all 3 models (Naive Bayes, SVM, Logistic Regression)

**Features:**
- Input: Tamil news headline
- Output: Category & Sentiment predictions from ALL models
- Shows model accuracies
- Preprocessing pipeline included
- Easy-to-use prediction function

## 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
import pickle
import re
import os
from typing import List, Dict
import warnings
warnings.filterwarnings('ignore')

print("тЬУ Libraries imported successfully!")

тЬУ Libraries imported successfully!


## 2. Tamil Text Preprocessing Class

In [2]:
class TamilTextPreprocessor:
    """
    Preprocessing pipeline for Tamil text (same as training preprocessing).
    """
    
    def __init__(self, stopwords_file: str = 'resources/stopwords.txt', 
                 suffixes_file: str = 'resources/suffixes.csv'):
        self.stopwords = set()
        self.suffixes = {}
        
        # Load stopwords if available
        if os.path.exists(stopwords_file):
            with open(stopwords_file, 'r', encoding='utf-8') as f:
                self.stopwords = {line.strip() for line in f if line.strip()}
            print(f"тЬУ Loaded {len(self.stopwords)} stopwords")
        else:
            print("тЪа Stopwords file not found, continuing without stopwords")
        
        # Load suffixes if available
        if os.path.exists(suffixes_file):
            try:
                df = pd.read_csv(suffixes_file, encoding='utf-8')
                for _, row in df.iterrows():
                    suffix = str(row['suffix']).strip()
                    if suffix:
                        self.suffixes[suffix] = str(row['meaning']).strip()
                self.sorted_suffixes = sorted(self.suffixes.keys(), key=len, reverse=True)
                print(f"тЬУ Loaded {len(self.suffixes)} suffixes")
            except:
                print("тЪа Could not load suffixes, continuing without suffix removal")
        else:
            print("тЪа Suffixes file not found, continuing without suffix removal")
    
    def clean_text(self, text: str) -> str:
        """Clean Tamil text (remove English, digits, punctuation)."""
        if pd.isna(text) or not text:
            return ""
        
        text = str(text)
        # Remove English letters
        text = re.sub(r'[a-zA-Z]+', '', text)
        # Remove digits
        text = re.sub(r'[0-9рпж-рпп]+', '', text)
        # Remove punctuation
        text = re.sub(r'[!\"#$%&\'()*+,\-./:;<=>?@\[\\\\\]^_`{|}~тАжтАУтАФ]', ' ', text)
        text = re.sub(r'[тВ╣$тВм┬г┬етЧПтЧЛтЦатЦбтШЕтШЖтЩжтЩетЩатЩг]', '', text)
        # Normalize whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        
        return text
    
    def tokenize(self, text: str) -> List[str]:
        """Tokenize text into words."""
        if not text:
            return []
        return [w.strip() for w in text.split() if w.strip()]
    
    def remove_stopwords(self, tokens: List[str]) -> List[str]:
        """Remove stopwords from token list."""
        if not self.stopwords:
            return tokens
        return [t for t in tokens if t not in self.stopwords]
    
    def remove_suffixes(self, word: str) -> str:
        """Remove Tamil suffixes from word."""
        if not self.suffixes:
            return word
        
        for suffix in self.sorted_suffixes:
            if word.endswith(suffix) and len(word) > len(suffix):
                return word[:-len(suffix)]
        return word
    
    def preprocess(self, text: str) -> str:
        """Complete preprocessing pipeline."""
        # Clean text
        cleaned = self.clean_text(text)
        # Tokenize
        tokens = self.tokenize(cleaned)
        # Remove stopwords
        tokens = self.remove_stopwords(tokens)
        # Remove suffixes
        tokens = [self.remove_suffixes(t) for t in tokens]
        # Join back
        return ' '.join(tokens)

# Initialize preprocessor
preprocessor = TamilTextPreprocessor()
print("\nтЬУ Preprocessor initialized")

тЬУ Loaded 127 stopwords
тЬУ Loaded 133 suffixes

тЬУ Preprocessor initialized


## 3. Load All Trained Models

Loading:
- **Category Models**: Naive Bayes, SVM, Logistic Regression
- **Sentiment Models**: Naive Bayes, SVM, Logistic Regression
- **Vectorizers**: Category and Sentiment TF-IDF vectorizers

In [3]:
print("Loading all trained models...\n")

# Dictionary to store all models and vectorizers
models = {}

# Load Category Models
print("ЁЯУБ CATEGORY CLASSIFICATION MODELS:")
try:
    with open('models/category_naive_bayes.pkl', 'rb') as f:
        models['category_nb'] = pickle.load(f)
    print("  тЬУ Naive Bayes loaded")
except:
    print("  тЬЧ Naive Bayes not found")
    models['category_nb'] = None

try:
    with open('models/category_svm.pkl', 'rb') as f:
        models['category_svm'] = pickle.load(f)
    print("  тЬУ SVM loaded")
except:
    print("  тЬЧ SVM not found")
    models['category_svm'] = None

try:
    with open('models/category_logistic.pkl', 'rb') as f:
        models['category_lr'] = pickle.load(f)
    print("  тЬУ Logistic Regression loaded")
except:
    print("  тЬЧ Logistic Regression not found")
    models['category_lr'] = None

# Load Category Vectorizer
try:
    with open('models/category_vectorizer.pkl', 'rb') as f:
        category_vectorizer_data = pickle.load(f)
    print("  тЬУ Category Vectorizer loaded")
    models['category_vectorizer'] = category_vectorizer_data
except:
    print("  тЬЧ Category Vectorizer not found")
    models['category_vectorizer'] = None

# Load Sentiment Models
print("\nЁЯТн SENTIMENT CLASSIFICATION MODELS:")
try:
    with open('models/sentiment_naive_bayes.pkl', 'rb') as f:
        models['sentiment_nb'] = pickle.load(f)
    print("  тЬУ Naive Bayes loaded")
except:
    print("  тЬЧ Naive Bayes not found")
    models['sentiment_nb'] = None

try:
    with open('models/sentiment_svm.pkl', 'rb') as f:
        models['sentiment_svm'] = pickle.load(f)
    print("  тЬУ SVM loaded")
except:
    print("  тЬЧ SVM not found")
    models['sentiment_svm'] = None

try:
    with open('models/sentiment_logistic.pkl', 'rb') as f:
        models['sentiment_lr'] = pickle.load(f)
    print("  тЬУ Logistic Regression loaded")
except:
    print("  тЬЧ Logistic Regression not found")
    models['sentiment_lr'] = None

# Load Sentiment Vectorizer
try:
    with open('models/sentiment_vectorizer.pkl', 'rb') as f:
        sentiment_vectorizer_data = pickle.load(f)
    print("  тЬУ Sentiment Vectorizer loaded")
    models['sentiment_vectorizer'] = sentiment_vectorizer_data
except:
    print("  тЬЧ Sentiment Vectorizer not found")
    models['sentiment_vectorizer'] = None

print("\n" + "="*60)
print("MODEL LOADING COMPLETE")
print("="*60)

Loading all trained models...

ЁЯУБ CATEGORY CLASSIFICATION MODELS:
  тЬУ Naive Bayes loaded
  тЬУ SVM loaded
  тЬУ Logistic Regression loaded
  тЬУ Category Vectorizer loaded

ЁЯТн SENTIMENT CLASSIFICATION MODELS:
  тЬУ Naive Bayes loaded
  тЬУ SVM loaded
  тЬУ Logistic Regression loaded
  тЬУ Sentiment Vectorizer loaded

MODEL LOADING COMPLETE


## 4. Load Model Accuracies from Reports

In [4]:
import json

# Dictionary to store model accuracies
accuracies = {
    'category': {},
    'sentiment': {}
}

print("Loading model accuracies from reports...\n")

# Load Category Model Accuracies
print("ЁЯУК CATEGORY MODEL ACCURACIES:")
try:
    with open('reports/category_naive_bayes_report.json', 'r') as f:
        report = json.load(f)
        acc = report['test_metrics']['accuracy']
        f1 = report['test_metrics']['f1']
        accuracies['category']['Naive Bayes'] = {'accuracy': acc, 'f1': f1}
        print(f"  Naive Bayes     : Accuracy = {acc:.4f} ({acc*100:.2f}%), F1 = {f1:.4f}")
except:
    print("  Naive Bayes     : Report not found")

try:
    with open('reports/category_svm_report.json', 'r') as f:
        report = json.load(f)
        acc = report['test_metrics']['accuracy']
        f1 = report['test_metrics']['f1']
        accuracies['category']['SVM'] = {'accuracy': acc, 'f1': f1}
        print(f"  SVM             : Accuracy = {acc:.4f} ({acc*100:.2f}%), F1 = {f1:.4f}")
except:
    print("  SVM             : Report not found")

try:
    with open('reports/category_logistic_report.json', 'r') as f:
        report = json.load(f)
        acc = report['test_metrics']['accuracy']
        f1 = report['test_metrics']['f1']
        accuracies['category']['Logistic Regression'] = {'accuracy': acc, 'f1': f1}
        print(f"  Logistic Reg.   : Accuracy = {acc:.4f} ({acc*100:.2f}%), F1 = {f1:.4f}")
except:
    print("  Logistic Reg.   : Report not found")

# Load Sentiment Model Accuracies
print("\nЁЯТн SENTIMENT MODEL ACCURACIES:")
try:
    with open('reports/sentiment_naive_bayes_report.json', 'r') as f:
        report = json.load(f)
        acc = report['test_metrics']['accuracy']
        f1 = report['test_metrics']['f1']
        accuracies['sentiment']['Naive Bayes'] = {'accuracy': acc, 'f1': f1}
        print(f"  Naive Bayes     : Accuracy = {acc:.4f} ({acc*100:.2f}%), F1 = {f1:.4f}")
except:
    print("  Naive Bayes     : Report not found")

try:
    with open('reports/sentiment_svm_report.json', 'r') as f:
        report = json.load(f)
        acc = report['test_metrics']['accuracy']
        f1 = report['test_metrics']['f1']
        accuracies['sentiment']['SVM'] = {'accuracy': acc, 'f1': f1}
        print(f"  SVM             : Accuracy = {acc:.4f} ({acc*100:.2f}%), F1 = {f1:.4f}")
except:
    print("  SVM             : Report not found")

try:
    with open('reports/sentiment_logistic_report.json', 'r') as f:
        report = json.load(f)
        acc = report['test_metrics']['accuracy']
        f1 = report['test_metrics']['f1']
        accuracies['sentiment']['Logistic Regression'] = {'accuracy': acc, 'f1': f1}
        print(f"  Logistic Reg.   : Accuracy = {acc:.4f} ({acc*100:.2f}%), F1 = {f1:.4f}")
except:
    print("  Logistic Reg.   : Report not found")

print("\n" + "="*60)

Loading model accuracies from reports...

ЁЯУК CATEGORY MODEL ACCURACIES:
  Naive Bayes     : Accuracy = 0.6455 (64.55%), F1 = 0.6390
  SVM             : Accuracy = 0.6621 (66.21%), F1 = 0.6570
  Logistic Reg.   : Accuracy = 0.6520 (65.20%), F1 = 0.6512

ЁЯТн SENTIMENT MODEL ACCURACIES:
  Naive Bayes     : Accuracy = 0.6525 (65.25%), F1 = 0.6507
  SVM             : Accuracy = 0.6719 (67.19%), F1 = 0.6613
  Logistic Reg.   : Accuracy = 0.6840 (68.40%), F1 = 0.6571



## 5. TF-IDF Vectorization Helper Functions

In [5]:
from scipy.sparse import csr_matrix
from collections import Counter
import math

def compute_tf(tokens: List[str], vocabulary_set: set) -> Dict[str, float]:
    """Compute term frequency for tokens."""
    tf_dict = {}
    if not tokens:
        return tf_dict
    
    term_counts = Counter(tokens)
    doc_length = len(tokens)
    
    for term, count in term_counts.items():
        if term in vocabulary_set:
            tf_dict[term] = count / doc_length
    
    return tf_dict

def compute_tfidf(tf_dict: Dict[str, float], idf_dict: Dict[str, float]) -> Dict[str, float]:
    """Compute TF-IDF weights."""
    tfidf_dict = {}
    for term, tf_value in tf_dict.items():
        tfidf_dict[term] = tf_value * idf_dict.get(term, 0)
    return tfidf_dict

def create_tfidf_vector(tfidf_dict: Dict[str, float], word2idx: Dict[str, int], vocab_size: int):
    """Create sparse TF-IDF vector from TF-IDF dictionary."""
    row = []
    col = []
    data = []
    
    for term, tfidf_value in tfidf_dict.items():
        if term in word2idx:
            col.append(word2idx[term])
            row.append(0)
            data.append(tfidf_value)
    
    if not data:
        # Return zero vector if no terms match
        return csr_matrix((1, vocab_size), dtype=np.float32)
    
    return csr_matrix((data, (row, col)), shape=(1, vocab_size), dtype=np.float32)

print("тЬУ TF-IDF helper functions defined")

тЬУ TF-IDF helper functions defined


## 6. Main Prediction Function

This function:
1. Preprocesses the input Tamil headline
2. Converts it to TF-IDF vectors (category & sentiment)
3. Predicts using ALL available models
4. Returns results with model accuracies

In [6]:
def predict_headline(headline: str) -> Dict:
    """
    Predict category and sentiment for a Tamil news headline using all models.
    
    Args:
        headline: Tamil news headline text
    
    Returns:
        Dictionary with predictions from all models and their accuracies
    """
    print("="*80)
    print("TAMIL NEWS HEADLINE PREDICTION")
    print("="*80)
    print(f"\nЁЯУ░ Input Headline:\n  {headline}")
    
    # Step 1: Preprocess headline
    print("\nЁЯФз Preprocessing...")
    preprocessed = preprocessor.preprocess(headline)
    print(f"  Cleaned: {preprocessed}")
    
    if not preprocessed.strip():
        print("\nтЪа Warning: Headline is empty after preprocessing!")
        return {'error': 'Empty headline after preprocessing'}
    
    # Tokenize
    tokens = preprocessed.split()
    print(f"  Tokens: {tokens[:10]}..." if len(tokens) > 10 else f"  Tokens: {tokens}")
    
    results = {
        'original_headline': headline,
        'preprocessed_headline': preprocessed,
        'category_predictions': {},
        'sentiment_predictions': {}
    }
    
    # ============ CATEGORY CLASSIFICATION ============
    print("\n" + "="*80)
    print("ЁЯУБ CATEGORY CLASSIFICATION")
    print("="*80)
    
    if models['category_vectorizer'] is not None:
        # Create TF-IDF vector for category
        vocab = models['category_vectorizer']['vocabulary']
        word2idx = models['category_vectorizer']['word2idx']
        idf_dict = models['category_vectorizer']['idf_dict']
        
        vocab_set = set(vocab)
        tf = compute_tf(tokens, vocab_set)
        tfidf = compute_tfidf(tf, idf_dict)
        X_category = create_tfidf_vector(tfidf, word2idx, len(vocab))
        
        # Predict with Naive Bayes
        if models['category_nb'] is not None:
            pred = models['category_nb'].predict(X_category)[0]
            acc_info = accuracies['category'].get('Naive Bayes', {})
            acc = acc_info.get('accuracy', 0)
            f1 = acc_info.get('f1', 0)
            results['category_predictions']['Naive Bayes'] = {
                'prediction': pred,
                'accuracy': f"{acc:.4f} ({acc*100:.2f}%)",
                'f1_score': f"{f1:.4f}"
            }
            print(f"\n  ЁЯФ╣ Naive Bayes")
            print(f"     Predicted Category: {pred}")
            print(f"     Model Accuracy: {acc:.4f} ({acc*100:.2f}%)")
            print(f"     F1-Score: {f1:.4f}")
        
        # Predict with SVM
        if models['category_svm'] is not None:
            pred = models['category_svm'].predict(X_category)[0]
            acc_info = accuracies['category'].get('SVM', {})
            acc = acc_info.get('accuracy', 0)
            f1 = acc_info.get('f1', 0)
            results['category_predictions']['SVM'] = {
                'prediction': pred,
                'accuracy': f"{acc:.4f} ({acc*100:.2f}%)",
                'f1_score': f"{f1:.4f}"
            }
            print(f"\n  ЁЯФ╣ Linear SVM")
            print(f"     Predicted Category: {pred}")
            print(f"     Model Accuracy: {acc:.4f} ({acc*100:.2f}%)")
            print(f"     F1-Score: {f1:.4f}")
        
        # Predict with Logistic Regression
        if models['category_lr'] is not None:
            pred = models['category_lr'].predict(X_category)[0]
            acc_info = accuracies['category'].get('Logistic Regression', {})
            acc = acc_info.get('accuracy', 0)
            f1 = acc_info.get('f1', 0)
            results['category_predictions']['Logistic Regression'] = {
                'prediction': pred,
                'accuracy': f"{acc:.4f} ({acc*100:.2f}%)",
                'f1_score': f"{f1:.4f}"
            }
            print(f"\n  ЁЯФ╣ Logistic Regression")
            print(f"     Predicted Category: {pred}")
            print(f"     Model Accuracy: {acc:.4f} ({acc*100:.2f}%)")
            print(f"     F1-Score: {f1:.4f}")
    else:
        print("  тЬЧ Category vectorizer not available")
    
    # ============ SENTIMENT CLASSIFICATION ============
    print("\n" + "="*80)
    print("ЁЯТн SENTIMENT CLASSIFICATION")
    print("="*80)
    
    if models['sentiment_vectorizer'] is not None:
        # Create TF-IDF vector for sentiment
        vocab = models['sentiment_vectorizer']['vocabulary']
        word2idx = models['sentiment_vectorizer']['word2idx']
        idf_dict = models['sentiment_vectorizer']['idf_dict']
        
        vocab_set = set(vocab)
        tf = compute_tf(tokens, vocab_set)
        tfidf = compute_tfidf(tf, idf_dict)
        X_sentiment = create_tfidf_vector(tfidf, word2idx, len(vocab))
        
        # Predict with Naive Bayes
        if models['sentiment_nb'] is not None:
            pred = models['sentiment_nb'].predict(X_sentiment)[0]
            acc_info = accuracies['sentiment'].get('Naive Bayes', {})
            acc = acc_info.get('accuracy', 0)
            f1 = acc_info.get('f1', 0)
            results['sentiment_predictions']['Naive Bayes'] = {
                'prediction': pred,
                'accuracy': f"{acc:.4f} ({acc*100:.2f}%)",
                'f1_score': f"{f1:.4f}"
            }
            print(f"\n  ЁЯФ╣ Naive Bayes")
            print(f"     Predicted Sentiment: {pred}")
            print(f"     Model Accuracy: {acc:.4f} ({acc*100:.2f}%)")
            print(f"     F1-Score: {f1:.4f}")
        
        # Predict with SVM
        if models['sentiment_svm'] is not None:
            pred = models['sentiment_svm'].predict(X_sentiment)[0]
            acc_info = accuracies['sentiment'].get('SVM', {})
            acc = acc_info.get('accuracy', 0)
            f1 = acc_info.get('f1', 0)
            results['sentiment_predictions']['SVM'] = {
                'prediction': pred,
                'accuracy': f"{acc:.4f} ({acc*100:.2f}%)",
                'f1_score': f"{f1:.4f}"
            }
            print(f"\n  ЁЯФ╣ Linear SVM")
            print(f"     Predicted Sentiment: {pred}")
            print(f"     Model Accuracy: {acc:.4f} ({acc*100:.2f}%)")
            print(f"     F1-Score: {f1:.4f}")
        
        # Predict with Logistic Regression
        if models['sentiment_lr'] is not None:
            pred = models['sentiment_lr'].predict(X_sentiment)[0]
            acc_info = accuracies['sentiment'].get('Logistic Regression', {})
            acc = acc_info.get('accuracy', 0)
            f1 = acc_info.get('f1', 0)
            results['sentiment_predictions']['Logistic Regression'] = {
                'prediction': pred,
                'accuracy': f"{acc:.4f} ({acc*100:.2f}%)",
                'f1_score': f"{f1:.4f}"
            }
            print(f"\n  ЁЯФ╣ Logistic Regression")
            print(f"     Predicted Sentiment: {pred}")
            print(f"     Model Accuracy: {acc:.4f} ({acc*100:.2f}%)")
            print(f"     F1-Score: {f1:.4f}")
    else:
        print("  тЬЧ Sentiment vectorizer not available")
    
    print("\n" + "="*80)
    print("PREDICTION COMPLETE")
    print("="*80 + "\n")
    
    return results

print("тЬУ Prediction function defined")

тЬУ Prediction function defined


## 7. Test with Sample Headlines

Let's test the prediction system with some sample Tamil news headlines.

In [7]:
# Sample Tamil headlines for testing
sample_headlines = [
    "роЗроирпНродро┐роп роХро┐ро░ро┐роХрпНроХрпЖроЯрпН роЕрогро┐ роЙро▓роХроХрпН роХрпЛрокрпНрокрпИропрпИ ро╡рпЖройрпНро▒родрпБ",
    "рокрпБродро┐роп родрпКро┤ро┐ро▓рпНроирпБроЯрпНрок роХрогрпНроЯрпБрокро┐роЯро┐рокрпНрокрпБ роЕро▒ро┐ро╡ро┐ропро▓рпН роЙро▓роХро┐ро▓рпН роЪро╛родройрпИ",
    "роЕро░роЪрпБ рокрпБродро┐роп роХро▓рпНро╡ро┐ роХрпКро│рпНроХрпИ роЕро▒ро┐ро╡ро┐рокрпНрокрпБ роЪрпЖропрпНродродрпБ",
    "роЗро▓роЩрпНроХрпИ роЕро░роЪрпБ рокрпБродро┐роп рокрпКро░рпБро│ро╛родро╛ро░ роЪрпАро░рпНродро┐ро░рпБродрпНродроЩрпНроХро│рпИ роЕро▒ро┐ро╡ро┐родрпНродродрпБ",
    "роЪрпЖройрпНройрпИ рооро╛роироХро░ро┐ро▓рпН рокрпБродро┐роп роорпЖроЯрпНро░рпЛ ро░ропро┐ро▓рпН родро┐роЯрпНроЯроорпН родрпКроЯроЩрпНроХро┐ропродрпБ"
]

print("Testing with sample headlines...\n")

Testing with sample headlines...



In [8]:
# Test with first sample headline
if sample_headlines:
    result = predict_headline(sample_headlines[0])

TAMIL NEWS HEADLINE PREDICTION

ЁЯУ░ Input Headline:
  роЗроирпНродро┐роп роХро┐ро░ро┐роХрпНроХрпЖроЯрпН роЕрогро┐ роЙро▓роХроХрпН роХрпЛрокрпНрокрпИропрпИ ро╡рпЖройрпНро▒родрпБ

ЁЯФз Preprocessing...
  Cleaned: роЗроирпНродро┐роп роХро┐ро░ро┐роХрпНроХрпЖроЯрпН роЕрогро┐ роЙро▓роХроХрпН роХрпЛрокрпНрокрпИропрпИ ро╡рпЖройрпНро▒родрпБ
  Tokens: ['роЗроирпНродро┐роп', 'роХро┐ро░ро┐роХрпНроХрпЖроЯрпН', 'роЕрогро┐', 'роЙро▓роХроХрпН', 'роХрпЛрокрпНрокрпИропрпИ', 'ро╡рпЖройрпНро▒родрпБ']

ЁЯУБ CATEGORY CLASSIFICATION

  ЁЯФ╣ Naive Bayes
     Predicted Category: sports
     Model Accuracy: 0.6455 (64.55%)
     F1-Score: 0.6390

  ЁЯФ╣ Linear SVM
     Predicted Category: sports
     Model Accuracy: 0.6621 (66.21%)
     F1-Score: 0.6570

  ЁЯФ╣ Logistic Regression
     Predicted Category: sports
     Model Accuracy: 0.6520 (65.20%)
     F1-Score: 0.6512

ЁЯТн SENTIMENT CLASSIFICATION

  ЁЯФ╣ Naive Bayes
     Predicted Sentiment: Positive
     Model Accuracy: 0.6525 (65.25%)
     F1-Score: 0.650

## 8. Interactive Prediction Cell

**Enter your own Tamil headline below and run the cell to get predictions!**

In [9]:
# Enter your Tamil headline here
your_headline = "роЗроирпНродро┐роп роХро┐ро░ро┐роХрпНроХрпЖроЯрпН роЕрогро┐ роЙро▓роХроХрпН роХрпЛрокрпНрокрпИропрпИ ро╡рпЖройрпНро▒родрпБ"

# Get predictions
result = predict_headline(your_headline)

TAMIL NEWS HEADLINE PREDICTION

ЁЯУ░ Input Headline:
  роЗроирпНродро┐роп роХро┐ро░ро┐роХрпНроХрпЖроЯрпН роЕрогро┐ роЙро▓роХроХрпН роХрпЛрокрпНрокрпИропрпИ ро╡рпЖройрпНро▒родрпБ

ЁЯФз Preprocessing...
  Cleaned: роЗроирпНродро┐роп роХро┐ро░ро┐роХрпНроХрпЖроЯрпН роЕрогро┐ роЙро▓роХроХрпН роХрпЛрокрпНрокрпИропрпИ ро╡рпЖройрпНро▒родрпБ
  Tokens: ['роЗроирпНродро┐роп', 'роХро┐ро░ро┐роХрпНроХрпЖроЯрпН', 'роЕрогро┐', 'роЙро▓роХроХрпН', 'роХрпЛрокрпНрокрпИропрпИ', 'ро╡рпЖройрпНро▒родрпБ']

ЁЯУБ CATEGORY CLASSIFICATION

  ЁЯФ╣ Naive Bayes
     Predicted Category: sports
     Model Accuracy: 0.6455 (64.55%)
     F1-Score: 0.6390

  ЁЯФ╣ Linear SVM
     Predicted Category: sports
     Model Accuracy: 0.6621 (66.21%)
     F1-Score: 0.6570

  ЁЯФ╣ Logistic Regression
     Predicted Category: sports
     Model Accuracy: 0.6520 (65.20%)
     F1-Score: 0.6512

ЁЯТн SENTIMENT CLASSIFICATION

  ЁЯФ╣ Naive Bayes
     Predicted Sentiment: Positive
     Model Accuracy: 0.6525 (65.25%)
     F1-Score: 0.650

## 9. Batch Prediction Function

Predict multiple headlines at once and save results to CSV.

In [10]:
def predict_batch(headlines: List[str], save_to_csv: bool = True) -> pd.DataFrame:
    """
    Predict category and sentiment for multiple headlines.
    
    Args:
        headlines: List of Tamil news headlines
        save_to_csv: Whether to save results to CSV file
    
    Returns:
        DataFrame with predictions from all models
    """
    print(f"\nProcessing {len(headlines)} headlines...\n")
    
    results_list = []
    
    for i, headline in enumerate(headlines, 1):
        print(f"[{i}/{len(headlines)}] Processing: {headline[:50]}...")
        
        # Preprocess
        preprocessed = preprocessor.preprocess(headline)
        
        if not preprocessed.strip():
            print(f"  тЪа Skipping empty headline")
            continue
        
        tokens = preprocessed.split()
        
        row = {
            'original_headline': headline,
            'preprocessed_headline': preprocessed
        }
        
        # Category predictions
        if models['category_vectorizer'] is not None:
            vocab = models['category_vectorizer']['vocabulary']
            word2idx = models['category_vectorizer']['word2idx']
            idf_dict = models['category_vectorizer']['idf_dict']
            vocab_set = set(vocab)
            
            tf = compute_tf(tokens, vocab_set)
            tfidf = compute_tfidf(tf, idf_dict)
            X = create_tfidf_vector(tfidf, word2idx, len(vocab))
            
            if models['category_nb'] is not None:
                row['category_nb'] = models['category_nb'].predict(X)[0]
            if models['category_svm'] is not None:
                row['category_svm'] = models['category_svm'].predict(X)[0]
            if models['category_lr'] is not None:
                row['category_lr'] = models['category_lr'].predict(X)[0]
        
        # Sentiment predictions
        if models['sentiment_vectorizer'] is not None:
            vocab = models['sentiment_vectorizer']['vocabulary']
            word2idx = models['sentiment_vectorizer']['word2idx']
            idf_dict = models['sentiment_vectorizer']['idf_dict']
            vocab_set = set(vocab)
            
            tf = compute_tf(tokens, vocab_set)
            tfidf = compute_tfidf(tf, idf_dict)
            X = create_tfidf_vector(tfidf, word2idx, len(vocab))
            
            if models['sentiment_nb'] is not None:
                row['sentiment_nb'] = models['sentiment_nb'].predict(X)[0]
            if models['sentiment_svm'] is not None:
                row['sentiment_svm'] = models['sentiment_svm'].predict(X)[0]
            if models['sentiment_lr'] is not None:
                row['sentiment_lr'] = models['sentiment_lr'].predict(X)[0]
        
        results_list.append(row)
    
    # Create DataFrame
    df_results = pd.DataFrame(results_list)
    
    # Save to CSV if requested
    if save_to_csv:
        output_file = 'output/batch_predictions.csv'
        df_results.to_csv(output_file, index=False, encoding='utf-8-sig')
        print(f"\nтЬУ Results saved to: {output_file}")
    
    print(f"\nтЬУ Batch prediction complete: {len(df_results)} headlines processed")
    
    return df_results

print("тЬУ Batch prediction function defined")

тЬУ Batch prediction function defined


In [11]:
# Test batch prediction with sample headlines
if sample_headlines:
    batch_results = predict_batch(sample_headlines, save_to_csv=True)
    print("\n" + "="*80)
    print("BATCH PREDICTION RESULTS")
    print("="*80)
    display(batch_results)


Processing 5 headlines...

[1/5] Processing: роЗроирпНродро┐роп роХро┐ро░ро┐роХрпНроХрпЖроЯрпН роЕрогро┐ роЙро▓роХроХрпН роХрпЛрокрпНрокрпИропрпИ ро╡рпЖройрпНро▒родрпБ...
[2/5] Processing: рокрпБродро┐роп родрпКро┤ро┐ро▓рпНроирпБроЯрпНрок роХрогрпНроЯрпБрокро┐роЯро┐рокрпНрокрпБ роЕро▒ро┐ро╡ро┐ропро▓рпН роЙро▓роХро┐ро▓рпН роЪро╛...
[3/5] Processing: роЕро░роЪрпБ рокрпБродро┐роп роХро▓рпНро╡ро┐ роХрпКро│рпНроХрпИ роЕро▒ро┐ро╡ро┐рокрпНрокрпБ роЪрпЖропрпНродродрпБ...
[4/5] Processing: роЗро▓роЩрпНроХрпИ роЕро░роЪрпБ рокрпБродро┐роп рокрпКро░рпБро│ро╛родро╛ро░ роЪрпАро░рпНродро┐ро░рпБродрпНродроЩрпНроХро│рпИ роЕро▒ро┐ро╡ро┐...
[5/5] Processing: роЪрпЖройрпНройрпИ рооро╛роироХро░ро┐ро▓рпН рокрпБродро┐роп роорпЖроЯрпНро░рпЛ ро░ропро┐ро▓рпН родро┐роЯрпНроЯроорпН родрпКроЯроЩрпНроХро┐...

тЬУ Results saved to: output/batch_predictions.csv

тЬУ Batch prediction complete: 5 headlines processed

BATCH PREDICTION RESULTS


Unnamed: 0,original_headline,preprocessed_headline,category_nb,category_svm,category_lr,sentiment_nb,sentiment_svm,sentiment_lr
0,роЗроирпНродро┐роп роХро┐ро░ро┐роХрпНроХрпЖроЯрпН роЕрогро┐ роЙро▓роХроХрпН роХрпЛрокрпНрокрпИропрпИ ро╡рпЖройрпНро▒родрпБ,роЗроирпНродро┐роп роХро┐ро░ро┐роХрпНроХрпЖроЯрпН роЕрогро┐ роЙро▓роХроХрпН роХрпЛрокрпНрокрпИропрпИ ро╡рпЖройрпНро▒родрпБ,sports,sports,sports,Positive,Positive,Positive
1,рокрпБродро┐роп родрпКро┤ро┐ро▓рпНроирпБроЯрпНрок роХрогрпНроЯрпБрокро┐роЯро┐рокрпНрокрпБ роЕро▒ро┐ро╡ро┐ропро▓рпН роЙро▓роХро┐ро▓...,рокрпБродро┐роп родрпКро┤ро┐ро▓рпНроирпБроЯрпНрок роХрогрпНроЯрпБрокро┐роЯро┐рокрпНрокрпБ роЕро▒ро┐ро╡ро┐ропро▓рпН роЙро▓роХро┐ро▓...,technology,technology,technology,Positive,Positive,Positive
2,роЕро░роЪрпБ рокрпБродро┐роп роХро▓рпНро╡ро┐ роХрпКро│рпНроХрпИ роЕро▒ро┐ро╡ро┐рокрпНрокрпБ роЪрпЖропрпНродродрпБ,роЕро░роЪрпБ рокрпБродро┐роп роХро▓рпНро╡ро┐ роХрпКро│рпНроХрпИ роЕро▒ро┐ро╡ро┐рокрпНрокрпБ роЪрпЖропрпНродродрпБ,tamilnadu,technology,technology,Neutral,Neutral,Neutral
3,роЗро▓роЩрпНроХрпИ роЕро░роЪрпБ рокрпБродро┐роп рокрпКро░рпБро│ро╛родро╛ро░ роЪрпАро░рпНродро┐ро░рпБродрпНродроЩрпНроХро│рпИ роЕ...,роЗро▓роЩрпНроХрпИ роЕро░роЪрпБ рокрпБродро┐роп рокрпКро░рпБро│ро╛родро╛ро░ роЪрпАро░рпНродро┐ро░рпБродрпНродроЩрпН роЕро▒ро┐ро╡ро┐,india,world,world,Neutral,Neutral,Neutral
4,роЪрпЖройрпНройрпИ рооро╛роироХро░ро┐ро▓рпН рокрпБродро┐роп роорпЖроЯрпНро░рпЛ ро░ропро┐ро▓рпН родро┐роЯрпНроЯроорпН родрпКроЯ...,роЪрпЖройрпНройрпИ рооро╛роироХро░ро┐ро▓рпН рокрпБродро┐роп роорпЖроЯрпНро░рпЛ ро░ропро┐ро▓рпН родро┐роЯрпНроЯроорпН родрпКроЯ...,tamilnadu,tamilnadu,tamilnadu,Neutral,Neutral,Neutral


## 10. Summary Statistics

In [12]:
print("\n" + "="*80)
print("PREDICTION SYSTEM SUMMARY")
print("="*80)

print("\nЁЯУК LOADED MODELS:")
print("  Category Classification:")
print(f"    - Naive Bayes:        {'тЬУ' if models['category_nb'] else 'тЬЧ'}")
print(f"    - SVM:                {'тЬУ' if models['category_svm'] else 'тЬЧ'}")
print(f"    - Logistic Regression: {'тЬУ' if models['category_lr'] else 'тЬЧ'}")
print(f"    - Vectorizer:         {'тЬУ' if models['category_vectorizer'] else 'тЬЧ'}")

print("\n  Sentiment Classification:")
print(f"    - Naive Bayes:        {'тЬУ' if models['sentiment_nb'] else 'тЬЧ'}")
print(f"    - SVM:                {'тЬУ' if models['sentiment_svm'] else 'тЬЧ'}")
print(f"    - Logistic Regression: {'тЬУ' if models['sentiment_lr'] else 'тЬЧ'}")
print(f"    - Vectorizer:         {'тЬУ' if models['sentiment_vectorizer'] else 'тЬЧ'}")

print("\nЁЯУИ MODEL PERFORMANCE:")
print("  Category Models:")
for model_name, metrics in accuracies['category'].items():
    print(f"    {model_name:20s}: {metrics['accuracy']:.2%} accuracy, F1={metrics['f1']:.4f}")

print("\n  Sentiment Models:")
for model_name, metrics in accuracies['sentiment'].items():
    print(f"    {model_name:20s}: {metrics['accuracy']:.2%} accuracy, F1={metrics['f1']:.4f}")

print("\n" + "="*80)
print("тЬЕ SYSTEM READY FOR PREDICTIONS")
print("="*80)
print("\nUsage:")
print("  1. Single prediction: predict_headline('your tamil headline here')")
print("  2. Batch prediction:  predict_batch([headline1, headline2, ...])")
print("="*80)


PREDICTION SYSTEM SUMMARY

ЁЯУК LOADED MODELS:
  Category Classification:
    - Naive Bayes:        тЬУ
    - SVM:                тЬУ
    - Logistic Regression: тЬУ
    - Vectorizer:         тЬУ

  Sentiment Classification:
    - Naive Bayes:        тЬУ
    - SVM:                тЬУ
    - Logistic Regression: тЬУ
    - Vectorizer:         тЬУ

ЁЯУИ MODEL PERFORMANCE:
  Category Models:
    Naive Bayes         : 64.55% accuracy, F1=0.6390
    SVM                 : 66.21% accuracy, F1=0.6570
    Logistic Regression : 65.20% accuracy, F1=0.6512

  Sentiment Models:
    Naive Bayes         : 65.25% accuracy, F1=0.6507
    SVM                 : 67.19% accuracy, F1=0.6613
    Logistic Regression : 68.40% accuracy, F1=0.6571

тЬЕ SYSTEM READY FOR PREDICTIONS

Usage:
  1. Single prediction: predict_headline('your tamil headline here')
  2. Batch prediction:  predict_batch([headline1, headline2, ...])
