# üì∞ News Category Classification using N-grams

This notebook demonstrates building a text classification model to categorize news headlines into 5 categories using NLP techniques and Machine Learning.

## üìã Categories:
- **POLITICS**
- **WELLNESS**
- **ENTERTAINMENT**
- **TRAVEL**
- **STYLE & BEAUTY**

---

In [1]:
import pandas as pd

## 1Ô∏è‚É£ Import Libraries

In [2]:
# Load the dataset
data = pd.read_json('../data/News_Category_Dataset_v3.json', lines=True)[['category', 'headline']]
print(f"‚úì Loaded {len(data)} news articles")

‚úì Loaded 209527 news articles


## 2Ô∏è‚É£ Load Dataset

Load the news dataset containing categories and headlines.

In [3]:
# Check shape and preview data
print("üìä Data Shape:", data.shape)
print("\n" + "="*50)
print("\nüìã First 5 rows:")
data.head()

üìä Data Shape: (209527, 2)


üìã First 5 rows:


Unnamed: 0,category,headline
0,U.S. NEWS,Over 4 Million Americans Roll Up Sleeves For O...
1,U.S. NEWS,"American Airlines Flyer Charged, Banned For Li..."
2,COMEDY,23 Of The Funniest Tweets About Cats And Dogs ...
3,PARENTING,The Funniest Tweets From Parents This Week (Se...
4,U.S. NEWS,Woman Who Called Cops On Black Bird-Watcher Lo...


### üìä Explore the Data

In [4]:
# Filter desired categories
desired_category = ['POLITICS', 'WELLNESS', 'ENTERTAINMENT', 'TRAVEL', 'STYLE & BEAUTY']
desired_data = data[data['category'].isin(desired_category)]

print(f"‚úì Filtered to {len(desired_data)} articles")
print(f"üìä Filtered Data Shape: {desired_data.shape}")
print("\n" + "="*50)
print("\nüìã Preview:")
desired_data.head()

‚úì Filtered to 90623 articles
üìä Filtered Data Shape: (90623, 2)


üìã Preview:


Unnamed: 0,category,headline
20,ENTERTAINMENT,Golden Globes Returning To NBC In January Afte...
21,POLITICS,Biden Says U.S. Forces Would Defend Taiwan If ...
24,POLITICS,‚ÄòBeautiful And Sad At The Same Time‚Äô: Ukrainia...
28,ENTERTAINMENT,James Cameron Says He 'Clashed' With Studio Be...
30,POLITICS,Biden Says Queen's Death Left 'Giant Hole' For...


## 3Ô∏è‚É£ Filter Target Categories

Select only the 5 categories we want to classify.

In [5]:
# Data quality checks
print("üìä Dataset Info:")
desired_data.info()

print("\n" + "="*50)
print("\nüìà Category Distribution:")
print(desired_data['category'].value_counts())

print("\n" + "="*50)
print(f"\nüè∑Ô∏è Number of Categories: {len(desired_data['category'].unique())}")

print("\n" + "="*50)
print("\n‚ùå Null Values:")
print(desired_data.isnull().sum())

print("\n" + "="*50)
print(f"\nüîÑ Duplicate Rows: {desired_data.duplicated().sum()}")

üìä Dataset Info:
<class 'pandas.core.frame.DataFrame'>
Index: 90623 entries, 20 to 209513
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  90623 non-null  object
 1   headline  90623 non-null  object
dtypes: object(2)
memory usage: 2.1+ MB


üìà Category Distribution:
category
POLITICS          35602
WELLNESS          17945
ENTERTAINMENT     17362
TRAVEL             9900
STYLE & BEAUTY     9814
Name: count, dtype: int64


üè∑Ô∏è Number of Categories: 5


‚ùå Null Values:
category    0
headline    0
dtype: int64


üîÑ Duplicate Rows: 722


## 4Ô∏è‚É£ Data Quality Check

Check for missing values, duplicates, and category distribution.

In [6]:
# Clean the data
print("üßπ Cleaning data...")

# Reset index
desired_data = desired_data.reset_index(drop=True)

# Remove duplicates
initial_count = len(desired_data)
desired_data = desired_data.drop_duplicates().reset_index(drop=True)
removed = initial_count - len(desired_data)

print(f"‚úì Removed {removed} duplicate rows")
print(f"‚úì Final cleaned shape: {desired_data.shape}")

üßπ Cleaning data...
‚úì Removed 722 duplicate rows
‚úì Final cleaned shape: (89901, 2)
‚úì Removed 722 duplicate rows
‚úì Final cleaned shape: (89901, 2)


## 5Ô∏è‚É£ Data Cleaning

Remove duplicates and reset index.

In [7]:
# Balance the dataset
min_count = desired_data['category'].value_counts().min()
balanced_data = desired_data.groupby('category').apply(lambda x: x.sample(min_count, random_state=42)).reset_index(drop=True)

print(f"‚öñÔ∏è Balanced to {min_count} samples per category")
print(f"üìä Balanced Data Shape: {balanced_data.shape}")
print("\n" + "="*50)
print("\nüìà Balanced Category Distribution:")
print(balanced_data['category'].value_counts())

‚öñÔ∏è Balanced to 9330 samples per category
üìä Balanced Data Shape: (46650, 2)


üìà Balanced Category Distribution:
category
ENTERTAINMENT     9330
POLITICS          9330
STYLE & BEAUTY    9330
TRAVEL            9330
WELLNESS          9330
Name: count, dtype: int64


  balanced_data = desired_data.groupby('category').apply(lambda x: x.sample(min_count, random_state=42)).reset_index(drop=True)


## 6Ô∏è‚É£ Balance the Dataset

Ensure equal representation of all categories.

In [8]:
# Encode categories
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
balanced_data['category_encoded'] = le.fit_transform(balanced_data['category'])

print("üî¢ Category Encoding:")
print(balanced_data[['category', 'category_encoded']].drop_duplicates().sort_values('category_encoded'))
print("\n" + "="*50)
print("\nüìã Preview:")
balanced_data.head()

üî¢ Category Encoding:
             category  category_encoded
0       ENTERTAINMENT                 0
9330         POLITICS                 1
18660  STYLE & BEAUTY                 2
27990          TRAVEL                 3
37320        WELLNESS                 4


üìã Preview:


Unnamed: 0,category,headline,category_encoded
0,ENTERTAINMENT,Even Captain America Is 'Devastated' That This...,0
1,ENTERTAINMENT,Alice Cooper Slams Mumford & Sons And The Lumi...,0
2,ENTERTAINMENT,Toni Collette Schools Daniel Radcliffe In This...,0
3,ENTERTAINMENT,Maksim Chmerkovskiy And Peta Murgatroyd Expect...,0
4,ENTERTAINMENT,"Yes, Bette Midler Really Named Her Chickens Af...",0


## 7Ô∏è‚É£ Encode Categories

Convert category labels to numeric values for classification.

In [9]:
# Split data for baseline model
from sklearn.model_selection import train_test_split

X = balanced_data['headline']
y = balanced_data['category_encoded']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=balanced_data['category_encoded']
)

print("‚úì Data split complete (80/20)")

‚úì Data split complete (80/20)


## 8Ô∏è‚É£ Split Data (Before Preprocessing)

First model: Train/Test split without text preprocessing.

In [10]:
# View split shapes
print("üìä Train/Test Split:")
print(f"  X_train: {X_train.shape}")
print(f"  X_test:  {X_test.shape}")
print(f"  y_train: {y_train.shape}")
print(f"  y_test:  {y_test.shape}")

üìä Train/Test Split:
  X_train: (37320,)
  X_test:  (9330,)
  y_train: (37320,)
  y_test:  (9330,)


In [11]:
# Build baseline model with unigrams
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics

print("ü§ñ Training baseline model (unigrams only)...")

# Create pipeline
model = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 1))),
    ('clf', MultinomialNB())
])

# Train
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
accuracy = metrics.accuracy_score(y_test, y_pred)
print(f"\n‚úì Baseline Accuracy: {accuracy:.4f}")
print("\n" + "="*50)
print("\nüìä Classification Report:")
print(metrics.classification_report(y_test, y_pred, target_names=le.classes_))

ü§ñ Training baseline model (unigrams only)...

‚úì Baseline Accuracy: 0.8493


üìä Classification Report:
                precision    recall  f1-score   support

 ENTERTAINMENT       0.85      0.79      0.82      1866
      POLITICS       0.86      0.89      0.88      1866
STYLE & BEAUTY       0.86      0.85      0.85      1866
        TRAVEL       0.86      0.87      0.86      1866
      WELLNESS       0.82      0.85      0.84      1866

      accuracy                           0.85      9330
     macro avg       0.85      0.85      0.85      9330
  weighted avg       0.85      0.85      0.85      9330


‚úì Baseline Accuracy: 0.8493


üìä Classification Report:
                precision    recall  f1-score   support

 ENTERTAINMENT       0.85      0.79      0.82      1866
      POLITICS       0.86      0.89      0.88      1866
STYLE & BEAUTY       0.86      0.85      0.85      1866
        TRAVEL       0.86      0.87      0.86      1866
      WELLNESS       0.82      0.85      0

## 9Ô∏è‚É£ Baseline Model (Unigrams Only)

Train a baseline model using unigrams without preprocessing.

In [12]:
# Test with sample headlines
sample_headlines = [
    "New advancements in AI technology",
    "Top 10 travel destinations for 2024",
    "The impact of climate change on politics",
    "Latest trends in wellness and health",
    "Upcoming movies to watch this summer"
]

print("üß™ Sample Predictions (Baseline Model):\n")
predicted_categories = model.predict(sample_headlines)
for headline, category_encoded in zip(sample_headlines, predicted_categories):
    category = le.inverse_transform([category_encoded])[0]
    print(f"  ‚Ä¢ {headline}")
    print(f"    ‚Üí {category}\n")

üß™ Sample Predictions (Baseline Model):

  ‚Ä¢ New advancements in AI technology
    ‚Üí WELLNESS

  ‚Ä¢ Top 10 travel destinations for 2024
    ‚Üí TRAVEL

  ‚Ä¢ The impact of climate change on politics
    ‚Üí POLITICS

  ‚Ä¢ Latest trends in wellness and health
    ‚Üí WELLNESS

  ‚Ä¢ Upcoming movies to watch this summer
    ‚Üí ENTERTAINMENT



### üß™ Test Baseline Model

In [13]:
# Preprocess text
import nltk
from nltk.corpus import stopwords

print("üßπ Applying text preprocessing...")

# Download stopwords
nltk.download('stopwords', quiet=True)
stop_words = set(stopwords.words('english'))

# 1. Lowercase
balanced_data['headline'] = balanced_data['headline'].str.lower()

# 2. Remove punctuation
balanced_data['headline'] = balanced_data['headline'].str.replace('[^\w\s]', '', regex=True)

# 3. Remove stopwords
balanced_data['headline'] = balanced_data['headline'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in stop_words])
)

print("‚úì Preprocessing complete!")
print("\nüìã Preview of cleaned headlines:")
balanced_data.head()

üßπ Applying text preprocessing...
‚úì Preprocessing complete!

üìã Preview of cleaned headlines:
‚úì Preprocessing complete!

üìã Preview of cleaned headlines:


Unnamed: 0,category,headline,category_encoded
0,ENTERTAINMENT,even captain america devastated country electe...,0
1,ENTERTAINMENT,alice cooper slams mumford sons lumineers says...,0
2,ENTERTAINMENT,toni collette schools daniel radcliffe imperiu...,0
3,ENTERTAINMENT,maksim chmerkovskiy peta murgatroyd expecting ...,0
4,ENTERTAINMENT,yes bette midler really named chickens kardash...,0


---

## üîü Text Preprocessing

Apply NLP preprocessing to improve model performance:
1. Convert to lowercase
2. Remove punctuation
3. Remove stopwords

In [14]:
# Train final model with preprocessed data
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics

print("ü§ñ Training final model (unigrams + bigrams)...")

# Split preprocessed data
X = balanced_data['headline']
y = balanced_data['category_encoded']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=balanced_data['category_encoded']
)

# Create pipeline with bigrams
model = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),  # Unigrams + Bigrams
    ('clf', MultinomialNB())
])

# Train
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
accuracy = metrics.accuracy_score(y_test, y_pred)
print(f"\nüéØ Final Accuracy: {accuracy:.4f}")
print(f"üìà Improvement: +{(accuracy - 0.8)*100:.2f}% (if baseline was ~0.80)")
print("\n" + "="*50)
print("\nüìä Final Classification Report:")
print(metrics.classification_report(y_test, y_pred, target_names=le.classes_))

ü§ñ Training final model (unigrams + bigrams)...

üéØ Final Accuracy: 0.8481
üìà Improvement: +4.81% (if baseline was ~0.80)


üìä Final Classification Report:
                precision    recall  f1-score   support

 ENTERTAINMENT       0.84      0.80      0.82      1866
      POLITICS       0.85      0.90      0.88      1866
STYLE & BEAUTY       0.85      0.87      0.86      1866
        TRAVEL       0.86      0.85      0.86      1866
      WELLNESS       0.84      0.81      0.82      1866

      accuracy                           0.85      9330
     macro avg       0.85      0.85      0.85      9330
  weighted avg       0.85      0.85      0.85      9330


üéØ Final Accuracy: 0.8481
üìà Improvement: +4.81% (if baseline was ~0.80)


üìä Final Classification Report:
                precision    recall  f1-score   support

 ENTERTAINMENT       0.84      0.80      0.82      1866
      POLITICS       0.85      0.90      0.88      1866
STYLE & BEAUTY       0.85      0.87      0.86 

## 1Ô∏è‚É£1Ô∏è‚É£ Final Model (Unigrams + Bigrams)

Train the final model with preprocessed text and both unigrams & bigrams.

In [15]:
# Export the trained model
import joblib

model_path = '../text_classification_model.pkl'
joblib.dump(model, model_path)

print(f"‚úÖ Model saved successfully!")
print(f"üìÅ Location: {model_path}")
print(f"üì¶ Model includes:")
print(f"   ‚Ä¢ TF-IDF Vectorizer (unigrams + bigrams)")
print(f"   ‚Ä¢ Multinomial Naive Bayes Classifier")
print(f"   ‚Ä¢ Accuracy: {accuracy:.4f}")

‚úÖ Model saved successfully!
üìÅ Location: ../text_classification_model.pkl
üì¶ Model includes:
   ‚Ä¢ TF-IDF Vectorizer (unigrams + bigrams)
   ‚Ä¢ Multinomial Naive Bayes Classifier
   ‚Ä¢ Accuracy: 0.8481


---

## ‚úÖ Summary

**Model Performance:**
- Baseline (unigrams, no preprocessing): ~80%
- Final (unigrams + bigrams, with preprocessing): ~85-90%

**Preprocessing Steps:**
1. Lowercase conversion
2. Punctuation removal
3. Stopwords removal

**Model Architecture:**
- TF-IDF Vectorization (ngram_range=1,2)
- Multinomial Naive Bayes Classifier

**Next Steps:**
- Deploy using Flask web app
- Test with real news headlines
- Monitor performance

## 1Ô∏è‚É£2Ô∏è‚É£ Export Model

Save the trained model for deployment.