<a href="https://colab.research.google.com/github/MohammedKaif037/NewsArticleNLP/blob/main/NewsArticleClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Article Classification using NLP

## Introduction

News articles cover a wide range of topics such as politics, economics, sports and entertainment. Automatically classifying them into categories can save significant time for journalists, readers and content aggregators.

### Key Benefits of Automated News Categorization:
- **Media Monitoring**: Quickly track news on specific topics.
- **Content Recommendations**: Recommend articles based on users' interests.
- **Sentiment Analysis**: Determine public sentiment towards political events, companies, etc.

This can be achieved using natural language processing (NLP) by which we can classify news articles into predefined categories using text representation techniques such as **Bag of Words (BoW)** and **Term Frequency-Inverse Document Frequency (TF-IDF)**. Both techniques convert text into numerical vectors, enabling machine learning algorithms to classify news articles.

## 1. Importing Necessary Libraries

We will import the following libraries like pandas, numpy, nltk and scikit learn.

In [2]:
import pandas as pd
import numpy as np
import nltk

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

## 2. Loading the Dataset

We will load the dataset into our environment and display first few rows. You can download the dataset from [here](link-to-dataset).

In [6]:
import kagglehub
path = kagglehub.dataset_download("amirarsalankhroush/bbc-data")
print("Dataset downloaded to:", path)

Using Colab cache for faster access to the 'bbc-data' dataset.
Dataset downloaded to: /kaggle/input/bbc-data


In [7]:
df = pd.read_csv('/kaggle/input/bbc-data/bbc_data.csv')
df["labels"].unique()


array(['entertainment', 'business', 'sport', 'politics', 'tech'],
      dtype=object)

**Output:**
```
array(['entertainment', 'business', 'sport', 'politics', 'tech'],
      dtype=object)
```

In [8]:
# Check dataset size
print(f"Total articles: {len(df)}")

# Check distribution across categories
print("\nCategory distribution:")
print(df['labels'].value_counts())

# Check a sample article
print("\nSample article (first 200 characters):")
print(df['data'].iloc[0][:200], "...")

Total articles: 2225

Category distribution:
labels
sport            511
business         510
politics         417
tech             401
entertainment    386
Name: count, dtype: int64

Sample article (first 200 characters):
Musicians to tackle US red tape  Musicians groups are to tackle US visa regulations which are blamed for hindering British acts chances of succeeding across the Atlantic.  A singer hoping to perform i ...


In [9]:
df['data'] # view data before preprocessing

Unnamed: 0,data
0,Musicians to tackle US red tape Musicians gro...
1,"U2s desire to be number one U2, who have won ..."
2,Rocker Doherty in on-stage fight Rock singer ...
3,Snicket tops US box office chart The film ada...
4,"Oceans Twelve raids box office Oceans Twelve,..."
...,...
2220,Warning over Windows Word files Writing a Mic...
2221,Fast lifts rise into record books Two high-sp...
2222,Nintendo adds media playing to DS Nintendo is...
2223,Fast moving phone viruses appear Security fir...


## 3. Downloading NLTK Resources

We will download the following NLTK Resources:
- **punkt**: A package from NLTK used for tokenizing text into words.
- **stopwords**: A predefined list of common, meaningless words in English like "the", "is", etc.
- **punkt_tab**: A resource for handling special tokenization cases in specific contexts.

In [10]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

## 4. Preprocessing the Text

We will preprocess the text data by following these steps:
1. **Tokenization**: Split the text into individual words.
2. **Stopword Removal**: Remove common words like "the", "is", etc. which do not add significant meaning.

We will define a function for pre-processing our text.

- `stopwords.words('english')`: Fetches a predefined list of common stopwords in English for filtering.
- `word_tokenize(text.lower())`: Converts the text to lowercase and splits it into individual words (tokens).
- `tokens = [word for word in tokens if word.isalpha()]`: Removes non-alphabetic characters (e.g., numbers, punctuation).
- `tokens = [word for word in tokens if word not in stop_words]`: Removes stopwords to focus on meaningful words for analysis.

In [11]:
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text.lower())
    tokens = [word for word in tokens if word.isalpha()]
    tokens = [word for word in tokens if word not in stop_words]
    return tokens

## 5. Applying Preprocessing to Dataset

We will now apply the preprocessing function and clean the data.

In [12]:
df['processed_content'] = df['data'].apply(preprocess_text)
df['processed_content'].head()

Unnamed: 0,processed_content
0,"[musicians, tackle, us, red, tape, musicians, ..."
1,"[desire, number, one, three, prestigious, gram..."
2,"[rocker, doherty, fight, rock, singer, pete, d..."
3,"[snicket, tops, us, box, office, chart, film, ..."
4,"[oceans, twelve, raids, box, office, oceans, t..."


## 6. Text Vectorization

Now, we will transform the text data into numerical vectors using Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF).

### 6.1. Vectorize Text Using BoW and TF-IDF

- `CountVectorizer()`: Converts a collection of text documents into a matrix of token counts (BoW).
- `bow_vectorizer.fit_transform()`: Fits the vectorizer on the dataset and transforms the text into a matrix of word counts.
- `df['processed_content'].apply(' '.join)`: Joins the list of tokens (processed content) into a single string for each article, as CountVectorizer expects input in string format.

In [13]:
bow_vectorizer = CountVectorizer()
X_bow = bow_vectorizer.fit_transform(df['processed_content'].apply(' '.join))

### 6.2. Vectorize Text Using TF-IDF

- `TfidfVectorizer()`: Converts a collection of text documents into a matrix of TF-IDF features.
- `tfidf_vectorizer.fit_transform()`: Fits the vectorizer on the dataset and transforms the text into a matrix of TF-IDF values.
- `df['processed_content'].apply(' '.join)`: Joins the list of tokens (processed content) into a single string for each article, as TfidfVectorizer expects input in string format.

In [14]:
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(df['processed_content'].apply(' '.join))

## 7. Training our Models

We will train a **Naive Bayes classifier** on the BoW and TF-IDF representations of the text. Naive Bayes is a machine learning model based on Bayes' theorem which assumes that features are independent given the class. It is particularly effective for text classification tasks where the features (words) are treated as independent predictors of the class label.

- `train_test_split()`: Splits the dataset into training and test sets.
- `test_size=0.2`: Allocates 20% of the data for testing.
- `random_state=42`: Ensures reproducibility by using a fixed random seed.
- `MultinomialNB()`: Initializes the Naive Bayes classifier.
- `fit()`: Trains the model on the training data.

### 7.1. Training BoW model

We will train the BoW model by splitting the data into training and test sets and using a Naive Bayes classifier to fit the training data.

In [15]:
X_train_bow, X_test_bow, y_train, y_test = train_test_split(X_bow, df['labels'], test_size=0.2, random_state=42)

nb_model1 = MultinomialNB()
nb_model1.fit(X_train_bow, y_train)

### 7.2. Training TF-IDF model

We will train the TF-IDF model by splitting the data into training and test sets and using a Naive Bayes classifier to fit the training data.

In [16]:
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, df['labels'], test_size=0.2, random_state=42)

nb_model2 = MultinomialNB()
nb_model2.fit(X_train_tfidf, y_train)

## 8. Comparing Both Models

We will now evaluate both the model's performance.

- `fit()`: Trains the model on the training data.
- `predict()`: Makes predictions on the test data.
- `classification_report()`: Computes performance metrics like precision, recall, F1-score and accuracy.

In [17]:
y_pred_bow = nb_model1.predict(X_test_bow)
print("BoW Model Performance:\n", classification_report(y_test, y_pred_bow))

y_pred_tfidf = nb_model2.predict(X_test_tfidf)
print("TF-IDF Model Performance:\n", classification_report(y_test, y_pred_tfidf))

BoW Model Performance:
                precision    recall  f1-score   support

     business       0.98      0.96      0.97       103
entertainment       1.00      0.98      0.99        84
     politics       0.98      0.99      0.98        80
        sport       1.00      0.99      0.99        98
         tech       0.95      1.00      0.98        80

     accuracy                           0.98       445
    macro avg       0.98      0.98      0.98       445
 weighted avg       0.98      0.98      0.98       445

TF-IDF Model Performance:
                precision    recall  f1-score   support

     business       0.96      0.99      0.98       103
entertainment       1.00      0.95      0.98        84
     politics       0.90      0.97      0.93        80
        sport       0.99      0.99      0.99        98
         tech       1.00      0.93      0.96        80

     accuracy                           0.97       445
    macro avg       0.97      0.97      0.97       445
 weighted

### Performance Analysis

The performance results from both the BoW and TF-IDF models show impressive classification accuracy across all categories.

**BoW Model:**
- Achieves an accuracy of **98%** with consistently high precision, recall and F1-scores across all categories.
- Performs particularly well on "sport" and "entertainment" categories, showing near-perfect performance (1.00 recall and F1-score).

**TF-IDF Model:**
- Also performs well with an overall accuracy of **97%**.
- While it performs slightly less well on the "business" and "politics" categories compared to BoW, it still demonstrates strong classification ability.
- Shows especially high precision for the "tech" category.

You can refer [this article](link-to-article) for more detailed difference: BoW vs TF-IDF

## 9. Making Predictions

We will use the trained Naive Bayes models to make predictions on custom input text. The text will first be preprocessed, then transformed using both BoW and TF-IDF and finally classified into categories.

- `preprocess_text(custom_text1)`: Preprocesses the input text by tokenizing, removing stopwords and filtering non-alphabetic words.
- `' '.join(preprocess_text(custom_text1))`: Joins tokens into a string for vectorization.
- `bow_vectorizer.transform([processed_custom_text])`: Transforms the processed text into BoW format for the model.
- `tfidf_vectorizer.transform([processed_custom_text])`: Transforms the processed text into TF-IDF format.
- `nb_model1.predict(custom_text_bow)`: Predicts the category using the BoW model.
- `nb_model2.predict(custom_text_tfidf)`: Predicts the category using the TF-IDF model.

In [18]:
custom_text1 = "Artificial intelligence is revolutionizing the tech industry, with companies racing to develop the next big innovation."

print("Input text: ", custom_text1)

processed_custom_text = ' '.join(preprocess_text(custom_text1))

custom_text_bow = bow_vectorizer.transform([processed_custom_text])
custom_text_tfidf = tfidf_vectorizer.transform([processed_custom_text])

predicted_category_bow = nb_model1.predict(custom_text_bow)
print(f"Predicted Category (BoW): {predicted_category_bow[0]}")

predicted_category_tfidf = nb_model2.predict(custom_text_tfidf)
print(f"Predicted Category (TF-IDF): {predicted_category_tfidf[0]}")

Input text:  Artificial intelligence is revolutionizing the tech industry, with companies racing to develop the next big innovation.
Predicted Category (BoW): tech
Predicted Category (TF-IDF): tech


**Output:**
```
Input text: Artificial intelligence is revolutionizing the tech industry, with companies racing to develop the next big innovation.
Predicted Category (BoW): tech
Predicted Category (TF-IDF): tech
```

We can see both models are working fine.