# News Group Classification

In this project, we aim to build a **text classification model** that can automatically categorize news articles into their respective topics. This involves applying **Natural Language Processing (NLP)** techniques and training a machine learning model on labeled news data.


# Importing Libraries

In [None]:
!pip install contractions
!pip install pyarrow
!pip install numpy
!pip install gensim
import pandas as pd
import contractions
import re
import nltk
import sklearn
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from gensim.models import Word2Vec
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

# Dataset Overview

Our dataset consists of news articles with the following columns:

- **category**: The target label indicating the topic of the article (e.g., sports, tech, politics, etc.).
- **filename**: The file name or path associated with each article.
- **content**: The full text of the news article, which will serve as our main input for training the classification model.

We'll use the **`content`** column as the input feature for NLP processing, and the **`category`** column as the target for model training.


In [13]:
data = pd.read_parquet("news_data.parquet",engine='pyarrow')

In [14]:
data.to_csv('news_data.csv', index=False)

In [15]:
print(data.head())

      category filename                                            content
0  alt.atheism    49960  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49...
1  alt.atheism    51060  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...
2  alt.atheism    51119  Newsgroups: alt.atheism\nPath: cantaloupe.srv....
3  alt.atheism    51120  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...
4  alt.atheism    51121  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...


In [16]:
print(data.shape)               # Rows and columns

(19997, 3)


In [17]:
print(data['category'].value_counts())  # How many articles per category

category
alt.atheism                 1000
comp.graphics               1000
talk.politics.misc          1000
talk.politics.mideast       1000
talk.politics.guns          1000
sci.space                   1000
sci.med                     1000
sci.electronics             1000
sci.crypt                   1000
rec.sport.hockey            1000
rec.sport.baseball          1000
rec.motorcycles             1000
rec.autos                   1000
misc.forsale                1000
comp.windows.x              1000
comp.sys.mac.hardware       1000
comp.sys.ibm.pc.hardware    1000
comp.os.ms-windows.misc     1000
talk.religion.misc          1000
soc.religion.christian       997
Name: count, dtype: int64


In [18]:
print(data['content'][0])       # View a sample article

Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49960 alt.atheism.moderated:713 news.answers:7054 alt.answers:126
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!spool.mu.edu!uunet!pipex!ibmpcug!mantis!mathew
From: mathew <mathew@mantis.co.uk>
Newsgroups: alt.atheism,alt.atheism.moderated,news.answers,alt.answers
Subject: Alt.Atheism FAQ: Atheist Resources
Summary: Books, addresses, music -- anything related to atheism
Keywords: FAQ, atheism, books, music, fiction, addresses, contacts
Message-ID: <19930329115719@mantis.co.uk>
Date: Mon, 29 Mar 1993 11:57:19 GMT
Expires: Thu, 29 Apr 1993 11:57:19 GMT
Followup-To: alt.atheism
Distribution: world
Organization: Mantis Consultants, Cambridge. UK.
Approved: news-answers-request@mit.edu
Supersedes: <19930301143317@mantis.co.uk>
Lines: 290

Archive-name: atheism/resources
Alt-atheism-archive-name: resources
Last-modified: 11 December

In [19]:
news_content = data['content']
news_labels = data['category']

# Preprocessing

## Clean Raw Input Data

To ensure our text data is ready for machine learning, we apply the following preprocessing steps:

- **Expand Contractions**  
  Convert common contractions to their full form for consistency.  
  _Examples:_
  - `"don't"` → `"do not"`  
  - `"it's"` → `"it is"`

- **Lowercase the Text**  
  Normalize all text to lowercase to reduce vocabulary size and avoid case-sensitive duplicates.

- **Remove Metadata**  
  Strip away unnecessary headers, footers, and email signatures that do not contribute to the actual content.

- **️Remove Numbers and Punctuation**  
  These elements often add noise and are usually not meaningful in text classification tasks.

- **Remove Extra Whitespace**  
  Clean up unnecessary spaces, tabs, and newline characters to maintain uniformity in the text.


In [20]:
news_content = news_content.apply(lambda x: contractions.fix(x)) # expand contractions
news_content  = news_content.str.lower()

In [21]:
def clean_article(text):
    # Remove headers and footers (common email/news metadata)
    text = re.sub(r"(?s)^.*?Lines: \d+\s+", "", text)
    text = re.sub(r"(?s)^.*?(?=Archive-name:)", "", text)
    text = re.sub(r"(?s)^\s*From:.*?\n", "", text)
    text = re.sub(r"(?s)^.*?Subject:.*?\n", "", text)
    text = re.sub(r"(?s)^.*?Path:.*?\n", "", text)
    text = re.sub(r"(?s)^.*?Newsgroups:.*?\n", "", text)
    text = re.sub(r"(?s)^.*?Message-ID:.*?\n", "", text)
    text = re.sub(r"(?s)^.*?Organization:.*?\n", "", text)

    # Remove email signatures (e.g., lines starting with '--')
    text = re.sub(r"--\s*\n.*", "", text, flags=re.DOTALL)

    # Remove numbers and punctuation
    text = re.sub(r"[^a-zA-Z\s]", " ", text)

    # Remove extra whitespace (tabs, newlines, multiple spaces)
    text = re.sub(r"\s+", " ", text).strip()

    return text


In [22]:
news_content = news_content.apply(clean_article)

In [23]:
#news_content[0] # sample output to ensure cleaning was applied


# Tokenization and stopwords removal

- **Tokenize Text**    
  Break each cleaned text into individual words (tokens) using NLTK's word_tokenize. This enables more granular analysis and further NLP processing.
  
  _Examples:_
  - `"the quick brown fox"` → `["the", "quick", "brown", "fox"]`

- **Remove Stopwords**
  Eliminate common English stopwords (e.g., "the", "is", "and") using NLTK’s predefined list. These words typically carry less semantic meaning and can introduce noise in text classification tasks.

  _Examples:_
  - `["the", "quick", "brown", "fox"]` → `["quick", "brown", "fox"]`

In [24]:
stop_words = set(stopwords.words('english'))

def tokenize_and_remove_stopwords(text):
    tokens = word_tokenize(text)
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return filtered_tokens

# Apply to the series
news_tokens = news_content.apply(tokenize_and_remove_stopwords)


## TF–IDF Vectorization

Transform raw text data into numerical features.TF–IDF captures the importance of words across documents by balancing their frequency within a document.

 - Reconstruct Tokenized Text

In [25]:
cleaned_text = news_tokens.apply(lambda tokens: " ".join(tokens))

- Initialize the vectorizer

In [26]:
vectorizer = TfidfVectorizer(
    # ignore very rare words
    min_df=5,
    # ignore very common words
    max_df=0.8,
    # include unigrams and bigrams -> Unigrams are capture the frequency of individual words,Bigrams capture the relationship between two consecutive words, which can be useful for understanding the context and meaning of words in a sentence.
    ngram_range=(1,2),
)



In [27]:
#vectorizer = TfidfVectorizer()

- transform into a TF–IDF matrix

In [28]:
tfidf_matrix = vectorizer.fit_transform(cleaned_text)


In [29]:
print("TF–IDF matrix shape:", tfidf_matrix.shape)

feature_names = vectorizer.get_feature_names_out()

TF–IDF matrix shape: (19997, 111908)


## Word Embedding using Word2Vec

Convert words into dense, continuous-valued vectors that capture semantic relationships. Unlike TF–IDF, which treats each term independently, Word2Vec learns word representations so that words appearing in similar contexts end up with similar vectors.

How it works: Word2Vec slides a window over each token list and learns to predict a target word from its neighbors (or vice versa).

In [30]:
# 7. Train Word2Vec model
w2v_model = Word2Vec(
    sentences=news_tokens,
    vector_size=100,  # dimensionality
    window=5,
    min_count=5,      # ignore words with freq < 5
    workers=4,        # Number of CPU cores used
    seed=42
)
w2v_model.save("word2vec.model")

-  Build document embeddings by averaging word vectors

In [31]:
def document_vector(tokens):
    vecs = [w2v_model.wv[word] for word in tokens if word in w2v_model.wv]
    if not vecs:
        return np.zeros(w2v_model.vector_size)
    return np.mean(vecs, axis=0)

In [33]:
import numpy as np

embeddings = np.vstack(news_tokens.apply(document_vector).values)
df_embeddings = pd.DataFrame(
    embeddings,
    columns=[f"dim_{i}" for i in range(w2v_model.vector_size)]
)

In [34]:
df_embeddings.to_parquet('doc_embeddings.parquet', engine='pyarrow')
print("Document embeddings saved; shape:", df_embeddings.shape)

Document embeddings saved; shape: (19997, 100)


# Train-Validation-Test split


Split the dataset into three subsets to enable effective model training, validation, and testing. This ensures unbiased evaluation and prevents overfitting.

 -  Split off test (20%)

In [35]:
X_temp, X_test, y_temp, y_test = train_test_split(
    tfidf_matrix,
    news_labels,
    test_size=0.20,       # 20% → test set
    random_state=42,
    stratify=news_labels
)

- Split the remaining 80% into train (60%) and validation (20%)

In [36]:
X_train, X_val, y_train, y_val = train_test_split(
    X_temp,
    y_temp,
    test_size=0.25,       # 25% of 80% = 20% overall for validation
    random_state=42,
    stratify=y_temp
)

In [37]:
print(f"Train  → X_train: {X_train.shape},  y_train: {y_train.shape}")
print(f"Val    → X_val:   {X_val.shape},    y_val:   {y_val.shape}")
print(f"Test   → X_test:  {X_test.shape},   y_test:  {y_test.shape}")

Train  → X_train: (11997, 111908),  y_train: (11997,)
Val    → X_val:   (4000, 111908),    y_val:   (4000,)
Test   → X_test:  (4000, 111908),   y_test:  (4000,)
