# News Group Classification 

In this project, we aim to build a **text classification model** that can automatically categorize news articles into their respective topics. This involves applying **Natural Language Processing (NLP)** techniques and training a machine learning model on labeled news data.


# Importing Libraries

In [117]:
import pandas as pd
import contractions
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Dataset Overview

Our dataset consists of news articles with the following columns:

- **category**: The target label indicating the topic of the article (e.g., sports, tech, politics, etc.).
- **filename**: The file name or path associated with each article.
- **content**: The full text of the news article, which will serve as our main input for training the classification model.

We'll use the **`content`** column as the input feature for NLP processing, and the **`category`** column as the target for model training.


In [118]:
data = pd.read_parquet("news_data.parquet")

In [119]:
print(data.head())

      category filename                                            content
0  alt.atheism    49960  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49...
1  alt.atheism    51060  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...
2  alt.atheism    51119  Newsgroups: alt.atheism\nPath: cantaloupe.srv....
3  alt.atheism    51120  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...
4  alt.atheism    51121  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...


In [120]:
print(data.shape)               # Rows and columns

(19997, 3)


In [121]:
print(data['category'].value_counts())  # How many articles per category

category
alt.atheism                 1000
comp.graphics               1000
comp.os.ms-windows.misc     1000
comp.sys.ibm.pc.hardware    1000
comp.sys.mac.hardware       1000
comp.windows.x              1000
misc.forsale                1000
rec.autos                   1000
rec.motorcycles             1000
rec.sport.baseball          1000
rec.sport.hockey            1000
sci.crypt                   1000
sci.electronics             1000
sci.med                     1000
sci.space                   1000
talk.politics.guns          1000
talk.politics.misc          1000
talk.politics.mideast       1000
talk.religion.misc          1000
soc.religion.christian       997
Name: count, dtype: int64


In [122]:
print(data['content'][0])       # View a sample article

Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49960 alt.atheism.moderated:713 news.answers:7054 alt.answers:126
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!spool.mu.edu!uunet!pipex!ibmpcug!mantis!mathew
From: mathew <mathew@mantis.co.uk>
Newsgroups: alt.atheism,alt.atheism.moderated,news.answers,alt.answers
Subject: Alt.Atheism FAQ: Atheist Resources
Summary: Books, addresses, music -- anything related to atheism
Keywords: FAQ, atheism, books, music, fiction, addresses, contacts
Message-ID: <19930329115719@mantis.co.uk>
Date: Mon, 29 Mar 1993 11:57:19 GMT
Expires: Thu, 29 Apr 1993 11:57:19 GMT
Followup-To: alt.atheism
Distribution: world
Organization: Mantis Consultants, Cambridge. UK.
Approved: news-answers-request@mit.edu
Supersedes: <19930301143317@mantis.co.uk>
Lines: 290

Archive-name: atheism/resources
Alt-atheism-archive-name: resources
Last-modified: 11 December

In [123]:
news_content = data['content']
news_labels = data['category']

# Preprocessing

## Clean Raw Input Data

To ensure our text data is ready for machine learning, we apply the following preprocessing steps:

- **Expand Contractions**  
  Convert common contractions to their full form for consistency.  
  _Examples:_
  - `"don't"` → `"do not"`  
  - `"it's"` → `"it is"`

- **Lowercase the Text**  
  Normalize all text to lowercase to reduce vocabulary size and avoid case-sensitive duplicates.

- **Remove Metadata**  
  Strip away unnecessary headers, footers, and email signatures that do not contribute to the actual content.

- **️Remove Numbers and Punctuation**  
  These elements often add noise and are usually not meaningful in text classification tasks.

- **Remove Extra Whitespace**  
  Clean up unnecessary spaces, tabs, and newline characters to maintain uniformity in the text.


In [124]:
news_content = news_content.apply(lambda x: contractions.fix(x)) # expand contractions
news_content  = news_content.str.lower() 

In [125]:
def clean_article(text):
    # Remove headers and footers (common email/news metadata)
    text = re.sub(r"(?s)^.*?Lines: \d+\s+", "", text)
    text = re.sub(r"(?s)^.*?(?=Archive-name:)", "", text)
    text = re.sub(r"(?s)^\s*From:.*?\n", "", text)
    text = re.sub(r"(?s)^.*?Subject:.*?\n", "", text)
    text = re.sub(r"(?s)^.*?Path:.*?\n", "", text)
    text = re.sub(r"(?s)^.*?Newsgroups:.*?\n", "", text)
    text = re.sub(r"(?s)^.*?Message-ID:.*?\n", "", text)
    text = re.sub(r"(?s)^.*?Organization:.*?\n", "", text)

    # Remove email signatures (e.g., lines starting with '--')
    text = re.sub(r"--\s*\n.*", "", text, flags=re.DOTALL)

    # Remove numbers and punctuation
    text = re.sub(r"[^a-zA-Z\s]", " ", text)

    # Remove extra whitespace (tabs, newlines, multiple spaces)
    text = re.sub(r"\s+", " ", text).strip()

    return text


In [126]:
news_content = news_content.apply(clean_article)

In [None]:
#news_content[0] # sample output to ensure cleaning was applied


'xref cantaloupe srv cs cmu edu alt atheism alt atheism moderated news answers alt answers path cantaloupe srv cs cmu edu crabapple srv cs cmu edu bb andrew cmu edu news sei cmu edu cis ohio state edu magnus acs ohio state edu usenet ins cwru edu agate spool mu edu uunet pipex ibmpcug mantis mathew from mathew mathew mantis co uk newsgroups alt atheism alt atheism moderated news answers alt answers subject alt atheism faq atheist resources summary books addresses music anything related to atheism keywords faq atheism books music fiction addresses contacts message id mantis co uk date mon mar gmt expires thu apr gmt followup to alt atheism distribution world organization mantis consultants cambridge uk approved news answers request mit edu supersedes mantis co uk lines archive name atheism resources alt atheism archive name resources last modified december version atheist resources addresses of atheist organizations usa freedom from religion foundation darwin fish bumper stickers and as

# Tokenization and stopwords removal

- **Tokenize Text**    
  Break each cleaned text into individual words (tokens) using NLTK's word_tokenize. This enables more granular analysis and further NLP processing.
  
  _Examples:_
  - `"the quick brown fox"` → `["the", "quick", "brown", "fox"]`

- **Remove Stopwords**
  Eliminate common English stopwords (e.g., "the", "is", "and") using NLTK’s predefined list. These words typically carry less semantic meaning and can introduce noise in text classification tasks.

  _Examples:_
  - `["the", "quick", "brown", "fox"]` → `["quick", "brown", "fox"]`

In [128]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to C:\Users\CITY-
[nltk_data]     LAP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\CITY-
[nltk_data]     LAP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\CITY-
[nltk_data]     LAP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [129]:
stop_words = set(stopwords.words('english'))

def tokenize_and_remove_stopwords(text):
    tokens = word_tokenize(text)
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return filtered_tokens

# Apply to the series
news_tokens = news_content.apply(tokenize_and_remove_stopwords)
