### The 20 Newsgroups Dataset
The 20 Newsgroups dataset is a classic and widely-used collection of documents for text classification and clustering. It consists of approximately 20,000 documents, partitioned nearly evenly across 20 different newsgroups.

### What it Contains
The dataset contains posts collected from 20 different Usenet newsgroups. These groups cover a wide range of topics, including:

__Computers:__ comp.graphics, comp.sys.mac.hardware

__Recreation:__ rec.sport.baseball, rec.autos

__Science:__ sci.space, sci.med

__Religion:__ talk.religion.misc, alt.atheism

__Politics:__ talk.politics.guns, talk.politics.mideast

### Format and Structure
The dataset is primarily in __plain text__ format. Each document is a single newsgroup post. While the raw data includes headers, footers, and quoted replies, the sklearn version of the dataset can clean these elements out upon loading, as seen in the code.

The data is structured as a collection of text documents and their corresponding target labels (the newsgroup they belong to). This makes it ideal for supervised learning, where a model learns to predict the correct category based on the document's content.

### How it was Collected
The dataset was originally collected by Ken Lang in 1992 and compiled from newsgroup archives. It was one of the first publicly available large-scale datasets for text classification, and its clear division into distinct categories makes it a popular benchmark for evaluating machine learning algorithms. Its well-documented nature and consistent structure have made it a staple in the field of natural language processing (NLP) research and education for decades.

In [1]:
import os
import pickle

import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
newsgroups_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes')) #train+test, remove metadata
documents = newsgroups_data.data

In [6]:
print(len(documents)) #documents is a list

18846


In [7]:
print(documents[0])



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




In [8]:
#TFIDF + stopword removal + ignore terms with more than appear in more than 95% of documents and less than 2 occurences
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.95, min_df=2)
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
print(f"Shape of TF-IDF matrix: {tfidf_matrix.shape}")

Shape of TF-IDF matrix: (18846, 51840)


Save files for reusage

In [13]:
data_dir = "D:/Data_and_AI/Projects/End-to-end/news_search_engine/data"

with open(os.path.join(data_dir, 'documents.pkl'), 'wb') as f:
    pickle.dump(documents, f)
    
with open(os.path.join(data_dir, "tfidf_matrix.pkl"), 'wb') as f:
    pickle.dump(tfidf_matrix, f)
    
with open(os.path.join(data_dir, "tfidf_vectorizer.pkl"), 'wb') as f:
    pickle.dump(tfidf_vectorizer, f)
print("Data Saved Succesfully")

Data Saved Succesfully
