## 20 Newsgroups Dataset Overview

The 20 Newsgroups dataset is a popular collection for text classification and text mining tasks. Compiled from the newsgroup postings of 20 different topics, it offers a rich corpus for natural language processing (NLP) research and machine learning (ML) applications.

### Key Features

- **Topics**: The dataset encompasses a diverse range of subjects, broadly categorized into four main groups:
  - **Computer Hardware and Software**: Discussions about technology, including graphics cards, Microsoft Windows, and Mac hardware.
  - **Science**: Conversations around scientific disciplines such as physics, medicine, and space.
  - **Sociopolitical**: Debates and discussions on politics, guns, and the Middle East.
  - **Religion**: Dialogues concerning atheism, Christianity, and Islam.

- **Content**: It contains around 20,000 discussion items, roughly equally distributed across the 20 different newsgroups. 

- **Format**: Each entry in the dataset is a raw text file that may include the message body, headers, footers, and quotes. 

In [1]:
import pandas as pd
import xgboost as xgb
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer # pip install -U sentence-transformers

In [2]:
train = fetch_20newsgroups(subset='train')

In [3]:
train.target

array([7, 4, 4, ..., 3, 1, 8])

In [4]:
train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [5]:
print(train.data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [6]:
model = SentenceTransformer("all-MiniLM-L6-v2") 

In [7]:
# Our sentences to encode
sentences = train.data

# Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

In [8]:
embeddings[0].shape

(384,)

In [9]:
len(embeddings)

11314

In [10]:
df = pd.DataFrame(embeddings)

In [11]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,374,375,376,377,378,379,380,381,382,383
0,-0.073384,0.144642,0.043866,-0.008487,0.010976,0.00465,-0.092037,0.056421,-0.117753,-0.004202,...,-0.065488,-0.052008,-0.062907,0.004635,0.022742,-0.064751,0.104309,-0.024578,-0.006717,0.132525
1,0.010949,0.03887,0.048669,0.013852,0.006879,-0.030265,-0.027496,0.062196,-0.020248,-0.073615,...,-0.010721,0.015176,-0.073615,0.076515,0.015066,0.117065,0.029308,-0.011192,-0.052652,-0.000681
2,-0.072989,-0.018359,0.014476,0.039985,0.038344,-0.004665,-0.105183,0.053926,0.000405,-0.000927,...,-0.035562,0.035174,-0.084836,0.048907,-0.061031,0.035921,0.042243,-0.125008,-0.101583,0.016186
3,-0.107436,0.012275,-0.032905,0.003114,-0.017748,-0.009349,0.010851,0.091343,-0.082454,-0.130297,...,-0.114026,0.030369,0.012861,0.007753,-0.013295,0.056767,-0.016463,-0.047806,-0.058419,-0.0184
4,-0.027212,-0.032349,0.024046,0.117962,0.055256,-0.044903,0.065358,0.071563,-0.045105,0.076455,...,-0.03533,0.037122,-0.013958,-0.03862,-0.012475,-0.006234,0.099191,0.039579,-0.044369,-0.014249


In [12]:
df['label'] = train.target

In [13]:
label_dict = {k:v for k,v in enumerate(train.target_names, 0)}

In [14]:
label_dict

{0: 'alt.atheism',
 1: 'comp.graphics',
 2: 'comp.os.ms-windows.misc',
 3: 'comp.sys.ibm.pc.hardware',
 4: 'comp.sys.mac.hardware',
 5: 'comp.windows.x',
 6: 'misc.forsale',
 7: 'rec.autos',
 8: 'rec.motorcycles',
 9: 'rec.sport.baseball',
 10: 'rec.sport.hockey',
 11: 'sci.crypt',
 12: 'sci.electronics',
 13: 'sci.med',
 14: 'sci.space',
 15: 'soc.religion.christian',
 16: 'talk.politics.guns',
 17: 'talk.politics.mideast',
 18: 'talk.politics.misc',
 19: 'talk.religion.misc'}

In [15]:
df['label'].replace(label_dict, inplace=True)

In [16]:
df.sample(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,375,376,377,378,379,380,381,382,383,label
2538,-0.034843,-0.058041,-0.066117,0.008092,-0.003159,-0.003017,-0.027399,-0.035066,0.027299,0.037996,...,0.00664,-0.013894,-0.052924,0.006114,0.00653,0.127832,-0.015681,0.01131,0.06171,talk.politics.mideast
7845,0.016892,-0.10639,0.001552,-0.100962,-0.095707,0.008934,-0.119062,0.02675,-0.029804,0.005235,...,0.049609,-0.007812,-0.05395,0.031513,-0.003336,0.033478,0.062172,-0.045588,0.025845,comp.sys.mac.hardware
1955,-0.069299,-0.045954,0.083345,-0.005201,-0.011273,-0.04489,-0.046241,-0.014178,0.00107,0.014388,...,-0.08285,-0.00143,-0.041279,-0.049106,0.07468,-0.096011,-0.037772,0.010779,0.026012,sci.crypt


In [17]:
df.label.value_counts()

label
rec.sport.hockey            600
soc.religion.christian      599
rec.motorcycles             598
rec.sport.baseball          597
sci.crypt                   595
rec.autos                   594
sci.med                     594
comp.windows.x              593
sci.space                   593
comp.os.ms-windows.misc     591
sci.electronics             591
comp.sys.ibm.pc.hardware    590
misc.forsale                585
comp.graphics               584
comp.sys.mac.hardware       578
talk.politics.mideast       564
talk.politics.guns          546
alt.atheism                 480
talk.politics.misc          465
talk.religion.misc          377
Name: count, dtype: int64

In [18]:
df = df[df['label'].isin(['rec.sport.baseball', 'sci.space'])]

In [19]:
X = df[df.columns[:-1]]
y = df['label'].apply(lambda x: 1 if x == 'rec.sport.baseball' else 0)

In [20]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)

In [21]:
# Actually we don't need to normalize data in this case
# If you're not sure why, use df.describe()
clf = Pipeline(steps=[  #('preprocessor', preprocessor),
    ('classifier',
     xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss'))
])

# Train the XGBoost model
clf.fit(X_train, y_train)

# Predict on the testing set
y_pred = clf.predict(X_test)

# Evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

Accuracy: 0.9831932773109243
Precision: 0.9915254237288136
Recall: 0.975
F1 Score: 0.9831932773109243
