## Importing libraries

In [3]:
# Import all libraries for the project
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

print("All necessary libraries have been imported.")

All necessary libraries have been imported.


## Loading Dataset


In [4]:
# Loading the dataset

newsgroups = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)

print("--- ✅ SUCCESS! ---")
print("The 20 Newsgroups dataset has been loaded.")

--- ✅ SUCCESS! ---
The 20 Newsgroups dataset has been loaded.


## Exploring the Loaded dataset

In [5]:
# Exploring  the loaded data
# The 'target_names' attribute gives us the list of all category names
print("The categories are:")
print(newsgroups.target_names)

# The 'data' attribute is a list of all the articles. Let's look at the first one.
print("\n--- Sample Article ---")
print(newsgroups.data[0])

The categories are:
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

--- Sample Article ---
From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why


## Preparing features and split the data

In [6]:
# Assign features and target, then split the data
# X is the list of articles (our features)
X = newsgroups.data
# y is the list of categories (our target)
y = newsgroups.target

# Split our data into a training set and a testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

print("Data has been prepared and split into training and testing sets.")

Data has been prepared and split into training and testing sets.


## Building and Training the Machine Learning Model

In [7]:
# Cell 5: Create and train the model pipeline
# Our model is a pipeline that first vectorizes the text and then applies Naive Bayes
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

# We train the model by calling .fit() with our training data
print("Training the model... (This can take a moment)")
model.fit(X_train, y_train)
print("Model training is complete!")

Training the model... (This can take a moment)
Model training is complete!


## Evaluating the Model Performance

In [8]:
# Cell 6: Make predictions and evaluate the model
# Use the trained model to make predictions on the test set
predictions = model.predict(X_test)

# Compare the model's predictions to the actual answers to get the accuracy
accuracy = accuracy_score(y_test, predictions)

print("\n--- ✅ Final Result ---")
print(f"The model's accuracy is: {accuracy * 100:.2f}%")


--- ✅ Final Result ---
The model's accuracy is: 84.25%
