# Sport vs Politics Text Classification

## Introduction

The objective of this project is to design a machine learning based text
classification system that classifies a given news document as either
**Sports** or **Politics**. The project explores different feature
representation techniques and compares the performance of multiple
machine learning models on the same dataset.


In [2]:
# Step 1: Load Sports and Politics data from 20 Newsgroups dataset

from sklearn.datasets import fetch_20newsgroups

# Define categories
sports_categories = [
    "rec.sport.baseball",
    "rec.sport.hockey"
]

politics_categories = [
    "talk.politics.guns",
    "talk.politics.mideast"
]

# Load datasets
sports_data = fetch_20newsgroups(
    subset="all",
    categories=sports_categories,
    remove=("headers", "footers", "quotes")
)

politics_data = fetch_20newsgroups(
    subset="all",
    categories=politics_categories,
    remove=("headers", "footers", "quotes")
)

# Check dataset sizes
print("Number of Sports documents:", len(sports_data.data))
print("Number of Politics documents:", len(politics_data.data))

# Show one example from each class
print("\nSample Sports document:\n")
print(sports_data.data[0][:500])

print("\nSample Politics document:\n")
print(politics_data.data[0][:500])


Number of Sports documents: 1993
Number of Politics documents: 1850

Sample Sports document:

Ditto...

If we allow people like him to continue to do what he does, it's a
shame.  People say that cheap shots and drawing penalties by fake-
ing is part of the game, I say "Bullsh-t!".  If he ever tried some
like that on a Yzerman, he'd would have to deal with Probert now
wouldn't he?  What Ulf does isn't even retaliatory!  There's now
way one could justify what he does and if they do they're fools.


Sample Politics document:

Re: More on Gun Buybacks

The Denver buy back, trading guns for Denver Nuggets tickets was pretty much
a bust. Very few guns were turned in. The news tried to hype it but 
when the best they could do was ".... including a loaded .38..." well,
you get the picture.

A side note- the news also reported that the guns would be checked for
whether or not they were stolen. STOLEN GUNS WILL BE RETURNED TO THEIR
OWNERS!!!!! (They say)

(Does this have anything to do with the 

In [3]:
# Step 2: Create labels and combine datasets

import pandas as pd

# Create labeled data
sports_df = pd.DataFrame({
    "text": sports_data.data,
    "label": "sports"
})

politics_df = pd.DataFrame({
    "text": politics_data.data,
    "label": "politics"
})

# Combine both classes into one dataset
dataset = pd.concat([sports_df, politics_df], ignore_index=True)

# Shuffle the dataset
dataset = dataset.sample(frac=1, random_state=42).reset_index(drop=True)

# Display basic dataset information
print("Total documents:", len(dataset))
print("\nClass distribution:")
print(dataset["label"].value_counts())

# Show first few rows
dataset.head()


Total documents: 3843

Class distribution:
label
sports      1993
politics    1850
Name: count, dtype: int64


Unnamed: 0,text,label
0,"\n\nTrue, coach Matikainen is ready to keep a ...",sports
1,\n\nYou apparently think you are some sort of ...,politics
2,"I basically agree, the Tigers are my favorite ...",sports
3,Andy Beyer has claimed that the Israeli Press ...,politics
4,"\n ""I have no question that our plan was ...",politics


In [4]:
# Step 3: Split dataset into training and testing sets

from sklearn.model_selection import train_test_split

X = dataset["text"]
y = dataset["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# Verify split sizes
print("Training samples:", len(X_train))
print("Testing samples:", len(X_test))

print("\nTraining class distribution:")
print(y_train.value_counts())

print("\nTesting class distribution:")
print(y_test.value_counts())


Training samples: 3074
Testing samples: 769

Training class distribution:
label
sports      1594
politics    1480
Name: count, dtype: int64

Testing class distribution:
label
sports      399
politics    370
Name: count, dtype: int64


In [5]:
# Step 4: Feature extraction using TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF vectorizer
tfidf = TfidfVectorizer(
    stop_words="english",
    ngram_range=(1, 2),     # unigrams + bigrams
    max_df=0.9,             # ignore very common terms
    min_df=2                # ignore very rare terms
)

# Fit on training data and transform both train and test
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

# Check feature space size
print("Number of TF-IDF features:", X_train_tfidf.shape[1])


Number of TF-IDF features: 50615


In [6]:
# Step 5: Train Naive Bayes classifier

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Initialize model
nb_model = MultinomialNB()

# Train model
nb_model.fit(X_train_tfidf, y_train)

# Predict on test data
nb_preds = nb_model.predict(X_test_tfidf)

# Evaluate
nb_accuracy = accuracy_score(y_test, nb_preds)

print("Naive Bayes Accuracy:", round(nb_accuracy, 4))
print("\nClassification Report:\n")
print(classification_report(y_test, nb_preds))


Naive Bayes Accuracy: 0.9649

Classification Report:

              precision    recall  f1-score   support

    politics       0.97      0.96      0.96       370
      sports       0.96      0.97      0.97       399

    accuracy                           0.96       769
   macro avg       0.97      0.96      0.96       769
weighted avg       0.96      0.96      0.96       769



In [7]:
# Step 6: Train Logistic Regression classifier

from sklearn.linear_model import LogisticRegression

# Initialize model
lr_model = LogisticRegression(max_iter=1000)

# Train model
lr_model.fit(X_train_tfidf, y_train)

# Predict on test data
lr_preds = lr_model.predict(X_test_tfidf)

# Evaluate
lr_accuracy = accuracy_score(y_test, lr_preds)

print("Logistic Regression Accuracy:", round(lr_accuracy, 4))
print("\nClassification Report:\n")
print(classification_report(y_test, lr_preds))


Logistic Regression Accuracy: 0.9493

Classification Report:

              precision    recall  f1-score   support

    politics       0.96      0.93      0.95       370
      sports       0.94      0.96      0.95       399

    accuracy                           0.95       769
   macro avg       0.95      0.95      0.95       769
weighted avg       0.95      0.95      0.95       769



In [8]:
# Step 7: Train Support Vector Machine (SVM) classifier

from sklearn.svm import LinearSVC

# Initialize model
svm_model = LinearSVC()

# Train model
svm_model.fit(X_train_tfidf, y_train)

# Predict on test data
svm_preds = svm_model.predict(X_test_tfidf)

# Evaluate
svm_accuracy = accuracy_score(y_test, svm_preds)

print("SVM Accuracy:", round(svm_accuracy, 4))
print("\nClassification Report:\n")
print(classification_report(y_test, svm_preds))


SVM Accuracy: 0.961

Classification Report:

              precision    recall  f1-score   support

    politics       0.97      0.95      0.96       370
      sports       0.95      0.97      0.96       399

    accuracy                           0.96       769
   macro avg       0.96      0.96      0.96       769
weighted avg       0.96      0.96      0.96       769



In [9]:
# Step 8: Create comparison table for all models

import pandas as pd

results = pd.DataFrame({
    "Model": ["Naive Bayes", "Logistic Regression", "SVM"],
    "Accuracy": [nb_accuracy, lr_accuracy, svm_accuracy]
})

results



Unnamed: 0,Model,Accuracy
0,Naive Bayes,0.964889
1,Logistic Regression,0.949285
2,SVM,0.960988


From the comparison, Naive Bayes achieved the highest accuracy, followed
closely by SVM. Logistic Regression performed slightly lower but still
demonstrated strong classification performance. The results indicate
that probabilistic and margin-based classifiers are well suited for
high-dimensional text data represented using TF-IDF features.
