# Flat Classifier: Logistic Regression with TF-IDF

This notebook demonstrates a simple **flat classification approach** where hierarchical structure is ignored. We use **TF-IDF vectorization** followed by a **Logistic Regression** classifier.

In [8]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Load 20 Newsgroups dataset (subset for speed)
categories = None  # Use all categories; or specify a subset like ['rec.sport.baseball', 'sci.med']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes'))


In [16]:
# Check dataset size
print(f"Number of training samples: {len(newsgroups_train.data)}")
print(f"Number of test samples: {len(newsgroups_test.data)}")

Number of training samples: 11314
Number of test samples: 7532


In [17]:
# Check number of categories
print(f"Number of categories: {len(newsgroups_train.target_names)}")
print("Categories:", newsgroups_train.target_names)

Number of categories: 20
Categories: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [18]:
# Show some examples
print("\n--- Sample training text (index 0) ---")
print(newsgroups_train.data[0][:500], "...")  # first 500 chars

print("\nCorresponding label (category):")
print(newsgroups_train.target_names[newsgroups_train.target[0]])


--- Sample training text (index 0) ---
I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail. ...

Corresponding label (category):
rec.autos


## 🔹 TF-IDF Vectorization

We convert the text into numerical features using `TfidfVectorizer`. Options like n-grams and stopword removal are applied for tuning performance.

In [9]:
# Vectorize text data using TF-IDF
vectorizer = TfidfVectorizer(max_features=10_000, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(newsgroups_train.data)
X_test_tfidf = vectorizer.transform(newsgroups_test.data)

## 🔹 Logistic Regression Classifier

We train a multiclass `LogisticRegression` model. Evaluation includes classification report and Accuracy.

In [10]:
# Train Logistic Regression classifier
clf = LogisticRegression(max_iter=1000, n_jobs=-1, verbose=1)
clf.fit(X_train_tfidf, newsgroups_train.target)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


In [11]:
# Predict on test set
y_pred = clf.predict(X_test_tfidf)

In [12]:
# Evaluate
accuracy = accuracy_score(newsgroups_test.target, y_pred)
print(f"Accuracy: {accuracy:.4f}")

print("Classification report:")
print(classification_report(newsgroups_test.target, y_pred, target_names=newsgroups_test.target_names))

Accuracy: 0.6694
Classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.47      0.44      0.46       319
           comp.graphics       0.63      0.70      0.66       389
 comp.os.ms-windows.misc       0.64      0.61      0.62       394
comp.sys.ibm.pc.hardware       0.64      0.61      0.62       392
   comp.sys.mac.hardware       0.73      0.65      0.69       385
          comp.windows.x       0.81      0.71      0.76       395
            misc.forsale       0.75      0.78      0.76       390
               rec.autos       0.70      0.69      0.70       396
         rec.motorcycles       0.47      0.76      0.58       398
      rec.sport.baseball       0.78      0.78      0.78       397
        rec.sport.hockey       0.88      0.86      0.87       399
               sci.crypt       0.87      0.67      0.76       396
         sci.electronics       0.54      0.58      0.56       393
                 sci.med       0.74