# Practice 3

Use the 20 Newsgroups data set available on http://qwone.com/~jason/20Newsgroups/.

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

1. Read the two categories from the dataset using sklearn.datasets.load_files. You can start with ‘comp.graphics’ and ‘sci.med’.
2. Using Scikit Learn sklearn.feature_extraction.text.CountVectorizer convert the text content into numerical feature vectors.
3. Using Scikit Learn sklearn.feature_extraction.text.TfidfTransformer compute the TF-IDF
   - Term Frequency (TF) = (Number of times term t appears in a document)/(Number of terms in the document)
   - Inverse Document Frequency (IDF) = log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in. The IDF of a rare word is high, whereas the IDF of a frequent word is likely to be low. Thus having the effect of highlighting words that are distinct.
   - TF-IDF value is calculated as = TF × IDF
4. Using Scikit Learn build a basic KNN classifier model for this dataset.

In [1]:
from sklearn.datasets import load_files

# Choose comp.graphics and sci.med
categories = ['comp.graphics', 'sci.med']

train_path = './20news-bydate-train'
test_path = './20news-bydate-test'

train_data = load_files(train_path, categories=categories, encoding='latin1', decode_error='ignore')
test_data = load_files(test_path, categories=categories, encoding='latin1', decode_error='ignore')

print(f"Train docs: {len(train_data.data)}, Test docs: {len(test_data.data)}")

Train docs: 1178, Test docs: 785


In [2]:
from sklearn.feature_extraction.text import CountVectorizer

# Convert text content into numerical feature vectors
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(train_data.data)
X_test_counts = vectorizer.transform(test_data.data)
print(f"Train shape: {X_train_counts.shape}, Test shape: {X_test_counts.shape}")

Train shape: (1178, 24614), Test shape: (785, 24614)


In [3]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
print(f"Train TF-IDF shape: {X_train_tfidf.shape}, Test TF-IDF shape: {X_test_tfidf.shape}")

Train TF-IDF shape: (1178, 24614), Test TF-IDF shape: (785, 24614)


In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_tfidf, train_data.target)
y_pred = knn.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(test_data.target, y_pred))
print(classification_report(test_data.target, y_pred, target_names=train_data.target_names))

Accuracy: 0.9070063694267516
               precision    recall  f1-score   support

comp.graphics       0.93      0.88      0.90       389
      sci.med       0.89      0.93      0.91       396

     accuracy                           0.91       785
    macro avg       0.91      0.91      0.91       785
 weighted avg       0.91      0.91      0.91       785

