# Text Classification with Scikit-learn

The Jupyter Notebook demonstrates text classification using various machine learning models. The goal is to classify text documents from the 20 Newsgroups dataset into two categories: sci.med and comp.graphics.

## Set-up environment

First, we install the libraries which we'll use. Ex: pip install torch transformers datasets scikit-learn numpy pandas matplotlib

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load dataset
newsgroups = fetch_20newsgroups(subset='train', categories=['sci.med', 'comp.graphics'])
X_train, X_test, y_train, y_test = train_test_split(newsgroups.data, newsgroups.target, test_size=0.2, random_state=42)

# Text vectorization
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Classifier 1 : Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_tfidf, y_train)
y_pred = lr.predict(X_test_tfidf)

# Evaluation
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.96      0.95       120
           1       0.96      0.94      0.95       116

    accuracy                           0.95       236
   macro avg       0.95      0.95      0.95       236
weighted avg       0.95      0.95      0.95       236



In [3]:
from sklearn.naive_bayes import MultinomialNB

# Classifier 2 : Multinomial Naive Bayes
nb = MultinomialNB()
nb.fit(X_train_tfidf, y_train)
y_pred_nb = nb.predict(X_test_tfidf)

# Evaluation
print(classification_report(y_test, y_pred_nb))

              precision    recall  f1-score   support

           0       0.99      0.95      0.97       120
           1       0.95      0.99      0.97       116

    accuracy                           0.97       236
   macro avg       0.97      0.97      0.97       236
weighted avg       0.97      0.97      0.97       236



In [4]:
from sklearn.svm import SVC

# Classifier 3 : Support Vector Classifier
svm = SVC(kernel='linear')
svm.fit(X_train_tfidf, y_train)
y_pred_svm = svm.predict(X_test_tfidf)

# Evaluation
print(classification_report(y_test, y_pred_svm))

              precision    recall  f1-score   support

           0       0.97      0.98      0.98       120
           1       0.98      0.97      0.97       116

    accuracy                           0.97       236
   macro avg       0.97      0.97      0.97       236
weighted avg       0.97      0.97      0.97       236



In [5]:
from sklearn.ensemble import RandomForestClassifier

# Classifier  : Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train_tfidf, y_train)
y_pred_rf = rf.predict(X_test_tfidf)

# Evaluation
print(classification_report(y_test, y_pred_rf))


              precision    recall  f1-score   support

           0       0.90      0.94      0.92       120
           1       0.94      0.90      0.92       116

    accuracy                           0.92       236
   macro avg       0.92      0.92      0.92       236
weighted avg       0.92      0.92      0.92       236



In [6]:
from sklearn.neighbors import KNeighborsClassifier

# Classifier 5 : K-Nearest Neighbors
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_tfidf, y_train)
y_pred_knn = knn.predict(X_test_tfidf)

# Evaluation
print(classification_report(y_test, y_pred_knn))


              precision    recall  f1-score   support

           0       0.96      0.92      0.94       120
           1       0.92      0.97      0.94       116

    accuracy                           0.94       236
   macro avg       0.94      0.94      0.94       236
weighted avg       0.94      0.94      0.94       236

