# Text Classification on 20Newsgroups

In this tutorial you'll learn to classify the [20Newsgroups](http://qwone.com/~jason/20Newsgroups/) dataset and compare the performance of STC with standard classifiers.


## Colab

This tutorial and the rest in [this sequence](https://github.com/SparseTensorClassifier/tutorial) can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link or click [here](https://colab.research.google.com/github/SparseTensorClassifier/tutorial/blob/main/Text_Classification_20Newsgroups.ipynb).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SparseTensorClassifier/tutorial/blob/main/Text_Classification_20Newsgroups.ipynb)

## Setup

Uncomment and run the following cell to install the packages. Then, import the modules.

In [1]:
# !pip install stc pandas numpy scikit-learn nltk

In [2]:
import nltk
import warnings
import pandas as pd
import numpy as np

import sklearn.metrics as mtr
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

from stc import SparseTensorClassifier

np.random.seed(42)
warnings.filterwarnings('ignore')

## Download the 20news dataset

In [3]:
data_train = fetch_20newsgroups(subset='train')
data_test = fetch_20newsgroups(subset='test')

## Set up the competing algorithms

In [4]:
models = {
    'Logistic Regression': LogisticRegression(),
    'Support Vector Machine': SVC(),
    'Multinomial Naive Bayes': MultinomialNB(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'K-Nearest Neighbors': KNeighborsClassifier()
}

### Prepare train and test sets

Use a simple tokenization with `nltk.word_tokenize` and vectorize with Tf-Idf.

In [5]:
vectorizer = TfidfVectorizer(tokenizer=nltk.word_tokenize)
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
y_train, y_test = data_train.target, data_test.target

### Fit

In [6]:
for model_name, model in models.items():
    print("Training: {}".format(model_name))
    models[model_name].fit(X_train, y_train)

Training: Logistic Regression
Training: Support Vector Machine
Training: Multinomial Naive Bayes
Training: Decision Tree
Training: Random Forest
Training: K-Nearest Neighbors


### Predict

In [7]:
predictions = {}
for model_name, model in models.items():
    print("Predicting: {}".format(model_name))
    predictions[model_name] = model.predict(X_test)

Predicting: Logistic Regression
Predicting: Support Vector Machine
Predicting: Multinomial Naive Bayes
Predicting: Decision Tree
Predicting: Random Forest
Predicting: K-Nearest Neighbors


## Prepare the data for SparseTensorClassifier

Use a simple tokenization with `nltk.word_tokenize` and convert to JSON.

In [8]:
json_train, json_test = [], []
for i, doc in list(enumerate(data_train.data)):
    json_train.append({'words': nltk.word_tokenize(doc), 'target': [data_train.target[i]]})
for i, doc in list(enumerate(data_test.data)):
    json_test.append({'words': nltk.word_tokenize(doc)})

### Fit

In [9]:
STC = SparseTensorClassifier(features=['words'], targets=['target'])
STC.fit(json_train)



### Predict

In [10]:
labels, _, _ = STC.predict(json_test, probability=False, explain=False)
predictions['Sparse Tensor Classifier'] = labels.target.values.astype(int)



## Print evaluation metrics

In [11]:
E = []
for estimator, y_pred in predictions.items():
    report = mtr.classification_report(y_test, y_pred, output_dict=True, zero_division=0)
    E.append({
        'Model': estimator, 'Accuracy': report['accuracy'],
        'Avg Precision (macro)': report['macro avg']['precision'],
        'Avg Recall (macro)': report['macro avg']['recall'],
        'Avg F1-score (macro)': report['macro avg']['f1-score'],
        'Avg Precision (weighted)': report['weighted avg']['precision'],
        'Avg Recall (weighted)': report['weighted avg']['recall'],
        'Avg F1-score (weighted)': report['weighted avg']['f1-score']
    })
E = pd.DataFrame(E).set_index('Model', inplace=False)

In [12]:
E

Unnamed: 0_level_0,Accuracy,Avg Precision (macro),Avg Recall (macro),Avg F1-score (macro),Avg Precision (weighted),Avg Recall (weighted),Avg F1-score (weighted)
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Logistic Regression,0.797663,0.802854,0.787721,0.789587,0.804665,0.797663,0.796521
Support Vector Machine,0.777217,0.790612,0.768136,0.772872,0.793172,0.777217,0.779243
Multinomial Naive Bayes,0.735661,0.8164,0.716682,0.717503,0.811408,0.735661,0.732126
Decision Tree,0.536511,0.530171,0.529172,0.52879,0.537341,0.536511,0.536033
Random Forest,0.747345,0.7615,0.735418,0.733717,0.759892,0.747345,0.741517
K-Nearest Neighbors,0.523898,0.592446,0.52229,0.533538,0.600985,0.523898,0.538897
Sparse Tensor Classifier,0.863516,0.862865,0.855559,0.85565,0.865961,0.863516,0.861457


# Congratulations! 

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with SparseTensorClassifier, we encourage you to finish the rest of the tutorials in [this series](https://github.com/SparseTensorClassifier/tutorial). Don't forget to star the [repository](https://github.com/SparseTensorClassifier/stc)! 

[![GitHub Repo stars](https://img.shields.io/github/stars/SparseTensorClassifier/stc?style=social)](https://github.com/SparseTensorClassifier/stc)

Thanks by https://sparsetensorclassifier.org