# Week 4, Lesson 4, Activity 7: End-to-end topic classification

&copy;2021, Ekaterina Kochmar \
(updated to newer library versions: Nadejda Roubtsova, February 2022)

Your task in this activity is to:

- Implement a topic classification algorithm and apply it to the set of `20 Newsgroups` posts specified in this notebook.

## Step 1: Data loading

First, let's import the libraries that we are going to use in this notebook. Then, let's define a method to load *training* and *test* subsets using a predefined list of categories. Note that following options are also available:
- you can use `load_dataset('all', categories)` to load the whole dataset
- you can use `load_dataset('train', None)` to load the set of all topics

In [1]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np

def load_dataset(a_set, cats):
    dataset = fetch_20newsgroups(subset=a_set, categories=cats,
                          remove=('headers', 'footers', 'quotes'),
                          shuffle=True)
    return dataset

categories = ["comp.windows.x", "misc.forsale", "rec.autos", "rec.motorcycles", "rec.sport.baseball"]
categories += ["rec.sport.hockey", "sci.crypt", "sci.med", "sci.space", "talk.politics.mideast"]

newsgroups_train = load_dataset(# appy load_dataset to the training subset a_set='train' and the set of categories
                                )
newsgroups_test = load_dataset(# appy load_dataset to the training subset a_set='test' and the set of categories
                               )

TypeError: load_dataset() missing 2 required positional arguments: 'a_set' and 'cats'

Let's check what is contained in the uploaded data subsets:

In [None]:
def check_data(dataset):
    print(list(dataset.target_names)) # names of the categories
    print(# the number of texts in the dataset can be accessed using dataset.filenames.shape
          )
    print(# the number of target labels is accessible in a similar way, using .target field instead of .filenames
          # this number should be equal to the number of texts
          )
    if # check that the sizes of both the data (number of texts) and labels (number of targets) is equal
        print("Equal sizes for data and targets")
    print(dataset.filenames[0]) # name and location of the file
    print(dataset.data[0])
    print(# print out the first 10 target labels
          )
    
check_data(newsgroups_train)
print("\n***\n")
check_data(newsgroups_test)

## Step 2: ML pipeline with sklearn

Now let's create word vectors based on the content of the posts:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words = 'english')

def text2vec(vectorizer, train_set, test_set):
    vectors_train = vectorizer.fit_transform(train_set.data)
    vectors_test = vectorizer.transform(# now apply vectorizer to the test_set data
                                        # Note: you apply only the .transform method, 
                                        # not .fit_transform to the test data
                                        )
    return vectors_train, vectors_test

vectors_train, vectors_test = text2vec(# apply to the relevant data structures
                                       )

Let's check how the data looks like now:

In [None]:
print(vectors_train.shape)
print(# apply the same to the test data
      # the number of test documents should be the same as before
      # the number of features should be the same for the train and the test data
      )
print(vectors_train[0])
print(vectorizer.get_feature_names_out()[33404])

## Step 3: Apply a machine learning classifier to the data

Next, let's apply the Multinomial Naive Bayes classifier:

In [None]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB(alpha=0.1)
clf.fit(vectors_train, newsgroups_train.target)
predictions = clf.predict(vectors_test)

## Step 4: Evaluate the results

Finally, let's evaluate the results, extract the most informative terms per topic, and print out and visualise the confusion matrix. What can you say about the final results?

In [None]:
from sklearn import metrics

def show_top(classifier, categories, vectorizer, n):
    feature_names = np.asarray(vectorizer.get_feature_names_out())
    for i, category in enumerate(categories):
        top = np.argsort(classifier.feature_log_prob_[i])[-n:]
        print(f'{category}: {" ".join(feature_names[top])}')
        

full_report = metrics.classification_report(newsgroups_test.target, 
                                            predictions, target_names=newsgroups_test.target_names)
print(full_report)
show_top(# apply to the relvant data structures to return the top 10 most informative words
         )

Further evaluation using confusion matrices and visualisation:

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

classifier = clf.fit(vectors_train, newsgroups_train.target)
ConfusionMatrixDisplay.from_estimator(classifier, vectors_test, newsgroups_test.target)
plt.show()

for i, category in enumerate(newsgroups_train.target_names):
    print(i, category)