# Text Classification with the 20 Newsgroups Dataset

The *20 Newsgroups* dataset was created in the 1990s and contains texts extracted from various Usenet groups dedicated to specific topics. Its official homepage can be found here: http://qwone.com/~jason/20Newsgroups/

It is a commonly used dataset for benchmarking text classification approaches. Some examples of state-of-the-art benchmark scores on the dataset can be found here: https://paperswithcode.com/dataset/20-newsgroups

The labels of the dataset have been computed from the Usenet group where each message was posted. "Off-topic" messages therefore may have unintuitive or seemingly "wrong" labels.

## Download the Dataset

Run the following cell to download the CSV file containing the data. Note in this example we're downloading the data the "Pythonic" way, rather than using the Terminal command wget as in the previous notebook.

In [None]:
import requests

data_url = "https://raw.githubusercontent.com/TaylorPeer/bfh/main/PoML/data/20_newsgroups.csv"

response = requests.get(data_url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    
    # Save the file to your working directory
    with open("20_newsgroups.csv", "wb") as file:
        file.write(response.content)
    print("File downloaded successfully.")
    
else:
    print(f"Failed to download the file. Status code: {response.status_code}")


## Inspect the Dataset

As with the previous notebook, inspect the downloaded file to find out how to properly load it. Take note of its formatting.

In [None]:
!head -n 2 20_newsgroups.csv

## Load the Dataset

Use Pandas to load the dataset into a DataFrame.

In [None]:
import pandas as pd

df = pd.read_csv(...)

## Inspect the Dataset

Make use of the .unique() function of a dataset to view all the unique values contained in a DataFrame column, in this case, the class labels.

In [None]:
df["newsgroup"].unique()

### Label Distribution

Using the groupby() and count() methods, we can calculate the number of examples per class label.

Being aware of the underlying distribution of our data is important. Some learning algorithms may be affected by skewed data. Also, our interpretations of evaluation scores depends heavily on the distribution of the class labels as well as the number of classes.

In [None]:
df.groupby("newsgroup").count()

### Example Texts

Take a look at some of the raw data to get a feeling for the kinds of messages to be classified. Being familiar with the data can help with recognizing text that may be problematic for our classifiers later on.

In [None]:
# Increase the amount of text displayed when rendering DataFrames
pd.set_option('display.max_colwidth', 200)

df.sample(5)

## Vectorization

In the previous notebook using the Iris dataset, the raw features (petal and sepal widths and lengths) of the flowers to be classifed were already in a format that could be processed by machine learning algorithms so no explicit vectorization step was necessary. This is not the case when working with text data, and we will have to introduce a new explicit vectorization step. Even advanced text processing systems like ChatGPT do not work directly on text internally, but also use vector representations.

One way to vectorize text is to regard every word as a feature. In that case the length of our feature vector would equate to the number of unique words contained in our dataset. For each document in the dataset, we would then create a feature vector to represent it, with a 1 for every feature word contained in the document and a 0 for every other word present in the dataset but not contained in that particular document. 

As you can imagine, the feature vectors would be extremely large and mostly **sparse**, meaning they would mostly contain 0s, since most individual documents will not contain very many of the possible words. The size of these vectors presents a challenge for many learning algorithms and is often referred to as the **curse of dimensionality**.

Many **hyperparameters** are available that influence the size of these feature vectors to help keep them to a manageable size.


### TF-IDF

Rather than only using 1s and 0s to represent the presence and absense of feature terms, the concept of scoring the importance of terms has also been introduced. One way to compute scores for terms is via a TF-IDF calculation. TF-IDF combines two simple assumptions to assign scores to terms. The first assumption is that if a term appears frequently in a document, then it must be important in that document, e.g. the document is likely about that topic. The second assumption is that if a term appears frequently in the entire document collection, it is likely not that important, for instance, because it is a commonly used term overall. TF-IDF takes both of these assumptions into consideration and weights terms more highly for a document if they occur often in that document and weights them lower if they occur often overall in the collection.

### Feature Vectorization of Text Using TF-IDF

Run the following cell to vectorize the dataframe and inspect its new representation.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the 'Text' column
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])

# Convert the TF-IDF matrix to a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

tfidf_df

### TfidfVectorizer Hyperparameters

Use the *min_df* and *max_df* parameters of the TfidfVectorizer to adjust the vectorization of our dataframe. 

*min_df* represents the minimum amount of documents that must contain a term in order for it to be used as a feature in the vectorization process. This helps us set a sensible lower cutoff to avoid vectorizing typos, spelling mistakes, and other uncommon words that are unlikely to be useful during training.

*max_df* represents the maximum amount of documents that can contain a term in order for it to be used as a feature in the vectorization process. This helps us sort out extremely frequent terms that are unlikely to give us any hints about the class label of a document.

Note that both parameters accept both floating point values between 0 and 1 to represent a percentage of the document collection as well as an integer value representing an absolute number of documents in the collection.

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_df=..., min_df=...)

tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df

## Training

Before training, let's divide our dataset into training and test splits.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(df["text"], df["newsgroup"], test_size=0.2)

classifier = LogisticRegression()

### Fit the Vectorizer only on the Training Set

It's important to fit the vectorizer only on the training set and not the entire dataset - why?

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=20)
tfidf_matrix = tfidf_vectorizer.fit_transform(X_train)

### Apply the Fitted Vectorizer to the Test Set

In [None]:
tf_idf_vectorized_test = tfidf_vectorizer.transform(X_test)

### Train the Model

In [None]:
trained_model = classifier.fit(tfidf_matrix, y_train)

### Evaluate the Model

In [None]:
from sklearn.metrics import classification_report

predictions = trained_model.predict(tf_idf_vectorized_test)

print(classification_report(
    y_test, 
    predictions, 
    target_names=trained_model.classes_))


### Confusion Matrix

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=trained_model.classes_)
disp.plot(xticks_rotation="vertical")
plt.show()


### Additional Models

Adjust the hyperparameters of the following classifiers and TfidfVectorizer to investigate their impact on the runtime and performance of the models. 

Some models used in the previous Iris notebook have been excluded here due to their extremely slow performance on large feature vectors (KNeighborsClassifier and SVC).

In [None]:
import time
from tqdm import tqdm
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

# Decision Tree:
# - max_depth: maximum depth of decision nodes (default: None)
decision_tree = DecisionTreeClassifier(max_depth=10)

# Random Forest
# - n_estimators: number of individual decision trees used internally by the model (default: 100)
random_forest = RandomForestClassifier(n_estimators=50)

# Naive Bayes:
naive_bayes = MultinomialNB()

# Logistic Regression:
# - max_iter: maximum number of iterations (default: 100)
logistic_regression = LogisticRegression(max_iter=1000)

classifiers = [
    decision_tree,
    random_forest,
    naive_bayes,
    logistic_regression
]

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=20)
tfidf_matrix = tfidf_vectorizer.fit_transform(X_train)
tf_idf_vectorized_test = tfidf_vectorizer.transform(X_test)

model_metrics = []
for classifier in tqdm(classifiers):
    
    # Train the classifier
    start_time = time.time()
    trained_model = classifier.fit(tfidf_matrix, y_train)
    end_training_time = time.time()
    training_time_elapsed = end_training_time - start_time
    
    # Apply trained classifier to test set
    start_time = time.time()
    predictions = trained_model.predict(tf_idf_vectorized_test)
    prediction_time = time.time()
    prediction_time_elapsed = prediction_time - start_time
    
    # Measure model performance
    score = classifier.score(tf_idf_vectorized_test, y_test)
    
    # Record model metrics
    model_metrics.append({
        "model": trained_model.__class__.__name__,
        "training_time": training_time_elapsed,
        "prediction_time": prediction_time_elapsed,
        "score": score,
    })
    
# Print model metrics table
pd.DataFrame(model_metrics)

### Impact of Training Set Size

Since our TfidfVectorizer is "trained" on the training set, it only knows words contained in that collection. If this set is very small, our vectorizer will not know many terms and other, new documents to be classified will contain many **out-of-vocabulary** terms. This is an instance where having more data available to train with typically drastically improves model performance - to a point.

Run the cell below to view the impact of training on different amounts of data.

In [None]:
# Naive Bayes:
naive_bayes = MultinomialNB()

training_set_sizes = [0.05, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1.0]

model_metrics = []
for training_set_size in tqdm(training_set_sizes):
    
    X_train_sample = X_train.sample(frac=training_set_size).sort_index()
    y_train_sample = y_train[y_train.index.isin(X_train_sample.index)].sort_index()
    
    tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=20)
    tfidf_matrix = tfidf_vectorizer.fit_transform(X_train_sample)
    tf_idf_vectorized_test = tfidf_vectorizer.transform(X_test)
    
    # Train the classifier
    start_time = time.time()
    trained_model = classifier.fit(tfidf_matrix, y_train_sample)
    end_training_time = time.time()
    training_time_elapsed = end_training_time - start_time
    
    # Apply trained classifier to test set
    predictions = trained_model.predict(tf_idf_vectorized_test)
    
    # Measure model performance
    score = classifier.score(tf_idf_vectorized_test, y_test)
    
    # Record model metrics
    model_metrics.append({
        "training_set_proportion": training_set_size,
        "training_set_size": len(X_train_sample),
        "training_time": training_time_elapsed,
        "score": score,
    })
    
# Print model metrics table
pd.DataFrame(model_metrics)

### Explainability

Different models have different internal workings. In the previous Iris notebook we saw how to inspect a Decision Tree classifier to interpret how it classifies our data. Other models have other ways of interpreting their output. 

Regression models work by learning weights to associate with each feature. These weights are then multiplied with the feature value during prediction time and an internal calculation is used to predict a class label. The value of these weights can be interpreted as a kind of "importance" since a higher weight will cause an input feature to indicate in favor of or against a particular class label.

The feature weights of our trained Logistic Regression model can be inspected as follows:

In [None]:
# Fit our vectorizer and logistic regression model
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=20)
tfidf_matrix = tfidf_vectorizer.fit_transform(X_train)
logistic_regression = LogisticRegression(max_iter=1000)
logistic_regression = logistic_regression.fit(tfidf_matrix, y_train)

# Get the coefficients from the trained model
coefficients = logistic_regression.coef_[0]

tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
feature_names = tfidf_df.columns

# Create a DataFrame to display the coefficients with feature names
coefficients_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Sort the DataFrame by the absolute magnitude of coefficients
coefficients_df['Abs_Coefficient'] = abs(coefficients_df['Coefficient'])
sorted_coefficients_df = coefficients_df.sort_values(by='Abs_Coefficient', ascending=False)

# Display the sorted DataFrame to see the most important features
sorted_coefficients_df.head(20)