<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2024-Tutorial-Notebooks/blob/main/tutorials_notebooks_in_class_2024/W02_Intro_to_sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to scikit-learn

We would like to introduce you to scikit-learn with the help of an instructional example about text classification. We will cover the most basic principles and ideas about scikit-learn in this notebook. This tutorial is inspired by the sklearn tutorial on http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html, but contains a few more explanations and is suited to introduce scikit-learn in class.

$Author$: Phillip Ströbel

With minor adjustments from: Janis Goldzycher, Andrianos Michail

## Data

Get the data from http://qwone.com/~jason/20Newsgroups/. We will work with the 20news-bydate.tar.gz data set. Unzip it to a suitable destination. Here, all the data lies in the data folder. To our convenience, it has already been split into a training and a test set, so we don't have to care about this. What we need to do though is get the data and put it into a dataframe (you could also only work with dictionaries or other data containers). We do this for both the training and the test set.

In [None]:
import os
import pandas as pd


def create_df(path_to_data, random_state=42):
    """
    Takes the path of a folder containing all the subfolders (which contain the actual documents).
    Builds a pandas datafram with document ids, the text and the label.
    :param path_to_data: path to top folder as a string
    :param random_state: integer, seed for shuffling
    :return: pandas dataframe with all the data
    """
    doc_list = list()  # doc_list now: [[doc<str>, label<str>], ...]

    for category in os.listdir(path_to_data):
        for document in os.listdir(os.path.join(path_to_data, category)):
            doc = open(os.path.join(path_to_data, category, document), 'r', encoding='latin-1').read().replace('\n', ' ')
            doc_list.append([doc, category])

    df = pd.DataFrame(doc_list, columns=['text', 'label'])

    return df.sample(frac=1, random_state=random_state) # return and shuffle dataframe

In [None]:
!wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz

In [None]:
!tar -xvzf 20news-bydate.tar.gz
!echo "The files are unzipped and this folder now contains:"
!ls


In [None]:
!mkdir data
!mv 20news-bydate-* data/

In [None]:
train = create_df('data/20news-bydate-train')
test = create_df('data/20news-bydate-test')

Several ways to inspect the data.

In [None]:
print('training size: ', train.shape)
print('test size: ', test.shape)

In [None]:
train.head()

In [None]:
train.info()

In [None]:
train.describe()

In [None]:
train.groupby('label').size()

As usual, we split the labels from the training and the test set.

In [None]:
X_train = train.text
y_train = train.label
X_test = test.text
y_test = test.label

In [None]:
type(X_train)

Series is just a "One-dimensional ndarray with axis labels". Let's see if we got this right.

In [None]:
print('Training set shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test set shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)

In [None]:
X_train.head()

## Preprocessing
So far, so good! But we know that machine learning algorithms cannot work with text data directly. So we need to vectorise the data somehow. also, we might do some preprocessing. Let's see how we can tackle these problems.
### Vectorise the data
Luckily, sklearn offers some nice classes which help us. We should tokenise the data and then vectorise it. Conveniently, sklearns `CountVectoriser()` does exactly that. Let's see how it works.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(encoding='latin-1')
X_train_counts = count_vect.fit_transform(X_train)  # num_docs x num_words

Basically, the three central methods in sklearn are `transform`, `fit`, `fit_transform`, and `predict`. We will see how each of these work and when to use them. We have alredy made use of `fit_transform`. Instead of using this method, we could have called the method `fit` on the training set first and the use `transform` to vectorise the data (to 'transform' it). With the fitted `CountVectorizer` we can now transform other data, like for example the test set.

In [None]:
X_test_counts = count_vect.transform(X_test)

We will return to this later. First let us see what `CountVectorizer` produces.

In [None]:
X_train_counts

The vectorised form contains 11314 rows, which is the number of our documents, while the number of columns tells us something about the vocabulary size of the whole corpus. But what's a sparse matrix? Note that saving the complete, sparse document-vocabulary matrix would need to hold 1,472,030,598 values, most of which would be zero? Why? Instead, we only save 1,787,565 values in a compressed sparse row format. An example:

In [None]:
import numpy as np
from scipy import sparse

row_indices = np.array([0, 0, 1, 2, 2, 2])
col_indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
mtx = sparse.csr_matrix((data, (row_indices, col_indices)), shape=(3, 3))
mtx

In [None]:
print(mtx)

In [None]:
m = mtx.todense()
m

In [None]:
m[0,0]

In [None]:
m[0,:]

How does indexing of sparse matrices work?

In [None]:
print(mtx[:,0])

Now let's apply our new knowledge to our word-document matrix.

In [None]:
X_train_counts.shape

In [None]:
X_train_counts[0,:]

We can see which positions of the document vector are occupied. A `1` means the word occurs once in the document, while any other number gives the exact count.

In [None]:
print(X_train_counts[0,:])

The number of words in a document is also trivial to get.

In [None]:
X_train_counts[0,:].sum()

In a similar fashion, we can count how many times a certain word occurs in the training set. (In this case, the word occurring first in the vocabulary.)

In [None]:
X_train_counts[:,0].sum()

We can also learn more about the vocabulary, e.g., how many times a word occurs in the corpus. First, we need to find the index:

In [None]:
count_vect.vocabulary_.get('sin')

Now we have the index, we can count how many times the word "sin" occurs in our corpus.

In [None]:
sin_index = count_vect.vocabulary_.get('sin')
X_train_counts[:,sin_index].sum()

So far, so good. `CountVectorizer` lets you also define if you want to count bigrams, or other n-grams. Moreover, you can not only count words, but als characters. We suggest you try these out for yourself. In the following, we will continue with unigrams.

Since we have numbers now instead of strings, we could start training models now. However, raw counts will not be very informative, since we also have to take the length of a dodument into account. Dividing each row by the total number of words will give us the term frequency for each document. That will be much better! Now we still might have higher values for words which occur often in many documents. typically, these words are less informative, so we need to downscale those weights. This will modify or counts so that we are left with what is called the "term frequency-inverse document frequency" measure, or tf-idf. The tf-idf measure is given by
\begin{equation}
f_{t,d}\cdot log \frac{N}{n_t}
\end{equation}
In sklearn, there is the `TfidfTransformer` which does exactly that for us :-).

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_tranformer = TfidfTransformer(smooth_idf=True).fit(X_train_counts)
X_train_tfidf = tfidf_tranformer.transform(X_train_counts)

In [None]:
X_train_tfidf.shape

In [None]:
X_train_tfidf[0,:]

In [None]:
print(X_train_tfidf[0,:])

Again we apply the transformation to the test set:

In [None]:
X_test_tfidf = tfidf_tranformer.transform(X_test_counts)

This should suffice as features to train a classifer (for the moment).

### Vectorise labels
Next, we deal with the labels. Every document has exactly one label attached. We have 20 labels in total. This means we can basically assign a number to each label.

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.transform(y_test)

In [None]:
y_train[0]

In [None]:
y_train.shape

In [None]:
y_test.shape

In [None]:
label_encoder.classes_

## Finally, let's train models
Now it's time to train models. Let's stick to the Multinomial Naive Bayes classifier for the moment.

In [None]:
from sklearn.naive_bayes import MultinomialNB

nb_clf = MultinomialNB()
nb_clf.fit(X_train_tfidf, y_train)

Let's see how well we do on the test set:

In [None]:
nb_clf.predict(X_test_tfidf)

In [None]:
y_test

Computing the accuracy is simple:

In [None]:
correct = 0

for index, prediction in enumerate(nb_clf.predict(X_test_tfidf)):
    if prediction == y_test[index]:
        correct +=1

print('Accuracy: ', correct/y_test.shape[0])

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(nb_clf.predict(X_test_tfidf), y_test)

Almost 80 percent, that is not too bad. What about a Support Vector Classifier?

In [None]:
from sklearn.svm import LinearSVC

svc = LinearSVC()
svc.fit(X_train_tfidf, y_train)

In [None]:
correct = 0

for index, prediction in enumerate(svc.predict(X_test_tfidf)):
    if prediction == y_test[index]:
        correct +=1

print('Accuracy: ', correct/y_test.shape[0])

An increase of 8%, that's good!

However, in order to determine the performance of our models we need cross validation.

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(nb_clf, X_train_tfidf, y_train, scoring='accuracy', cv=10)

In [None]:
scores

In [None]:
scores = cross_val_score(svc, X_train_tfidf, y_train, scoring='accuracy', cv=10)

In [None]:
scores

We can also calculate precision, recall, and f1 relatively easily:

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=50)
sgd_clf.fit(X_train_tfidf, y_train)

y_train_predictions = cross_val_predict(sgd_clf, X_train_tfidf, y_train, cv=3)

In [None]:
print(precision_score(y_train, y_train_predictions, average='micro'))
print(recall_score(y_train, y_train_predictions, average='micro'))
print(f1_score(y_train, y_train_predictions, average='micro'))
conf_mx = confusion_matrix(y_train, y_train_predictions)
conf_mx

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

In [None]:
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums

np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()

In [None]:
label_encoder.classes_

## Shortcuts in sklearn - pipelines
Sklearn allows us to build convenient `Pipelines`, which facilitate the management of our data and the training of our models enourmously. Consider for example:

In [None]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('nb_clf', MultinomialNB())
])

We could even replace the two first lines of the pipeline by using `TfidfVectorizer`, which first fits and transforms the input the same way as the `CountVectorizer`.

In [None]:
text_clf.fit(X_train, y_train)

In [None]:
scores = cross_val_score(text_clf, X_train, y_train, scoring='accuracy', cv=10)

In [None]:
scores

## Model selection - find your best model
For every model you would like to train, there is a plethora of parameters you could set. How to find the best model? Again, sklearn has a solution: `GridSearchCV`. With grid search cross validation, you can set your hyperparameter space and train different models with all the parameter combinations. Keep in mind that depending on how many folds you train, the whole training procedure takes significantly longer. But let's set up grid search cross validation. We set up a new pipeline for a SVC

In [None]:
from sklearn.model_selection import GridSearchCV

text_svc = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('svc', LinearSVC())
])

param_grid = {'vect__ngram_range': [(1, 1), (1, 2)],
             'svc__loss': ['hinge', 'squared_hinge'],
             'svc__multi_class': ['ovr', 'crammer_singer']}

gs_svc = GridSearchCV(text_svc, param_grid, cv=5, n_jobs=4, verbose=1)
gs_svc.fit(X_train, y_train)

from sklearn.model_selection import GridSearchCV

text_svc = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('svc', LinearSVC())
])

param_grid = {'vect__ngram_range': [(1, 1), (1, 2)],
             'svc__loss': ['hinge', 'squared_hinge'],
             'svc__multi_class': ['ovr', 'crammer_singer']}

gs_svc = GridSearchCV(text_svc, param_grid, cv=10, n_jobs=3, verbose=1, return_train_score=True)
gs_svc.fit(X_train, y_train)

In [None]:
svc_df = pd.DataFrame.from_dict(gs_svc.cv_results_)
svc_df.sort_values(by=["rank_test_score"])

In [None]:
gs_svc.predict(X_test)

In [None]:
y_test

In [None]:
best_model = Pipeline([
    ('vect', CountVectorizer(ngram_range=(1,2))),
    ('tfidf', TfidfTransformer()),
    ('svc', LinearSVC(loss='hinge', multi_class='crammer_singer'))
])

best_model.fit(X_train, y_train)

In [None]:
best_model.predict(X_test)

In [None]:
correct = 0

for index, prediction in enumerate(best_model.predict(X_test)):
    if prediction == y_test[index]:
        correct +=1

print('Accuracy: ', correct/y_test.shape[0])

##  Modern Solutions Sneak Peek - Transformer

Let's look at another task, paraphrase detection. Do two sentences have the same meaning?

In [None]:
!pip install transformers datasets -qU

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset
import torch

# 1. Load Pre-trained Model and Tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "AMHR/adversarial-paraphrasing-detector"  # Replace with the model you want to use
model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 2. Prepare the Dataset
dataset = load_dataset("glue", "mrpc")
eval_dataset = dataset["validation"]

correct = 0
total = 0

# 3. Evaluate the Model
model.eval()  # Set the model to evaluation mode
with torch.no_grad():
    for i, example in enumerate(eval_dataset):
        if i > 5:
           break
        # Tokenize the inputs and get the model's predictions
        inputs = tokenizer(example['sentence1'], example['sentence2'], return_tensors='pt', truncation=True, padding=True, max_length=128)
        # Move input tensors to the same device as the model
        inputs = {name: tensor.to(device) for name, tensor in inputs.items()}
        outputs = model(**inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        # Print first 5 example pairs along with predictions and ground truth labels
        if i < 5:
            print(f"Example {i+1}")
            print(f"Sentence 1: {example['sentence1']}")
            print(f"Sentence 2: {example['sentence2']}")
            print(f"Prediction: {'Paraphrase' if predictions == 1 else 'Not a Paraphrase'}")
            print(f"Ground Truth: {'Paraphrase' if example['label'] == 1 else 'Not a Paraphrase'}")
            print("="*50)


In [None]:
def predict_paraphrase(sentence1, sentence2, model, tokenizer, device):
    # Prepare the sentences for the model
    inputs = tokenizer(sentence1, sentence2, return_tensors='pt', truncation=True, padding=True, max_length=128)

    # Move the input tensors to the device the model is on
    inputs = {name: tensor.to(device) for name, tensor in inputs.items()}

    model.eval()
    with torch.no_grad():
        # Get model's prediction
        outputs = model(**inputs)
        logits = outputs.logits
        prediction = torch.argmax(logits, dim=-1).item()

    return "Paraphrase" if prediction == 1 else "Not a Paraphrase"

In [None]:
# Custom Sentences testing
sentence1 = "This tutorial rocks."
sentence2 = "I want to throw rocks at this Tutor."

result = predict_paraphrase(sentence1, sentence2, model, tokenizer, device)
print(f"Sentence 1: {sentence1}")
print(f"Sentence 2: {sentence2}")
print(f"Prediction: {result}")

In [None]:
# Custom Sentences Testing
sentence1 = "I am so tired but I want to stay in this tutorial."
sentence2 = "I am exhausted and forced to be here."

result = predict_paraphrase(sentence1, sentence2, model, tokenizer, device)
print(f"Sentence 1: {sentence1}")
print(f"Sentence 2: {sentence2}")
print(f"Prediction: {result}")

In [None]:
# Custom Sentences Testing
sentence1 = "This field of research is pretty cool."
sentence2 = "I find this line of research very cool."

result = predict_paraphrase(sentence1, sentence2, model, tokenizer, device)
print(f"Sentence 1: {sentence1}")
print(f"Sentence 2: {sentence2}")
print(f"Prediction: {result}")