<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2023-Tutorial-Notebooks/blob/main/tutorial_notebooks/03_tutorial_pipeline_skorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to sklearn Pipeline and skorch

In this tutorial, we will introduce
* the Pipeline class of sklearn
* the skorch library: a framework unifying pytorch and sklearn.

Based on the notebook of Phillip Ströbel (adjustments from Janis Goldzycher, Andrianos Michail)


## Recap

We are still working with the text classification data from http://qwone.com/~jason/20Newsgroups/.
Cf. notebook from last week for more comments.

In [None]:
import os
import pandas as pd


def create_df(path_to_data, random_state=42):
    """
    Takes the path of a folder containing all the subfolders (which contain the actual documents).
    Builds a pandas datafram with document ids, the text and the label.
    :param path_to_data: path to top folder as a string
    :param random_state: integer, seed for shuffling
    :return: pandas dataframe with all th
    """
    doc_list = list()  # doc_list now: [[doc<str>, label<str>], ...]

    for category in os.listdir(path_to_data):
        for document in os.listdir(os.path.join(path_to_data, category)):
            doc = open(os.path.join(path_to_data, category, document), 'r', encoding='latin-1').read().replace('\n', ' ')
            doc_list.append([doc, category])

    df = pd.DataFrame(doc_list, columns=['text', 'label'])

    return df.sample(frac=1, random_state=random_state) # return and shuffle dataframe

In [None]:
train = create_df('../datasets/20news-bydate/20news-bydate-train')
test = create_df('../datasets/20news-bydate/20news-bydate-test')

Several ways to inspect the data.

In [None]:
train.head()

Unnamed: 0,text,label
7492,From: fragante@unixg.ubc.ca (Gv Fragante) Subj...,comp.sys.ibm.pc.hardware
3546,Organization: Central Michigan University From...,rec.sport.hockey
5582,From: dmeier@casbah.acns.nwu.edu (Douglas Meie...,talk.politics.misc
4793,From: shavlik@cs.wisc.edu (Jude Shavlik) Subje...,sci.med
3813,From: nhmas@gauss.med.harvard.edu (Mark Shneyd...,rec.sport.hockey


As usual, we split the labels from the training and the test set.

In [None]:
X_train = train.text
y_train = train.label
X_test = test.text
y_test = test.label

Series is just a "One-dimensional ndarray with axis labels". Let's see if we got this right.

In [None]:
print('Training set shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test set shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)

Training set shape:  (11314,)
Training labels shape:  (11314,)
Test set shape:  (7532,)
Test labels shape:  (7532,)


In [None]:
# randomly sample 5000 documents from the training set from train set
X_train = X_train.sample(n=5000, random_state=42)
# get same documents from labels
y_train = y_train[X_train.index]

## Preprocessing and fitting models


In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import LinearSVC

In [None]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)  # num_docs x num_words
X_test_counts = count_vect.transform(X_test)

In [None]:
tfidf_tranformer = TfidfTransformer(smooth_idf=True).fit(X_train_counts)
X_train_tfidf = tfidf_tranformer.transform(X_train_counts)
X_test_tfidf = tfidf_tranformer.transform(X_test_counts)

In [None]:
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.transform(y_test)

In [None]:
svc = LinearSVC()
svc.fit(X_train_tfidf, y_train)
scores = cross_val_score(svc, X_train_tfidf, y_train, scoring='accuracy', cv=10)

In [None]:
print(scores)

[0.91607774 0.92402827 0.93992933 0.94787986 0.93015031 0.92219275
 0.92130858 0.92484527 0.93191866 0.93191866]


We can also calculate precision, recall, and f1 relatively easily:

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=50)
sgd_clf.fit(X_train_tfidf, y_train)

y_train_predictions = cross_val_predict(sgd_clf, X_train_tfidf, y_train, cv=3)

In [None]:
# maybe add confusion matrix

## Shortcuts in sklearn: Pipelines
Sklearn allows us to build convenient `Pipelines`, which facilitate the management of our data and the training of our models enourmously. Consider for example:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

In [None]:
# Define a pipeline: first vectorize, then tfidf, then classify
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression())
])

In [None]:
# Q: What types of ngrams are used here?
# Q: What type of regularization did we used here? How can we change it?

We could even replace the two first lines of the pipeline by using `TfidfVectorizer`, which first fits and transforms the input the same way as the `CountVectorizer`.

In [None]:
text_clf.fit(X_train, y_train)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression())])

In [None]:
scores = cross_val_score(text_clf, X_train, y_train, scoring='accuracy', cv=10)

In [None]:
scores

array([0.91607774, 0.92402827, 0.93992933, 0.94787986, 0.93015031,
       0.92219275, 0.92130858, 0.92484527, 0.93191866, 0.93191866])

## Model selection - find your best model
For every model you would like to train, there is a plethora of parameters you could set. How to find the best model? Again, sklearn has a solution: `GridSearchCV`. With grid search cross validation, you can set your hyperparameter space and train different models with all the parameter combinations. Keep in mind that depending on how many folds you train, the whole training procedure takes significantly longer. But let's set up grid search cross validation. We set up a new pipeline for a SVC

In [None]:
from sklearn.model_selection import GridSearchCV

text_svc = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('svc', LinearSVC())
])

param_grid = {'vect__ngram_range': [(1, 1), (1, 2)],
             'svc__loss': ['hinge', 'squared_hinge'],
             'svc__multi_class': ['ovr', 'crammer_singer']}

# Q: With how many combinations of parameters will we end up?

gs_svc = GridSearchCV(text_svc, param_grid, cv=5, verbose=1)
gs_svc.fit(X_train, y_train)

In [None]:
svc_df = pd.DataFrame.from_dict(gs_svc.cv_results_)
svc_df.sort_values(by=["rank_test_score"])

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_svc__loss,param_svc__multi_class,param_vect__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
3,436.592107,171.236469,2.384123,0.070692,hinge,crammer_singer,"(1, 2)","{'svc__loss': 'hinge', 'svc__multi_class': 'cr...",0.933716,0.935484,0.933274,0.92532,0.930592,0.931677,0.003544,1
7,433.028093,113.623423,2.36251,0.188189,squared_hinge,crammer_singer,"(1, 2)","{'svc__loss': 'squared_hinge', 'svc__multi_cla...",0.933716,0.935484,0.933274,0.92532,0.930592,0.931677,0.003544,1
5,35.854953,1.288269,2.438577,0.162266,squared_hinge,ovr,"(1, 2)","{'svc__loss': 'squared_hinge', 'svc__multi_cla...",0.929739,0.931507,0.931065,0.924437,0.930592,0.929468,0.002583,3
1,176.69541,15.243509,2.438571,0.082104,hinge,ovr,"(1, 2)","{'svc__loss': 'hinge', 'svc__multi_class': 'ov...",0.929739,0.932391,0.930623,0.922669,0.929266,0.928938,0.003311,4
4,9.12273,0.169448,1.026385,0.044109,squared_hinge,ovr,"(1, 1)","{'svc__loss': 'squared_hinge', 'svc__multi_cla...",0.926204,0.931507,0.929739,0.922227,0.925729,0.927081,0.00325,5
2,107.631349,46.304807,1.015226,0.0285,hinge,crammer_singer,"(1, 1)","{'svc__loss': 'hinge', 'svc__multi_class': 'cr...",0.925762,0.928414,0.927972,0.92046,0.92794,0.926109,0.002972,6
6,118.908495,45.469967,0.999624,0.033259,squared_hinge,crammer_singer,"(1, 1)","{'svc__loss': 'squared_hinge', 'svc__multi_cla...",0.925762,0.928414,0.927972,0.92046,0.92794,0.926109,0.002972,6
0,23.560421,1.666786,1.010844,0.064981,hinge,ovr,"(1, 1)","{'svc__loss': 'hinge', 'svc__multi_class': 'ov...",0.927088,0.930181,0.926204,0.91825,0.924845,0.925314,0.003943,8


In [None]:
best_model = Pipeline([
    ('vect', CountVectorizer(ngram_range=(1,2))),
    ('tfidf', TfidfTransformer()),
    ('svc', LinearSVC(loss='hinge', multi_class='crammer_singer'))
])

best_model.fit(X_train, y_train)



Pipeline(steps=[('vect', CountVectorizer(ngram_range=(1, 2))),
                ('tfidf', TfidfTransformer()),
                ('svc', LinearSVC(loss='hinge', multi_class='crammer_singer'))])

In [None]:
correct = 0

for index, prediction in enumerate(best_model.predict(X_test)):
    if prediction == y_test[index]:
        correct +=1

print('Accuracy: ', correct/y_test.shape[0])

Accuracy:  0.8600637280934679
