DAT340, Assignment 4

Romain THEODET

### Exercise question

With the first training set, the model could "memorize" that it rains in Gothenburg or in December,
while for the second training set, the threshold isn't that obvious.

In [1]:
from sklearn.model_selection import train_test_split


def read_data(corpus_file):
    X = []
    Y = []
    with open(corpus_file, encoding="utf-8") as f:
        for line in f:
            _, y, _, x = line.split(maxsplit=3)
            X.append(x.strip())
            Y.append(y)
    return X, Y


# Read all the documents.
X, Y = read_data("data/all_sentiment_shuffled.txt")

# Split into training and test parts.
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=0)


### SVC implementation

In [2]:
import time
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import Normalizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
import pegasos

# Set up the preprocessing steps and the classifier.
pipeline = make_pipeline(
    TfidfVectorizer(),
    SelectKBest(k=1000),
    Normalizer(),
    pegasos.PegasosSVC()
)

# Train the classifier.
t0 = time.time()
pipeline.fit(Xtrain, Ytrain)
t1 = time.time()

print()
print("Training time for SVC: {:.2f} sec.".format(t1 - t0))

# Evaluate on the test set.
Yguess = pipeline.predict(Xtest)
print("Accuracy for SVC: {:.4f}.".format(accuracy_score(Ytest, Yguess)))


Iteration 1*10^3, average loss: 1.0728
Iteration 2*10^3, average loss: 0.7800
Iteration 3*10^3, average loss: 0.6764
Iteration 4*10^3, average loss: 0.6257
Iteration 5*10^3, average loss: 0.5856
Iteration 6*10^3, average loss: 0.5607
Iteration 7*10^3, average loss: 0.5429
Iteration 8*10^3, average loss: 0.5301
Iteration 9*10^3, average loss: 0.5169
Iteration 1*10^4, average loss: 0.5053
Iteration 2*10^4, average loss: 0.4623
Iteration 3*10^4, average loss: 0.4480
Iteration 4*10^4, average loss: 0.4415
Iteration 5*10^4, average loss: 0.4377
Iteration 6*10^4, average loss: 0.4359
Iteration 7*10^4, average loss: 0.4331
Iteration 8*10^4, average loss: 0.4315
Iteration 9*10^4, average loss: 0.4299

Training time for SVC: 6.20 sec.
Accuracy for SVC: 0.8225.


### Logistic regression implementation

In [6]:
import time
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import Normalizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

# Set up the preprocessing steps and the classifier.
pipeline = make_pipeline(
    TfidfVectorizer(),
    SelectKBest(k=1000),
    Normalizer(),
    pegasos.PegasosLR()
)

# Train the classifier.
t0 = time.time()
pipeline.fit(Xtrain, Ytrain)
t1 = time.time()

print()
print("Training time for LR: {:.2f} sec.".format(t1 - t0))

# Evaluate on the test set.
Yguess = pipeline.predict(Xtest)
print("Accuracy for LR: {:.4f}.".format(accuracy_score(Ytest, Yguess)))

Iteration 1*10^3, average loss: 0.8854
Iteration 2*10^3, average loss: 0.6788
Iteration 3*10^3, average loss: 0.6117
Iteration 4*10^3, average loss: 0.5759
Iteration 5*10^3, average loss: 0.5557
Iteration 6*10^3, average loss: 0.5413
Iteration 7*10^3, average loss: 0.5308
Iteration 8*10^3, average loss: 0.5237
Iteration 9*10^3, average loss: 0.5166
Iteration 1*10^4, average loss: 0.5126
Iteration 2*10^4, average loss: 0.4928
Iteration 3*10^4, average loss: 0.4864
Iteration 4*10^4, average loss: 0.4829
Iteration 5*10^4, average loss: 0.4810
Iteration 6*10^4, average loss: 0.4789
Iteration 7*10^4, average loss: 0.4779
Iteration 8*10^4, average loss: 0.4773
Iteration 9*10^4, average loss: 0.4766

Training time for LR: 11.19 sec.
Accuracy for LR: 0.8061.


### Sparse matrices

Using sparce matrices, we can (try to) speed up the fitting process.

Here, we implemented the tasks 1.b and 1.c of the assignment.

### SVC implementation using sparse matrices

In [8]:
import pegasos_sparse

# Set up the preprocessing steps and the classifier.
pipeline = make_pipeline(
    TfidfVectorizer(),
    Normalizer(),
    pegasos_sparse.PegasosSVCSparse()
)

# Train the classifier.
t0 = time.time()
pipeline.fit(Xtrain, Ytrain)
t1 = time.time()

print()
print("Training time for sparse SVC: {:.2f} sec.".format(t1 - t0))

# Evaluate on the test set.
Yguess = pipeline.predict(Xtest)
print("Accuracy for sparse SVC: {:.4f}.".format(accuracy_score(Ytest, Yguess)))

Iteration 1*10^3, average loss: 9.7449
Iteration 2*10^3, average loss: 5.7722
Iteration 3*10^3, average loss: 4.2145
Iteration 4*10^3, average loss: 3.3652
Iteration 5*10^3, average loss: 2.8044
Iteration 6*10^3, average loss: 2.4032
Iteration 7*10^3, average loss: 2.1153
Iteration 8*10^3, average loss: 1.8971
Iteration 9*10^3, average loss: 1.7260
Iteration 1*10^4, average loss: 1.5811
Iteration 2*10^4, average loss: 0.8917
Iteration 3*10^4, average loss: 0.6389
Iteration 4*10^4, average loss: 0.5067
Iteration 5*10^4, average loss: 0.4231
Iteration 6*10^4, average loss: 0.3653
Iteration 7*10^4, average loss: 0.3235
Iteration 8*10^4, average loss: 0.2920
Iteration 9*10^4, average loss: 0.2667

Training time for sparse SVC: 7.62 sec.
Accuracy for sparse SVC: 0.8355.


### LR implementation using sparse matrices

In [11]:
# Set up the preprocessing steps and the classifier.
pipeline = make_pipeline(
    TfidfVectorizer(),
    Normalizer(),
    pegasos_sparse.PegasosLRSparse()
)

# Train the classifier.
t0 = time.time()
pipeline.fit(Xtrain, Ytrain)
t1 = time.time()

print()
print("Training time for sparse SVC: {:.2f} sec.".format(t1 - t0))

# Evaluate on the test set.
Yguess = pipeline.predict(Xtest)
print("Accuracy for sparse SVC: {:.4f}.".format(accuracy_score(Ytest, Yguess)))


Iteration 1*10^3, average loss: 7.8656
Iteration 2*10^3, average loss: 4.5425
Iteration 3*10^3, average loss: 3.2869
Iteration 4*10^3, average loss: 2.5816
Iteration 5*10^3, average loss: 2.1335
Iteration 6*10^3, average loss: 1.8311
Iteration 7*10^3, average loss: 1.6075
Iteration 8*10^3, average loss: 1.4418
Iteration 9*10^3, average loss: 1.3104
Iteration 1*10^4, average loss: 1.2041
Iteration 2*10^4, average loss: 0.7113
Iteration 3*10^4, average loss: 0.5434
Iteration 4*10^4, average loss: 0.4586
Iteration 5*10^4, average loss: 0.4083
Iteration 6*10^4, average loss: 0.3740
Iteration 7*10^4, average loss: 0.3499
Iteration 8*10^4, average loss: 0.3310
Iteration 9*10^4, average loss: 0.3167

Training time for sparse SVC: 32.32 sec.
Accuracy for sparse SVC: 0.8410.
