DAT340, Assignment 4

Romain THEODET

### Exercise question

With the first training set, the model could "memorize" that it rains in Gothenburg or in December,
while for the second training set, the threshold isn't that obvious.

In [66]:
from sklearn.model_selection import train_test_split


def read_data(corpus_file):
    X = []
    Y = []
    with open(corpus_file, encoding="utf-8") as f:
        for line in f:
            _, y, _, x = line.split(maxsplit=3)
            X.append(x.strip())
            Y.append(y)
    return X, Y


# Read all the documents.
X, Y = read_data("data/all_sentiment_shuffled.txt")

# Split into training and test parts.
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=0)


### SVC implementation

In [67]:
import time
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import Normalizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
import pegasos

# Set up the preprocessing steps and the classifier.
pipeline = make_pipeline(
    TfidfVectorizer(),
    SelectKBest(k=1000),
    Normalizer(),
    pegasos.PegasosSVC()
)

# Train the classifier.
t0 = time.time()
pipeline.fit(Xtrain, Ytrain)
t1 = time.time()

print()
print("Training time for SVC: {:.2f} sec.".format(t1 - t0))

# Evaluate on the test set.
Yguess = pipeline.predict(Xtest)
print("Accuracy for SVC: {:.4f}.".format(accuracy_score(Ytest, Yguess)))


Iteration 1*10^3, average loss: 0.8985
Iteration 2*10^3, average loss: 0.7027
Iteration 3*10^3, average loss: 0.6185
Iteration 4*10^3, average loss: 0.5775
Iteration 5*10^3, average loss: 0.5494
Iteration 6*10^3, average loss: 0.5290
Iteration 7*10^3, average loss: 0.5138
Iteration 8*10^3, average loss: 0.5043
Iteration 9*10^3, average loss: 0.4925
Iteration 1*10^4, average loss: 0.4858
Iteration 2*10^4, average loss: 0.4545
Iteration 3*10^4, average loss: 0.4410
Iteration 4*10^4, average loss: 0.4359
Iteration 5*10^4, average loss: 0.4336
Iteration 6*10^4, average loss: 0.4318
Iteration 7*10^4, average loss: 0.4282
Iteration 8*10^4, average loss: 0.4275
Iteration 9*10^4, average loss: 0.4274

Training time for SVC: 3.28 sec.
Accuracy for SVC: 0.8196.


### Logistic regression implementation

In [68]:
import time
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import Normalizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

# Set up the preprocessing steps and the classifier.
pipeline = make_pipeline(
    TfidfVectorizer(),
    SelectKBest(k=1000),
    Normalizer(),
    pegasos.PegasosLR()
)

# Train the classifier.
t0 = time.time()
pipeline.fit(Xtrain, Ytrain)
t1 = time.time()
print("Training time for LR: {:.2f} sec.".format(t1 - t0))

# Evaluate on the test set.
Yguess = pipeline.predict(Xtest)
print("Accuracy for LR: {:.4f}.".format(accuracy_score(Ytest, Yguess)))

Iteration 1*10^3, average loss: 0.6310
Iteration 2*10^3, average loss: 0.5603
Iteration 3*10^3, average loss: 0.5334
Iteration 4*10^3, average loss: 0.5182
Iteration 5*10^3, average loss: 0.5115
Iteration 6*10^3, average loss: 0.5082
Iteration 7*10^3, average loss: 0.5024
Iteration 8*10^3, average loss: 0.5002
Iteration 9*10^3, average loss: 0.4971
Iteration 1*10^4, average loss: 0.4959
Iteration 2*10^4, average loss: 0.4838
Iteration 3*10^4, average loss: 0.4802
Iteration 4*10^4, average loss: 0.4783
Iteration 5*10^4, average loss: 0.4774
Iteration 6*10^4, average loss: 0.4762
Iteration 7*10^4, average loss: 0.4759
Iteration 8*10^4, average loss: 0.4755
Iteration 9*10^4, average loss: 0.4753
Training time for LR: 4.46 sec.
Accuracy for LR: 0.8078.


### Sparse matrices

Using sparce matrices, we can (try to) speed up the fitting process.

Here, we implemented the task 1.b and 1.c of the assignment.

### SVC implementation using sparse matrices

In [69]:
import pegasos_sparse

# Set up the preprocessing steps and the classifier.
pipeline = make_pipeline(
    TfidfVectorizer(),
    Normalizer(),
    pegasos_sparse.PegasosSVCSparse()
)

# Train the classifier.
t0 = time.time()
pipeline.fit(Xtrain, Ytrain)
t1 = time.time()

print()
print("Training time for sparse SVC: {:.2f} sec.".format(t1 - t0))

# Evaluate on the test set.
Yguess = pipeline.predict(Xtest)
print("Accuracy for sparse SVC: {:.4f}.".format(accuracy_score(Ytest, Yguess)))

Iteration 1*10^3, average loss: 0.7498
Iteration 2*10^3, average loss: 0.6575
Iteration 3*10^3, average loss: 0.6159
Iteration 4*10^3, average loss: 0.5939
Iteration 5*10^3, average loss: 0.5819
Iteration 6*10^3, average loss: 0.5784
Iteration 7*10^3, average loss: 0.5706
Iteration 8*10^3, average loss: 0.5646
Iteration 9*10^3, average loss: 0.5624
Iteration 1*10^4, average loss: 0.5595
Iteration 2*10^4, average loss: 0.5418
Iteration 3*10^4, average loss: 0.5343
Iteration 4*10^4, average loss: 0.5283
Iteration 5*10^4, average loss: 0.5245
Iteration 6*10^4, average loss: 0.5216
Iteration 7*10^4, average loss: 0.5187
Iteration 8*10^4, average loss: 0.5165
Iteration 9*10^4, average loss: 0.5145

Training time for sparse SVC: 4.87 sec.
Accuracy for sparse SVC: 0.8053.


### LR implementation using sparse matrices

In [70]:
# Set up the preprocessing steps and the classifier.
pipeline = make_pipeline(
    TfidfVectorizer(),
    Normalizer(),
    pegasos_sparse.PegasosLRSparse()
)

# Train the classifier.
t0 = time.time()
pipeline.fit(Xtrain, Ytrain)
t1 = time.time()

print()
print("Training time for sparse SVC: {:.2f} sec.".format(t1 - t0))

# Evaluate on the test set.
Yguess = pipeline.predict(Xtest)
print("Accuracy for sparse SVC: {:.4f}.".format(accuracy_score(Ytest, Yguess)))


Iteration 1*10^3, average loss: 0.6447
Iteration 2*10^3, average loss: 0.6026
Iteration 3*10^3, average loss: 0.5900
Iteration 4*10^3, average loss: 0.5827
Iteration 5*10^3, average loss: 0.5783
Iteration 6*10^3, average loss: 0.5757
Iteration 7*10^3, average loss: 0.5740
Iteration 8*10^3, average loss: 0.5717
Iteration 9*10^3, average loss: 0.5699
Iteration 1*10^4, average loss: 0.5690
Iteration 2*10^4, average loss: 0.5620
Iteration 3*10^4, average loss: 0.5605
Iteration 4*10^4, average loss: 0.5591
Iteration 5*10^4, average loss: 0.5584
Iteration 6*10^4, average loss: 0.5577
Iteration 7*10^4, average loss: 0.5573
Iteration 8*10^4, average loss: 0.5570
Iteration 9*10^4, average loss: 0.5566

Training time for sparse SVC: 11.36 sec.
Accuracy for sparse SVC: 0.7948.
