# <center>Text Classification with fastText</center>

This project introduces the task of text classification using the fastText library. First obtain the dataset then prepare for train/valid split and at the end predicting labels for unseen input data.

In [1]:
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
import os
os.chdir('/content/gdrive/My Drive/AppDataSci')

In [10]:
!head -n 10 goodreads_reviews.txt

__label__negative Quarantine: The Loners is the first in a series by Lex Thomas (a pen name for the writing team of Lex Hrabe and Thomas Voorhies).
__label__negative I have an qualm with this book, and it is that it took me out of my Brandon Sanderson haze.
__label__negative What a waste of time.
__label__positive Really interesting look at life in the royal court.
__label__negative DNF.
__label__positive If you're looking for a book that will creep you out, stay you up all night, get paranoid, and sleep with lights on (that's the exact thing happen to me).
__label__positive I've been reading a lot of classics lately, which has been enjoyable.
__label__positive 4.5 Stars!
__label__negative Haunting and lovely as always, though I must say it's not my favorite of his.
__label__negative Interesting.


Here __label__ negative implies bad reviews and __label__positive implies good reviews.

In [18]:
!wc goodreads_reviews.txt

  1330981  17187179 111987316 goodreads_reviews.txt


We've got roughly 1.3M samples in our dataset. Let's split it into a training set of roughly 1M samples and testing set of 378k samples.

In [17]:
!head -n 1000000 goodreads_reviews.txt > reviews.train
!tail -n 330981 goodreads_reviews.txt > reviews.valid

In [7]:
!git clone https://github.com/facebookresearch/fastText.git
%cd fastText
!make
!cp fasttext ../
%cd ..

Cloning into 'fastText'...
remote: Enumerating objects: 3854, done.[K
remote: Total 3854 (delta 0), reused 0 (delta 0), pack-reused 3854[K
Receiving objects: 100% (3854/3854), 8.22 MiB | 10.63 MiB/s, done.
Resolving deltas: 100% (2417/2417), done.
Checking out files: 100% (526/526), done.
/content/gdrive/My Drive/AppDataSci/fastText
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/args.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/autotune.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/matrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/dictionary.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/loss.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/productquantizer.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/densematrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG

In [19]:
!./fastText/fasttext supervised -input ./reviews.train -output ./reviews_model1

Read 13M words
Number of words:  316262
Number of labels: 2
Progress: 100.0% words/sec/thread:  395026 lr:  0.000000 avg.loss:  0.511251 ETA:   0h 0m 0s


In [27]:
!echo "Really interesting look at life in the royal court." | ./fastText/fasttext predict reviews_model1.bin -

__label__positive


In [21]:
!./fastText/fasttext test reviews_model1.bin ./reviews.valid

N	330981
P@1	0.745
R@1	0.745


the number of false positives is the same as the number of false negatives, all three metrics have identical values.

In [22]:
!./fastText/fasttext supervised -input reviews.train -output reviews_model2 -epoch 25

Read 13M words
Number of words:  316262
Number of labels: 2
Progress: 100.0% words/sec/thread:  394267 lr:  0.000000 avg.loss:  0.480929 ETA:   0h 0m 0s


In [23]:
!./fastText/fasttext test reviews_model2.bin reviews.valid

N	330981
P@1	0.741
R@1	0.741


In [24]:
!./fastText/fasttext supervised -input reviews.train -output reviews_model3 -lr 1.0 -epoch 25 -wordNgrams 2

Read 13M words
Number of words:  316262
Number of labels: 2
Progress: 100.0% words/sec/thread:  242138 lr:  0.000000 avg.loss:  4.187792 ETA:   0h 0m 0s


In [25]:
!./fastText/fasttext test reviews_model3.bin reviews.valid

N	330981
P@1	0.718
R@1	0.718


<h2>Multilabel</h2>

In [28]:
!head -n 10 goodreads_books.txt

__label__2 Quarantine: The Loners is the first in a series by Lex Thomas (a pen name for the writing team of Lex Hrabe and Thomas Voorhies).
__label__3 I have an qualm with this book, and it is that it took me out of my Brandon Sanderson haze.
__label__1 What a waste of time.
__label__4 Really interesting look at life in the royal court.
__label__1 DNF.
__label__5 If you're looking for a book that will creep you out, stay you up all night, get paranoid, and sleep with lights on (that's the exact thing happen to me).
__label__4 I've been reading a lot of classics lately, which has been enjoyable.
__label__4 4.5 Stars!
__label__3 Haunting and lovely as always, though I must say it's not my favorite of his.
__label__2 Interesting.


In [29]:
!wc goodreads_books.txt

  1330981  17187179 102670449 goodreads_books.txt


In [30]:
!head -n 1000000 goodreads_books.txt > books.train
!tail -n 330981 goodreads_books.txt > books.valid

In [31]:
!./fastText/fasttext supervised -input ./books.train -output ./books_model1

Read 13M words
Number of words:  316262
Number of labels: 5
Progress: 100.0% words/sec/thread:  350414 lr:  0.000000 avg.loss:  1.157621 ETA:   0h 0m 0s


In [32]:
!echo "Really interesting look at life in the royal court." | ./fastText/fasttext predict books_model1.bin -

__label__4


In [33]:
!./fastText/fasttext test books_model1.bin ./books.valid

N	330981
P@1	0.487
R@1	0.487


In [34]:
!./fastText/fasttext supervised -input books.train -output books_model2 -epoch 25

Read 13M words
Number of words:  316262
Number of labels: 5
Progress: 100.0% words/sec/thread:  347216 lr:  0.000000 avg.loss:  1.073316 ETA:   0h 0m 0s


In [35]:
!./fastText/fasttext test books_model2.bin books.valid

N	330981
P@1	0.475
R@1	0.475


In [36]:
!./fastText/fasttext supervised -input books.train -output books_model3 -lr 1.0 -epoch 25 -wordNgrams 2

Read 13M words
Number of words:  316262
Number of labels: 5
Progress: 100.0% words/sec/thread:  200720 lr:  0.000000 avg.loss:  8.792355 ETA:   0h 0m 0s


In [37]:
!./fastText/fasttext test books_model3.bin books.valid

N	330981
P@1	0.434
R@1	0.434


In [38]:
!head -n 10 books_reviews.txt

__label__poor Quarantine: The Loners is the first in a series by Lex Thomas (a pen name for the writing team of Lex Hrabe and Thomas Voorhies).
__label__average I have an qualm with this book, and it is that it took me out of my Brandon Sanderson haze.
__label__terrible What a waste of time.
__label__good Really interesting look at life in the royal court.
__label__terrible DNF.
__label__excellent If you're looking for a book that will creep you out, stay you up all night, get paranoid, and sleep with lights on (that's the exact thing happen to me).
__label__good I've been reading a lot of classics lately, which has been enjoyable.
__label__good 4.5 Stars!
__label__average Haunting and lovely as always, though I must say it's not my favorite of his.
__label__poor Interesting.


In [39]:
!wc books_reviews.txt

  1330981  17187179 109730560 books_reviews.txt


In [40]:
!head -n 1000000 books_reviews.txt > sentences.train
!tail -n 330981 books_reviews.txt > sentences.valid

In [41]:
!./fastText/fasttext supervised -input ./sentences.train -output ./sentences_model1

Read 13M words
Number of words:  316262
Number of labels: 5
Progress: 100.0% words/sec/thread:  321170 lr:  0.000000 avg.loss:  1.157532 ETA:   0h 0m 0s


In [42]:
!echo "Really interesting look at life in the royal court." | ./fastText/fasttext predict sentences_model1.bin -

__label__good


In [43]:
!./fastText/fasttext test sentences_model1.bin ./sentences.valid

N	330981
P@1	0.487
R@1	0.487


In [44]:
!./fastText/fasttext supervised -input sentences.train -output sentences_model2 -epoch 25

Read 13M words
Number of words:  316262
Number of labels: 5
Progress: 100.0% words/sec/thread:  324285 lr:  0.000000 avg.loss:  1.057963 ETA:   0h 0m 0s


In [45]:
!./fastText/fasttext test sentences_model2.bin sentences.valid

N	330981
P@1	0.475
R@1	0.475


In [46]:
!./fastText/fasttext supervised -input sentences.train -output sentences_model3 -lr 1.0 -epoch 25 -wordNgrams 2

Read 13M words
Number of words:  316262
Number of labels: 5
Progress: 100.0% words/sec/thread:  191249 lr:  0.000000 avg.loss:  8.703199 ETA:   0h 0m 0s


In [47]:
!./fastText/fasttext test sentences_model3.bin sentences.valid

N	330981
P@1	0.432
R@1	0.432
