<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/bow_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

You can wget this notebook from https://raw.githubusercontent.com/TurkuNLP/intro-to-nlp/master/bow_classifier.ipynb

# Text classification

## Objectives of this lecture

* Introduce text classification as a task
* Introduce basic approaches to text classification, especially the bag-of-words model
* Show practical implementation of the concepts in the previous two lectures (intro and machine learning primer)
* Train a simple classifier on the IMDB dataset

## Purporse of these materials

* These materials support, but do not replace the lectures
* Presence on lectures highly recommended
* Make notes!

## Text classification

* Text in, class label out
* Often thought especially in terms of "Document classification" --- document in, class label(s) out
  * Movie review -> star rating
  * Email -> ham or spam
  * News article -> list of topics from a pre-defined set of topic categories
  * Comment in online discussion -> abusive or not
  * Customer email / call to customer service -> topic and relevant business process
  * Customer feedback -> needs reaction or not
  * ...
  
 * Document classification can be seen as a supervised classification problem with pre-defined classes
 * A direct application of the standard approach:
   1. prepare training/dev/test data
   2. extract features
   3. train your favorite classifier
   4. test on held-out data
 

# Bag-of-words model

* BoW is the simplest way to model documents for classification
* The document is reduced to a **set** of features, in the simplest case the words


* Feature vector: a vector with as many dimensions as we have unique features, and a non-zero value set for every feature present in our example
* Values: 1/0 (present/absent) or oftentimes TF/IDF weights (why?)

# IMDB data

* Movie review sentiment positive/negative
* Some 25,000 examples, 50:50 split of classes (why is this number relevant?)
* Current state-of-the-art is about 95% accuracy (what is accuracy?)

In [1]:
!wget -nc https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/imdb_train.json


--2022-01-26 13:42:07--  https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/imdb_train.json
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/TurkuNLP/intro-to-nlp/master/Data/imdb_train.json [following]
--2022-01-26 13:42:07--  https://raw.githubusercontent.com/TurkuNLP/intro-to-nlp/master/Data/imdb_train.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33944099 (32M) [text/plain]
Saving to: ‘imdb_train.json’


2022-01-26 13:42:08 (274 MB/s) - ‘imdb_train.json’ saved [33944099/33944099]



In [2]:
import json
import random
with open("imdb_train.json") as f:
    data=json.load(f)
random.shuffle(data) #play it safe! (why?)
print("class label:", data[0]["class"])
print("text:",data[0]["text"])

class label: neg
text: Everything in this film is bad , the story , the acting , the effects but its funny , funny , funny !!!Scott Valentine with the army uniform thats ten sizes too big is so bad with the permanent attempt at a scowl on his face as the leader of a special ops group its hilarious ! The ''terrorists'' are as scary and realistic as the ''raptors'' , this is so phoney and bad at everything it tries you have to laugh .The part where the giant T-REX who somehow snuck on board a ship and then somehow got below is blown up and you see the metal pole sticking up where its head was is the perfect ending .If your into bad films , this is the pot of gold , the mona lisa of b-b-bad !!!


## Bag of words in practice

* We will need to build a feature vector for every example
* We will need to know the class for every example

* Build a data matrix with dimensionality (number of examples, number of possible features), and a value for each feature, 0/1 for binary features, TF-IDF weights are also a typical choice

It is quite useless to do all this ourselves, so we will use ready-made classes and functions mostly from scikit

In [3]:
# We need to gather the texts and labels into separate lists
texts=[one_example["text"] for one_example in data]
labels=[one_example["class"] for one_example in data]
print("This many texts",len(texts))
print("This many labels",len(labels))
print()
for label,text in list(zip(labels,texts))[:20]:
    print(label,text[:50]+"...")



This many texts 25000
This many labels 25000

neg Everything in this film is bad , the story , the a...
pos Jafar Panahi's comedy-drama \Offside\" portrays so...
pos This is a great movie that everyone should see. It...
pos I saw this movie with a bunch of friends and altho...
pos Having previously seen the abridged print presente...
neg In my opinion, this movie's title should be change...
neg Well. Where to begin. Let's just say this; avoid t...
neg Flat, ordinary thriller about a conniving woman wh...
pos This is one of the most guilty pleasure movies eve...
neg If Alien, Jurassic Park and countless other sci fi...
pos SPOILERS 9/11 is a very good and VERY realistic do...
neg This is one of the funniest movies i've ever seen....
neg I never like to comment on a good film but when it...
neg Words can't describe how utterly stupid this story...
neg I'd give it a 2/10.  I was really, really disappoi...
pos This movie is one of the funniest, saddest and mos...
neg What can possibly said

## Sklearn text vectorizers

* Vectorizers take care of turning inputs into feature vectors
* Also build the feature name to feature index mapping:
    * `.fit()` learn the mapping
    * `.fit_transform()` learn and apply the mapping
    * `.transform()` apply the learned mapping
    * (What is the difference?)

Let's first try on a tiny example, then work our way to the real data:

In [4]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

vectorizer=CountVectorizer()

#just two short document
toy_data=["More precious than gold: Why the metal palladium is soaring","The price of the precious metal palladium has soared on the global commodities markets."]

vectorizer.fit(toy_data)
print("Unique features:")
print(vectorizer.get_feature_names())
print()
print("Feature vectors (sparse format):")
print(vectorizer.transform(toy_data))

Unique features:
['commodities', 'global', 'gold', 'has', 'is', 'markets', 'metal', 'more', 'of', 'on', 'palladium', 'precious', 'price', 'soared', 'soaring', 'than', 'the', 'why']

Feature vectors (sparse format):
  (0, 2)	1
  (0, 4)	1
  (0, 6)	1
  (0, 7)	1
  (0, 10)	1
  (0, 11)	1
  (0, 14)	1
  (0, 15)	1
  (0, 16)	1
  (0, 17)	1
  (1, 0)	1
  (1, 1)	1
  (1, 3)	1
  (1, 5)	1
  (1, 6)	1
  (1, 8)	1
  (1, 9)	1
  (1, 10)	1
  (1, 11)	1
  (1, 12)	1
  (1, 13)	1
  (1, 16)	3




In [5]:
#...and in a more understandable format
#...these are the feature vectors of our two toy documents
print(vectorizer.transform(toy_data).todense())

[[0 0 1 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1]
 [1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 3 0]]


In [6]:
# only features seen in the training data can be taken into account!
print(vectorizer.transform(["unseen words only"]).todense())

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,1))
feature_matrix=vectorizer.fit_transform(texts)
print("shape=",feature_matrix.shape)
print("what did we get? ->", feature_matrix.__class__)

shape= (25000, 74849)
what did we get? -> <class 'scipy.sparse.csr.csr_matrix'>


In [None]:
print(feature_matrix)

  (0, 18292)	1
  (0, 46050)	1
  (0, 38755)	1
  (0, 66339)	1
  (0, 32435)	1
  (0, 46680)	1
  (0, 24202)	1
  (0, 68640)	1
  (0, 4753)	1
  (0, 2662)	1
  (0, 60351)	1
  (0, 402)	1
  (0, 72365)	1
  (0, 36865)	1
  (0, 67117)	1
  (0, 67125)	1
  (0, 6334)	1
  (0, 25734)	1
  (0, 9304)	1
  (0, 73342)	1
  (0, 66367)	1
  (0, 65747)	1
  (0, 62338)	1
  (0, 3258)	1
  (0, 21827)	1
  :	:
  (24999, 3538)	1
  (24999, 57668)	1
  (24999, 49147)	1
  (24999, 66526)	1
  (24999, 66621)	1
  (24999, 31868)	1
  (24999, 13574)	1
  (24999, 31669)	1
  (24999, 68412)	1
  (24999, 45388)	1
  (24999, 66329)	1
  (24999, 45656)	1
  (24999, 50496)	1
  (24999, 72910)	1
  (24999, 33779)	1
  (24999, 18839)	1
  (24999, 72239)	1
  (24999, 17239)	1
  (24999, 29971)	1
  (24999, 70493)	1
  (24999, 12484)	1
  (24999, 51610)	1
  (24999, 15621)	1
  (24999, 67385)	1
  (24999, 25109)	1


In [9]:
print(vectorizer.get_feature_names_out()[:1000])

['00' '000' '0000000000001' '00001' '00015' '000s' '001' '003830' '006'
 '007' '0079' '0080' '0083' '0093638' '00am' '00pm' '00s' '01' '01pm' '02'
 '020410' '029' '03' '04' '041' '05' '050' '06' '06th' '07' '08' '087'
 '089' '08th' '09' '0f' '0ne' '0r' '0s' '10' '100' '1000' '1000000'
 '10000000000000' '1000lb' '1000s' '1001' '100b' '100k' '100m' '100min'
 '100mph' '100s' '100th' '100x' '100yards' '101' '101st' '102' '102nd'
 '103' '104' '1040' '1040a' '1040s' '105' '1050' '105lbs' '106' '106min'
 '107' '108' '109' '10am' '10lines' '10mil' '10min' '10minutes' '10p'
 '10pm' '10s' '10star' '10th' '10x' '10yr' '11' '110' '1100' '11001001'
 '1100ad' '111' '112' '1138' '114' '1146' '115' '116' '117' '11f' '11m'
 '11th' '12' '120' '1200' '1200f' '1201' '1202' '123' '12383499143743701'
 '125' '125m' '127' '128' '12a' '12hr' '12m' '12mm' '12s' '12th' '13'
 '130' '1300' '1300s' '131' '1318' '132' '134' '135' '135m' '136' '137'
 '138' '139' '13k' '13s' '13th' '14' '140' '1408' '140hp' '1415' '14

Now we have the feature matrix done!

# Data split

* Train data - all training based on it (this includes the vectorizer!)
* Development data - set the parameters
* Test data - used for nothing during training, produce final results

In [10]:
from sklearn.model_selection import train_test_split

train_texts, dev_texts, train_labels, dev_labels=train_test_split(texts,labels,test_size=0.2)
vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,1))
feature_matrix_train=vectorizer.fit_transform(train_texts)
feature_matrix_dev=vectorizer.transform(dev_texts)




In [11]:
print(feature_matrix_train.shape)
print(feature_matrix_dev.shape)

(20000, 68538)
(5000, 68538)


# Classifier train

* Let us try the venerable, if a bit outdated SVM
* Linear SVM for simplicity

In [12]:
import sklearn.svm
classifier=sklearn.svm.LinearSVC(C=0.0005,verbose=1)
classifier.fit(feature_matrix_train, train_labels)

[LibLinear]

LinearSVC(C=0.0005, verbose=1)

# Test

* For a quick test we can use the .score() method

In [13]:
print("DEV",classifier.score(feature_matrix_dev, dev_labels))
print("TRAIN",classifier.score(feature_matrix_train, train_labels))


DEV 0.8746
TRAIN 0.8949


* Try varying the C value and observe the results

In [14]:
import sklearn.metrics
predictions_dev=classifier.predict(feature_matrix_dev)
print(predictions_dev)
print(sklearn.metrics.confusion_matrix(dev_labels,predictions_dev))
print(sklearn.metrics.accuracy_score(dev_labels,predictions_dev))

['neg' 'pos' 'pos' ... 'pos' 'neg' 'neg']
[[2126  359]
 [ 268 2247]]
0.8746


# Trained classifier save and load

* We fitted two things: the vectorizer and the classifier
* If we want to reuse them later, we need to save them
* You can use `pickle` in most cases

In [15]:
import pickle

with open("saved_model.pickle","wb") as f:
    pickle.dump((classifier,vectorizer),f)

* let's try to load and test

In [16]:
with open("saved_model.pickle","rb") as f:
    classifier_loaded,vectorizer_loaded=pickle.load(f)

feature_matrix_dev_loaded=vectorizer_loaded.transform(dev_texts)
print("DEV - loaded (should match the score above)",classifier_loaded.score(feature_matrix_dev, dev_labels))



DEV - loaded (should match the score above) 0.8746
