You can wget this notebook from https://raw.githubusercontent.com/TurkuNLP/intro-to-nlp/master/bow_classifier.ipynb

# Text classification

## Objectives of this lecture

* Introduce text classification as a task
* Introduce basic approaches to text classification, especially the bag-of-words model
* Show practical implementation of the concepts in the previous two lectures (intro and machine learning primer)
* Train a simple classifier on the IMDB dataset

## Purporse of these materials

* These materials support, but do not replace the lectures
* Presence on lectures highly recommended
* Make notes!

## Text classification

* Text in, class label out
* Often thought especially in terms of "Document classification" --- document in, class label(s) out
  * Movie review -> star rating
  * Email -> ham or spam
  * News article -> list of topics from a pre-defined set of topic categories
  * Comment in online discussion -> abusive or not
  * Customer email / call to customer service -> topic and relevant business process
  * Customer feedback -> needs reaction or not
  * ...
  
 * Document classification can be seen as a supervised classification problem with pre-defined classes
 * A direct application of the standard approach:
   1. prepare training/dev/test data
   2. extract features
   3. train your favorite classifier
   4. test on held-out data
 

# Bag-of-words model

* BoW is the simplest way to model documents for classification
* The document is reduced to a **set** of features, in the simplest case the words


* Feature vector: a vector with as many dimensions as we have unique features, and a non-zero value set for every feature present in our example
* Values: 1/0 (present/absent) or oftentimes TF/IDF weights (why?)

# IMDB data

* Movie review sentiment positive/negative
* Some 25,000 examples, 50:50 split of classes (why is this number relevant?)
* Current state-of-the-art is about 95% accuracy (what is accuracy?)

In [10]:
import json
import random
with open("Data/imdb_train.json") as f:
    data=json.load(f)
random.shuffle(data) #play it safe! (why?)
# Data given to you can be ordered and some classifiers might not work well
print("class label:", data[0]["class"])
print("text:",data[0]["text"])

class label: pos
text: I saw this film with a special screening of the work of Owen Alik Shahadah. It is so interesting where did this guy come from. Now he is probably the key independent African filmmaker in the world. And I am not talking about black filmmakers I am talking about filmmakers who are rooted in culture. The idea if anything testifies to the diversity and range of African themes, with his film 500 Years Later it is a African issue. But the Idea doesn't fit that mold. Showing the artistic diversity. The film is an all African cast but the topic is a human topic which most of us could relate to. I just love the mild comedy in it. And the Kora of Tunde Jegede is just amazing, it is really a art-house gem.


## Bag of words in practice

* We will need to build a feature vector for every example
* We will need to know the class for every example

* Build a data matrix with dimensionality (number of examples, number of possible features), and a value for each feature, 0/1 for binary features, TF-IDF weights are also a typical choice

It is quite useless to do all this ourselves, so we will use ready-made classes and functions mostly from scikit

In [11]:
# We need to gather the texts and labels into separate lists
texts=[one_example["text"] for one_example in data]
labels=[one_example["class"] for one_example in data]
print("This many texts",len(texts))
print("This many labels",len(labels))
print()
for label,text in list(zip(labels,texts))[:20]:
    print(label,text[:50]+"...")



This many texts 25000
This many labels 25000

pos I saw this film with a special screening of the wo...
neg This film so NOT funny - such a waste of great sta...
neg A Christmas Story Is A Holiday Classic And My Favo...
neg I've seen all kinds of \Hamlet\"s.   Kenneth Brana...
neg This movie is a modest effort by Spike Lee. He is ...
neg Surely one of the most ill-advised remakes of a cl...
neg Okay. So there aren't really that many great movie...
pos BEWARE SPOILERS. This movie was okay. Goldie Hawn ...
pos I love occult Horror, and the great British Hammer...
pos In Iran, the Islamic Revolution has shaped all par...
pos I found this to be a so-so romance/drama that has ...
neg Sorry did i miss something? did i walk out early? ...
neg The movie had so much potential, but due to 70's t...
neg *****probably minor spoilers******  I cant say i l...
neg (aka: The Bloodsucker Leads the Dance)  Lots of na...
pos Everyone has already commented on the cinematograp...
pos \What would you do?\" 

## Sklearn text vectorizers

* Vectorizers take care of turning inputs into feature vectors
* Also build the feature name to feature index mapping:
    * `.fit()` learn the mapping
    * `.fit_transform()` learn and apply the mapping
    * `.transform()` apply the learned mapping
    * (What is the difference?)

Let's first try on a tiny example, then work our way to the real data:

In [15]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

vectorizer=CountVectorizer(ngram_range=(1,1))

#just two short document
toy_data=["More precious than gold: Why the metal palladium is soaring","The price of the precious metal palladium has soared on the global commodities markets."]

vectorizer.fit(toy_data)
print("Unique features:")
print(vectorizer.get_feature_names())
print()
print("Feature vectors (sparse format):")
print(vectorizer.transform(toy_data))

Unique features:
['commodities', 'global', 'gold', 'has', 'is', 'markets', 'metal', 'more', 'of', 'on', 'palladium', 'precious', 'price', 'soared', 'soaring', 'than', 'the', 'why']

Feature vectors (sparse format):
  (0, 2)	1
  (0, 4)	1
  (0, 6)	1
  (0, 7)	1
  (0, 10)	1
  (0, 11)	1
  (0, 14)	1
  (0, 15)	1
  (0, 16)	1
  (0, 17)	1
  (1, 0)	1
  (1, 1)	1
  (1, 3)	1
  (1, 5)	1
  (1, 6)	1
  (1, 8)	1
  (1, 9)	1
  (1, 10)	1
  (1, 11)	1
  (1, 12)	1
  (1, 13)	1
  (1, 16)	3


In [16]:
#...and in a more understandable format
#...these are the feature vectors of our two toy documents
print(vectorizer.transform(toy_data).todense())

[[0 0 1 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1]
 [1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 3 0]]


In [7]:
# only features seen in the training data can be taken into account!
print(vectorizer.transform(["unseen words only"]).todense())

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


In [17]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,1))
feature_matrix=vectorizer.fit_transform(texts)
print("shape=",feature_matrix.shape)
print("what did we get? ->", feature_matrix.__class__)

shape= (25000, 74849)
what did we get? -> <class 'scipy.sparse.csr.csr_matrix'>


In [9]:
print(feature_matrix)

  (0, 18292)	1
  (0, 46050)	1
  (0, 38755)	1
  (0, 66339)	1
  (0, 32435)	1
  (0, 46680)	1
  (0, 24202)	1
  (0, 68640)	1
  (0, 4753)	1
  (0, 2662)	1
  (0, 60351)	1
  (0, 402)	1
  (0, 72365)	1
  (0, 36865)	1
  (0, 67117)	1
  (0, 67125)	1
  (0, 6334)	1
  (0, 25734)	1
  (0, 9304)	1
  (0, 73342)	1
  (0, 66367)	1
  (0, 65747)	1
  (0, 62338)	1
  (0, 3258)	1
  (0, 21827)	1
  :	:
  (24999, 3538)	1
  (24999, 57668)	1
  (24999, 49147)	1
  (24999, 66526)	1
  (24999, 66621)	1
  (24999, 31868)	1
  (24999, 13574)	1
  (24999, 31669)	1
  (24999, 68412)	1
  (24999, 45388)	1
  (24999, 66329)	1
  (24999, 45656)	1
  (24999, 50496)	1
  (24999, 72910)	1
  (24999, 33779)	1
  (24999, 18839)	1
  (24999, 72239)	1
  (24999, 17239)	1
  (24999, 29971)	1
  (24999, 70493)	1
  (24999, 12484)	1
  (24999, 51610)	1
  (24999, 15621)	1
  (24999, 67385)	1
  (24999, 25109)	1


In [10]:
print(vectorizer.get_feature_names()[:1000])

['00', '000', '0000000000001', '00001', '00015', '000s', '001', '003830', '006', '007', '0079', '0080', '0083', '0093638', '00am', '00pm', '00s', '01', '01pm', '02', '020410', '029', '03', '04', '041', '05', '050', '06', '06th', '07', '08', '087', '089', '08th', '09', '0f', '0ne', '0r', '0s', '10', '100', '1000', '1000000', '10000000000000', '1000lb', '1000s', '1001', '100b', '100k', '100m', '100min', '100mph', '100s', '100th', '100x', '100yards', '101', '101st', '102', '102nd', '103', '104', '1040', '1040a', '1040s', '105', '1050', '105lbs', '106', '106min', '107', '108', '109', '10am', '10lines', '10mil', '10min', '10minutes', '10p', '10pm', '10s', '10star', '10th', '10x', '10yr', '11', '110', '1100', '11001001', '1100ad', '111', '112', '1138', '114', '1146', '115', '116', '117', '11f', '11m', '11th', '12', '120', '1200', '1200f', '1201', '1202', '123', '12383499143743701', '125', '125m', '127', '128', '12a', '12hr', '12m', '12mm', '12s', '12th', '13', '130', '1300', '1300s', '131', 

Now we have the feature matrix done!

# Data split

* Train data - all training based on it (this includes the vectorizer!)
* Development data - set the parameters
* Test data - used for nothing during training, produce final results

In [18]:
from sklearn.model_selection import train_test_split

train_texts, dev_texts, train_labels, dev_labels = train_test_split(texts,labels,test_size=0.2)
# bad habits
vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,1))
feature_matrix_train=vectorizer.fit_transform(train_texts)
feature_matrix_dev=vectorizer.transform(dev_texts)




In [19]:
print(feature_matrix_train.shape)
print(feature_matrix_dev.shape)

(20000, 68545)
(5000, 68545)


# Classifier train

* Let us try the venerable, if a bit outdated SVM
* Linear SVM for simplicity

In [76]:
import sklearn.svm
# Always try your way forward with the C value
classifier=sklearn.svm.LinearSVC(C=0.007,verbose=1)
classifier.fit(feature_matrix_train, train_labels)

[LibLinear]

LinearSVC(C=0.007, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=1)

# Test

* For a quick test we can use the .score() method

In [77]:
print("DEV",classifier.score(feature_matrix_dev, dev_labels))
print("TRAIN",classifier.score(feature_matrix_train, train_labels))


DEV 0.886
TRAIN 0.965


* Try varying the C value and observe the results

In [71]:
import sklearn.metrics
predictions_dev=classifier.predict(feature_matrix_dev)
print(predictions_dev)
print(sklearn.metrics.confusion_matrix(dev_labels,predictions_dev))
print(sklearn.metrics.accuracy_score(dev_labels,predictions_dev))
# Negatives can have a higher cost if they're wrongly predicted compared to 
# positives

['neg' 'neg' 'neg' ... 'pos' 'pos' 'pos']
[[2189  289]
 [ 281 2241]]
0.886


# Trained classifier save and load

* We fitted two things: the vectorizer and the classifier
* If we want to reuse them later, we need to save them
* You can use `pickle` in most cases

In [78]:
import pickle

with open("saved_model.pickle","wb") as f:
    pickle.dump((classifier,vectorizer),f)

* let's try to load and test

In [17]:
with open("saved_model.pickle","rb") as f:
    classifier_loaded,vectorizer_loaded=pickle.load(f)

feature_matrix_dev_loaded=vectorizer_loaded.transform(dev_texts)
print("DEV - loaded (should match the score above)",classifier_loaded.score(feature_matrix_dev, dev_labels))



DEV - loaded (should match the score above) 0.8654
