# Text classification

## Objectives of this lecture

* Introduce text classification as a task
* Introduce basic approaches to text classification, especially the bag-of-words model
* Show practical implementation of the concepts in the previous two lectures (intro and machine learning primer)
* Train a simple classifier on the IMDB dataset

## Purporse of these materials

* These materials support, but do not replace the lectures
* Presence on lectures highly recommended
* Make notes!

## Text classification

* Text in, class label out
* Often thought especially in terms of "Document classification" --- document in, class label(s) out
  * Movie review -> star rating
  * Email -> ham or spam
  * News article -> list of topics from a pre-defined set of topic categories
  * Comment in online discussion -> abusive or not
  * Customer email / call to customer service -> topic and relevant business process
  * Customer feedback -> needs reaction or not
  * ...
  
 * Document classification can be seen as a supervised classification problem with pre-defined classes
 * A direct application of the standard approach:
   1. prepare training/dev/test data
   2. extract features
   3. train your favorite classifier
   4. test on held-out data
 

# Bag-of-words model

* BoW is the simplest way to model documents for classification
* The document is reduced to a **set** of features, in the simplest case the words


* Feature vector: a vector with as many dimensions as we have unique features, and a non-zero value set for every feature present in our example
* Values: 1/0 (present/absent) or oftentimes TF/IDF weights (why?)

# IMDB data

* Movie review sentiment positive/negative
* Some 25,000 examples, 50:50 split of classes (why is this number relevant?)
* Current state-of-the-art is about 95% accuracy (what is accuracy?)

In [1]:
import json
import random
with open("data-bow/imdb_train.json") as f:
    data=json.load(f)
random.shuffle(data) #play it safe! (why?)
print("class label:", data[0]["class"])
print("text:",data[0]["text"])

class label: pos
text: When people nowadays hear of a 1940s drama, they usually appear to create a distance of irony claiming that it's another tearjerker with great stars in the lead of tragic, melancholic roles. This opinion, however, does not resemble Neorealist movies, in particular this one directed by Count Luchino Visconti. OSSESSIONE as his debut once censored and once cherished as nearly a realistic masterpiece is still loved by some people and strongly criticized by others. The contradictory opinions about the film that have appeared in these 65 years seem to have been caused by the content of the movie itself, exceptionally controversial for modern times as well as the past. At the same time, while being based on the novel by James M. Cain, THE POSTMAN ALWAYS RINGS TWICE, it is one of the most genuine screen adaptations where director remains his own style, view, his own art. I have seen the film twice and the second viewing led me to very detailed analysis part of which I'd

## Bag of words in practice

* We will need to build a feature vector for every example
* We will need to know the class for every example

* Build a data matrix with dimensionality (number of examples, number of possible features), and a value for each feature, 0/1 for binary features, TF-IDF weights are also a typical choice

It is quite useless to do all this ourselves, so we will use ready-made classes and functions mostly from scikit

In [2]:
# We need to gather the texts and labels into separate lists
texts=[one_example["text"] for one_example in data]
labels=[one_example["class"] for one_example in data]
print("This many texts",len(texts))
print("This many labels",len(labels))
print()
for label,text in list(zip(labels,texts))[:20]:
    print(label,text[:50]+"...")



This many texts 25000
This many labels 25000

pos When people nowadays hear of a 1940s drama, they u...
neg Oh my god! This movie insults the intelligence of ...
neg I wasn't as \lucky\" as some of the others comment...
neg Can anybody do good CGI films besides Pixar? I mea...
neg For romantic comedies, I often judge the quality o...
pos This- and not a certain slightly overrated Souther...
neg I am starting this review with a big giant spoiler...
neg I was in second grade, 12 years ago. I remember it...
pos You play as B.J. Blazkowicz, a US secret agent sol...
pos I just realised I've been using IMDb for years now...
pos I love the way that this game can make you literal...
pos Chris and Andre are two average, ordinary teens. M...
pos \Twelve monkeys\"'s got all the elements to become...
pos Come on people. This movie is better than 4. I can...
pos definitely needed a little work in season 2. Such ...
neg La Sanguisuga Conduce la Danza, or The Bloodsucker...
neg 1st watched 11/7/2002 

## Sklearn text vectorizers

* Vectorizers take care of turning inputs into feature vectors
* Also build the feature name to feature index mapping:
    * `.fit()` learn the mapping
    * `.fit_transform()` learn and apply the mapping
    * `.transform()` apply the learned mapping
    * (What is the difference?)

Let's first try on a tiny example, then work our way to the real data:

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer=CountVectorizer()

#just two short document
toy_data=["More precious than gold: Why the metal palladium is soaring","The price of the precious metal palladium has soared on the global commodities markets."]

vectorizer.fit(toy_data)
print("Unique features:")
print(vectorizer.get_feature_names())
print()
print("Feature vectors (sparse format):")
print(vectorizer.transform(toy_data))

Unique features:
['commodities', 'global', 'gold', 'has', 'is', 'markets', 'metal', 'more', 'of', 'on', 'palladium', 'precious', 'price', 'soared', 'soaring', 'than', 'the', 'why']

Feature vectors (sparse format):
  (0, 2)	1
  (0, 4)	1
  (0, 6)	1
  (0, 7)	1
  (0, 10)	1
  (0, 11)	1
  (0, 14)	1
  (0, 15)	1
  (0, 16)	1
  (0, 17)	1
  (1, 0)	1
  (1, 1)	1
  (1, 3)	1
  (1, 5)	1
  (1, 6)	1
  (1, 8)	1
  (1, 9)	1
  (1, 10)	1
  (1, 11)	1
  (1, 12)	1
  (1, 13)	1
  (1, 16)	3


In [4]:
#...and in a more understandable format
#...these are the feature vectors of our two toy documents
print(vectorizer.transform(toy_data).todense())

[[0 0 1 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1]
 [1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 3 0]]


In [5]:
# only features seen in the training data can be taken into account!
print(vectorizer.transform(["unseen words only"]).todense())

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,1))
feature_matrix=vectorizer.fit_transform(texts)
print("shape=",feature_matrix.shape)
print("what did we get? ->", feature_matrix.__class__)

shape= (25000, 74849)
what did we get? -> <class 'scipy.sparse.csr.csr_matrix'>


In [7]:
print(feature_matrix)

  (0, 39)	1
  (0, 7045)	1
  (0, 73498)	1
  (0, 53839)	1
  (0, 71547)	1
  (0, 2238)	1
  (0, 19563)	1
  (0, 74324)	1
  (0, 66619)	1
  (0, 66387)	1
  (0, 59728)	1
  (0, 24608)	1
  (0, 41089)	1
  (0, 5502)	1
  (0, 2149)	1
  (0, 30670)	1
  (0, 57715)	1
  (0, 72088)	1
  (0, 14645)	1
  (0, 67324)	1
  (0, 22083)	1
  (0, 15153)	1
  (0, 22050)	1
  (0, 46957)	1
  (0, 73714)	1
  :	:
  (24999, 72246)	1
  (24999, 22056)	1
  (24999, 6334)	1
  (24999, 8583)	1
  (24999, 9881)	1
  (24999, 30646)	1
  (24999, 41513)	1
  (24999, 2662)	1
  (24999, 71159)	1
  (24999, 41798)	1
  (24999, 58613)	1
  (24999, 43995)	1
  (24999, 72784)	1
  (24999, 30118)	1
  (24999, 24536)	1
  (24999, 34585)	1
  (24999, 3258)	1
  (24999, 4465)	1
  (24999, 44164)	1
  (24999, 46050)	1
  (24999, 66562)	1
  (24999, 66339)	1
  (24999, 34683)	1
  (24999, 67125)	1
  (24999, 46680)	1


In [8]:
print(vectorizer.get_feature_names()[:1000])

['00', '000', '0000000000001', '00001', '00015', '000s', '001', '003830', '006', '007', '0079', '0080', '0083', '0093638', '00am', '00pm', '00s', '01', '01pm', '02', '020410', '029', '03', '04', '041', '05', '050', '06', '06th', '07', '08', '087', '089', '08th', '09', '0f', '0ne', '0r', '0s', '10', '100', '1000', '1000000', '10000000000000', '1000lb', '1000s', '1001', '100b', '100k', '100m', '100min', '100mph', '100s', '100th', '100x', '100yards', '101', '101st', '102', '102nd', '103', '104', '1040', '1040a', '1040s', '105', '1050', '105lbs', '106', '106min', '107', '108', '109', '10am', '10lines', '10mil', '10min', '10minutes', '10p', '10pm', '10s', '10star', '10th', '10x', '10yr', '11', '110', '1100', '11001001', '1100ad', '111', '112', '1138', '114', '1146', '115', '116', '117', '11f', '11m', '11th', '12', '120', '1200', '1200f', '1201', '1202', '123', '12383499143743701', '125', '125m', '127', '128', '12a', '12hr', '12m', '12mm', '12s', '12th', '13', '130', '1300', '1300s', '131', 

Now we have the feature matrix done!

# Data split

* Train data - all training based on it (this includes the vectorizer!)
* Development data - set the parameters
* Test data - used for nothing during training, produce final results

In [9]:
from sklearn.model_selection import train_test_split

train_texts, dev_texts, train_labels, dev_labels=train_test_split(texts,labels,test_size=0.2)
vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,1))
feature_matrix_train=vectorizer.fit_transform(train_texts)
feature_matrix_dev=vectorizer.transform(dev_texts)




In [10]:
print(feature_matrix_train.shape)
print(feature_matrix_dev.shape)

(20000, 68295)
(5000, 68295)


# Classifier train

* Let us try the venerable, if a bit outdated SVM
* Linear SVM for simplicity

In [11]:
import sklearn.svm
classifier=sklearn.svm.LinearSVC(C=0.0005,verbose=1)
classifier.fit(feature_matrix_train, train_labels)

[LibLinear]

LinearSVC(C=0.0005, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=1)

# Test

* For a quick test we can use the .score() method

In [12]:
print("DEV",classifier.score(feature_matrix_dev, dev_labels))
print("TRAIN",classifier.score(feature_matrix_train, train_labels))


DEV 0.8666
TRAIN 0.8954


* Try varying the C value and observe the results

In [13]:
import sklearn.metrics
predictions_dev=classifier.predict(feature_matrix_dev)
print(predictions_dev)
print(sklearn.metrics.confusion_matrix(dev_labels,predictions_dev))
print(sklearn.metrics.accuracy_score(dev_labels,predictions_dev))

['pos' 'pos' 'pos' ... 'pos' 'pos' 'pos']
[[2128  368]
 [ 299 2205]]
0.8666
