# Text classification

In a nutshell (details will be explained during the lecture)

- Assign input text into categories, either predefined (supervised) or not (unsupervised/clustering)
  - Spam / not spam
  - One of several topics
  - Who is the author?
  - ...
- Done with machine learning
  - We covered clustering last week, so now we look into **supervised** classification
  - Main difference: unsupervised = no training data, supervised = training data
- Training data:
  - Ready examples of documents and their classes
  - Learn the task from these examples
  - Unsupervised = we don't know the classes, supervised = we know the classes
- Training: text features + model induction algorithm -> model
- Classification: text features + model -> predictions


# Features

- Represent each document for the classifier
- E.g.
  - Bag of Words (BoW)
  - Character N-Grams
  - Document metadata
  - PoS tags
  - ...you name it, someone tried it...
  
Let's try on Suomi24 VRT data

```
<text discussionarea="Suhteet" subsections="Sinkut" title="Jos ATM hyppäisi benjihypyn" views="0" cid="unspecified" anonnick="ätminkäinen" comms="9" year="2015" date="12.05.2015" dateto="20150512" tid="13592337" datefrom="20150512" time="22:50" sect="Suhteet" subsect="Sinkut" ssubsect="" sssubsect="" ssssubsect="" sssssubsect="" ssssssubsect="" urlboard="http://keskustelu.suomi24.fi/t/13592337" urlmsg="http://keskustelu.suomi24.fi/t/13592337">
<paragraph>
<sentence>
Niin    1       niin    Adv     CASECHANGE_Up   2       advmod
parantaisiko    2       parantaa        V       PRS_Sg3|VOICE_Act|MOOD_Cond|CLIT_Qst    0       ROOT
se      3       se      Pron    SUBCAT_Dem|NUM_Sg|CASE_Nom      2       nsubj
hänen   4       hän     Pron    SUBCAT_Pers|NUM_Sg|CASE_Gen     5       poss
markkina-arvoaan        5       markkina-arvo   N       NUM_Sg|CASE_Par|POSS_Px3        2       dobj
naisten 6       nainen  N       NUM_Pl|CASE_Gen 7       poss
silmissä        7       silmä   N       NUM_Pl|CASE_Ine 2       nommod
?       8       ?       Punct   _       2       punct
</sentence>
</paragraph>
</text>
<text discussionarea="Suhteet" subsections="Sinkut" title="Jos ATM hyppäisi benjihypyn" cid="79614512" anonnick="NaisetOvatElukoita" comms="9" views="" date="20.06.2015" dateto="20150620" year="2015" tid="13592337" datefrom="20150620" time="20:34" sect="Suhteet" subsect="Sinkut" ssubsect="" sssubsect="" ssssubsect="" sssssubsect="" ssssssubsect="" urlboard="http://keskustelu.suomi24.fi/t/13592337" urlmsg="http://keskustelu.suomi24.fi/t/13592337#comment-79614512">
<paragraph>
<sentence>
No      1       no      Interj  CASECHANGE_Up   3       intj
jos     2       jos     Adv     _       3       advmod
teet    3       tehdä   V       PRS_Sg2|VOICE_Act|TENSE_Prs|MOOD_Ind    0       ROOT
sen     4       se      Pron    SUBCAT_Dem|NUM_Sg|CASE_Gen      5       poss
```

In [1]:
import re
import codecs
txt_re=re.compile(ur'^<text discussionarea="(.*?)".*tid="([0-9]+?)"',re.U)
ignore_re=re.compile(ur'^</?(text|sentence|paragraph)')


def read_vrt(inp):
    """Function to read the Suomi24 VRT format"""
    current_topic=None #topic name
    current_tid=None #discussion thread number
    words=[] #words in the discussion
    for line in inp:
        line=line.strip()
        match=txt_re.match(line)
        if match: #we have a new post
            if match.group(2)!=current_tid and words:#...and it is not part of the current thread
                yield current_topic, words
                words=[]
            current_topic=match.group(1) #Pick groups out of the regular expression
            current_tid=match.group(2)
        if ignore_re.match(line):
            continue
        columns=line.split(u"\t")
        if not columns[1].isdigit(): #there seem to be few broken ones, skip
            continue
        words.append(columns[2].lower())
    else: #for loop ran out of items
        if words:
            yield current_topic, words

topics=[] #list of strings
texts=[] #list of strings
with codecs.open("s24.vrt","r","utf-8") as f:
    for topic, words in read_vrt(f):
        topics.append(topic)
        texts.append(u" ".join(words))

print "Document count:", len(topics)
print "Distinct topics:", u", ".join(set(topics))    

Document count: 12453
Distinct topics: Paikkakunnat, Tori, Koti ja rakentaminen, Työ ja opiskelu, Ajanviete, Nuoret, Ruoka ja juoma, MainPage, Suhteet, Lemmikit, Matkailu, Suomi24, Perhe, Ajoneuvot ja liikenne, Yhteiskunta, Tiede ja teknologia, Harrastukset, Viihde ja kulttuuri, Muoti ja kauneus, Ryhmät, Urheilu ja kuntoilu, Talous, Terveys


# TF.IDF weights

$$ TF\cdot\frac{N}{DF} $$

* TF - term frequency - count of term in current document
* N - number of documents in the data
* DF - number of documents with the term

In [2]:
import sklearn.feature_extraction

def tokenizer(txt):
    """Simple whitespace tokenizer"""
    return txt.split()

#Extract the features
tfidf_v=sklearn.feature_extraction.text.TfidfVectorizer(tokenizer=tokenizer) #,max_df=0.9
d=tfidf_v.fit_transform(texts)
print "documents x features", d.shape
print "feature matrix"
print d
print "features"
fnames=tfidf_v.get_feature_names()
for feature_id in range(1,100000,5000):
    print feature_id,fnames[feature_id]



documents x features (12453, 229764)
feature matrix
  (0, 227112)	0.13654656756
  (0, 227095)	0.109876693955
  (0, 226981)	0.112921934495
  (0, 223285)	0.0671284409095
  (0, 219663)	0.0644993532335
  (0, 218297)	0.0261740937501
  (0, 217821)	0.0602237358203
  (0, 215121)	0.0799111500286
  (0, 208541)	0.0322894333732
  (0, 208466)	0.104748001507
  (0, 203132)	0.0731111585761
  (0, 202008)	0.0455023327263
  (0, 200001)	0.150731517533
  (0, 199437)	0.0306226943858
  (0, 199156)	0.0911279995445
  (0, 198843)	0.100900616753
  (0, 198766)	0.0270201961422
  (0, 190661)	0.0596916611611
  (0, 187069)	0.0299879590476
  (0, 180836)	0.0968389976216
  (0, 176730)	0.0316865123023
  (0, 176630)	0.0637229924694
  (0, 175613)	0.0543242115156
  (0, 175480)	0.112138936284
  (0, 171781)	0.0798338024262
  :	:
  (12452, 51024)	0.0471395227882
  (12452, 49511)	0.0523624941598
  (12452, 47227)	0.0938046288728
  (12452, 39342)	0.0198309929611
  (12452, 38446)	0.0428498935496
  (12452, 36926)	0.0342077174773
  



# Support Vector Machines

* Will be explained during the lecture, Google if you couldn't attend
* Key concepts:
  - Separating hyperplane
  - Margin
  - Errors and slack variables
  - The parameter C
  - Regularization
  
<img src="http://docs.opencv.org/2.4/_images/sample-errors-dist.png"/>

* Multiclass classification = number of classes > 2
* One vs all = train a classifier for each class, pick the max score

# Evaluation

* Will be explained during the lecture, Google key concepts if you couldn't attend
* Key concepts:
  - Accuracy, Precision, Recall, F-score
  - Train / Development / Test Data
  - Crossvalidation
  - Overfitting
  - Parameter optimization


In [3]:
import sklearn.svm
import sklearn.cross_validation
X_train,X_test,Y_train,Y_test=sklearn.cross_validation.train_test_split(d, topics, test_size=0.3, random_state=0)

for C in (0.01,0.1,1,10,100):
    lin_clf = sklearn.svm.LinearSVC(C=C)
    lin_clf.fit(X_train,Y_train)
    print "C=%.3f  Accuracy=%.2f%%"%(C,lin_clf.score(X_test,Y_test)*100.0)


C=0.010  Accuracy=37.42%
C=0.100  Accuracy=59.82%
C=1.000  Accuracy=66.68%
C=10.000  Accuracy=66.62%
C=100.000  Accuracy=65.95%


...66% is not bad, keeping in mind we have 23 classes to choose from.

# Random baseline

* That we have 23 classes doesn't mean our baseline is 1/23!
* Class imbalance
* Accuracy susceptible to this!

How do we fare compared to making random choices?

In [4]:
import sklearn.dummy
dummy=sklearn.dummy.DummyClassifier(strategy="most_frequent")
dummy.fit(X_train,Y_train)
print "Dummy classifier predicting most frequent class: %.2f%%"%(dummy.score(X_test,Y_test)*100.0)
dummy=sklearn.dummy.DummyClassifier(strategy="stratified")
dummy.fit(X_train,Y_train)
print "Dummy classifier predicting at random by class dist.: %.2f%%"%(dummy.score(X_test,Y_test)*100.0)

Dummy classifier predicting most frequent class: 28.88%
Dummy classifier predicting at random by class dist.: 13.01%


So, if you predict the most frequent class, you get to 28% accuracy and with the simple SVM we get 66% accuracy. I.e we can safely say the classifier is learning something. :)

# Character n-grams

* Quite popular choice
* Does it work?


In [5]:
tfidf_v_char=sklearn.feature_extraction.text.TfidfVectorizer(analyzer='char',ngram_range=(3,4)) #,max_df=0.9
d_char=tfidf_v_char.fit_transform(texts)
print "documents x features", d.shape
print "feature matrix"
print d_char
print "features"
fnames=tfidf_v_char.get_feature_names()
for feature_id in range(1,100000,5000):
    print feature_id,fnames[feature_id]


documents x features (12453, 229764)
feature matrix
  (0, 327495)	0.0279364084575
  (0, 327494)	0.0246940006294
  (0, 326082)	0.0284459123917
  (0, 326081)	0.0195666672323
  (0, 325561)	0.0570182735877
  (0, 325556)	0.0397059026255
  (0, 323870)	0.0145962063742
  (0, 323857)	0.0141140700881
  (0, 323698)	0.0329225192112
  (0, 323691)	0.0325440317323
  (0, 323579)	0.013446493561
  (0, 323569)	0.00752836225688
  (0, 323164)	0.0143551104277
  (0, 323159)	0.0139049754487
  (0, 322949)	0.0260674398704
  (0, 322948)	0.0252456237298
  (0, 322845)	0.0755417813006
  (0, 322843)	0.0727451050801
  (0, 322632)	0.0144044808613
  (0, 322621)	0.0124514648769
  (0, 322401)	0.0172161118601
  (0, 322400)	0.0151691640639
  (0, 322151)	0.0413451316873
  (0, 322141)	0.0229695854203
  (0, 321922)	0.0415038550878
  :	:
  (12452, 7121)	0.012006175183
  (12452, 7119)	0.0228802062083
  (12452, 7107)	0.0191241708685
  (12452, 7104)	0.00912417192638
  (12452, 7100)	0.0192372220089
  (12452, 7093)	0.0294206477113


In [7]:
X_train_char,X_test_char,Y_train_char,Y_test_char=\
    sklearn.cross_validation.train_test_split(d_char, topics, test_size=0.3, random_state=0)

for C in (0.01,0.1,1):
    lin_clf_char = sklearn.svm.LinearSVC(C=C)
    lin_clf_char.fit(X_train_char,Y_train_char)
    print "C=%.3f  Accuracy=%.2f%%"%(C,lin_clf_char.score(X_test_char,Y_test_char)*100.0)


C=0.010  Accuracy=43.98%
C=0.100  Accuracy=63.30%
C=1.000  Accuracy=68.55%


Forget about words and you'll get better numbers! Cool, eh? :)

Does this generalize? Let's run on Finnish tweets!

In [8]:
# I gathered a bunch of totally random Finnish tweets, will my model work?
import json

tweets=[]
with open("fin_tweets.json","r") as f:
    for lineno,line in enumerate(f):
        line=line.strip()
        if not line:
            continue
        try:
            tweet=json.loads(line)
        except ValueError: #some of these are broken
            continue
        tweets.append(tweet["text"])


In [9]:
d_tweet_char=tfidf_v_char.transform(tweets)
print d_tweet_char.shape
for counter,(tweet, cls) in enumerate(zip(tweets,lin_clf_char.predict(d_tweet_char))):
    print cls, " --- ", tweet
    if counter==50:
        break

(886, 336566)
Yhteiskunta  ---  RT @zeekends: wcw babe ; isha asli sofia rye vanessa 👅
Yhteiskunta  ---  DaanLuyten #TilItHappensToYou #BestMovieSong #iHeartAwards
Yhteiskunta  ---  Mulla on kangasväriä farkuissa rip 😢😢 https://t.co/67iudMBfuj
Yhteiskunta  ---  @MaayronFerreira IJAEJIOEJIOEAJOIEAJI
Yhteiskunta  ---  #NowPlaying BFC-radio (@BFC_radio) https://t.co/xYZVlndWyG … #Erdioo
Yhteiskunta  ---  omfg esQUEJ MEESTOYJ
Yhteiskunta  ---  Meikä oli jo hetken pitkäperjantaissa. Ja nyt on vasta kiiraskeskiviikko. Päivät sekoo kun on näitä pyhiä.
Yhteiskunta  ---  [22:59:10] 118.113.52.162:4384 &gt;&gt; :1433 (TCP:SYN)
Työ ja opiskelu  ---  Oho tukkani on ekaa kertaa vuosiin mitassa jossa se alkaa aaltoilla ellen kampaa sitä suoraksi suihkun jälkeen. Hassua.
Yhteiskunta  ---  #vibrator adulttoys #sextoys https://t.co/dk83khR6lo
Yhteiskunta  ---  Tänään osui Hip Hop ja Rap YouTube Video Suomessa.「Cheek」's 『Kuka Muu Muka』 https://t.co/pQQ8kJTWiL
Yhteiskunta  ---  [23:00:26] 125.123.234.198

# 8-O

Oh good lord - twitter is such crap! [pulling hair 1AM the night before the lecture] Let's try to apply some of our newly acquired skills to recover. :| How about we try run the tweets through the parser and check the words against the top-most Finnish vocabulary and only keep tweets of interest.

In [10]:
import lwvlib
wv=lwvlib.load("pb34_lemma_200_v2.bin",70000,70000)

def read_conllu(inp):
    tweet=[] #list of lemmas
    for line in inp:
        line=line.strip().replace(u"#",u"")
        if not line:
            yield tweet
            tweet=[]
        else:
            tweet.append(line.split(u"\t")[2])
            
import re
wrdre=re.compile(u"^[a-zäöå-]+$")
def known_words(tweet):
    return sum(1 for word in tweet if word in wv.words and wrdre.match(word))

tweets=[]
with codecs.open("fin_tweets.conllu","r","utf-8") as f:
    for tweet in read_conllu(f):
        if float(known_words(tweet))/len(tweet)>0.7:
            tweets.append(u" ".join(tweet).replace(u"#",u"|"))
for t in tweets[:10]:
    print t



oho tukka olla eka kerta vuosi mitta joka se alkaa aaltoilla josei kammata se suora suihku jälkeen . hassu .
@kuningaskulutta olla olla turhauttaa joutua siirtää päivittäinen raha-asia laina vuoksi . jotenkin tykätä , kun olla erillään .
@maijalarmo tuoda Felix uusi korkki olla ihan ykkönen
ei haluu liikkua , pitää mennä kauppa mut ulkona sata lumi ja mä olla ruokakooma päällä ugh
@BornForFiNRS mä ei ärsyttää vielä koska vetää just pussi fanipaloi ja nyt sattua maha lol
@RenneKorppila vai sellanen kaveri . mä ei toisaalta mikään ihme ettäei olla koskaan kuulla ko . tyyppi .
@Kinukki sanoma olla selvä , että ei uskoa olla väärä kun arvella sinä kertoa tämä itse ,olethan aikuinen ..
@Nysses ei . olla ilo huomata että minä @jysk_fi tämä tapahtua päivittäin ja asiakaspalvelu olla kunnia-asia .
paitsi ain olla kiva nähä ämmii tappelees mut veikka tämä menoo vappun sata lumi
RT @SaaraHuttunen : mä haluta olla terve ja onnellinen . muu prioriteetti mä ei nyt olla . toki koulu ois kiva joskus 

In [11]:
d_tweet_char=tfidf_v_char.transform(tweets)
print d_tweet_char.shape
for counter,(tweet, cls) in enumerate(zip(tweets,lin_clf_char.predict(d_tweet_char))):
    print cls, " --- ", tweet
    print
    if counter==50:
        break

(46, 336566)
Koti ja rakentaminen  ---  oho tukka olla eka kerta vuosi mitta joka se alkaa aaltoilla josei kammata se suora suihku jälkeen . hassu .

Yhteiskunta  ---  @kuningaskulutta olla olla turhauttaa joutua siirtää päivittäinen raha-asia laina vuoksi . jotenkin tykätä , kun olla erillään .

Paikkakunnat  ---  @maijalarmo tuoda Felix uusi korkki olla ihan ykkönen

Suhteet  ---  ei haluu liikkua , pitää mennä kauppa mut ulkona sata lumi ja mä olla ruokakooma päällä ugh

Suhteet  ---  @BornForFiNRS mä ei ärsyttää vielä koska vetää just pussi fanipaloi ja nyt sattua maha lol

Yhteiskunta  ---  @RenneKorppila vai sellanen kaveri . mä ei toisaalta mikään ihme ettäei olla koskaan kuulla ko . tyyppi .

Suhteet  ---  @Kinukki sanoma olla selvä , että ei uskoa olla väärä kun arvella sinä kertoa tämä itse ,olethan aikuinen ..

Yhteiskunta  ---  @Nysses ei . olla ilo huomata että minä @jysk_fi tämä tapahtua päivittäin ja asiakaspalvelu olla kunnia-asia .

Yhteiskunta  ---  paitsi ain olla ki

# And how about the vectors, do they help any?


In [12]:
import lwvlib
import numpy
wv=lwvlib.load("pb34_lemma_200_v2.bin",50000,50000)

def doc2vec(txt,wv,i,data_matrix):
    """Text with whitespace tokenization
    wv
    i - which row are we filling
    data_matrix - and to where?"""
    for w in txt.split():
        w=w.lower()
        dim=wv.get(w)
        if dim==None:
            continue
        data_matrix[i]+=wv.vectors[dim]

#topics,texts
data_matrix=numpy.zeros((len(texts),wv.vectors.shape[1]))
for i,txt in enumerate(texts):
    doc2vec(txt,wv,i,data_matrix)
sklearn.preprocessing.normalize(data_matrix,copy=False)
    

  def _ipython_display_formatter_default(self):
  def _formatters_default(self):
  def _deferred_printers_default(self):
  def _singleton_printers_default(self):
  def _type_printers_default(self):
  def _singleton_printers_default(self):
  def _type_printers_default(self):
  def _deferred_printers_default(self):


array([[-0.10225586,  0.08287574,  0.04128639, ...,  0.01366414,
         0.14373324,  0.17274647],
       [-0.05364433, -0.01276515,  0.0603284 , ...,  0.00620707,
         0.19762445,  0.13600904],
       [-0.06920946,  0.01818488,  0.03807921, ...,  0.02176174,
         0.13039736,  0.17019013],
       ..., 
       [-0.07167699,  0.06385985,  0.05177857, ..., -0.00953654,
         0.17387475,  0.13991461],
       [-0.06337436,  0.04784775,  0.04624651, ...,  0.03932344,
         0.09021433,  0.12855842],
       [-0.10096667,  0.06961406,  0.01622496, ..., -0.01524248,
         0.14144494,  0.164709  ]])

In [13]:
cls=sklearn.svm.LinearSVC(C=1.0)
cls.fit(data_matrix[:10000],topics[:10000])

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=None, tol=0.0001, verbose=0)

In [14]:
cls.score(data_matrix[10000:],topics[10000:])

0.59315124337545866

In [15]:
# maybe we could try with some nonlinear stuff
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
import keras.optimizers
import keras.utils.np_utils

def class2id(topics):
    d={}
    nums=[]
    for t in topics:
        nums.append(d.setdefault(t,len(d)))
    return nums,d

topic_numbers,class_dict=class2id(topics)
topic_numbers_matrix=keras.utils.np_utils.to_categorical(topic_numbers)
dim_in,dim_internal,dim_out=data_matrix.shape[1],200,len(class_dict)

print dim_in, dim_internal,dim_out

#Neural network:
model = Sequential()
#Non-linear layer #1
model.add(Dense(dim_internal, input_dim=dim_in))
model.add(Activation("tanh"))
model.add(Dropout(0.5))
model.add(Dense(dim_internal))
model.add(Activation("tanh"))
model.add(Dropout(0.5))
#Linear projection at the end
model.add(Dense(dim_out))
model.add(Activation("softmax"))

sgd = keras.optimizers.SGD(lr=0.05, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',optimizer=sgd,class_mode='categorical')
#Learn!
model.fit(data_matrix[:10000],topic_numbers_matrix[:10000],verbose=2,batch_size=200,show_accuracy=True,validation_split=0.3)



Using Theano backend.
  "downsample module has been moved to the theano.tensor.signal.pool module.")


200 200 23
Train on 7000 samples, validate on 3000 samples
Epoch 1/100
0s - loss: 2.6129 - acc: 0.2510 - val_loss: 2.3120 - val_acc: 0.3457
Epoch 2/100
0s - loss: 2.3415 - acc: 0.3149 - val_loss: 2.1417 - val_acc: 0.3963
Epoch 3/100
0s - loss: 2.2013 - acc: 0.3466 - val_loss: 2.0366 - val_acc: 0.4387
Epoch 4/100
0s - loss: 2.0875 - acc: 0.3940 - val_loss: 1.9314 - val_acc: 0.4633
Epoch 5/100
0s - loss: 1.9935 - acc: 0.4280 - val_loss: 1.8639 - val_acc: 0.4880
Epoch 6/100
0s - loss: 1.9230 - acc: 0.4533 - val_loss: 1.8401 - val_acc: 0.4793
Epoch 7/100
1s - loss: 1.8661 - acc: 0.4650 - val_loss: 1.7977 - val_acc: 0.4993
Epoch 8/100
0s - loss: 1.8140 - acc: 0.4759 - val_loss: 1.7457 - val_acc: 0.5007
Epoch 9/100
1s - loss: 1.7796 - acc: 0.4880 - val_loss: 1.7327 - val_acc: 0.5083
Epoch 10/100
1s - loss: 1.7398 - acc: 0.4964 - val_loss: 1.6941 - val_acc: 0.5173
Epoch 11/100
0s - loss: 1.7191 - acc: 0.5017 - val_loss: 1.7063 - val_acc: 0.5160
Epoch 12/100
0s - loss: 1.7055 - acc: 0.5049 - v

<keras.callbacks.History at 0x14376f90>

In [16]:
print 
import sklearn.metrics

sklearn.metrics.accuracy_score(topic_numbers[10000:],model.predict_classes(data_matrix[10000:]))




0.58622095393395846