# The why's of classifiers and regularization

* The SVM really learns a weight vector, with a weight for every feature
* Above hyperplane (positive class) test: $ \vec{w}\cdot \vec{x} + b \gt 0 $ where $ \vec{w} $ is the learned weight vector, $ \vec{x} $ is the feature vector you are classifying, and $ b $ is a bias.
* So we just sum the elementwise multiplication of the features with the learned weights, and add a bias

## Most discriminative features

The weights correlate with the importance of the feature for the classification. Let's see for ourselves on the movie review data from the last demos.


In [1]:
# Read data in (these are one review per file)
import glob
import codecs

def read_dir(dir_name):
    texts=[]
    for f_name in sorted(glob.glob(dir_name+"/*.txt")):
        with codecs.open(f_name,"r","utf-8") as f:
            texts.append(f.read())
    return texts

train_pos_txt=read_dir("imdb/train/pos")
train_neg_txt=read_dir("imdb/train/neg")
test_pos_txt=read_dir("imdb/test/pos")
test_neg_txt=read_dir("imdb/test/neg")

print len(train_pos_txt), "positive and", len(train_neg_txt), "negative reviews"


12500 positive and 12500 negative reviews


In [37]:
tfidf_v=TfidfVectorizer(sublinear_tf=True) #sublinear_tf flattens the tf scores a bit (runs them through a log function)
d=tfidf_v.fit_transform(train_pos_txt+train_neg_txt)
d_test=tfidf_v.transform(test_pos_txt+test_neg_txt)
lin_c=LinearSVC(C=1.0) #I tried and this is a good C value
lin_c.fit(d,[1]*len(train_pos_txt)+[0]*len(train_neg_txt))
#the learned feature weights are in lin_c.coef_[0]
f_names=tfidf_v.get_feature_names()
sorted_by_weight=sorted(zip(lin_c.coef_[0],f_names))
for f_weight,f_name in sorted_by_weight[:30]:
    print f_name, f_weight
print "------------------------"
for f_weight,f_name in sorted_by_weight[-30:]:
    print f_name, f_weight


worst -5.04284864046
waste -3.84302809294
awful -3.66709733104
boring -3.30642775532
bad -3.29585913666
disappointment -3.22802725648
poorly -3.04569604489
poor -2.94538722211
disappointing -2.86117762915
worse -2.72027362264
fails -2.64784124737
terrible -2.64249575048
lacks -2.6293188406
mess -2.5912934151
dull -2.50818415226
nothing -2.45162316557
unfortunately -2.3782031781
horrible -2.35114065677
annoying -2.30886533656
save -2.29264857263
avoid -2.16329662223
laughable -2.16206626355
ridiculous -2.15838484198
unfunny -2.08066875969
weak -2.05548219122
forgettable -2.03245698378
badly -2.03199956277
supposed -2.02040781557
lame -2.0079412708
lousy -1.94495953337
------------------------
love 1.76886158776
carrey 1.7814144629
liked 1.78228883306
subtle 1.83466969424
well 1.85343988421
appreciated 1.86413443733
incredible 1.87133672818
job 1.90344759777
rare 1.91900092273
enjoyed 1.92269534387
brilliant 1.94120781876
it 1.94821522476
perfectly 1.95311731401
funniest 1.95993067312
fu

# Regularization

Key concepts (explained during the lecture, google if absent):
* Regularization
  * Need to avoid overfitting training examples
  * Keeping the magnitude of the SVM weight vector (hyperplane) small means that spurious individual features cannot be given very high weights to overfit to your training data
  * In other words, the weights must be kept sane
* L1 and L2 regularization
  * L1 sum of absolute values is kept small
  * L2 sum of squares is kept small
* In fig below the point on the left line optimizes the L1 norm and on the right line optimizes the L2 norm. $ w_2 $ is a strict zero for L1 but not for L2
  
<img src="https://upload.wikimedia.org/wikipedia/en/f/fd/L1_and_L2_balls.jpg" />

# Feature selection with L1

As you have seen, unlike L2, the L1 regularization tends to drive feature weights down to strict zero. It is therefore very useful for in/out feature selection. Let us try with the same IMDB dataset.

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
import numpy

tfidf_v=TfidfVectorizer(sublinear_tf=True)
d=tfidf_v.fit_transform(train_pos_txt+train_neg_txt)
d_test=tfidf_v.transform(test_pos_txt+test_neg_txt)

# First with L2 which is the default
print "*********** L2 **********"
for C in (0.01,0.03,0.05,0.1,1.0,10):
    lin_c=LinearSVC(C=C)
    lin_c.fit(d,[1]*len(train_pos_txt)+[0]*len(train_neg_txt))
    print "C=%04.2f   Accuracy=%.2f%%   Non-zero weight features: %d"%(C,lin_c.score(d_test,[1]*len(test_pos_txt)+[0]*len(test_neg_txt))*100.0,numpy.count_nonzero(lin_c.coef_))
print 
print
print "*********** L1 **********"
for C in (0.01,0.03,0.05,0.1,1.0,10):
    lin_c=LinearSVC(penalty="l1",C=C,dual=False) #these settings are needed for L1
    lin_c.fit(d,[1]*len(train_pos_txt)+[0]*len(train_neg_txt))
    print "C=%04.2f   Accuracy=%.2f%%   Non-zero weight features: %d"%(C, lin_c.score(d_test,[1]*len(test_pos_txt)+[0]*len(test_neg_txt))*100.0,numpy.count_nonzero(lin_c.coef_))



*********** L2 **********
C=0.01   Accuracy=86.10%   Non-zero weight features: 74824
C=0.03   Accuracy=87.98%   Non-zero weight features: 74032
C=0.05   Accuracy=88.68%   Non-zero weight features: 73088
C=0.10   Accuracy=88.92%   Non-zero weight features: 71238
C=1.00   Accuracy=88.01%   Non-zero weight features: 64477
C=10.00   Accuracy=85.80%   Non-zero weight features: 62097


*********** L1 **********
C=0.01   Accuracy=73.49%   Non-zero weight features: 21
C=0.03   Accuracy=81.63%   Non-zero weight features: 119
C=0.05   Accuracy=83.98%   Non-zero weight features: 209
C=0.10   Accuracy=86.44%   Non-zero weight features: 394
C=1.00   Accuracy=87.88%   Non-zero weight features: 3970
C=10.00   Accuracy=85.36%   Non-zero weight features: 6587


Looks like C=0.01 gives us only **21** active features. Let's look at them!

In [36]:
#Let's try this one
lin_c=LinearSVC(penalty="l1",C=0.01,dual=False)
lin_c.fit(d,[1]*len(train_pos_txt)+[0]*len(train_neg_txt))
feats=numpy.nonzero(lin_c.coef_) #List of indices where the coef is not zero
f_names=tfidf_v.get_feature_names()
for f_weight,f_name in sorted((lin_c.coef_[0][idx],f_names[idx]) for idx in feats[1]):
    print f_name, f_weight

bad -5.80960603215
worst -4.76305681117
waste -2.52218371608
no -2.29481501206
awful -1.87703501105
nothing -1.26514803954
boring -1.09520869636
even -0.680857894709
plot -0.494747346774
terrible -0.468445491958
minutes -0.442163097956
just -0.437272508868
poor -0.330544770183
stupid -0.122586460651
very 0.196431419642
well 0.265343065818
wonderful 0.610334460291
love 1.21605247426
excellent 1.56261628434
best 1.58308357965
great 3.78397235444


## ...not bad(?)

...and that's all the features ever used to get 73% accuracy on the task (which mind you is not very good compared to the 88% we see otherwise and the 94% reported in the paper this data comes from). 

In [38]:
#Let's try this one - 394 features
lin_c=LinearSVC(penalty="l1",C=0.1,dual=False)
lin_c.fit(d,[1]*len(train_pos_txt)+[0]*len(train_neg_txt))
feats=numpy.nonzero(lin_c.coef_) #List of indices where the coef is not zero
f_names=tfidf_v.get_feature_names()
for f_weight,f_name in sorted((lin_c.coef_[0][idx],f_names[idx]) for idx in feats[1]):
    print f_name, f_weight

worst -6.87349592774
waste -5.35435625783
awful -5.01860127605
bad -4.4911046273
boring -3.94362135886
poor -3.64374427939
poorly -3.550536664
dull -3.39664453478
fails -3.22196148183
terrible -3.10702704783
nothing -2.97105160271
mess -2.90293775497
unfortunately -2.89159274004
disappointing -2.8064114785
horrible -2.77848952548
annoying -2.72574521534
worse -2.67771457585
disappointment -2.67286381622
lame -2.48326262442
ridiculous -2.42347315866
supposed -2.4103855913
pointless -2.34258712999
no -2.31902773125
avoid -2.31359887562
stupid -2.25299710512
laughable -2.19388787459
oh -2.15841896888
script -2.14880225756
minutes -2.14077101761
save -2.08975151902
badly -2.0858420394
instead -2.06403385676
lacks -1.83856101787
unfunny -1.76695332212
unless -1.6771504156
crap -1.65092507076
pathetic -1.50644537582
looks -1.50542942059
just -1.47719166217
wasted -1.40232726332
even -1.37908439281
weak -1.34762895768
predictable -1.32278439817
plot -1.32116346684
attempt -1.30704859478
woode

...Let's yet test the effect of the sublinear_tf option:

In [39]:
#Let me switch the sublinear_tf off - see how the common word features creep in
tfidf_v=TfidfVectorizer()
d=tfidf_v.fit_transform(train_pos_txt+train_neg_txt)
d_test=tfidf_v.transform(test_pos_txt+test_neg_txt)
lin_c=LinearSVC(penalty="l1",C=0.01,dual=False)
lin_c.fit(d,[1]*len(train_pos_txt)+[0]*len(train_neg_txt))
feats=numpy.nonzero(lin_c.coef_)
f_names=tfidf_v.get_feature_names()
for f_weight,f_name in sorted((lin_c.coef_[0][idx],f_names[idx]) for idx in feats[1]):
    print f_name, f_weight

bad -5.08123163107
worst -4.38981930799
no -2.34662691435
waste -2.02745498996
awful -1.39197174668
even -0.979188595952
nothing -0.912777476398
boring -0.891127772481
just -0.888671837571
was -0.541189528328
to -0.534307473047
they -0.514391761016
plot -0.411256591686
br -0.210465862914
movie -0.15522111619
terrible -0.143288441578
this -0.141724247277
minutes -0.126692192144
there -0.0272097939993
wonderful 0.0366115803371
his 0.0968399465279
as 0.119787864376
is 0.143553932311
well 0.249518189519
very 0.325843067457
excellent 0.990287868212
love 1.09597880933
best 1.35149402781
and 1.73153913202
great 3.5491986854
