# SVM for POS Tagging

In this exercise, we will implement a POS tagging system by using the SVM classifier, sweeping left-to-right, and classifying each word into classes of different parts-of-speech. We will use the scikit-learn library.

**training data**: wsj00-18.tag

**testing data**: wsj22-24.tag

Please **output** a file pos.guessa that is identical to wsj22-24.tag, except it adds a new field last which is the result of your classifier.

**Hint**:

The focus of the task is on feature engineering. The following features are good to start with:

- word form
- suffix
- whether it is a digit or not
- initial letter capitalized or not
- word on left
- word on right
- previous tag
- ...

## Example solution

In [20]:
from sklearn.feature_extraction import DictVectorizer
from sklearn import svm

In [21]:
def readconll(file):
    lines = [line.strip() for line in open(file)]
    while lines[-1] == '':  # Remove trailing empty lines
        lines.pop()
    s = [x.split('_') for x in '_'.join(lines).split('__')]  # Quick split corpus into sentences
    return [[y.split() for y in x] for x in s]

In [22]:
sentences = readconll('wsj00-18.tag')

In [23]:
def FeatureExtract(word, history, future):
    
    #current word in lower case, previous word, previous tag, next word
    feature_dict = {word.lower():1, history[-1][0]:1, history[-1][1]:1, future[0]:1}
    
    #first letter capitalized or not
    if w[0].isupper():
        feature_dict['Upper1'] = 1
        
    #Suffixes
    for i in range(1, len(word)):
        feature_dict['suf-'+word[-i:]] = 1
        
    if w.isdigit():
        feature_dict['isdigit'] = 1
        
    return feature_dict 

In [24]:
vectorizer = DictVectorizer(sparse=True)
clf = svm.LinearSVC()

In [25]:
#Training 

features = []
classes = []
for sentence in sentences:
    for i, (w, c) in enumerate(sentence):
        if i == 0:
            history = [['<s>', '<S>']]
        else:
            history = sentence[:i]
        if i == len(sentence) -1:
            future = ['</s>']
        else:
            future = [word for word, tag in sentence][i+1:]
        classes.append(c)
        features.append(FeatureExtract(w, history, future))
        
X = vectorizer.fit_transform(features)
clf.fit(X, classes)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [26]:
#Testing

fw = open('pos.guess', 'w')
test_sentences = readconll('wsj22-24.tag')

gc = 0.0
cc = 0.0
for sentence in test_sentences:
    testdata = [word for word, tag in sentence]
    guesstags = []
    for i, w in enumerate(testdata):
        if i == 0:
            history = [('<s>', '<S>')]
        else:
            history = list(zip(testdata[:i], guesstags))
            #print (history)
        if i == len(testdata) -1:
            future = ['</s>']
        else:
            future = testdata[i+1:]
        test_features = vectorizer.transform([FeatureExtract(w, history, future)])
        guesstag = clf.predict(test_features)[0]
        guesstags.append(guesstag) 
    
    
    for (w, t), g in zip(sentence, guesstags):
        if t == g:
            cc += 1
        gc += 1
        fw.write(w+'\t'+t+'\t'+g+'\n')
    fw.write('\n') 
fw.close()

print ("Number of words correctly tagged: ", cc)
print ("Number of all words tagged: ", gc)
print ("Accuracy: ", round(cc*100/gc, 4))

Number of words correctly tagged:  125052.0
Number of all words tagged:  129654.0
Accuracy:  96.4506
