<h2>SVM - StackOverflow Tags Dataset</h2>

<p>This is a dataset which contains question text and the corisponding tag to those questions. This example uses the top 20 tags as a classification task.</p>

<p><strong>Current best: 81.833.. %</strong></p>

First we import libaries. We will be using sklearn mainly for the classification functionailty.

In [3]:
import logging
import pandas as pd
import numpy as np
from numpy import random
import gensim
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup

from sklearn.calibration import CalibratedClassifierCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report



<p>Now we grap our dataset and convert it to be used by a classifyer.</p>
<p>Some of the cleaning steps include excluding characters that are not word-like and exlcuding stopwords</p>

In [5]:
# Used tags for classification
my_tags = ['java','html','asp.net','c#','ruby-on-rails','jquery','mysql','php','ios','javascript','python','c','css','android','iphone','sql','objective-c','c++','angularjs','.net']

# Read SO dataset
data = pd.read_csv('https://storage.googleapis.com/tensorflow-workshop-examples/stack-overflow-data.csv')
data = data[pd.notnull(data.tags)]

STOPWORDS = set(stopwords.words('english'))
REG_SYMBOLS_TO_SPACE = re.compile('[/(){}\[\]\|@,;]')
REG_STRIPED_SYMBOLS = re.compile('[^0-9a-z #+_]')

def clean_text(text):
    text = BeautifulSoup(text, "lxml").text
    text = REG_SYMBOLS_TO_SPACE.sub(' ', text)
    text = REG_STRIPED_SYMBOLS.sub('', text)
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)
    return text

data.post = data.post.apply(clean_text)

X = data.post
Y = data.tags

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state = 42)


<p>
First we create a count vector which converts to lowercase. It also includes both 1-gram and 2-gram features, this means a BOW feature extration that includes unquie words plus unquie word pairs. Then we apply tf-idf.
</p>

<p>
The classifyer uses a crosss-valdiation of 4 and a `linearSVC` (_N one-vs-rest SVMs, where N is the number of classes_) with hinge loss.
</p>

In [8]:
svm = Pipeline([('vect', CountVectorizer(lowercase=True, ngram_range=(1, 2))),
                ('tfidf', TfidfTransformer()),
                ('clf', CalibratedClassifierCV(LinearSVC(C=1, loss='hinge', class_weight='balanced'), cv=4))
               ])
svm.fit(X_train, y_train)

y_pred = svm.predict(X_test)

print('Accuracy: ' + str(accuracy_score(y_pred, y_test)))
print(classification_report(y_test, y_pred,target_names=my_tags))

Accuracy: 0.8183333333333334
               precision    recall  f1-score   support

         java       0.78      0.66      0.72       613
         html       0.92      0.92      0.92       620
      asp.net       0.96      0.96      0.96       587
           c#       0.82      0.79      0.81       586
ruby-on-rails       0.83      0.86      0.84       599
       jquery       0.62      0.65      0.63       589
        mysql       0.80      0.78      0.79       594
          php       0.84      0.90      0.87       610
          ios       0.74      0.74      0.74       617
   javascript       0.70      0.65      0.67       587
       python       0.71      0.70      0.70       611
            c       0.87      0.87      0.87       594
          css       0.80      0.80      0.80       619
      android       0.86      0.87      0.87       574
       iphone       0.85      0.84      0.84       584
          sql       0.71      0.69      0.70       578
  objective-c       0.86      0.88 