References:   
https://www.kaggle.com/alokmalik/text-classification-using-svm    
https://github.com/YangLinyi/SVM-CNN-RNN-HAN-Popular-NLP-Models    

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split #split data into train and test sets
from sklearn.feature_extraction.text import CountVectorizer #convert text comment into a numeric vector
from sklearn.feature_extraction.text import TfidfTransformer #use TF IDF transformer to change text vector created by count vectorizer
from sklearn.svm import SVC# Support Vector Machine
from sklearn.pipeline import Pipeline #pipeline to implement steps in series
from gensim import parsing # To stem data
from sklearn import metrics 

**Pre-defined Evaluation Functions**

In [2]:
def Get_Accuracy(y_true, y_pred): #Accuracy 准确率：分类器正确分类的样本数与总样本数之比 
    accuracy = metrics.accuracy_score(y_true,y_pred)  
    return accuracy

def Get_Precision_score(y_true, y_pred): #Precision：精准率 正确被预测的正样本(TP)占所有被预测为正样本(TP+FP)的比例. 
    precision = metrics.precision_score(y_true,y_pred)  
    return precision

def Get_Recall(y_true, y_pred): #Recall 召回率 正确被预测的正样本(TP)占所有真正 正样本(TP+FN)的比例.  
    Recall = metrics.recall_score(y_true,y_pred)  
    return Recall 
 
def Get_f1_score(y_true, y_pred): #F1-score: 精确率(precision)和召回率(Recall)的调和平均数  
    f1_score = metrics.f1_score(y_true,y_pred)  
    return f1_score

**Glove embedding**

In [3]:
# embeddings_dict = {}
# with open("GloVe/glove.6B.300d.txt", 'r') as f:
#     for line in f:
#         values = line.split()
#         word = values[0]
#         vector = np.asarray(values[1:], "float32")
#         embeddings_dict[word] = vector

**Read Data**

In [4]:
train_path = 'train.tsv'
test_path = 'test.tsv'

train_df = pd.read_csv(train_path, sep='\t')
test_df = pd.read_csv(test_path, sep='\t')

**Print dataframe to check the shape**  
Back part of this jupyter used some specific column names only for this specific dataset. So check your dataset to make sure use it in the correct shape.

In [5]:
train_df

Unnamed: 0,Sentiment,Text
0,Negative,"Long, boring, blasphemous. Never have I been s..."
1,Negative,Not good! Rent or buy the original! Watch this...
2,Negative,"This movie is so bad, it can only be compared ..."
3,Negative,"Spanish horrors are not bad at all, some are s..."
4,Negative,I've seen about 820 movies released between 19...
...,...,...
1702,Positive,Robert A. Heinlein's classic novel Starship Tr...
1703,Positive,"Well, I have finally caught up with ""Rock 'N' ..."
1704,Positive,Bacall does well here - especially considering...
1705,Positive,"Eddie Murphy plays Chandler Jarrell, a man who..."


In [6]:
test_df

Unnamed: 0,Sentiment,Text
0,Negative,"If you haven't seen this, it's terrible. It is..."
1,Negative,"being a NI supporter, it's hard to objectively..."
2,Negative,"Nine minutes of psychedelic, pulsating, often ..."
3,Negative,Really bad movie. Maybe the worst I've ever se...
4,Negative,I read the novel some years ago and I liked it...
...,...,...
483,Positive,Both Robert Duvall and Glenn Close played thei...
484,Positive,With various Bogdanoviches and Gazzaras scatte...
485,Positive,This is one of the best made movies from 2002....
486,Positive,"Im not usually a lover of musicals,but if i ha..."


**Cleansing label to numeric**

In [7]:
def clean_target(target_col):
    new_labels = []
    for each in target_col:
        if each == 'Negative':
            new_labels.append(0)
        elif each == 'Positive':
            new_labels.append(1)
    return new_labels
        
train_df['Sentiment'] = clean_target(train_df['Sentiment'])
test_df['Sentiment'] = clean_target(test_df['Sentiment'])      

**Parse**

In [8]:
#for grouping similar words such as 'trying" and "try" are same words
def parse(s):
    parsing.stem_text(s)
    return s

#applying parsing to comments.
for i in range(0,len(train_df)):
    train_df.iloc[i,1]=parse(train_df.iloc[i,1])


for i in range(0,len(test_df)):
    test_df.iloc[i,1]=parse(test_df.iloc[i,1])

**Split train and test**   
if you use only one data source

In [19]:
#Seperate data into feature and results
X, y = text, labels

#Split data in train and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

**Define feature and target**

In [8]:
X_train = train_df['Text']
y_train = train_df['Sentiment']
X_test = test_df['Text']
y_test = test_df['Sentiment']

**Build SVM and predict**

In [9]:
#Use pipeline to carry out steps in sequence with a single object
#SVM's rbf kernel gives highest accuracy in this classification problem.
text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SVC(kernel='rbf'))])

#train model
text_clf.fit(X_train, y_train)


#predict class form test data 
predicted = text_clf.predict(X_test)

**Evaluation**

In [14]:
# f = open("result.txt","a")
accuracy = Get_Accuracy(y_test, predicted)
# f.write("SVM Accuracy_Score = %f",accuracy)
precision = Get_Precision_score(y_test, predicted)
# f.write("SVM Precision = %f",precision)
recall = Get_Recall(y_test, predicted)
# f.write("SVM Recall = %f",recall)
f1_score = Get_f1_score(y_test, predicted)
print("SVM evaluation Result: Accuracy {:.2%}  ".format(accuracy),\
      "Precision {:.2%}  ".format(precision),\
      "Recall {:.2%}  ".format(recall),\
      "F1-Score {:.2%}  ".format(f1_score))
# f.write("SVM F1-Score  = %f",f1_score)
# f.close()

SVM evaluation Result: Accuracy 84.84%   Precision 82.02%   Recall 89.39%   F1-Score 85.55%  


**With parsing**   
SVM Accuracy_Score 84.84%  
SVM Precision 82.02%  
SVM Recall 89.39%  
SVM F1-Score 85.55%   
**Without parsing**   
SVM Accuracy_Score 84.84%  
SVM Precision 82.02%  
SVM Recall 89.39%  
SVM F1-Score 85.55%  