# <h1><center><font color=blue>D</font>escubrimiento de <font color=blue>I</font>nteracciones que <font color=blue>I</font>mpactan el <font color=blue>A</font>prendizaje (<font color=blue>D</font><font color=blue>I</font><font color=blue>I</font><font color=blue>A</font>)</center></h1>
## <h2><center>Development of a software service for semantic patterns discovery impacting learning based on student's interaction in social networks</center></h2>

# Sentiment Analysis Component

This component allows to perform semantic analysis on text documents generated through the use of formal and informal learning platforms such as posts, comments and messages. It predicts the negative, positive or neutral polarity of the documents based on their lexical-syntactical structure.

Example features:

- All the programs (files *.py) were implemented in Python 2.7

- The dataset used are part of a Spanish collection of tweets from TASS 2018 http://www.sepln.org/workshops/tass/

- The Python packages required to run the programs are the following:
  
     - Numpy (classification) http://www.numpy.org/
     - scikit-learn (classification) http://scikit-learn.org/stable/
     - NLTK (NLP techniques) https://www.nltk.org/
     - CLiPS pattern (NLP techniques) https://www.clips.uantwerpen.be/pattern

The different steps used to clean, pre-process, represent and classify the set of text documents are described as follows.

For the development of this component the InterTASS Corpus (2017) was used. This courpus was created by the SEPLN (Spanish Society for Natural Language Processing) for the TASS-2017: Workshop on Semantic Analysis at SEPLN. 
It consists of tweets written in Spanish with sentiment annotated in a scale of 4 levels of polarity: positive (P), neutral (NEU), negative (N) and no sentiment (NONE). The corpus has three datasets:

- Training: 1008 tweets.
- Development: 506 tweets.
- Test: 1899 tweets.

The three datasets of the corpus are three XML files. An example may be found below: 

<tweet>
    #The integer representation of the unique identifier for this Tweet.
	<tweetid>768228580673347584</tweetid>
    #The user who posted the Tweet.
	<user>pacosan1269</user>
    #The UTF-8 text of the Tweet.
	<content>Oye @luciaskaa te pones a hablar con @josecs02 como en los viejos tiempos y no avisas  @JuanAlfonso251 @CrisDazGar</content>
    #The Tweet's timestamp.
	<date>2016-08-23 23:29:01</date>
    #The language in which the Tweet's content is written. All corpus is written in Spanish ("es").
	<lang>es</lang>
	<sentiment>
        #The polarity value attributed to the Tweet's content.
		<polarity><value>N</value></polarity>
	</sentiment>
</tweet>


## Pre-processing

### Parsing

In the case that the original data document is in a different format than the desired, the first step is parsing.

First, the the original training XML document is parsed into a plain TXT file using the Python ElementTree functionality. Considering a training sentiment file, the sentiment tags are obtained (positive, negative and
neutral) except for the *NONE* tags, which were discarted. Bellow is described how to achieve this task:

Note: the characters "@-?@" are used as delimiters in the input and output files. 


- The content (text) and polarity of the documents is obtained, except from the ones with a "NONE" polarity value.

In [1]:
#Module to generate and parse XML documents.
import xml.etree.ElementTree as ET
tweetDic=[]
tree = ET.ElementTree(file="trainingOriginalFile.xml")
for x in tree.iter(tag='tweet'):
   tempDict={}
   for y in x:
      tempDict[y.tag]=((y.text).replace('\n',' ')).replace('\n',' ')
      if y.tag=="sentiment":
        if (y[0][0].text) != "NONE":
          tempDict[y.tag]=y[0][0].text
          tweetDic.append(tempDict)

- A TXT file that contains each document in one line is created.

In [2]:
#Module to transcode text between different representations.
import codecs
with codecs.open("training.txt","w","utf-8") as file:
    for x in tweetDic:
        file.write(x["tweetid"]+"@-?@"+
               x["user"]+"@-?@"+
               x["content"]+"@-?@"+
               x["date"]+"@-?@"+
               x["lang"]+"@-?@"+
x["sentiment"]+"\n")

In [3]:
#Module to transcode text between different representations.
import codecs
#First 5 Tweets in the parsed training document.
with codecs.open("training.txt","r","utf-8") as file:
    for x in range(5):
        print file.readline()

768213567418036224@-?@anahorxn@-?@@myendlesshazza a. que puto mal escribo  b. me sigo surrando help   3. ha quedado raro el "cómetelo" ahí JAJAJAJA@-?@2016-08-23 22:29:21@-?@es@-?@N

768212591105703936@-?@martitarey13@-?@@estherct209 jajajaja la tuya y la d mucha gente seguro!! Pero yo no puedo sin mi melena me muero @-?@2016-08-23 22:25:29@-?@es@-?@N

768221670255493120@-?@endlessmilerr@-?@Quiero mogollón a @AlbaBenito99 pero sobretodo por lo rápido que contesta a los wasaps @-?@2016-08-23 23:01:33@-?@es@-?@P

768221021300264964@-?@JunoWTFL@-?@Vale he visto la tia bebiendose su regla y me hs dado muchs grima @-?@2016-08-23 22:58:58@-?@es@-?@N

768220253730009091@-?@Alis_8496@-?@@Yulian_Poe @guillermoterry1 Ah. mucho más por supuesto! solo que lo incluyo. Me habías entendido mal @-?@2016-08-23 22:55:55@-?@es@-?@P



For the test dataset, the same steps are followed. The original XML file is parsed into a plain TXT file using the Python ElementTree functionality; however, considering a test sentiment file, it should be noted that there is no extracted _"sentiment"_ tag. Therefore:

In [4]:
#Module to generate and parse XML documents.
import xml.etree.ElementTree as ET
#Module to transcode text between different representations.
import codecs

tweetDic=[]

tree = ET.ElementTree(file="testOriginalFile.xml")
for x in tree.iter(tag='tweet'):
   tempDict={}
   for y in x:
      tempDict[y.tag]=((y.text).replace('\n',' ')).replace('\n',' ')
   tweetDic.append(tempDict)


with codecs.open("test.txt","w","utf-8") as file:
    for x in tweetDic:
        file.write(x["tweetid"]+"@-?@"+
               x["user"]+"@-?@"+
               x["content"]+"@-?@"+
               x["date"]+"@-?@"+
               x["lang"]+"\n")
               #No "sentiment" tag.

In [5]:
#Module to transcode text between different representations.
import codecs
#First 5 Tweets in the parsed test document.
with codecs.open("test.txt","r","utf-8") as file:
    for x in range(5):
        print file.readline()

770567971701940224@-?@wikimiscojones@-?@@LonelySoad mientras que no te pillen la primera semana que es cuando tienes la nariz como un payaso @-?@2016-08-30 10:24:55@-?@es

770503386789711872@-?@HLF_Metr4spt@-?@@ceemeese ya era hora de volver al csgo y dejares el padel bienvenida @-?@2016-08-30 06:08:17@-?@es

770502863017635840@-?@AVazquez_C@-?@@mireiaescribano justo cuando se terminan las fiestas de verano, me viene genial  @-?@2016-08-30 06:06:12@-?@es

770599972102348800@-?@minniecris@-?@@LuisMartinez22_ pensba q iba a hacer @wxplosive una reflexion profunda de las q me hace a mi pero @-?@2016-08-30 12:32:05@-?@es

770599962216390656@-?@VI_Lelouch@-?@@Vic_Phantomhive Si lo encuentro, sin compañeros y barato, me iría hasta ahora mismo @-?@2016-08-30 12:32:02@-?@es



### Cleaning

Once the documents are in the desired format, the training and test datasets are cleaned by removing all non-ASCII characters and URLs.

- The diacritical marks are removed from the words.

In [6]:
#Module to access to the Unicode Character Database which defines character properties for all Unicode characters.
import unicodedata

#Function that eliminates diacritical marks from Spanish words. 
def remove_accents(s):
    try:
        return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
    except TypeError:
        return ""     
    except UnicodeDecodeError:
        return ""
    except:
        return ""

- All URLs are removed from each document, lowercases and then collected into a new file.

In [7]:
#Module to transcode text between different representations.
import codecs
#Regular expressions module to search within and change text using formal patterns.
import re
#Pattern.es is a web mining module with natural language processing (NLP) tools for Spanish.
#The parse() function annotates words in the given string with their part-of-speech (POS) tags.
#The split() function takes the output of parse() and returns a Text.
from pattern.es import parse, split
try:
    trainingList=[]
    with codecs.open("training.txt","r","utf-8") as file:
        for line in file:
            elementList=line.split("@-?@")
            #The URLs are removed by matching and replacing a regular expression pattern with a space.
            TextWithoutURL=re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''',
                               '',elementList[2], flags=re.MULTILINE)
            elements=TextWithoutURL.split(" ")
            wordList=[]
            for x in elements:
                #All non-ASCII characters are removed by matching and replacing a regular expression pattern with a space.
                #Diacritical marks are removed through the remove_accents function.
                #Letters are lowercased.
                word= re.sub('[^A-Za-z0-9 ]+', '',remove_accents(x.lower()))
                #After the previous formatting, words are stored in a list.
                if len(word)>0:
                    wordList.append(word)
            #A TXT file containing each document in one line is generated.
            trainingList.append(elementList[0]+"@-?@"+
                                  elementList[1]+"@-?@"+
                                  " ".join(wordList)+"@-?@"+
                                  elementList[3]+"@-?@"+
                                  elementList[4]+"@-?@"+
                                  elementList[5])        

    with codecs.open("trainingClean.txt","w","utf-8") as file:
         for a in trainingList:
                file.write(a)

#Error handling.
except IOError as (errno, strerror):
    print "Input/output error ({0}): {1}".format(errno, strerror)

except:
    print "Unexpected error:", sys.exc_info()[0] 

In [8]:
#Module to transcode text between different representations.
import codecs
#First 5 Tweets in the clean training document.
with codecs.open("trainingClean.txt","r","utf-8") as file:
    for x in range(5):
        print file.readline()

768213567418036224@-?@anahorxn@-?@myendlesshazza a que puto mal escribo b me sigo surrando help 3 ha quedado raro el cometelo ahi jajajaja@-?@2016-08-23 22:29:21@-?@es@-?@N

768212591105703936@-?@martitarey13@-?@estherct209 jajajaja la tuya y la d mucha gente seguro pero yo no puedo sin mi melena me muero@-?@2016-08-23 22:25:29@-?@es@-?@N

768221670255493120@-?@endlessmilerr@-?@quiero mogollon a albabenito99 pero sobretodo por lo rapido que contesta a los wasaps@-?@2016-08-23 23:01:33@-?@es@-?@P

768221021300264964@-?@JunoWTFL@-?@vale he visto la tia bebiendose su regla y me hs dado muchs grima@-?@2016-08-23 22:58:58@-?@es@-?@N

768220253730009091@-?@Alis_8496@-?@yulianpoe guillermoterry1 ah mucho mas por supuesto solo que lo incluyo me habias entendido mal@-?@2016-08-23 22:55:55@-?@es@-?@P



The test dataset is cleaned following the same steps:

In [9]:
#Module to transcode text between different representations.
import codecs
#Regular expressions module to search within and change text using formal patterns.
import re
#Module to access to the Unicode Character Database which defines character properties for all Unicode characters.
import unicodedata
#Module that implements several types of pseudorandom number generators.
import random

#Function that eliminates diacritical marks from Spanish words. 
def remove_accents(s):
    try:
            return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
    except TypeError:
            return ""     
    except UnicodeDecodeError:
            return ""
    except:
            return ""
try:
    testList=[]
    with codecs.open("test.txt","r","utf-8") as file:
        for line in file:
            elementList=line.split("@-?@")
            #The URLs are removed by matching and replacing a regular expression pattern with a space.
            TextWithoutURL=re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''',
                               '',elementList[2], flags=re.MULTILINE)
            elements=TextWithoutURL.split(" ")
            wordList=[]
            for x in elements:
                #All non-ASCII characters are removed by matching and replacing a regular expression pattern with a space.
                #Diacritical marks are removed through the remove_accents function.
                #Letters are lowercased.
                word= re.sub('[^A-Za-z0-9 ]+', '',remove_accents(x.lower()))
                #After the previous formatting, words are stored in a list.
                if len(word)>0:
                    wordList.append(word)
            #A TXT file containing each document in one line is generated.
            testList.append(elementList[0]+"@-?@"+
                                  elementList[1]+"@-?@"+
                                  " ".join(wordList)+"@-?@"+
                                  elementList[3]+"@-?@"+
                                  elementList[4])
                                  #No "sentiment" tag.

    with codecs.open("testClean.txt","w","utf-8") as file:
         for a in testList:
             file.write(a)

#Error handling.             
except IOError as (errno, strerror):
    print "Input/output error ({0}): {1}".format(errno, strerror)

except:
    print "Unexpected error:", sys.exc_info()[0]

In [10]:
#Module to transcode text between different representations.
import codecs
#First 5 Tweets in the clean test document.
with codecs.open("testClean.txt","r","utf-8") as file:
    for x in range(5):
        print file.readline()

770567971701940224@-?@wikimiscojones@-?@lonelysoad mientras que no te pillen la primera semana que es cuando tienes la nariz como un payaso@-?@2016-08-30 10:24:55@-?@es

770503386789711872@-?@HLF_Metr4spt@-?@ceemeese ya era hora de volver al csgo y dejares el padel bienvenida@-?@2016-08-30 06:08:17@-?@es

770502863017635840@-?@AVazquez_C@-?@mireiaescribano justo cuando se terminan las fiestas de verano me viene genial@-?@2016-08-30 06:06:12@-?@es

770599972102348800@-?@minniecris@-?@luismartinez22 pensba q iba a hacer wxplosive una reflexion profunda de las q me hace a mi pero@-?@2016-08-30 12:32:05@-?@es

770599962216390656@-?@VI_Lelouch@-?@vicphantomhive si lo encuentro sin companeros y barato me iria hasta ahora mismo@-?@2016-08-30 12:32:02@-?@es



## Feature Set

After preprocessing, the next step is to collect the textual features for the construction of a classification model. Here, the fequency of occurrence of N-grams is obtained. N-grams are obtained using the Ngram functionality of the Natural Language Toolkit (NLTK). For this component, trigrams are used. 

- The N-grams associated to each text document with their frequency are obtained and ordered from most to least frequent in a feature set file.

In [11]:
#Module to transcode text between different representations.
import codecs
#Module that provides system-specific configuration and operations.
import sys
#Regular expressions module to search within and change text using formal patterns.
import re
#Module that serves as a functional interface to built-in operators.
import operator
#The Natural Language Toolkit is a module with natural language processing (NLP) resources and corpora.
import nltk
#Word_tokenize is a tokenizer that divides a string into a list of substrings.
from nltk import word_tokenize
#Ngrams returns the ngrams generated from a sequence of items
from nltk.util import ngrams

#Determine the N-gram window size.
NgramType=3

try:
    nGramFeatureSet={}
    with codecs.open("trainingClean.txt","r","utf-8") as file:
        for line in file:
            elementList=line.split("@-?@")
            #Words are tokenized.
            words = nltk.word_tokenize(elementList[2])
            #N-grams are obtained.
            nGrams=ngrams(words,NgramType)
            #N-grams are collected and their frequency is calculated.
            for nGram in nGrams:
                nGram=' '.join(e for e in nGram)   
                if nGram in nGramFeatureSet:
                    nGramFeatureSet[nGram]=nGramFeatureSet[nGram]+1
                else:
                    nGramFeatureSet[nGram]=1     
    #N-grams are sorted from most to least frequent.
    nGramFeatureSetSort = sorted(nGramFeatureSet.items(),
                                 key=operator.itemgetter(1),
                                 reverse=True)

#A TXT file containing each document in one line is generated.
    with codecs.open("feature3Gram.txt","w","utf-8") as file:
         for a in nGramFeatureSetSort:
             file.write(a[0]+"@-?@"+str(a[1])+"\n")

#Error handling.
except IOError as (errno, strerror):
    print "Input/output error ({0}): {1}".format(errno, strerror)

except:
    print "Unexpected error:", sys.exc_info()[0] 

In [12]:
#Module to transcode text between different representations.
import codecs
#First 10 training trigrams.
with codecs.open("feature3Gram.txt","r","utf-8") as file:
    for x in range(10):
        print file.readline()

es que es@-?@9

que no me@-?@7

no pasa nada@-?@5

y no me@-?@5

me voy a@-?@5

que tengas un@-?@4

va a ser@-?@4

no he podido@-?@4

a mi me@-?@4

es que no@-?@4



## Vector Representation

The next step to obtain the texts' polarity is to transform each text (training) into a vector representation taking the frequency of occurrence of the "N" best N-grams. For this component, the 150 most frequent trigrams are used.

- The N most frequent N-grams are obtained. 
- Each text is transformed into a vector representation based on the frequency of occurrence of the selected features.

In [13]:
#Module to transcode text between different representations.
import codecs
#Regular expressions module to search within and change text using formal patterns.
import re
#Module that provides system-specific configuration and operations.
import sys
#Module that serves as a functional interface to built-in operators.
import operator
#The Natural Language Toolkit is a module with natural language processing (NLP) resources and corpora.
import nltk
#Word_tokenize is a tokenizer that divides a string into a list of substrings.
from nltk import word_tokenize
#Ngrams returns the ngrams generated from a sequence of items
from nltk.util import ngrams
#The Collections module includes container data types beyond the built-in types list, dict, and tuple.
#A Counter is a container that keeps track of how many times equivalent values are added.
from collections import Counter


#Determine the N-gram window size.
NgramType=3
#Determine the amount of features to select.
NgramNumber=150

try: 
    nGramFeatureSet=[]
    counter=0
    with codecs.open("feature3Gram.txt","r","utf-8") as file:
        #The "N" features are collected.
        for line in file:
            counter=counter+1
            if counter <= NgramNumber:
                elementsList=line.split("@-?@")
                nGramFeatureSet.append(elementsList[0])
            else:
                break

    vectorList=[]
    sentimentTags=[]
    with codecs.open("trainingClean.txt","r","utf-8") as file:
        for line in file:
            nGramList=[]
            vector=[]
            elementsList=line.split("@-?@")
            #Sentiment tags are collected.
            sentimentTags.append(elementsList[5])
            #The text is tokenized.
            words = nltk.word_tokenize(elementsList[2])
            #N-grams are obtained.
            nGrams=ngrams(words,NgramType)
            #N-grams are collected.
            for nGram in nGrams:
                nGram=' '.join(e for e in nGram)
                nGramList.append(nGram)
            #Texts are transformed into vectors.
            for nGram in nGramFeatureSet:
                vector.append(nGramList.count(nGram))    
            vectorList.append(vector)    
                   
#A TXT file containing each document in one line is generated.
    with codecs.open("trainingVectors-3-150.txt","w","utf-8") as file:
         for x,y in zip(vectorList,sentimentTags):
             file.write(','.join(map(str, x))+","+y)

#Error handling.
except IOError as (errno, strerror):
    print "Input/output error ({0}): {1}".format(errno, strerror)

except:
    print "Unexpected error:", sys.exc_info()[0] 

The same actions are taken for the test documents to transform them into vector representations.

In [14]:
#Module to transcode text between different representations.
import codecs
#Regular expressions module to search within and change text using formal patterns.
import re
#Module that provides system-specific configuration and operations.
import sys
#Module that serves as a functional interface to built-in operators.
import operator
#The Natural Language Toolkit is a module with natural language processing (NLP) resources and corpora.
import nltk
#Word_tokenize is a tokenizer that divides a string into a list of substrings.
from nltk import word_tokenize
#Ngrams returns the ngrams generated from a sequence of items
from nltk.util import ngrams
#The Collections module includes container data types beyond the built-in types list, dict, and tuple.
#A Counter is a container that keeps track of how many times equivalent values are added.
from collections import Counter


#Determine the N-gram window size.
NgramType=3
#Determine the amount of features to select.
NgramNumber=150


try: 
    nGramFeatureSet=[]
    counter=0
    with codecs.open("feature3Gram.txt","r","utf-8") as file:
        #The "N" features are collected.
        for line in file:
            counter=counter+1
            if counter <= NgramNumber:
                elementsList=line.split("@-?@")
                nGramFeatureSet.append(elementsList[0])
            else:
                break

    vectorList=[]
    with codecs.open("testClean.txt","r","utf-8") as file:
        for line in file:
            nGramList=[]
            vector=[]
            elementsList=line.split("@-?@")
            #The text is tokenized.
            words = nltk.word_tokenize(elementsList[2])
            #N-grams are obtained.
            nGrams=ngrams(words,NgramType)
            #N-grams are collected.
            for nGram in nGrams:
                nGram=' '.join(e for e in nGram)
                nGramList.append(nGram)
            #Texts are transformed into vectors.
            for nGram in nGramFeatureSet:
                vector.append(nGramList.count(nGram))    
            vectorList.append(vector)    

#A TXT file containing each document in one line is generated.
    with codecs.open("testVectors-3-150.txt","w","utf-8") as file:
         for x in vectorList:
             file.write(','.join(map(str, x))+"\n")

#Error handling.            
except IOError as (errno, strerror):
    print "Input/output error ({0}): {1}".format(errno, strerror)

except:
    print "Unexpected error:", sys.exc_info()[0]

## Training - Testing

Finally, the vectors previously created are used to construct a predictive model of the sentiment (positive, negative or neutral) of a text based in the supervised learning theory.

- A model is created from training samples.

In [15]:
#Module to transcode text between different representations.
import codecs
#Regular expressions module to search within and change text using formal patterns.
import re
#Module that provides system-specific configuration and operations.
import sys
#Module for Python object serialization.
import pickle
#NumPy is a package for scientific computing that provides some of the highly optimized data structures. 
import numpy as np
#The scikit-learn module provides tools for data mining and data analysis.
#Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.
from sklearn import svm
#The sklearn.metrics module implements several loss, score, and utility functions to measure classification performance. 
#Precision_score computes the precision classification score.
from sklearn.metrics import precision_score
#Accuracy_score computes the accuracy classification score.
from sklearn.metrics import accuracy_score

#Determine the N-gram window size.
NgramType=3
#Determine the amount of features to select.
NgramNumber=150


#try:
temp=[]
temp2=[]
trainingVectors=[]
testVectors=[]
sentimentTag=[]
goldstandarTag=[]
testResults=[]

with codecs.open("trainingVectors-3-150.txt","r","utf-8") as file:
        for line in file:
          elements=line.split(",")
          vector=elements[0:len(elements)-1]
          trainingVectors.append([int(i) for i in vector])
          sentimentTag.append((elements[len(elements)-1]).rstrip())

with codecs.open("testVectors-3-150.txt","r","utf-8") as file:
        for line in file:
          vector=line.split(",")
          vector[len(vector)-1]=vector[len(vector)-1].rstrip()
          temp.append([[int(i) for i in vector]])

#A TXT file containing each document in one line is generated.
with codecs.open("goldStandar.qrel","r","utf-8") as file:
        for line in file:
          elements=line.split("	")
          elements[1]=elements[1].rstrip()
          if elements[1]!="NONE":
              temp2.append(elements[1])

#The NONE elements are removed from the dataset
for x,y in zip(temp,temp2):
    if y!="NONE":
        testVectors.append(x)
        goldstandarTag.append(y)

- For each test vector a sentiment is predicted using the classification model.

In [16]:
#Module to transcode text between different representations.
import codecs
#Regular expressions module to search within and change text using formal patterns.
import re
#Module that provides system-specific configuration and operations.
import sys
#Module for Python object serialization.
import pickle
#NumPy is a package for scientific computing that provides some of the highly optimized data structures. 
import numpy as np
#The scikit-learn module provides tools for data mining and data analysis.
#Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.
from sklearn import svm
#The sklearn.metrics module implements several loss, score, and utility functions to measure classification performance. 
#Precision_score computes the precision classification score.
from sklearn.metrics import precision_score
#Accuracy_score computes the accuracy classification score.
from sklearn.metrics import accuracy_score

# SVC with polynomial (degree 3) kernel
C = 1.0
clf= poly_svc = svm.SVC(kernel='poly', degree=3, C=C)
#training phase (model construction)
clf.fit(trainingVectors, sentimentTag)
#Model serialization
pickle.dump(clf, open("Model-3-150.txt", "wb"))
#in case of loading the object use: dump = pickle.load("Model-3-150.txt")
    
#Testing phase
for vector in testVectors:
    testResults.append(clf.predict(vector)[0])
    
print "Model Accuracy: "+str(accuracy_score(goldstandarTag,testResults))
print "Model micro presicion: "+str(precision_score(goldstandarTag,testResults, average='micro'))
print "Model macro presicion: "+str(precision_score(goldstandarTag,testResults, average='macro'))
print "Model weighted presicion: "+str(precision_score(goldstandarTag,testResults, average='weighted'))

Model Accuracy: 0.472
Model micro presicion: 0.472
Model macro presicion: 0.157333333333
Model weighted presicion: 0.222784


  'precision', 'predicted', average, warn_for)
