<h1>Sentiment analysis with scikit-learn</h1>
<p>As part of my <a href="http://niconico.dk/files/Opinionmining_speciale.pdf" target="_blank">master thesis</a> in IT and Cognition from University of Copenhagen I worked on this pretty straight forward piece of code.</p>

<p>From a collection of news articles on Danish bank Nykredit the relevant data is extracted, cleaned and used to train and test a few classifiers utilizing machine learning library scikit-learn.</p>

<p>More than anything this is a test of the quality of the tagging of the data. And as the results show the quality ain't the best. The reasons for this is discussed in chapter 6.4.2 of the above mentioned thesis.</p>

In [1]:
import pickle

# numpy and pandas for data handling
import numpy as np
import pandas as pd

# BeautifulSoup for parsing XML
from bs4 import BeautifulSoup

# nltk for various NLP tasks (Natural Language Toolkit)
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords

# difflib for measuring similarity of text
from difflib import SequenceMatcher as textSimilarity

# sklearn objects for feature extraction, classification and cross validation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import cross_validation
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn import svm
from sklearn.naive_bayes import BernoulliNB as bnb

print ("Modules imported")

Modules imported


In [2]:
def nykredit_xls_to_dataframe():
    """
    Parses Nykredit_feed.xls and returns selected columns of the spreadsheet
    as a Pandas dataframe.
    """
    try:
        NK_data = pd.read_pickle("NK_data.pkl")
        print ("NK_data loaded")
    except:
        print("exception")
        # loads the spreadsheet as a pandas.DataFrame
        xls_file = pd.ExcelFile("Nykredit_feed.xls")
        nykredit_data = xls_file.parse('tmp62.tmp')
        nykredit_text = xls_file.parse('ArticleText')
        
        # choose which columns to use from the spreadsheet:
        selected_parameters = ['ArticleKey', 'ArticleDate', 'QualitativeScore',
                               'Headline', 'Kilde', 'Raw Xml']
        NK_data = nykredit_data.reindex(columns=selected_parameters)

        # add a column containing the texts extracted from the xml:
        NK_data['Text'] = NK_data.apply(extract_text_from_xml, axis=1)
        
        # change column name 'Kilde'=>'Source'
        NK_data.columns = [['ArticleKey', 'ArticleDate', 'QualitativeScore',
                            'Headline', 'Source', 'Raw Xml', 'Text']]
        
        NK_data.to_pickle("NK_data.pkl")
        print ('NK_data.pkl constructed and saved to file')
        
    return NK_data

def extract_text_from_xml(row):
    """
    Parses the raw XML string in row and returns the text/article contained in row.XML.
    """    
    soup = BeautifulSoup(row['Raw Xml'], 'lxml')
    p_blocks = soup.findAll('p')
    
    output = []
    for p in p_blocks:
        if p.string: # some p blocks contains a None element
            output.append(p.string)
    return " ".join(output)

NK_data = nykredit_xls_to_dataframe()

NK_data loaded


<b>Now the NK_data frame looks as follows:</b>

In [3]:
NK_data.head()

Unnamed: 0,ArticleKey,ArticleDate,QualitativeScore,Headline,Source,Raw Xml,Text
0,e25f3cf8,2010-12-28,0,Fortsat lav rente på flexlån,Hadsund Folkeblad,"<?xml version=""1.0"" encoding=""utf-8""?><NewsML ...",ALS: December måned ventes hvert år med spændi...
1,e25bea65,2010-12-21,0,Fortsat en meget lav rente på et-årige flexlån,Aars Avis,"<?xml version=""1.0"" encoding=""utf-8""?><NewsML ...",AARS: December ventes hvert år med spænding af...
2,e252432b,2010-12-01,1,Førstegangskøbere kan med fordel købe bolig nu,Aabenraa Ugeavis,"<?xml version=""1.0"" encoding=""utf-8""?><NewsML ...",BOLIG: » Mange førstegangskøbere kan med forde...
3,e23bc88a,2010-10-05,1,Sparbank Nord anbefaler at lægge huslånene om,Annonce Bladet Salling-Fur-Skive,"<?xml version=""1.0"" encoding=""utf-8""?><NewsML ...",I de seneste måneder har massevis af boligejer...
4,e2520d69,2010-12-01,1,Nu kan der med fordel købes bolig,Lokal-Bladet Budstikken Vejen,"<?xml version=""1.0"" encoding=""utf-8""?><NewsML ...",-Mange førstegangskøbere kan med fordel slå ti...


<h3>Removing duplicate and semi-duplicate texts</h3>
<p>
    Besides a few identical duplicates in the collection (which can be easily and quickly removed) there are also a bunch of semi-duplicates. An example of these could be an article from a news agency published in two different papers. The first paper prints the article in it's entirety whereas the second cuts the last paragraph. The texts aren't identical but they share enough similarities so that we can remove one of them in order to avoid redundant datapoints. And, of course, the degree of similarity determines whether a text is considered a semi-duplicate or not.
</p>
<p>
    The similarity is measured using the <code><a href="https://docs.python.org/2/library/difflib.html"> difflib.SequenceMatcher</a></code> class. It is a class for compairing and scoring similarity of sequences - texts in our case. The threshold is set to 0.95 and if two texts has a similarity score above this, one of them is removed while the other is kept in the data set.
</p>
<p>
    <code>difflib.SequenceMatcher</code> (which has been loaded as <code>textSimilarity</code> in the import section) is a rather calculation heavy function. Hence
    the nested while loop in order not to compare a text that has already been deemed duplicate to another unseen text.
</p>

In [4]:
def duplicate_indexes(dataFrame):
    """
    Returns the indexes of duplicate and semi duplicates texts in dataFrame.
    """
    try:
        duplicates_list = pickle.load(open("duplicates", "rb"))
        print ("duplicates loaded")
    except:
        duplicates_list = []
        i = 0
        while i<dataFrame.shape[0]-1:
            if not i in duplicates_list: # text[i] is a duplicate => no need to check
                j = i+1
                while j<dataFrame.shape[0]:
                    if not j in duplicates_list: # text[j] is a duplicate => no need to check
                        text_similarity = textSimilarity(None,
                                                         dataFrame.ix[i].Text,
                                                         dataFrame.ix[j].Text
                                                        ).quick_ratio()
                        if text_similarity > 0.95:
                            duplicates_list.append(j)
                            print((i, j))
                            j += 1
                    j += 1
            i += 1
        pickle.dump(duplicates_list, open("duplicates", "wb"))
    return duplicates_list

duplicates = duplicate_indexes(NK_data)
NK_unique_data = pd.DataFrame.copy(NK_data.drop(duplicates))

duplicates loaded


<h3>Feature vectors</h3>
<p>
    I'm using scikit-learn's <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html"><code>CountVectorizer</code></a> class for counting the number of times the tokens (i.e. words) appear in the documents. The <code>vectorizer</code> is then used for building the feature vectors.
</p>
<p>
    Finally <code>crossvalidate_algorithms</code> performs a k-fold stratified shuffled split validation of a list of classifiers on the set of feature vectors given in input.
</p>
<p>

</p>

In [5]:
# Creating feature vectors using sklearn

def texts_to_feature_vectors(dataframe, number_of_features=5000):
    """
    Based on the dataframe given in input, texts_to_feature_vectors builds and returns
    an array of feature vectors representing the texts in the dataframe and an array of
    the appertaining labels.
    """
    
    NK_positive_negative = pd.DataFrame.copy(dataframe[dataframe.QualitativeScore!=-1])
    target = np.array(NK_positive_negative.QualitativeScore) # list of corresponding labels
    
    texts = np.array(NK_positive_negative.Text)
    
    # Initializing the CountVectorizer object
    # vectorizer counts the number of times the tokens appear in the document
    vectorizer = CountVectorizer(analyzer='word', tokenizer=None, preprocessor=None, \
                                 stop_words=stopwords.words("Danish"), \
                                 max_features=number_of_features)

    # Creating the feature vectors
    feature_vectors = vectorizer.fit_transform(texts).toarray()
    
    return feature_vectors, target

X, y = texts_to_feature_vectors(NK_unique_data)

def crossvalidate_algorithms(algos, f_vectors, target):
    """
    crossvalidate_algorithms performs a k-fold stratified shuffled split validation
    of an array of classifiers on the set of feature vectors given in input.
    """
    
    # StratifiedShuffleSplit is an iterator for generating stratified and
    # shuffled splits for the cross validation
    sss = StratifiedShuffleSplit(target, n_iter=5, test_size=0.1, random_state=0)
    
    scores = {}
    for algo in algos:
        s = cross_validation.cross_val_score(algo(), f_vectors, target, cv=sss)
        scores[algo.__name__] = s
        print (algo.__name__, ":", s, "\nMean accuracy:", round(100*s.mean(), 2), "\n")
    return scores

algorithm_scores = crossvalidate_algorithms([bnb, svm.LinearSVC], X, y)

BernoulliNB : [ 0.80147059  0.84558824  0.83088235  0.79411765  0.83823529] 
Mean accuracy: 82.21 

LinearSVC : [ 0.78676471  0.83823529  0.84558824  0.75        0.82352941] 
Mean accuracy: 80.88 



<h3>Results</h3>
<p>The accuracy scores are pretty low. The various reasons for this is discussed in chapter 6.4.2 in the aforementioned thesis, but one of the main reasons is wrongfully tagged texts.</p>

In [6]:
#
# As of now these functions aren't used for anything. They're used for a finer
# feature selection than what the vanilla equivalents of scikit-learn offers.
#

def createFreqDist(dataframe):
    """
    Creates an nltk frequency distribution for the input dataframe.
    """
    
    allTexts = ""
    for index, row in dataframe.iterrows():
        allTexts += row.Text + " "
    frequencyDistribution = nltk.FreqDist(word_tokenize(allTexts))
    return frequencyDistribution
frequencyDistribution = createFreqDist(NK_unique_data)

def trimFreqDist(freqDist, loCut, hiCut):
    """
    trimFreqDist removes the least and most common tokens of the frequency distribution
    according to the values defined by loCut and hiCut.
    """
    
    print("frequencyDistribution.B() =", frequencyDistribution.B(), ", loCut =", loCut, ", hiCut =", hiCut)
    fd = frequencyDistribution.copy()
    FreqDistList = fd.most_common(fd.B()) # Ordered list of (token, freq) from the most to least common tokens
    for (token, freq) in FreqDistList: # Remove tokens that only appear once
        if freq==1:
            fd.pop(token)
            
    FreqDistList = fd.most_common(fd.B()) # Update list after removing token of frequency 1
    loCutIndex = int(fd.B()*loCut) # The B() method returns the number of tokens in the freqDist
    hiCutIndex = int(fd.B()*hiCut)
    
    for (token, freq) in FreqDistList[:loCutIndex]:
        fd.pop(token)
    for (token, freq) in FreqDistList[hiCutIndex:]:
        fd.pop(token)
    print(fd.B())
    return fd

trimmedFreqDist = trimFreqDist(frequencyDistribution, 0.1, 0.9)

frequencyDistribution.B() = 39920 , loCut = 0.1 , hiCut = 0.9
15618
