
# <font color='green'>sentiment Analysis</font> ![title](./resources/img/sent_twitter.png)

## <font color='red'>Reading Data</font>
***

#### Import liberies

Refer to the web pages for individual libraries
* [Pandas](http://pandas.pydata.org/), to load and manage data
* [Matplotlib](http://matplotlib.org/), for visualization
* [numpy](http://www.numpy.org/) for painting representation and manipulation
* [re](https://docs.python.org/3/library/re.html) for regular expression
* [nltk](http://www.nltk.org/) for pretreatment

In [1]:
import pandas as pd
import re
import os
from copy import copy
import collections
import scipy
import numpy as np
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn import metrics
from sklearn.svm import SVC
from sklearn import model_selection

%matplotlib inline

#### Reading the dataset
Some of the data "downloaded_cleansed_B" is produced out of the "downloaded_cleansed_A". The difference is:
- "downloaded_cleansed_A" has three columns that we won't use.
- "downloaded_cleansed_A" has repeatted tweets.

In [2]:
df = pd.read_csv('./data/train/downloaded_cleansed_B.tsv', sep= '\t', header=None)
print (df.shape)
df.head()

(9665, 4)


Unnamed: 0,0,1,2,3
0,264183816548130816,15140428,positive,Gas by my house hit $3.39!!!! I'm going to Cha...
1,263405084770172928,591166521,negative,Not Available
2,262163168678248449,35266263,negative,Not Available
3,264249301910310912,18516728,negative,Iranian general says Israel's Iron Dome can't ...
4,262682041215234048,254373818,neutral,Not Available


Note that some tweets are "Not Available". We will reject them because it will not help in the analysis of feelings

#### Supprimer tous les tweets "NOT AVAILABLE"

In [3]:
df = df[df[3] != "Not Available"]
df.head()

Unnamed: 0,0,1,2,3
0,264183816548130816,15140428,positive,Gas by my house hit $3.39!!!! I'm going to Cha...
3,264249301910310912,18516728,negative,Iranian general says Israel's Iron Dome can't ...
6,264105751826538497,147088367,positive,with J Davlar 11th. Main rivals are team Polan...
7,264094586689953794,332474633,negative,"Talking about ACT's &amp;&amp; SAT's, deciding..."
9,254941790757601280,557103111,negative,"They may have a SuperBowl in Dallas, but Dalla..."


In [4]:
df.shape

(7205, 4)

#### <font color='blue'>Training tweets are too limited: just 7205 tweets ...</font>

In [5]:
raw_tweets = list(df[3])
labels = df[2]

***
## <font color='red'>Pre-train the tweets</font>
https://nlp.stanford.edu/IR-book/html/htmledition/determining-the-vocabulary-of-terms-1.html
***

In [6]:
TT = TweetTokenizer()

def emoticondictionary(filename):
    """
    Reads the emoticon file and represents it as dictionary where the emoticon is the key, 
    and its indication as a value
    """
    emo_scores = {'Positive': 'positive', 'Extremely-Positive': 'positive', 
                  'Negative': 'negative','Extremely-Negative': 'negative',
                  'Neutral': 'neutral'}
    emo_score_list = {}
    fi = open(filename,"r")
    l = fi.readline()
    while l:
        #replace the "Non-break space" with the ordinary space " "
        l = l.replace("\xa0"," ")
        li = l.split(" ")
        l2 = li[:-1] #removes the polarity of the emoticon ('negative', 'positive')
        l2.append(li[len(li) - 1].split("\t")[0]) #gets the last emoticon attached to the polarity by '\t'
        sentiment=li[len(li) - 1].split("\t")[1][:-1] #gets only the polarity, and removes '\n'
        score=emo_scores[sentiment]
        l2.append(score)
        for i in range(0,len(l2)-1):
            emo_score_list[l2[i]]=l2[len(l2)-1]
        l=fi.readline()
    return emo_score_list

emoticon_dict = emoticondictionary('./resources/emoticon.txt')


# substititue emoticon with its associated sentiment
def subsEmoticon(tweet,d):
    l = TT.tokenize(tweet)
    tweet = [d[i] if i in d.keys() else i for i in l]
    return tweet

raw_tweets = [subsEmoticon(tweet, emoticon_dict) for tweet in raw_tweets]
# print(":D X3 :|")
# subsEmoticon(":D X3 :|", dict)

In [7]:
def correct_case(prev_word, words):
    """
    Gets the proper 'prev_word' case and preserves it with the 'word'.
    """
    def case_of(text):
        """
        Return the case-function appropriate for the given word. 
        The returned cast is [upper, lower, title, or just str].
        """
        if len(text) == 1:
            return str.title if text.isupper() else str.lower
        return (str.upper if text.isupper() else
                str.lower if text.islower() else
                str.title if text.istitle() else
                str)
    assert type(words) == list
    words[0] = case_of(prev_word)(words[0])
    return words

In [8]:
def loadSlangs(filename):
    """
    This function reads the file that contains the slangs, and put them in a dictionary such that
    the key is the "slang" and the value is the acronym.
    slangs["i've"] = ['i',  'have']
    slang['12be'] = ['want', 'to', 'be']
    ...
    CAUTION: the keys and values are in lower-case
    """
    slangs={}
    fi=open(filename,'r')
    line=fi.readline()
    while line:
        l=line.split(r',%,')
        if len(l) == 2:
            slangs[l[0].lower()]=l[1][:-1].lower().split()
        line=fi.readline()
    fi.close()
    return slangs


replaced = 0
def replaceSlangs(tweet, slangs):
    """
    This function is used to replace the slang in the original tweets and replace them with the acronym.
    And it's also returns the the tweet in lower-case letters
    """
    global replaced
    result = []
    for w in tweet:
        if w.lower() in slangs.keys():
            replaced += 1
            result.extend(correct_case(w, slangs[w.lower()]))
        else:
            result.append(w)
    return result


slangs = loadSlangs('./resources/internetSlangs.txt')
raw_tweets = [replaceSlangs(tweet, slangs) for tweet in raw_tweets]
print (str(replaced)+" words were replaced")

3488 words were replaced


In [9]:
def load_apostrophe_words(filename):
    """
    This function reads the file that contains all words that have apostrophe, and put them in a dictionary 
    such that the key is the "word containing apostrophe" and the value is the "the word without apostrophe".
    slangs['i've'] = 'i have'
    slang['I'm] = 'I am'
    ...
    CAUTION: the keys and values are lower-case letters
    """
    apo={}
    fi=open(filename,'r')
    line=fi.readline()
    while line:
        l=line.split(r',%,')
        if len(l) == 2:
            apo[l[0].lower()] = l[1][:-1].lower().split()
        line=fi.readline()
    fi.close()
    return apo

replaced = 0
def replace_apostrophe(tweet,apos):
    global replaced
    result = []
    for w in tweet:
        if w.lower() in apos.keys():
            result.extend(correct_case(w, apos[w.lower()]))
            replaced += 1
        else:
            result.append(w)
    return result

apos = load_apostrophe_words('./resources/apostrophe_words.txt')
raw_tweets = [replace_apostrophe(tweet, apos) for tweet in raw_tweets]
print (str(replaced)+" words were replaced")

1682 words were replaced


In [10]:
negation_words = set(['barely', 'hardly', 'lack', 'never', 'neither', 'no', 'nobody', \
                      'not', 'nothing', 'none', 'nowhere', 'shortage', 'scarcely'])
punctuations = [',', '.', ':', ';', '!', '?']

def handle_negation(tweet):
    output = []
    negate = False
    for word in tweet:
        if word[-1] in punctuations and negate:
            negate = False
        if negate and not word.lower() in negation_words:
            output.append(word+"_not")
        else:
            output.append(word)
        if word.lower() in negation_words and not negate:
            negate = True
        elif word.lower() in negation_words and negate:
            negate = False
    return output

raw_tweets = [handle_negation(tweet) for tweet in raw_tweets]

In [11]:
def preprocess(tweet):
    tweet = ' '.join(tweet) #change from 'list' to str
    # delete symbols and URIs and tags (keep # and _)
    tweet =  ' '.join(re.sub("(@[A-Za-z0-9_]+)|([^0-9A-Za-z_#' \t])|(\w+:\/\/\S+)", '', tweet).split())
    # Convert '@username' to 'at_user'
    # tweet = re.sub('@[^\s]+','at_user',tweet)
    # remove hashtags
    # tweet = re.sub(r'#\s', '', tweet)
    # remove numbers
    tweet = re.sub('[0-9]', '', tweet)
    # remove additional spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    # replace the occurrence of 2 or more characters in a word, eg. loooong -> loong
    tweet = re.sub(r'(.)\1{2,}', r'\1\1', tweet)
    return tweet

preprocessed_tweets = [preprocess(tweet) for tweet in raw_tweets]
print (len(preprocessed_tweets))

7205


#### Delete stopwords
https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html

In [12]:
stop_words = stopwords.words('english')
stop_words.extend([word+'_not' for word in stop_words]) #negation
stop_words = set(stop_words)
# stop_words.update('j', 'im')
print (len(stop_words))

# remove stopwords
def rem_stop(tweet):
    words = tweet.split()
    tweet = ' '.join([word for word in words if word.lower() not in stop_words])
    return tweet

final_tweets = [rem_stop(tweet) for tweet in preprocessed_tweets]
del raw_tweets, preprocessed_tweets

print("\nCompare tweets before / after")
df['final_tweets'] = final_tweets
df[[3, 'final_tweets']].head(10)

358

Compare tweets before / after


Unnamed: 0,3,final_tweets
0,Gas by my house hit $3.39!!!! I'm going to Cha...,Gas house hit going Chapel Hill Sat positive
3,Iranian general says Israel's Iron Dome can't ...,Iranian general says Israel's Iron Dome deal_n...
6,with J Davlar 11th. Main rivals are team Polan...,J Davlar th Main rivals team Poland Hopefully ...
7,"Talking about ACT's &amp;&amp; SAT's, deciding...",Talking ACT's SAT's deciding want go college a...
9,"They may have a SuperBowl in Dallas, but Dalla...",may SuperBowl Dallas Dallas winning_not SuperB...
10,Im bringing the monster load of candy tomorrow...,Instant message bringing monster load candy to...
11,"Apple software, retail chiefs out in overhaul:...",Apple software retail chiefs overhaul SAN FRAN...
12,@oluoch @victor_otti @kunjand I just watched i...,watched Sridevi's comeback remember Sun mornin...
14,#Livewire Nadal confirmed for Mexican Open in ...,#Livewire Nadal confirmed Mexican Open Februar...
15,@MsSheLahY I didnt want to just pop up... but ...,didnt want pop yep chapel hill next wednesday ...


***
## <font color='red'>Train the model</font>
***
#### Create a feature vector
* See [Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model) for more details

In [29]:
#CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='word', preprocessor=None, stop_words=None, tokenizer=None,  ngram_range=(1,3))
counter_vocabulary = vectorizer.fit(final_tweets)
# features = vectorizer.fit_transform(final_tweets)
# del final_tweets
# features.shape

In [48]:
#TFIDFVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer

vectorizer = TfidfVectorizer(analyzer='word', preprocessor=None, stop_words=None, tokenizer=None,  ngram_range=(1,3))
tfidf_vocabulary = vectorizer.fit(final_tweets)
# features.shape

In [49]:
type(tfidf_vocabulary)

sklearn.feature_extraction.text.TfidfVectorizer


#### Put labels to train

In [15]:
mapper = {'positive': 1, 'negative': -1, 'neutral': 0}

labels = labels.map(mapper)
labels.shape

(7205,)

#### Import SVM

http://scikit-learn.org/stable/modules/svm.html

For a mathematical overview,
https://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html

#### Get the optimal regulation parameter using handout method

In [17]:
# best_score_C_val = 1.300
# KERNEL = 'linear'
# classifier = SVC(kernel=KERNEL)
# classifier.fit(features, labels)

In [52]:
from sklearn.pipeline import Pipeline

classifier = Pipeline( [('vect', CountVectorizer(vocabulary = counter_vocabulary.vocabulary_)),
                        ('tfidf', TfidfTransformer()),
                        ('clf', SVC(kernel='linear'))] )

#### Prediction on training data

In [53]:
from sklearn import metrics

nb_predict_train = classifier.predict(final_tweets)
#check accuracy
print("Accuracy: {:0.4f}".format(metrics.accuracy_score(labels, nb_predict_train)))

NotFittedError: idf vector is not fitted

In [None]:
#print confusion matrix
print("{}".format(metrics.confusion_matrix(labels, nb_predict_train, labels=[1,-1, 0])))

print("{}".format(metrics.classification_report(labels, nb_predict_train, labels=[1, -1, 0])))

### Predict using the model
***

#### Import test data

In [None]:
t_df = pd.read_csv('./data/test/actual/test_B_labeled.tsv', sep='\t', header=None)
t_df.shape

In [None]:
t_df = t_df[t_df[3] != 'Not Available']
t_df.shape

In [None]:
# The bar chart for the test data set

y = [len(t_df[t_df[2] == i]) for i in ['positive', 'negative', 'neutral']]
x = ['positive', 'negative', 'neutral']
x_pos = range(len(x))

plt.figure(figsize=(10,8))
plt.bar(x_pos, y, alpha=0.5)
plt.xticks(x_pos, x)
plt.ylabel('# Occurences').set_size(15)
plt.xlabel('Classes').set_size(15)

#### Pre-process tweets from the test dataset

In [None]:
raw_tweets_test = t_df[3]
raw_tweets_test = [subsEmoticon(tweet, emoticon_dict) for tweet in raw_tweets_test]
raw_tweets_test = [replaceSlangs(tweet, slangs) for tweet in raw_tweets_test]
raw_tweets_test = [replace_apostrophe(tweet, apos) for tweet in raw_tweets_test]
raw_tweets_test = [handle_negation(tweet) for tweet in raw_tweets_test]
preprocessed_tweets_test = [preprocess(tweet) for tweet in raw_tweets_test]
final_tweets_test = [rem_stop(tweet) for tweet in preprocessed_tweets_test]
t_df[3] = final_tweets_test

del raw_tweets_test, preprocessed_tweets_test

In [None]:
t_df.head()

#### Get labels from a set of test data

In [None]:
actual_labels = t_df[2]
actual_labels = actual_labels.map(mapper)
actual_labels.shape

#### Predict labels using the template

In [None]:
predicted_labels = classifier.predict(final_tweets_test)

### Evaluate the Model
***

#### Evaluate the accuracy

In [None]:
print('Accuracy: {:0.2f}%'.format(metrics.accuracy_score(actual_labels, predicted_labels) * 100))

#### Accuracy of cross-validation 10 times on test data

In [None]:
# from sklearn import model_selection

# scores = model_selection.cross_val_score(classifier, test_features, actual_labels, cv=10, scoring='accuracy')
# print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
# del test_features

In [None]:
# draw the classification report
print('{}'.format(metrics.classification_report(actual_labels, predicted_labels)))

See [Confusion Matrix](https://fr.wikipedia.org/wiki/Matrice_de_confusion) for more details


In [None]:
# Confusion Matrix
print('{}\n'.format(metrics.confusion_matrix(actual_labels, predicted_labels, labels=[1,-1,0])))
print("\x1b[31m\" macro f1 score \"\x1b[0m")
print('{}\n'.format(metrics.f1_score(actual_labels, predicted_labels, average='macro')))
print("\x1b[31m\" micro f1 score \"\x1b[0m")
print('{}\n'.format(metrics.f1_score(actual_labels, predicted_labels, average='micro')))

### Comparison with the 5 best teams of subtask B

We compare our average f-score with the other teams in the workshop. The results are taken from the attached document:
[Final report SemEval 2014 Subtask 9](http://www.aclweb.org/anthology/S14-2009)

|Team|Accuracy (Macro Averaged)| Accuracy (Micro Averaged)|
|----|-------------------------|--------------------------|
|TeamX|65.63%|69.99%|
|coooolll|63.23%|70.51%|
|RTRGO|63.08%|70.15%|
|NRC-Canada|67.62%|71.37%|
|TUGAS|63.89%|68.84%|
|**_ME_**|_57.48%_|_64.86%_|
| | |***classement : 23 / 50***|
 