# TP : Analyse des opinions sous twitter

Maël Fabien

In [451]:
### General import
import pandas as pd
import numpy as np
import os
import re
from sklearn.metrics import accuracy_score

### NLTK
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import wordnet as wn
import nltk as nltk
from nltk.tokenize.casual import TweetTokenizer

nltk.download('sentiwordnet')

[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /Users/maelfabien/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


True

## I. Importer les fichiers

Les tweets à analyser sont disponibles à l’adresse suivante : https://clavel.wp.imt.fr/files/ 2018/05/testdata.manual.2009.06.14.csv_.zip. Cette base (Sentiment140) a été obtenue sur le site de l’université de Stanford http://help.sentiment140.com/for-students. Un extrait en est donné dans le tableau 1. La base contient 498 tweets annotés manuellement. La base propose 6 champs corres- pondant aux informations suivantes :
1. la polarité du tweet : Chaque tweet est accompagné d’un score pouvant être égal à 0 (négatif), 2 (neutre) ou 4 (positif).
2. l’identifiant du tweet (2087)
3. la date du tweet (Sat May 16 23 :58 :44 UTC 2009)
4. la requête associée (lyx). Si pas de requête la valeur est NO_ QUERY.
5. l’utilisateur qui a tweeté (robotickilldozr)
6. le texte du tweet(Lyx is cool)

In [452]:
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)

In [453]:
# Import data
df = pd.read_csv('testdata.manual.2009.06.14.csv', header=None)
df = df.drop([1], axis=1)
df.head(15)

Unnamed: 0,0,2,3,4,5
0,4,Mon May 11 03:17:40 UTC 2009,kindle2,tpryan,@stellargirl I loooooooovvvvvveee my Kindle2. ...
1,4,Mon May 11 03:18:03 UTC 2009,kindle2,vcu451,Reading my kindle2... Love it... Lee childs i...
2,4,Mon May 11 03:18:54 UTC 2009,kindle2,chadfu,"Ok, first assesment of the #kindle2 ...it fuck..."
3,4,Mon May 11 03:19:04 UTC 2009,kindle2,SIX15,@kenburbary You'll love your Kindle2. I've had...
4,4,Mon May 11 03:21:41 UTC 2009,kindle2,yamarama,@mikefish Fair enough. But i have the Kindle2...
5,4,Mon May 11 03:22:00 UTC 2009,kindle2,GeorgeVHulme,@richardebaker no. it is too big. I'm quite ha...
6,0,Mon May 11 03:22:30 UTC 2009,aig,Seth937,Fuck this economy. I hate aig and their non lo...
7,4,Mon May 11 03:26:10 UTC 2009,jquery,dcostalis,Jquery is my new best friend.
8,4,Mon May 11 03:27:15 UTC 2009,twitter,PJ_King,Loves twitter
9,4,Mon May 11 03:29:20 UTC 2009,obama,mandanicole,how can you not love Obama? he makes jokes abo...


In [454]:
# Compute sentiment
score = {}
score[0] = 'Negatif'
score[2] = 'Neutre'
score[4] = 'Positif'

df['sent'] = df[0].apply(lambda x : score[x])

## II. Prétraitements

Les tweets contiennent des caractères spéciaux susceptibles de nuire à la mise en place des méthodes d’analyse d’opinions. Ecrire un programme permettant pour chaque tweet de :

- récupérer le texte associé
- segmenter en tokens
- supprimer les urls
- nettoyer les caractères inhérents à la structure d’un tweet
- corriger les abréviations et les spécificités langagières des tweets à l’aide du dictionnaire DicoS- lang (fichier SlangLookupTable.txt disponible ici : https://clavel.wp.imt.fr/files/2018/06/ Lexiques.zip), encodage du fichier : latin1

In [455]:
# Slang lookup table
slang = pd.read_csv('Lexiques/SlangLookupTable.txt', encoding='latin1', sep='\t', header=None)
slang.head()

Unnamed: 0,0,1
0,121,one to one
1,a/s/l,"age, sex, location"
2,adn,any day now
3,afaik,as far as I know
4,afk,away from keyboard


Vous préciserez dans le CR le nombre d’occurrences des caractères inhérents à la struc- ture du tweet et le nombre d’occurrences des ’hash-tags’ dans le corpus.

In [456]:
# Count characters
def count_char(txt) :
    
    count_hastag = 0
    count_struc = 0
    
    for string in txt :
        count_hastag += string.count("#")
        count_struc += string.count("@")
        
    return "There are %d hashtags and %d tweet inherecent characters." % (count_hastag, count_struc)
        

In [457]:
count_char(list(df[5]))

'There are 52 hashtags and 128 tweet inherecent characters.'

In [458]:
# Clean the text

def treat_text(txt) :
    
    # Remove URLs
    txt = re.sub(r'^https?:\/\/.*[\r\n]*', '', txt, flags=re.MULTILINE)
    
    # Replace characters
    txt = txt.replace("@", "")
    txt = txt.replace("#", "")
    
    # Tokenize
    tokens = tokenizer.tokenize(txt)
    
    # Replace slang language
    for i in range(len(tokens)) :
        if tokens[i] in slang[0] :
            tokens[i] = slang[slang[0] == tokens[i]][1]
    
    return tokens

In [459]:
tokens = df[5].apply(lambda x : treat_text(x))

In [460]:
tokens[:15]

0     [stellargirl, I, looovvveee, my, Kindle, 2, .,...
1     [Reading, my, kindle, 2, ..., Love, it, ..., L...
2     [Ok, ,, first, assesment, of, the, kindle, 2, ...
3     [kenburbary, You'll, love, your, Kindle, 2, .,...
4     [mikefish, Fair, enough, ., But, i, have, the,...
5     [richardebaker, no, ., it, is, too, big, ., I'...
6     [Fuck, this, economy, ., I, hate, aig, and, th...
7                [Jquery, is, my, new, best, friend, .]
8                                      [Loves, twitter]
9     [how, can, you, not, love, Obama, ?, he, makes...
10    [Check, this, video, out, -, -, President, Oba...
11    [Karoli, I, firmly, believe, that, Obama, /, P...
12    [House, Correspondents, dinner, was, last, nig...
13    [Watchin, Espn, .., Jus, seen, this, new, Nike...
14    [dear, nike, ,, stop, with, the, flywire, ., t...
Name: 5, dtype: object

## III. Etiquetage Grammatical

Développer une fonction capable de déterminer la catégorie grammaticale (POS : Part Of Speech) de chaque mot du tweet en utilisant la commande suivante de la libraire nltk :

In [461]:
def POS(tokens) :
    
    list_pos = []
    
    # For each token
    for token in tokens:
        
        # Append the POS tag
        list_pos.append(nltk.pos_tag(token))
        
    return list_pos

In [462]:
taggedData = POS(tokens)

## IV. Algorithme de détection - V1

NLTK dispose entre autre d’une interface pour manipuler la base de données WordNet. Ainsi, après installation de NLTK et du package WordNet, un utilisateur peut accéder à l’ensemble des synsets qui sont liés à un mot donné à l’aide d’une commande simple sous Python. Observez son fonctionnement à l’aide des lignes de code suivantes :

In [464]:
from nltk.corpus import wordnet as wn
wn.synsets('dog')

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

Pour cette étape, vous devez développer un programme permettant :
- de récupérer uniquement les mots correspondant à des adjectifs, noms, adverbes et verbes

In [465]:
def sort_words(taggedData) :
    list_token = []
    
    # For each tweet
    for tokens in taggedData :
        intermediate_list_token = []
        # For each token
        for token in tokens : 
            # Keep only tokens whose POS starts with the following letters
            if token[1][:2] == 'JJ' or token[1][:2] == 'NN' or token[1][:2] == 'VB' or token[1][:2] == 'JJ' == 'RB' :
                intermediate_list_token.append(token[0])
        list_token.append(intermediate_list_token)
    # Returns a list of lists
    return list_token
                

In [466]:
filtered_words = sort_words(taggedData)

- d’accéder aux scores (positifs et négatifs) des synsets dans la librairie NLTK. Ce script définira dans une classe Python l’objet SentiSynset sur le même modèle que le Synset développé dans NLTK pour WordNet, et permettra de lire le tableau de SentiWordNet comme suit.
- de calculer pour chaque mot les scores associés à leur premier synset,
- de calculer pour chaque tweet la somme des scores positifs et négatifs des SentiSynsets du tweet,
- de comparez la somme des scores positifs et des scores négatifs de chaque tweet pour décider de la classe à associer au tweet.

In [467]:
def compute_score_v1(taggedData) :
    
    # Move the function defined above here
    def sort_words(taggedData) :
        list_token = []
        for tokens in taggedData :
            intermediate_list_token = []
            for token in tokens :
                if token[1][:2] == 'JJ' or token[1][:2] == 'NN' or token[1][:2] == 'VB' or token[1][:2] == 'JJ' == 'RB' :
                    intermediate_list_token.append(token[0])
            list_token.append(intermediate_list_token)
        return list_token

    filtered_words = sort_words(taggedData)
    score_tweet = []
        
    # For each tweer
    for tweet in filtered_words :
        
        # Initalize the scores
        score = 0
        score_pos = 0
        score_neg = 0
        
        # For each token within the tweer
        for token in tweet :
            
            try :
                # Try to compute and add the positive and negative scores linked to the first wordnet
                score_pos += swn.senti_synset(wn.synsets(token)[0].name()).pos_score()
                score_neg += swn.senti_synset(wn.synsets(token)[0].name()).neg_score()
                    
            except : 
                pass
        
        # Format of the output : score pos, score neg, label
        if score_pos > score_neg :
            score_tweet.append([score_pos, score_neg, 'Positif'])
        elif score_pos == score_neg :
            score_tweet.append([score_pos, score_neg, 'Neutre'])
        else : 
            score_tweet.append([score_pos, score_neg, 'Negatif'])
            
    return np.array(score_tweet)

In [468]:
score_v1 = compute_score_v1(taggedData)

In [469]:
accuracy_score(df['sent'], score_v1[:,2])

0.5341365461847389

In [470]:
sum(df['sent'] == score_v1[:,2])

266

L'accuracy atteint 53.4% et on identifie correctement 266 labels sur les 498.

## Algorithme de détection - V2

Vous aurez besoin de : 
- la liste des mots en anglais correspondant à des négations (fichier NegatingWordList.txt disponible ici : https://clavel.wp.imt.fr/files/2018/06/Lexiques.zip) 
- et celle correspondant aux modifieurs (fichier BoosterWordList.txt disponible ici : https://clavel.wp.imt.fr/files/2018/06/Lexiques.zip). 

Pour chaque mot, l’algorithme doit effectuer les opérations suivantes :
- multiplie par 2 le score négatif et le score positif associés au mot si le mot précédent est un modifieur ;
- utilise uniquement le score négatif du mot pour le score positif global du tweet et le score positif du mot pour le score négatif global du tweet si le mot précédent est une négation.

In [471]:
negating = ["aren't","arent", "can't", "cannot", "cant", "don't", "dont", "isn't", "isnt", "never", "not", "won't", "wont", "wouldn't", "wouldnt"]


In [472]:
booster = pd.read_csv('Lexiques/BoosterWordList.txt', header=None, delimiter='\t')
booster

Unnamed: 0,0,1
0,absolutely,1
1,definitely,1
2,extremely,2
3,fuckin,2
4,fucking,2
5,hugely,2
6,incredibly,2
7,just,-1
8,overwhelmingly,2
9,so,0


In [480]:
def compute_score_v2(taggedData) :
    
    def sort_words(taggedData) :
        list_token = []
        for tokens in taggedData :
            intermediate_list_token = []
            for token in tokens :
                if token[1][:2] == 'JJ' or token[1][:2] == 'NN' or token[1][:2] == 'VB' or token[1][:2] == 'JJ' == 'RB' :
                    intermediate_list_token.append(token[0])
            list_token.append(intermediate_list_token)
        return list_token

    filtered_words = sort_words(taggedData)
    score_tweet = []
    
    for tweet in filtered_words :
        modifier = None
        negator = None
        
        score = 0
        score_pos = 0
        score_neg = 0
        
        total_negating = 0
        pos_negating = 0
    
        for token in tweet :
            
            try :
                # Check if the previous word in negating list (and inverse scores)
                if negator in negating :
                    # Total negating words
                    total_negating += 1
                    # Check if previous word in booster list (and double score)
                    if modifier in booster[0] :
                        score_pos += swn.senti_synset(wn.synsets(token)[0].name()).neg_score() * 2
                        score_neg += swn.senti_synset(wn.synsets(token)[0].name()).pos_score() * 2
                    else :
                        score_pos += swn.senti_synset(wn.synsets(token)[0].name()).neg_score()
                        score_neg += swn.senti_synset(wn.synsets(token)[0].name()).pos_score()
                else :
                    if modifier in booster[0] :
                        score_pos += swn.senti_synset(wn.synsets(token)[0].name()).pos_score() * 2
                        score_neg += swn.senti_synset(wn.synsets(token)[0].name()).neg_score() * 2
                    else :
                        score_pos += swn.senti_synset(wn.synsets(token)[0].name()).pos_score()
                        score_neg += swn.senti_synset(wn.synsets(token)[0].name()).neg_score()
                    
            except : 
                pass
            
            # Set the current token as the modifier and the negator for the next iteration
            modifier = token
            negator = token
            
        
        if score_pos > score_neg :
            # Count number of positive tweets with negating words
            if total_negating > 0 :
                pos_negating =+ 1
            score_tweet.append([score_pos, score_neg, 'Positif'])
        elif score_pos == score_neg :
            score_tweet.append([score_pos, score_neg, 'Neutre'])
        else : 
            score_tweet.append([score_pos, score_neg, 'Negatif'])
    print("Total number of negating terms in positive tweets : " + str(pos_negating))
    return np.array(score_tweet)

In [481]:
score_v2 = compute_score_v2(taggedData)

Total number of negating terms in positive tweets : 0


In [482]:
accuracy_score(df['sent'], score_v2[:,2])

0.5381526104417671

In [483]:
sum(df['sent'] == score_v2[:,2])

268

L'accuracy augmente légèrement avec cette nouvelle version, et on classifie correctement 2 exemples de plus.

## Algorithme de détection - V3

Vous avez ici besoin du dictionnaire d’émoticons est disponible (fichier EmoticonLookupTable.txt disponible ici : https://clavel.wp.imt.fr/files/2018/06/Lexiques.zip). 

Cet algorithme demande en entrée deux listes supplémentaires : 
- une liste d’emoticons positifs 
- et une liste d’émoticons négatifs

Les émoticons positifs rencontrés augmentent de 1 la valeur du score positif du tweet, tandis que les émoticons négatifs augmentent de 1 la valeur du score négatif du tweet.

In [484]:
emo = pd.read_csv('Lexiques/EmoticonLookupTable.txt', sep='\t', header=None)

In [485]:
emo_pos = list(emo[emo[1] > 0][0])
emo_neg = list(emo[emo[1] < 0][0])

In [492]:
def compute_score_v3(taggedData) :
    
    def sort_words(taggedData) :
        list_token = []
        for tokens in taggedData :
            intermediate_list_token = []
            for token in tokens :
                if token[1][:2] == 'JJ' or token[1][:2] == 'NN' or token[1][:2] == 'VB' or token[1][:2] == 'JJ' == 'RB' :
                    intermediate_list_token.append(token[0])
            list_token.append(intermediate_list_token)
        return list_token

    filtered_words = sort_words(taggedData)
    score_tweet = []
        
    # Total number of smileys
    nb_smileys = 0 
    
    for tweet in filtered_words :
        
        modifier = None
        negator = None
        
        score = 0
        score_pos = 0
        score_neg = 0
        
        for token in tweet :

            try :
                if negator in negating :
                    if modifier in booster[0] :
                        score_pos += swn.senti_synset(wn.synsets(token)[0].name()).neg_score() * 2
                        score_neg += swn.senti_synset(wn.synsets(token)[0].name()).pos_score() * 2
                    else :
                        score_pos += swn.senti_synset(wn.synsets(token)[0].name()).neg_score()
                        score_neg += swn.senti_synset(wn.synsets(token)[0].name()).pos_score()
                else :
                    if modifier in booster[0] :
                        score_pos += swn.senti_synset(wn.synsets(token)[0].name()).pos_score() * 2
                        score_neg += swn.senti_synset(wn.synsets(token)[0].name()).neg_score() * 2
                    else :
                        score_pos += swn.senti_synset(wn.synsets(token)[0].name()).pos_score()
                        score_neg += swn.senti_synset(wn.synsets(token)[0].name()).neg_score()
            
            # We now change the except and if the token does not have a synset, we check if it is an emoji
            except : 
                
                if token in emo_pos :
                    score_pos += 1
                    nb_smileys += 1
                elif token in emo_neg :
                    score_neg += 1
                    nb_smileys += 1
            
            else :
                pass
            
            modifier = token
            negator = token
            
        
        if score_pos > score_neg :
            score_tweet.append([score_pos, score_neg, 'Positif'])
        
        elif score_pos == score_neg :
            score_tweet.append([score_pos, score_neg, 'Neutre'])
        
        else : 
            score_tweet.append([score_pos, score_neg, 'Negatif'])
    
    print("Total number of smileys : " + str(nb_smileys))
    return np.array(score_tweet)

In [493]:
score_v3 = compute_score_v3(taggedData)

Total number of smileys : 52


In [494]:
accuracy_score(df['sent'], score_v3[:,2])

0.5742971887550201

In [495]:
sum(df['sent'] == score_v3[:,2])

286

## Algorithme de détection - V4

En analysant les sorties des algorithmes proposés précédemment, proposez votre propre algorithme d’analyse des opinions dans les tweets et les performances que vous obtenez.

In [516]:
def compute_score_v4(taggedData, factor) :
    
    def sort_words(taggedData) :
        list_token = []
        for tokens in taggedData :
            intermediate_list_token = []
            for token in tokens :
                if token[1][:2] == 'JJ' or token[1][:2] == 'NN' or token[1][:2] == 'VB' or token[1][:2] == 'JJ' == 'RB' :
                    intermediate_list_token.append(token[0])
            list_token.append(intermediate_list_token)
        return list_token

    filtered_words = sort_words(taggedData)
    score_tweet = []
    nb_smileys = 0
    
    for tweet in filtered_words :
        modifier = None
        negator = None
        
        score = 0
        score_pos = 0
        score_neg = 0
        
        for token in tweet :
            
            score_pos_int = 0
            score_neg_int = 0
            score_obj_int = 0
            
            tot = len(wn.synsets(token))
            
            if tot > 0 :
                
                # We now iterate through all the wordnets of a given token
                # And give each definition a decaying weight on the overall score (Factor ^ i where i is position)
                for i in range(tot) :
                     
                    if negator in booster[0] :
                        if modifier in booster[0] :
                            score_pos_int += swn.senti_synset(wn.synsets(token)[i].name()).neg_score() * 2 * factor ** (i)
                            score_neg_int += swn.senti_synset(wn.synsets(token)[i].name()).pos_score() * 2 * factor ** (i)
                            score_obj_int += swn.senti_synset(wn.synsets(token)[i].name()).obj_score() * 2 * factor ** (i)
                        
                        else :
                            score_pos_int += swn.senti_synset(wn.synsets(token)[i].name()).neg_score() * factor ** (i)
                            score_neg_int += swn.senti_synset(wn.synsets(token)[i].name()).pos_score() * factor ** (i)
                            score_obj_int += swn.senti_synset(wn.synsets(token)[i].name()).obj_score() * 2 * factor ** (i)
                    else :
                        if modifier in booster[0] :
                            score_pos_int += swn.senti_synset(wn.synsets(token)[i].name()).pos_score() * 2 * factor ** (i)
                            score_neg_int += swn.senti_synset(wn.synsets(token)[i].name()).neg_score() * 2 * factor ** (i)
                            score_obj_int += swn.senti_synset(wn.synsets(token)[i].name()).obj_score() * 2 * factor ** (i)
                        else :
                            score_pos_int += swn.senti_synset(wn.synsets(token)[i].name()).pos_score() * factor ** (i)
                            score_neg_int += swn.senti_synset(wn.synsets(token)[i].name()).neg_score() * factor ** (i)
                            score_obj_int += swn.senti_synset(wn.synsets(token)[i].name()).obj_score() * 2 * factor ** (i)
                
            else :
                if token in emo_pos :
                    score_pos += 1
                    nb_smileys += 1
                elif token in emo_neg :
                    score_neg += 1
                    nb_smileys += 1
            
            # We then normalize the total score to sum to 1 for each token
            total = score_pos_int + score_neg_int + score_obj_int
            
            if total > 0 :
                score_pos += score_pos_int / total
                score_neg += score_neg_int / total
            
            modifier = token
            negator = token
            
        # Also change thresholds by allocation more values to Neutral
        if score_pos > score_neg + 0.1 :
            score_tweet.append([score_pos, score_neg, 'Positif'])
        
        elif score_pos < score_neg - 0.1 :
            score_tweet.append([score_pos, score_neg, 'Negatif'])
        
        else : 
            score_tweet.append([score_pos, score_neg, 'Neutre'])
    
    print("Number of smileys : " + str(nb_smileys))
    return np.array(score_tweet)

In [536]:
score_v4 = compute_score_v4(taggedData, 0.9)

Number of smileys : 52


In [537]:
accuracy_score(df['sent'], score_v4[:,2])

0.642570281124498

In [529]:
sum(df['sent'] == score_v4[:,2])

320

Le résultat global est amélioré de quasiment 7% avec le seuil rajouté et le facteur de poids.