# Twitter sentiment analysis

For this project, we will compute the <b>semantic orientation</b> score which tells us whether a term (therefore a tweet) is more closely related to a positive or to a negative vocabulary. It's particulary interesting to notice that this approach is <b>non supervised</b>, that is, requires no labeled data. 

## Loading packages and data

In [1]:
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
import unicodedata
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
data = pd.read_csv("dataset.csv", encoding='latin1')
print(data.shape)
print('%.2f percents of tweets are positives'%(data.Sentiment.sum()*100/len(data)))
data.head()

(99989, 3)
56.46 percents of tweets are positives


Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...


## Cleaning tweets

Seems that tweets need a cleaning step:  
- Removing white spaces and setting lowercase as default
- Removing url
- 

In [3]:
%%time
SentimentText = []
for i,text in enumerate(data.SentimentText):
    if i%10000 == 0:
        print(i,'tweets processed')
    
    # Remove white spaces before and after, and set lowercase
    t = text.strip().lower()
    
    # Remove url
    t = re.sub(r'http(s)?:?//[\w$.…@&+-/]*',' ',t)
    t = re.sub(r'(www\.)(\w|\.)*', ' ', t)
    
    # Remove mutliple letters (ex : I looove it !)
    for i in range(25,1,-1):
        for voyelle in ['a','e','i','o','u','y',' ','.']:
            t = t.replace(voyelle*i, voyelle)
    for i in range(25,2,-1):
        for consonne in ['b','c','d','f','g','h','j','k','l','m','n','p','q','r','s','t','v','w','x','z']:
            t = t.replace(consonne*i, consonne)

    # Remove @mention, hashtag and esperluettes
    t = re.sub(r'(@|#|&)(\w)*',' ',t)
    
    # Remove remaining non-text characters
    t = re.sub(r'[^\w\s]',' ',t) # everything
    t = re.sub(r'[\d]+',' ',t)   # numbers
    t = re.sub(r'[_]+',' ',t)   # underscore '_' 
    
    # Remove white spaces IN the tweets
    t = re.sub(r'[\s]{2,}',' ',t).strip()
    t = unicodedata.normalize('NFD', t).encode('ascii', 'ignore').decode()
    
    # We consider words formed by less than 3 letters are non-informative
    t = [mot for mot in t.split() if len(mot)>=3]
    # Remove stop-words
    t = [mot for mot in t if mot not in stopwords.words('english')]
    t = ' '.join(t)
    
    SentimentText.append(t.lower())

0 tweets processed
10000 tweets processed
20000 tweets processed
30000 tweets processed
40000 tweets processed
50000 tweets processed
60000 tweets processed
70000 tweets processed
80000 tweets processed
90000 tweets processed
CPU times: user 3min 29s, sys: 31.6 s, total: 4min 1s
Wall time: 4min 2s


In [4]:
data['SentimentText'] = pd.Series(SentimentText)
data.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,sad apl friend
1,2,0,missed new mon trailer
2,3,1,omg already
3,4,0,omgaga gunna cry ben dentist since suposed get...
4,5,0,think cheating


In [5]:
data.to_csv('dataset_processed.csv')

## Semantic orientation

For some positively connotated vocabulary $V^{+}$ and some negatively connotated vocabulary $V^{-}$, the **semantic orientation** (SO) of a term $t$ is defined as follows :
$$SO(t) = \sum_{t' \in V^{+}} PMI(t,t') - \sum_{t' \in V^{-}} PMI(t,t'),$$
where $PMI(t_1, t_2)$ is a proximity measure (https://en.wikipedia.org/wiki/Pointwise_mutual_information) given as 
$$PMI(t_1,t_2) = \ln\Large(\normalsize\frac{\mathbb{P}(t_1, t_2)}{\mathbb{P}(t_1).\mathbb{P}(t_2)}\Large)\normalsize = \ln\Large(\normalsize \frac{ \frac{DF(t_1,t_2)}{D} }{\frac{DF(t_1)}{D}.\frac{DF(t_2)}{D}}\Large)\normalsize$$

In other words, it defines how close a term $t$ is to a positive vocabulary, composed of $D$ documents (tweets). $DF(t,t')$ refers to the document frequency, i.e. the numbers of documents in which both $t$ and $t'$ occur. 

So eventually, we need to compute for each term $t$ in our corpus the following :
$$SO(t) = \sum_{t' \in V^{+}} \ln\Large(\normalsize \frac{ \frac{DF(t,t')}{D} }{\frac{DF(t)}{D}.\frac{DF(t')}{D}}\Large)\normalsize ~ - \sum_{t' \in V^{-}} \ln\Large(\normalsize \frac{ \frac{DF(t,t')}{D} }{\frac{DF(t)}{D}.\frac{DF(t')}{D}}\Large)\normalsize ,$$

which simplifies as 

$$\text{$\LARGE ($}\sum_{t' \in V^{+}} \ln\text{$\large ($}DF(t,t')\text{$\large )$} - \ln\text{$\large ($}DF(t)\text{$\large )$} - \ln\text{$\large ($}DF(t')\text{$\large )$} + \ln(D) \text{$\LARGE )$ } - \text{$\LARGE ($}\sum_{t' \in V^{-}} \ln\text{$\large ($}DF(t,t')\text{$\large )$} - \ln\text{$\large ($}DF(t)\text{$\large )$} - \ln\text{$\large ($}DF(t')\text{$\large )$} + \ln(D) \text{$\LARGE )$ }$$ 

For a (relatively) short corpus like the one we 've got, $DF(t,t')$ and $DF(t')$ may be equals to zero, leading respectively $\ln(DF(t,t'))$ and $\ln(DF(t'))$ to be undefined. For this reason, we will assume log(0) = 0 as stated p. 12 here  : https://web.stanford.edu/class/linguist236/materials/ling236-handout-05-09-vsm.pdf

We get $V^{+}$ and $V^{-}$ from the existing lexicons made by Bing Liu : http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon

In [13]:
V_plus  = pd.read_table('opinion-lexicon-English/positive-words.txt', encoding='latin1')[33:].values
V_minus = pd.read_table('opinion-lexicon-English/negative-words.txt', encoding='latin1')[33:].values
V_plus  = np.ravel(V_plus).tolist()
V_minus = np.ravel(V_minus).tolist()

In [14]:
print(len(V_plus))
V_plus[:6]

2007


['a+', 'abound', 'abounds', 'abundance', 'abundant', 'accessable']

In [15]:
print(len(V_minus))
V_minus[:6]

4783


['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable', 'abominably']

## Split train-valid

In [16]:
x_train, x_valid , y_train, y_valid = train_test_split(data.SentimentText, 
                                                       data.Sentiment.values,
                                                       test_size=0.2,
                                                       random_state=123, shuffle=True,
                                                       stratify=data.Sentiment.values)
for df in ['x_train', 'x_valid']:
    eval(df).reset_index(drop=True, inplace=True)
    
len(x_train), len(y_train), len(x_valid), len(x_valid)

(79991, 79991, 19998, 19998)

## Documents-terms and co-occurence matrices

### Documents-terms matrix
This matrix will be used to efficiently compute $DF(t)$.

In [17]:
cv = CountVectorizer()
docs_terms = cv.fit_transform(x_train)
print(docs_terms.shape)

(79991, 39073)


So we have <b>79991</b> documents, resulting in a <b>39073</b> words vocabulary (for x_train).

### Co-occurence matrix
This matrix will be used to efficiently compute $DF(t,t')$

In [18]:
%%time
co_occurence = docs_terms.transpose().dot(docs_terms)
docs_terms   = pd.DataFrame(docs_terms.toarray(), columns=cv.get_feature_names()) 
co_occurence = pd.DataFrame(co_occurence.toarray(), columns=cv.get_feature_names(), index=cv.get_feature_names()) 
print(co_occurence.shape)

(39073, 39073)
CPU times: user 1.97 s, sys: 2.62 s, total: 4.6 s
Wall time: 4.85 s


In [19]:
print(docs_terms.shape)
docs_terms.head()

(79991, 39073)


Unnamed: 0,aaww,aba,aback,abagail,abandon,abandoned,abandoning,abangan,abb,abba,...,zumba,zune,zurg,zuri,zurich,zushi,zwart,zwijger,zwinky,zxcv
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
print(co_occurence.shape)
co_occurence.head()

(39073, 39073)


Unnamed: 0,aaww,aba,aback,abagail,abandon,abandoned,abandoning,abangan,abb,abba,...,zumba,zune,zurg,zuri,zurich,zushi,zwart,zwijger,zwinky,zxcv
aaww,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aba,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aback,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abagail,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abandon,0,0,0,0,4,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


As an example one can spot DF('suck','sucks'), that is, the tweets featuring both "suck" and "sucks" : 

In [21]:
print(co_occurence.loc['sucks','suck'])
idx = docs_terms[(docs_terms['sucks'] != 0) & (docs_terms['suck'] != 0)].index
for i in idx:
    print(i, x_train.ix[i])

2
20402 life sucks would suck lot beautiful check yes gingerette kep company
72735 well sucks people suck havent met one honestly nice person today evil


## DF(t), DF(t,t')

Let's find an efficient way to compute DF(t), that is, the number of documents when a term $t$ appears. Beware it's not the same as "the number of times when t occurs in the documents" ! In this second case, we would simply read DF(t) as `co_occurence.loc[t,t]`. So the diagonal of `co_occurence` gives us <b>TF(t)</b>, while we are looking for <b>DF(t)</b>.

In [22]:
sum_by_terms = docs_terms.values > 0
sum_by_terms = sum_by_terms.sum(axis=0)
sum_by_terms = pd.DataFrame(data={'Nb_times_appears':sum_by_terms}, index=docs_terms.columns).T
sum_by_terms

Unnamed: 0,aaww,aba,aback,abagail,abandon,abandoned,abandoning,abangan,abb,abba,...,zumba,zune,zurg,zuri,zurich,zushi,zwart,zwijger,zwinky,zxcv
Nb_times_appears,1,2,1,1,4,7,1,1,2,6,...,3,6,1,1,2,1,1,1,1,1


In [32]:
def DF(t, t1=''):
    """
    return either :
        - the number of documents where t appears 
        - the number of documents where both t and t1 appear (if t1 specified)
    """
    if t1 == '':
        try:
            return sum_by_terms.loc['Nb_times_appears', t]
        except:
            return 0
    else:
        try:
            return co_occurence.loc[t, t1]
        except:
            return 0
DF('cat','dog') # 'cat' and 'dog' appear 5 times together in our tweets

0

## PMI(t1, t2)

As stated earlier, we will assume log(0) = 0 :

In [26]:
def logarithme(nombre):
    return 0 if nombre==0 else np.log(nombre)

In [27]:
D_size = len(x_train)
def PMI(t,t1):
    """
    return the Pointwise Mutual Information (PMI) between two terms
    """
    if DF(t,t1) == 0:
        return 0
    else:
        return logarithme(DF(t,t1)) - logarithme(DF(t)) - logarithme(DF(t1)) + logarithme(D_size)
PMI('kitty','happy') # test

0.36769561060246758

PMI('kitty','happy') returns 0.368, while PMI('wedding','betrayal') would yield 0, as 'betrayal' doesn't occur in our corpus.

## Semantic orientation of a word: SO(t)

In [28]:
def SO(t):
    pmi_plus, pmi_minus = 0, 0
    for t1, t2 in zip(V_plus[:], V_minus[:]):
        pmi_plus = pmi_plus + PMI(t, t1)
        pmi_minus = pmi_minus + PMI(t, t2)
    return pmi_plus - pmi_minus

print('SO(\"adventure\") =', SO('adventure'))
print('SO(\"wedding\") =', SO('wedding'))
print('SO(\"darkness\") =', SO('darkness'))
print('SO(\"honey\") =', SO('honey'))
print('SO(\"suicide\") =', SO('suicide'))
print('SO(\"france\") =', SO('france'))

SO("adventure") = 16.6390233941
SO("wedding") = 18.1816566727
SO("darkness") = 12.0822023553
SO("honey") = 19.2777473106
SO("suicide") = -5.5589319577
SO("france") = -9.45185374576


Wait...how comes France, the most visited country in the world ends up with such a bad semantic orientation ? 

<b>Answer :</b> a crash involving Air France happened, leading twitter users to express their condoleances...and therefore, using a negatively connoted vocabulary.

In [29]:
for tweet in data.SentimentText.values[:6000]:
    if 'france' in tweet:
        print(tweet)

bodies air france crash ben found via
confirmed missing air france flight crashed atlantica
abt missing air france plane follow best air news source
thinking air france plane passengers
sad air france catastrophe
france tragic
prayers air france victims
france sad
miss tobs already fun france dont leave long pall love love love also love juju portsmouth
parts airfrance found near coast senegal


## Semantic orientation of a tweet : SO(tweet)
We assert that the semantic orientation of a tweet simply is the sum of its semantic orientations of its terms.
$$SO\text{("Python is fun !")} = SO(\text{"Python"}) + SO(\text{"is"}) + SO(\text{"fun"})$$
So eventually, we are able to predict whether a tweet is positively connoted or not !

In [30]:
def SO_tweet(tweet):
    somme_SO = 0
    for mot in tweet.split():
        somme_SO = somme_SO + SO(mot)
    return somme_SO

SO_tweet("python is fun")

58.039340540465389

## Conclusion and limitation

The Semantic Orientation approach, although intuitive, has several issues :
- It does not capture :
    - the negations, (e.g. "not bad") ; 
    - the intensity (e.g. : "so succesful")
    - ironic sentences (e.g. : "he played so weel football once again...")
- The prediction process is time-consuming.

However, this approach has proved itself to be quiet efficient in finding how close a tweet could be to a positive (resp. negative) vocabulary, in a non-supervised design.