# Quick Python Sentiment Analysis on scrapped Reuters Tweets

### First step : Topic modeling, we'll extract the common semantic structures from the tweets

Let's import the needed packages for my analysis

In [51]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
import numpy
import gensim
from gensim import corpora
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Let's start by visualizing the first few tweets I scrapped

In [54]:
with open("Reuters_tweets.txt","r") as f:
    for tweet in f.read().split('\n')[8:15]:
        print(tweet)

Divided we fall? Australia labor unions' slump may be one reason for low wages growth\
Golf - Mickelson silent after final round at U.S. Open\
Despite the odds, Prauge Zoo successfully bred three white-belted ruffed baby lemurs\
Panasonic halts operations at 2 Osaka plants after Japan earthquake\
Earthquake of 5.6 magnitude strikes Guatemala - USGS\
FIFA probes chants by Mexico fans for homophobia\
Unilever takes stand against digital media's fake followers.\


 We have to filter and clean the data before analyzing it 

In [33]:
stop = set(stopwords.words('english')) 
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer() 

In [34]:
def clean(doc):
    
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())

    return normalized

In [49]:
file = open("Reuters_tweets.txt","r")
doc_clean = [clean(doc).split() for doc in file][8:]

Now that the data has been cleaned, I can use my topic model tool (Using LDA)

In [39]:
dictionary = corpora.Dictionary(doc_clean)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
Lda = gensim.models.ldamodel.LdaModel

In [57]:
ldamodel = Lda(doc_term_matrix, num_topics=8, id2word = dictionary, passes=200)
ldamodel.print_topics(num_topics=8, num_words=5)

[(0,
  '0.020*"japan" + 0.020*"trade" + 0.020*"fall" + 0.020*"u" + 0.020*"since"'),
 (1,
  '0.029*"say" + 0.029*"defeat" + 0.015*"colombia" + 0.015*"earthquake" + 0.015*"2017"'),
 (2,
  '0.024*"mexico" + 0.023*"may" + 0.023*"world" + 0.023*"plan" + 0.023*"brexit"'),
 (3,
  '0.015*"fall" + 0.015*"take" + 0.015*"solution" + 0.015*"part" + 0.015*"splitting"'),
 (4,
  '0.012*"amazon" + 0.012*"week" + 0.012*"microsoft" + 0.012*"peace" + 0.012*"deal"'),
 (5,
  '0.017*"new" + 0.017*"three" + 0.017*"ruffed" + 0.017*"successfully" + 0.017*"zoo"'),
 (6,
  '0.021*"ramadan" + 0.021*"dessert" + 0.021*"zalabia" + 0.021*"town" + 0.021*"bangladesh"'),
 (7,
  '0.034*"internet" + 0.018*"new" + 0.018*"earthquake" + 0.018*"u" + 0.018*"magnitude"')]

Quick comments on the output : Results are interesting, some words definitely should belong under the same thematic ("Brexit", "plan", "may" in topic number 2 for example). However since my data set is quite small as it was extraced simply for the purpose of this demonstration, my model struggles to find words in the same semantic structure. With a larger dataset & model training the output would be much more interesting.

### Step two : Sentiment Analysis, we'll look at whether a tweet is more positive, neutral or negative

I'll start by calling the Sentiment Analysis tool 

In [45]:
sid = SentimentIntensityAnalyzer()

Now let's use it on every tweet in my extract :

In [47]:
with open("Reuters_tweets.txt","r") as f:
    for tweet in f.read().split('\n')[8:]:
        print(tweet)
        ss = sid.polarity_scores(tweet)
        for k in sorted(ss):
            print('{0}: {1}, '.format(k, ss[k]), end='')
        print()

Divided we fall? Australia labor unions' slump may be one reason for low wages growth\
compound: -0.2732, neg: 0.13, neu: 0.87, pos: 0.0, 
Golf - Mickelson silent after final round at U.S. Open\
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
Despite the odds, Prauge Zoo successfully bred three white-belted ruffed baby lemurs\
compound: 0.4939, neg: 0.0, neu: 0.775, pos: 0.225, 
Panasonic halts operations at 2 Osaka plants after Japan earthquake\
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
Earthquake of 5.6 magnitude strikes Guatemala - USGS\
compound: -0.3612, neg: 0.294, neu: 0.706, pos: 0.0, 
FIFA probes chants by Mexico fans for homophobia\
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
Unilever takes stand against digital media's fake followers.\
compound: -0.4767, neg: 0.307, neu: 0.693, pos: 0.0, 
@JLDastin and @Stephen nellis report Microsoft is gearing up to eliminate cashiers and checkout lines from stores, in a nascent challenge to Amazon's automated grocery shop. More from t

We can observe in the output the sentiment score associated to every tweet. Some are pretty accurate ("Suspect dead after 20 hurt in shooting at New Jersey arts festival" scored 0.5/1 as negative score ) and others less ("Panasonic halts operations at 2 Osaka plants after Japan earthquake" which scored 1 as neutral although I'd interpret it as negative). We would need to tune the model a little to make it more sensitive to some types of words for example or adjust some parameters.