# Twitter Data Mining

In this project, we'll go over the following topics:
- Collecting data from Twitter
- Text pre-processing using NLTK
- Analysing term frequencies
- Data Visualization
- Sentiment Analysis

This project was made by going through this (great) blog [post](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/).  Code extracts are mainly taken from there. I've also added pieces of code at various places following my interest. 

## Collecting data

In order to authorise our app to access Twitter on our behalf, we need to use the OAuth interface:

In [1]:
import tweepy
from tweepy import OAuthHandler

import os
consumer_key = os.environ['CONSUMER_KEY']
consumer_secret = os.environ['CONSUMER_SECRET']
access_token = os.environ['ACCESS_TOKEN']
access_secret = os.environ['ACCESS_SECRET']
 
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
 
api = tweepy.API(auth)

Display of the 10 tweets of the home timeline :

In [2]:
for status in tweepy.Cursor(api.home_timeline).items(5):
    # Process a single status
    print(status.text)

Arabie saoudite : les paris risqués de Mohammed ben Salmane 
Le point de vue Francis Perrin, directeur de recherche… https://t.co/RoiSUyEsQf
RT @IRIS_SUP_: Félicitation aux nouveaux diplômés IRIS Sup' 2017! 👏
Près de 500 personnes étaient présentes, diplômés, parents, amis, tous…
RT @carole_gomez: Conférence ce soir sur lutte contre la manipulation des compétitions en Fr. à @InstitutIRIS avec T.Pujol, C.Kalb @ffhandb…
RT @MouvEuropeen_Fr: Dans le cadre du #OnePlanetSummit le Mouvement Européen et la @EIB vous invitent à mettre en débat les investissements…
RT @InteragencyRAN: JUST OUT our new #report on #fragilestates by 2030 - looking at what the major #drivers of #state fragility are and how…


Display of the JSON response :

In [8]:
import json 
    
for status in tweepy.Cursor(api.home_timeline).items(1):
    # Process a single status
    print(json.dumps(status._json)[:1000])

{"created_at": "Mon Dec 11 17:17:30 +0000 2017", "id": 940269321837862912, "id_str": "940269321837862912", "text": "Arabie saoudite : les paris risqu\u00e9s de Mohammed ben Salmane \nLe point de vue Francis Perrin, directeur de recherche\u2026 https://t.co/RoiSUyEsQf", "truncated": true, "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": [{"url": "https://t.co/RoiSUyEsQf", "expanded_url": "https://twitter.com/i/web/status/940269321837862912", "display_url": "twitter.com/i/web/status/9\u2026", "indices": [117, 140]}]}, "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>", "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_reply_to_screen_name": null, "user": {"id": 265897069, "id_str": "265897069", "name": "IRIS", "screen_name": "InstitutIRIS", "location": "Paris", "description": "Acteur fran\u00e7ais de la recherche strat\u00e9gique et g\u00e9opolitique. 

Listing connections :

In [9]:
for friend in tweepy.Cursor(api.friends).items(1):
    print(json.dumps(friend._json)[:1000])

{"id": 265897069, "id_str": "265897069", "name": "IRIS", "screen_name": "InstitutIRIS", "location": "Paris", "description": "Acteur fran\u00e7ais de la recherche strat\u00e9gique et g\u00e9opolitique. Ses activit\u00e9s : la recherche, l\u2019organisation de manifestations, la publication, la formation.", "url": "http://t.co/EctAbFIhS6", "entities": {"url": {"urls": [{"url": "http://t.co/EctAbFIhS6", "expanded_url": "http://iris-france.org", "display_url": "iris-france.org", "indices": [0, 22]}]}, "description": {"urls": []}}, "protected": false, "followers_count": 24428, "friends_count": 422, "listed_count": 603, "created_at": "Mon Mar 14 09:29:08 +0000 2011", "favourites_count": 670, "utc_offset": 3600, "time_zone": "Paris", "geo_enabled": true, "verified": false, "statuses_count": 9309, "lang": "fr", "status": {"created_at": "Mon Dec 11 17:17:30 +0000 2017", "id": 940269321837862912, "id_str": "940269321837862912", "text": "Arabie saoudite : les paris risqu\u00e9s de Mohammed ben Sa

Listing my own tweets :

In [5]:
for tweet in tweepy.Cursor(api.user_timeline).items():
    #print(json.dumps(tweet._json))
    print(tweet._json['text'])
    print(tweet._json['created_at'])

"« Cool, bières et bonbons gratuits ! » : plongée dans une start-up déjantée" #actualites #feedly https://t.co/cKWCAPZSVr
Sun Jul 24 17:30:30 +0000 2016


Some variables :
- `text`: the text of the tweet itself
- `created_at`: the date of creation
- `favorite_count`, `retweet_count`: the number of favourites and retweets
- `favorited`, `retweeted`: boolean stating whether the authenticated user (you) have favourited or retweeted this tweet
- `lang`: acronym for the language (e.g. “en” for english)
- `id`: the tweet identifier
- `place`, `coordinates`, `geo`: geo-location information if available
- `user`: the author’s full profile
- `entities`: list of entities like URLs, @-mentions, hashtags and symbols
- `in_reply_to_user_id`: user identifier if the tweet is a reply to a specific user
- `in_reply_to_status_id`: status identifier id the tweet is a reply to a specific status

Listing tweets from another user :

In [6]:
for tweet in tweepy.Cursor(api.user_timeline, id='BarackObama').items(5):
    #print(json.dumps(tweet._json))
    print(tweet._json['text'])
    print(tweet._json['created_at'])

RT @ObamaFoundation: Watch: We hosted a Town Hall in New Delhi with @BarackObama and young leaders about how to drive change and make an im…
Mon Dec 04 22:57:47 +0000 2017
Michelle and I are delighted to congratulate Prince Harry and Meghan Markle on their engagement. We wish you a life… https://t.co/KC9nmjZPuX
Mon Nov 27 21:13:50 +0000 2017
From the Obama family to yours, we wish you a Happy Thanksgiving full of joy and gratitude. https://t.co/xAvSQwjQkz
Thu Nov 23 14:44:27 +0000 2017
ME:  Joe, about halfway through the speech, I’m gonna wish you a happy birth--
BIDEN:  IT’S MY BIRTHDAY!
ME:  Joe.… https://t.co/5qLUsDoaMi
Mon Nov 20 19:02:11 +0000 2017
RT @ObamaFoundation: Today, we honor those who have honored our country with its highest form of service. https://t.co/IbJNCwIofL https://t…
Sat Nov 11 15:13:46 +0000 2017


## Text pre-processing

Text tokenization using NLTK:

In [7]:
import nltk
from nltk.tokenize import word_tokenize

for tweet in tweepy.Cursor(api.user_timeline, id='BarackObama').items(5):
    tweet_text = tweet._json['text']
    print(word_tokenize(tweet_text))

['RT', '@', 'ObamaFoundation', ':', 'Watch', ':', 'We', 'hosted', 'a', 'Town', 'Hall', 'in', 'New', 'Delhi', 'with', '@', 'BarackObama', 'and', 'young', 'leaders', 'about', 'how', 'to', 'drive', 'change', 'and', 'make', 'an', 'im…']
['Michelle', 'and', 'I', 'are', 'delighted', 'to', 'congratulate', 'Prince', 'Harry', 'and', 'Meghan', 'Markle', 'on', 'their', 'engagement', '.', 'We', 'wish', 'you', 'a', 'life…', 'https', ':', '//t.co/KC9nmjZPuX']
['From', 'the', 'Obama', 'family', 'to', 'yours', ',', 'we', 'wish', 'you', 'a', 'Happy', 'Thanksgiving', 'full', 'of', 'joy', 'and', 'gratitude', '.', 'https', ':', '//t.co/xAvSQwjQkz']
['ME', ':', 'Joe', ',', 'about', 'halfway', 'through', 'the', 'speech', ',', 'I', '’', 'm', 'gon', 'na', 'wish', 'you', 'a', 'happy', 'birth', '--', 'BIDEN', ':', 'IT', '’', 'S', 'MY', 'BIRTHDAY', '!', 'ME', ':', 'Joe.…', 'https', ':', '//t.co/5qLUsDoaMi']
['RT', '@', 'ObamaFoundation', ':', 'Today', ',', 'we', 'honor', 'those', 'who', 'have', 'honored', 'our',

Enhancing the tokenization by accounting for @-mentions, emoticons, URLs and hash-tags:

In [17]:
import re
 
emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
 
regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
 
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
    
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
 
def tokenize(s):
    return tokens_re.findall(s)
 
def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens
 
for tweet in tweepy.Cursor(api.user_timeline, id='BarackObama').items(5):
    tweet_text = tweet._json['text']
    print(preprocess(tweet_text))
    

['RT', '@ObamaFoundation', ':', 'Watch', ':', 'We', 'hosted', 'a', 'Town', 'Hall', 'in', 'New', 'Delhi', 'with', '@BarackObama', 'and', 'young', 'leaders', 'about', 'how', 'to', 'drive', 'change', 'and', 'make', 'an', 'im', '…']
['Michelle', 'and', 'I', 'are', 'delighted', 'to', 'congratulate', 'Prince', 'Harry', 'and', 'Meghan', 'Markle', 'on', 'their', 'engagement', '.', 'We', 'wish', 'you', 'a', 'life', '…', 'https://t.co/KC9nmjZPuX']
['From', 'the', 'Obama', 'family', 'to', 'yours', ',', 'we', 'wish', 'you', 'a', 'Happy', 'Thanksgiving', 'full', 'of', 'joy', 'and', 'gratitude', '.', 'https://t.co/xAvSQwjQkz']
['ME', ':', 'Joe', ',', 'about', 'halfway', 'through', 'the', 'speech', ',', 'I', '’', 'm', 'gonna', 'wish', 'you', 'a', 'happy', 'birth', '-', '-', 'BIDEN', ':', 'IT', '’', 'S', 'MY', 'BIRTHDAY', '!', 'ME', ':', 'Joe', '.', '…', 'https://t.co/5qLUsDoaMi']
['RT', '@ObamaFoundation', ':', 'Today', ',', 'we', 'honor', 'those', 'who', 'have', 'honored', 'our', 'country', 'with', 

As stated in the blog post, the tokeniser is probably far from perfect, but it gives you the general idea. The tokenisation is based on regular expressions (regexp), which is a common choice for this type of problem. Some particular types of tokens (e.g. phone numbers or chemical names) will not be captured, and will be probably broken into several tokens. To overcome this problem, as well as to improve the richness of your pre-processing pipeline, you can improve the regular expressions, or even employ more sophisticated techniques like Named Entity Recognition.

## Analysing term frequencies

Let's count the terms used in the last 200 tweets of Barack Obama :

In [15]:
import operator 
import json
from collections import Counter

count_all = Counter()
for tweet in tweepy.Cursor(api.user_timeline, id='BarackObama').items(200):
    tweet_text = tweet._json['text']
    # Create a list with all the terms
    terms_all = [term for term in preprocess(tweet_text)]
    # Update the counter
    count_all.update(terms_all)
    
# Print the first 5 most frequent words
print(count_all.most_common(5))

NameError: name 'preprocess' is not defined

As you can see, the most frequent words are not exactly meaningful.  Let's remove the common words, called "stop-words".  We can use NLTK for this.  We also include tweet-specific stop-words such as RT (used for re-tweets), "via" and others:

In [11]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     d:\Profiles\cnozaradan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [12]:
from nltk.corpus import stopwords
import string
 
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via', 'RT', '…', '’', '“', 'The']

We can now adapt the code.  Let's embed it in a method :

In [23]:
def tweet_stats(twitter_id, tweet_nbr=200, most_common=5, lang='english'):
    punctuation = list(string.punctuation)
    stop = stopwords.words(lang) + punctuation + ['rt', 'via', 'RT', '…', '’', '“', 'The','é', 'les', 'a', 'La','Le','Il','être']
    count_all = Counter()
    count_single = Counter()
    count_hash = Counter()
    for tweet in tweepy.Cursor(api.user_timeline, id=twitter_id).items(tweet_nbr):
        tweet_text = tweet._json['text']
        # Create a list with all the terms
        terms_preprocessed = preprocess(tweet_text)
        terms_all = [term for term in terms_preprocessed if term not in stop]
        # Update the counter
        count_all.update(terms_all)
        # Count terms only once, equivalent to Document Frequency
        terms_single = set(terms_all)
        count_single.update(terms_single)
        # Count hashtags only
        terms_hash = [term for term in terms_preprocessed if term.startswith('#')]
        count_hash.update(terms_hash)
       
    # Print the first 5 most frequent words
    print('Most common terms:\n', count_all.most_common(most_common))
    print('\n Most common terms, counted once per tweet:\n', count_single.most_common(most_common))
    print('\n Most common hash-tag terms: ', count_hash.most_common(most_common))

In [49]:
tweet_stats('BarackObama')

Most common terms:
 [('—', 28), ('leaders', 25), ('Senate', 24), ('#DoYourJob', 23), ('Americans', 19)]

 Most common terms, counted once per tweet:
 [('—', 28), ('leaders', 25), ('Senate', 24), ('#DoYourJob', 23), ('Americans', 19)]

 Most common hash-tag terms:  [('#DoYourJob', 23), ('#ActOnClimate', 18), ('#Obamacare', 10), ('#GetCovered', 9), ('#LeadOnLeave', 3)]


For comparison, let's do the same on the tweets from Donald Trump:

In [46]:
tweet_stats('realDonaldTrump')

Most common terms:
 [('I', 32), ('great', 22), ('Tax', 15), ('years', 15), ('President', 14)]

 Most common terms, counted once per tweet:
 [('I', 29), ('great', 21), ('years', 15), ('President', 14), ('Tax', 13)]

 Most common hash-tag terms:  [('#MAGA', 3), ('#MakeAmericaGreatAgain', 2), ('#GES2017', 2), ('#Periscope', 1), ('#WorldAIDSDay', 1)]


This shows already some interesting trends...  Out of curiosity:

In [24]:
tweet_stats('EmmanuelMacron', lang='french')

Most common terms:
 [('Je', 27), ('Nous', 27), ('France', 21), ('jeunesse', 16), ('Afrique', 15)]

 Most common terms, counted once per tweet:
 [('Nous', 25), ('Je', 24), ('France', 20), ('jeunesse', 14), ('contre', 13)]

 Most common hash-tag terms:  [('#NeRienLaisserPasser', 7), ('#CongresAMF', 6), ('#OnePlanet', 5), ('#JohnnyHallyday', 5), ('#TraceMeetsMacron', 4)]
