# Twitter Data Mining

In this project, we'll go over the following topics:
- Collecting data from Twitter
- Text pre-processing using NLTK
- Analysing term frequencies
- Data Visualization
- Sentiment Analysis

This project was made by learning from this (great) blog [post](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/).

## Collecting data

In order to authorise our app to access Twitter on our behalf, we need to use the OAuth interface:

In [1]:
import tweepy
from tweepy import OAuthHandler

import os
consumer_key = os.environ['CONSUMER_KEY']
consumer_secret = os.environ['CONSUMER_SECRET']
access_token = os.environ['ACCESS_TOKEN']
access_secret = os.environ['ACCESS_SECRET']
 
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
 
api = tweepy.API(auth)

Display of the 10 tweets of the home timeline :

In [22]:
for status in tweepy.Cursor(api.home_timeline).items(10):
    # Process a single status
    print(status.text)

RT @HUMASCOP: #DeLaVillardiere parrain de la promo @InstitutIRIS @IRIS_SUP_  remise des diplômes https://t.co/FsXjVfZ02V
RT @IRIS_SUP_: Remise des diplômes IRIS 2017. C’est parti! Lancée par P. Boniface et B de la Villardiere, parrain de la promo 2017... https…
RT @HUMASCOP: @BdLVillardiere  parrain de la promo @InstitutIRIS @IRIS_SUP_  remise des diplômes https://t.co/33NEfjcNBQ
RT @HUMASCOP: Équipe pédagogique @IRIS_SUP_ @InstitutIRIS Remise des diplômes https://t.co/4019QGEe3N
RT @JFFiorina: Fier d’avoir remis les doubles diplômes @InstitutIRIS @IRIS_SUP_    Bravo 😀👏 @PascalBoniface #ilovegeopolitique #geopolitiqu…
RT @carole_gomez: Gala de l @IRIS_SUP_ avec discours du @bde_iris @InstitutIRIS https://t.co/42Itj2xNyF
📄Transfert de l’ambassade américaine à Jérusalem : un nouveau camouflet pour le droit international
L'analyse de… https://t.co/jCOdmMUWRW
RT @centrehl: The core team is assembled to prepare for the arrival of our #humanitarianleaders @InstitutIRIS @centrehl and the rest

Display of the JSON response :

In [23]:
import json 
    
for status in tweepy.Cursor(api.home_timeline).items(10):
    # Process a single status
    print(json.dumps(status._json))

{"created_at": "Fri Dec 08 21:49:07 +0000 2017", "id": 939250511668240385, "id_str": "939250511668240385", "text": "RT @HUMASCOP: #DeLaVillardiere parrain de la promo @InstitutIRIS @IRIS_SUP_  remise des dipl\u00f4mes https://t.co/FsXjVfZ02V", "truncated": false, "entities": {"hashtags": [{"text": "DeLaVillardiere", "indices": [14, 30]}], "symbols": [], "user_mentions": [{"screen_name": "HUMASCOP", "name": "FREEHUMANITARIAN", "id": 316278379, "id_str": "316278379", "indices": [3, 12]}, {"screen_name": "InstitutIRIS", "name": "IRIS", "id": 265897069, "id_str": "265897069", "indices": [51, 64]}, {"screen_name": "IRIS_SUP_", "name": "IRIS Sup'", "id": 3151901333, "id_str": "3151901333", "indices": [65, 75]}], "urls": [], "media": [{"id": 939197098511208453, "id_str": "939197098511208453", "indices": [97, 120], "media_url": "http://pbs.twimg.com/media/DQiyyPbWkAU6eYM.jpg", "media_url_https": "https://pbs.twimg.com/media/DQiyyPbWkAU6eYM.jpg", "url": "https://t.co/FsXjVfZ02V", "display_url":

Listing connections :

In [25]:
for friend in tweepy.Cursor(api.friends).items():
    print(json.dumps(friend._json))

{"id": 265897069, "id_str": "265897069", "name": "IRIS", "screen_name": "InstitutIRIS", "location": "Paris", "description": "Acteur fran\u00e7ais de la recherche strat\u00e9gique et g\u00e9opolitique. Ses activit\u00e9s : la recherche, l\u2019organisation de manifestations, la publication, la formation.", "url": "http://t.co/EctAbFIhS6", "entities": {"url": {"urls": [{"url": "http://t.co/EctAbFIhS6", "expanded_url": "http://iris-france.org", "display_url": "iris-france.org", "indices": [0, 22]}]}, "description": {"urls": []}}, "protected": false, "followers_count": 24390, "friends_count": 422, "listed_count": 603, "created_at": "Mon Mar 14 09:29:08 +0000 2011", "favourites_count": 666, "utc_offset": 3600, "time_zone": "Paris", "geo_enabled": true, "verified": false, "statuses_count": 9301, "lang": "fr", "status": {"created_at": "Fri Dec 08 21:49:07 +0000 2017", "id": 939250511668240385, "id_str": "939250511668240385", "text": "RT @HUMASCOP: #DeLaVillardiere parrain de la promo @Institu

Listing my own tweets :

In [10]:
for tweet in tweepy.Cursor(api.user_timeline).items():
    #print(json.dumps(tweet._json))
    print(tweet._json['text'])
    print(tweet._json['created_at'])

"« Cool, bières et bonbons gratuits ! » : plongée dans une start-up déjantée" #actualites #feedly https://t.co/cKWCAPZSVr
Sun Jul 24 17:30:30 +0000 2016


Some variables :
- `text`: the text of the tweet itself
- `created_at`: the date of creation
- `favorite_count`, `retweet_count`: the number of favourites and retweets
- `favorited`, `retweeted`: boolean stating whether the authenticated user (you) have favourited or retweeted this tweet
- `lang`: acronym for the language (e.g. “en” for english)
- `id`: the tweet identifier
- `place`, `coordinates`, `geo`: geo-location information if available
- `user`: the author’s full profile
- `entities`: list of entities like URLs, @-mentions, hashtags and symbols
- `in_reply_to_user_id`: user identifier if the tweet is a reply to a specific user
- `in_reply_to_status_id`: status identifier id the tweet is a reply to a specific status

Listing tweets from another user :

In [15]:
for tweet in tweepy.Cursor(api.user_timeline, id='BarackObama').items(5):
    #print(json.dumps(tweet._json))
    print(tweet._json['text'])
    print(tweet._json['created_at'])

RT @ObamaFoundation: Watch: We hosted a Town Hall in New Delhi with @BarackObama and young leaders about how to drive change and make an im…
Mon Dec 04 22:57:47 +0000 2017
Michelle and I are delighted to congratulate Prince Harry and Meghan Markle on their engagement. We wish you a life… https://t.co/KC9nmjZPuX
Mon Nov 27 21:13:50 +0000 2017
From the Obama family to yours, we wish you a Happy Thanksgiving full of joy and gratitude. https://t.co/xAvSQwjQkz
Thu Nov 23 14:44:27 +0000 2017
ME:  Joe, about halfway through the speech, I’m gonna wish you a happy birth--
BIDEN:  IT’S MY BIRTHDAY!
ME:  Joe.… https://t.co/5qLUsDoaMi
Mon Nov 20 19:02:11 +0000 2017
RT @ObamaFoundation: Today, we honor those who have honored our country with its highest form of service. https://t.co/IbJNCwIofL https://t…
Sat Nov 11 15:13:46 +0000 2017


## Text pre-processing

Text tokenization using NLTK:

In [17]:
import nltk
from nltk.tokenize import word_tokenize

for tweet in tweepy.Cursor(api.user_timeline, id='BarackObama').items(5):
    tweet_text = tweet._json['text']
    print(word_tokenize(tweet_text))

['RT', '@', 'ObamaFoundation', ':', 'Watch', ':', 'We', 'hosted', 'a', 'Town', 'Hall', 'in', 'New', 'Delhi', 'with', '@', 'BarackObama', 'and', 'young', 'leaders', 'about', 'how', 'to', 'drive', 'change', 'and', 'make', 'an', 'im…']
['Michelle', 'and', 'I', 'are', 'delighted', 'to', 'congratulate', 'Prince', 'Harry', 'and', 'Meghan', 'Markle', 'on', 'their', 'engagement', '.', 'We', 'wish', 'you', 'a', 'life…', 'https', ':', '//t.co/KC9nmjZPuX']
['From', 'the', 'Obama', 'family', 'to', 'yours', ',', 'we', 'wish', 'you', 'a', 'Happy', 'Thanksgiving', 'full', 'of', 'joy', 'and', 'gratitude', '.', 'https', ':', '//t.co/xAvSQwjQkz']
['ME', ':', 'Joe', ',', 'about', 'halfway', 'through', 'the', 'speech', ',', 'I', '’', 'm', 'gon', 'na', 'wish', 'you', 'a', 'happy', 'birth', '--', 'BIDEN', ':', 'IT', '’', 'S', 'MY', 'BIRTHDAY', '!', 'ME', ':', 'Joe.…', 'https', ':', '//t.co/5qLUsDoaMi']
['RT', '@', 'ObamaFoundation', ':', 'Today', ',', 'we', 'honor', 'those', 'who', 'have', 'honored', 'our',

In [18]:
# To be continued