<h1><div align="center">Social Data Mining</div></h1>
<h2><div align="center">Lesson IV - Twitter with Tweepy</div></h2>
<div align="center">Bruno Gonçalves</div>
<div align="center"><a href="http://www.data4sci.com/">www.data4sci.com</a></div>
<div align="center">@bgoncalves, @data4sci</div>

In [1]:
import json
import numpy as np
import matplotlib.pyplot as plt 
import tweepy

import watermark

%load_ext watermark
%matplotlib inline

Let's start by print out the versions of the libraries we're using for future reference

In [2]:
%watermark -n -v -m -p numpy,tweepy,matplotlib

Thu Sep 05 2019 

CPython 3.7.3
IPython 6.2.1

numpy 1.16.2
tweepy 3.8.0
matplotlib 3.1.0

compiler   : Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 18.7.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit


The first step is to load up the account information. I recommend that you keep all your account credentials in a dictionary like this one to make it easier to switch between accounts. You can find this stup in **twitter_accounts_STUB.py**

In [3]:
accounts = {
    "social" : { 'api_key' : 'API_KEY',
                 'api_secret' : 'API_SECRET',
                 'token' : 'TOKEN',
                 'token_secret' : 'TOKEN_SECRET'
                },               
    }

All my credentials are listed in **twitter_accounts.py**. You can find a stub version of this file in the github repository. You can just fill in your own credentials and then proceed to doing:

In [4]:
from twitter_accounts import accounts

You load the credentials for a specific account using its dictionary key:

In [5]:
app = accounts["bgoncalves"]

Which contains all the information you need to create an OAuth Handler

In [6]:
auth = tweepy.OAuthHandler(app["api_key"], app["api_secret"])
auth.set_access_token(app["token"], app["token_secret"])

That you can finally pass to the tweepy module

In [7]:
twitter_api = tweepy.API(auth)

This object will be your main way of interacting with the twitter API. 

# Searching Tweets

In [8]:
query = "instagram"
count = 200

To search for tweets matching a specific query we simply use the **search** method

In [9]:
statuses = twitter_api.search(q=query, count=count)

The count parameter specifies the number of results we want. 200 is the maximum per call. The **search** method returns a *SearchResults* object

In [10]:
type(statuses)

tweepy.models.SearchResults

That contains a lot of metadata in addition to just the list of results

In [11]:
print("max_id:", statuses.max_id)
print("since_id:", statuses.since_id)
print("refresh_url:", statuses.refresh_url)
print("completed_in:", statuses.completed_in)
print("query:", statuses.query)
print("count:", statuses.count)
print("next_results:", statuses.next_results)

max_id: 1169562026936586239
since_id: 1169562053771554817
refresh_url: ?since_id=1169562053771554817&q=instagram&include_entities=1
completed_in: 0.122
query: instagram
count: 100
next_results: ?max_id=1169562026936586239&q=instagram&count=100&include_entities=1


We can also access the results as if the *SearchResults* was a list

In [13]:
tweet = statuses[0]

In [15]:
print(tweet.text)

RT @suhocompany522: ☁면스타 135번째 게시물☁ 

찬열아 선물 고맙다 사랑한다🥰

https://t.co/YenW61WBKC
#김준면 #수호 #SUHO #준면 #金俊勉 #スホ #EXO @weareoneEXO https://t.co/…


To request the next page of results we pass the *next_results* field to the next call to **search**

In [19]:
try:
    for tweet in statuses:
        print (tweet.text)

    next_results = statuses.next_results

    args = dict(parse.parse_qsl(next_results[1:]))

    statuses = twitter_api.search(**args)
except:
    pass

RT @suhocompany522: ☁면스타 135번째 게시물☁ 

찬열아 선물 고맙다 사랑한다🥰

https://t.co/YenW61WBKC
#김준면 #수호 #SUHO #준면 #金俊勉 #スホ #EXO @weareoneEXO https://t.co/…
RT @playboilauren: Instagram enferma a las personas
えっ北ちゃん😂騙されたよー😂
https://t.co/2CcHGYZbeo https://t.co/Z9m7jimO2c
RT @geminids_night: &lt;(ㅍ▽ㅍ💙✨

" 필승!! 일병 최민호
슈퍼스타 이태민 응원을 명 받았습니다!
이에 신고 합니다 필승!!!
막내 탬 보여줘라!
슈퍼엠 빌보드 대박내자!! 다 응원한다! "
https://t.co/CLhkg0rc7…
RT @LILITEAMTH327: [IG UPDATE]

🗓️ 05.09.2019

Shoot Film
Sis : lalalalisa_m
Camera : Lomo Lubitel 166+
Film : Portra 400 

📸 Tongsouthernf…
RT @syoungstagram: [INSTAGRAM] sooyoungchoi: 조만간🎁있을듯....#뭘까 https://t.co/qR5pldgBLr https://t.co/dLdTrMr0RS
#thursdaymood😎 https://t.co/utlKRycez8
RT @iaiastyx_: My piece for Rozenberry's draw this in your style on instagram! Hope you like it~💓

#art #draw #drawthisinyourstyle #dtiys #…
RT @taemin_comeback: 190905 탬스타
https://t.co/cpcHBCVSSp https://t.co/q13yijsXwd
RT @YoccieYoccie: うどんコレクションスタンプラリーの、分かりにくいと言われる所をまとめてみました。

#repost… https://t.co/BZ0M1S2nE

# Streaming

If instead we are interested in real time results, we can use the *Streaming* API. We simply declare a Listener that overrides the on_data and on_error functions appropriately.

In [20]:
class StdOutListener(tweepy.StreamListener):
    def on_status(self, status):
        print(status.text)
        return True

    def on_error(self, status):
        print(status)
        
    def on_timeout(self):
        print('Timeout...', file=sys.stderr)
        return True

Instanciate the Listener

In [21]:
listen = StdOutListener()

And pass the listener and the OAuth object to the **Stream** module.

In [22]:
stream = tweepy.Stream(auth, listen)

This will return a stream object that we can (finally) use to track the Twitter stream for specific results

In [23]:
stream.filter(track=[query])

RT @Toey_Pongsk: Puppy love 🐾 
.
.
#smhhoundexplorer 
#smileyhoundmybestbuddy
#smileyhoundcrew https://t.co/DkgpFO6Fxo
RT @BonfirePictures: The City of Haarlem
Photos shot by Gabriel Guita (@gabrielguita_)
IG: https://t.co/QXg3utVoEq
#Netherlands https://t.c…
Please come support this event organised by my Mother !!! https://t.co/RiXZnr9q3D
RT @_brooklynsummer: Honor working under direction of Noah Dillon @noah__dillon and The Hellp https://t.co/tNslXwjW2G on their latest creat…
RT @EXOXOXOID: [SUHOSTAGRAM] 190905 kimjuncotton Instagram.

찬열아 선물 고맙다 사랑한다🥰
https://t.co/ygnU3mj3HJ
https://t.co/AvsiG2kTH8

#EXO #엑소 #Su…
RT @suhocompany522: ☁면스타스토리 UP☁ 

찬열이짱👍

https://t.co/r9RpHYMGEx
#김준면 #수호 #SUHO #준면 #金俊勉 #スホ #EXO @weareoneEXO https://t.co/h1HIgjvktw
RT @special1004: #sj https://t.co/QVm3lhu7hN
作ってます！！！！
今作ってますから！！！！

来週9日は草津までおじゃまンボ！！
お初ですわよ💗びわ湖市^ -… https://t.co/OZPOkXPPRI
Фикус Бенджамина Твилайт в ассортименте, 3 растения  в горшке,  цена 95  гривен! Бронь в лс автору фото #фикус… ht

Hola @radioenter Queremos escuchar a #CarlosRivera con su tema #TeEsperaba y dejar nuestro voto en el #RankingEnter… https://t.co/Wv3dfAIHUw
RT @rickbonadio: As palavras do Samuel faz a nossa equipe se encher de alegria e orgulho do nosso menino Sol e faz a gente continuar motiva…
RT @special1004: #newlogo https://t.co/XdxHRiikAj
"Life is not about who you once were, It's about who you are right now and the person you have the potential to be.… https://t.co/q0vob4DmKQ
RT @araki_hiro0614: 大阪二日目！
#幽遊白書
#桑原和真
#コエンマ https://t.co/O4pSmeMdQV
@City_Wonders I’m interested 
https://t.co/sZ9TAW1yYP
Evy@flourish-ious.com
https://t.co/qpKywvMcwB
RT @WWEXOL: [OFFICIAL] kimjuncotton Instagram Story Update #SUHO

"@.real__pcy Chanyeol-ie is the best 👍"

📸: https://t.co/CHL7AGBiWq

#EXO…
Manisa ovaları sergideki üzümlerle rengarenk oldu https://t.co/yCGfUVlqRv @habergundemim35 aracılığıyla https://t.co/XXjITYsAsx
RT @EXOXiuminTurkey: #OurPreciousXIUMIN 🦋

Super Junior Donghae'nin Minseok'u EXO konseri

KeyboardInterrupt: 

# User information

### Profile

Profile information is just a API call away

In [29]:
screen_name = 'neiltyson'

In [24]:
user = twitter_api.get_user(screen_name=screen_name)

In [25]:
print(user.screen_name, "has", user.followers_count, "followers and follows", user.friends_count, "other users")

neiltyson has 13446342 followers and follows 39 other users


### Friends

Requesting information on a users friends is also simple

In [30]:
friends = twitter_api.friends(screen_name=screen_name, count=200)

And we can see that we retrieved all the friends

In [31]:
len(friends)

39

And their screen names are:

In [32]:
for i, friend in enumerate(friends):
    print(i, friend.screen_name)

0 PlayingwScience
1 peeweeherman
2 StarTalkRadio
3 DefenseIntel
4 DeptofDefense
5 USNavy
6 DARPA
7 republicofmath
8 Snowden
9 TheTweetOfGod
10 levarburton
11 BrannonBraga
12 GirlsAreGeeks
13 bug_gwen
14 billmaher
15 JimGaffigan
16 SarahKSilverman
17 WhoopiGoldberg
18 Burghound
19 algore
20 rickygervais
21 Pogue
22 SamHarrisOrg
23 chucknicecomic
24 JohnAllenPaulos
25 sciencecomedian
26 SethMacFarlane
27 JRichardGott
28 michaelshermer
29 ProfBrianCox
30 pzmyers
31 milesobrien
32 kevinmitnick
33 billprady
34 RichardDawkins
35 StephenAtHome
36 BillNye
37 BadAstronomer
38 elakdawalla


### Followers

Since we already saw that the number of followers is significantly larger, we use a **Cursor** to seamlessly paginate through all the results. For expediency, here we only request the first 100 results.

In [33]:
for i, follower in enumerate(tweepy.Cursor(twitter_api.followers, screen_name=screen_name).items(100)):
    print(i, follower.screen_name)

0 mia_liv_lloyd
1 GeraldKutney
2 a62460415
3 AndreiRdulescu1
4 jennydiem
5 leonidasfourlis
6 Karl28301338
7 Ah78705537
8 minnesotaah
9 ColinHamsher
10 jediknightbren
11 AlvinAra5
12 pedurietz
13 shamit0
14 JezzBrown3
15 marklaaron
16 7_egal
17 XennialRevolut1
18 ___hades_
19 experts_editing
20 thesssguy
21 GogoiAnubhov
22 LouisMougoue
23 sew5726
24 FormigaLarrossa
25 VanTayl20559739
26 mikesamcali
27 studyinIndia6
28 sahas_bhatia
29 DjBlessGh4
30 Mienie_jaco
31 TFG57983868
32 Agathiyan_R
33 SpataroSofia
34 XaferZaki
35 Murda_B76
36 TotoWipeout
37 Raahul01415208
38 ChristianBabi11
39 akaki454
40 NeoTristan3
41 sizzlenewy
42 Caliber9icloud1
43 maguazzer
44 Companion_dev
45 StacyPa61356962
46 ypqLoDSz2ryLU1v
47 uclmaps
48 cammmmeee
49 Sealove54631190
50 Draeh17
51 ml3584
52 Yvette9754
53 19kasra97
54 Pratik9k
55 ChaunceyJared
56 ErinlynneK
57 Michael39484540
58 VictorWWPK
59 AdelinaLipsa
60 chris_costner
61 krssly
62 Jonathan_Land
63 Jess71856117
64 hfmpinto
65 Mries13
66 OussamaRimouche


### User timeline

The *user_timeline* method returns the tweets of a given user. As before, we can use a Cursor to iterate over all teh tweets, but do keep in mind that Twitter limits our access to only the 3200 most recent tweets

In [34]:
screen_name = "BarackObama"

tweets = []

for status in tweepy.Cursor(twitter_api.user_timeline, screen_name=screen_name).items():
    tweets.append(status.text)

print("Found", len(tweets), "tweets")

Found 3220 tweets


# Social Interactions

By following the timeline of the a user, we can see who s/he interacts with to generate a social interaction graph. We define the edge direction to be the direction of information flow, so: 
   - **retweet** - information flows from the author of the original tweet to the retweeter
   - **mention** - information flows from the author of the tweet to the one being mentioned

In [35]:
for status in tweepy.Cursor(twitter_api.user_timeline, screen_name=screen_name).items(200):
    if hasattr(status, 'retweeted_status'):
        print(status.retweeted_status.author.screen_name, '->', screen_name)
    elif status.in_reply_to_screen_name is not None:
        print(screen_name, '->', status.in_reply_to_screen_name)

BarackObama -> BarackObama
ObamaFoundation -> BarackObama
BarackObama -> BarackObama
MichelleObama -> BarackObama
MichelleObama -> BarackObama
BarackObama -> BarackObama
ObamaFoundation -> BarackObama
ObamaFoundation -> BarackObama
ObamaFoundation -> BarackObama
MichelleObama -> BarackObama
BarackObama -> BarackObama
BarackObama -> BarackObama
BarackObama -> BarackObama
BarackObama -> BarackObama
BarackObama -> BarackObama
BarackObama -> BarackObama
BarackObama -> BarackObama
BarackObama -> BarackObama
BarackObama -> BarackObama
ObamaFoundation -> BarackObama
MichelleObama -> BarackObama
MBK_Alliance -> BarackObama
BarackObama -> BarackObama
BarackObama -> BarackObama
BarackObama -> BarackObama
BarackObama -> BarackObama
BarackObama -> BarackObama
ObamaFoundation -> BarackObama
GetUSCovered -> BarackObama
ObamaFoundation -> BarackObama
BarackObama -> BarackObama
BarackObama -> BarackObama
BarackObama -> BarackObama
BarackObama -> BarackObama
BarackObama -> BarackObama
nowthisnews -> Ba

For convenience, here we chose to just list out all the edges in the order in which they appear. This information could naturally have been used to define a *NetworkX* graph for further analysis.

# Geolocated data

Here we demonstrate how to search for tweets containing geolocation information. We also take the opportunity to illustrate a more sophisticated StreamListener implementation

In [40]:
class FileOutListener(tweepy.StreamListener):
    def __init__(self, fp = None):
        super().__init__()
        self.tweet_count = 0
        if fp is not None:
            self.fp = fp
        else:
            self.fp = open("tweets.json", "wt")

    def on_data(self, data):
        # Using on_data (instead of on_status) Tweets are return as json strings. 
        # We can parse them to extract the information we require
        status = json.loads(data)

        self.tweet_count += 1
        print (self.tweet_count, status["id"])
        print(data.strip(), file=self.fp)
        return True

    def on_error(self, status):
        print(status)
    
    def on_timeout(self):
        print('Timeout...', file=sys.stderr)
        return True

Our bounding box will be NYC

In [41]:
bb = [-74,40,-73,41]  # NYC

And we will save the raw json from the tweets we obtain in a text file

In [42]:
with open("NYC.json", "wt") as fp:
    listener = FileOutListener(fp)

    stream = tweepy.Stream(auth, listener)
    stream.filter(locations=bb)

1 1169565200242106373
2 1169565201986985984
3 1169565202788077569
4 1169565203660492800
5 1169565217833050112
6 1169565220869722112
7 1169565232563441664
8 1169565238305460224
9 1169565254126362624
10 1169565256139624448
11 1169565257964150784
12 1169565264003903488
13 1169565268223373313
14 1169565271251664896
15 1169565278084173825
16 1169565286187569152
17 1169565302365065216
18 1169565356777713664
19 1169565357910253570
20 1169565361001381889
21 1169565387110981632
22 1169565394576781312
23 1169565399635111938
24 1169565404416622592
25 1169565423366529025
26 1169565438000476160
27 1169565441397796864
28 1169565441448038400
29 1169565457382330368
30 1169565466009985024
31 1169565473735921664
32 1169565491045838849
33 1169565494157987840
34 1169565494598389760
35 1169565498083872774
36 1169565499199541249
37 1169565498411036673
38 1169565506464079872
39 1169565507768508416
40 1169565514840064000
41 1169565519109873665
42 1169565538361720832
43 1169565541603917825
44 11695655582637465

KeyboardInterrupt: 