# Twitter data

## Copyright and Licensing

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use.

# Twitter API Access

Twitter implements OAuth 1.0A as its standard authentication mechanism, and in order to use it to make requests to Twitter's API, you'll need to go to https://dev.twitter.com/apps and create a sample application.

Choose any name for your application, write a description and use `http://google.com` for the website.

Under **Key and Access Tokens**, there are four primary identifiers you'll need to note for an OAuth 1.0A workflow: 
* consumer key, 
* consumer secret, 
* access token, and 
* access token secret (Click on Create Access Token to create those).

Note that you will need an ordinary Twitter account in order to login, create an app, and get these credentials.

The first time you execute the notebook, add all credentials so that you can save them in the `pkl` file, then you can remove the secret keys from the notebook because they will just be loaded from the `pkl` file.

The `pkl` file contains sensitive information that can be used to take control of your twitter acccount, **do not share it**.

In [1]:
import pickle
import os

In [2]:
if not os.path.exists('secret_twitter_credentials.pkl'):
    Twitter={}
    Twitter['Consumer Key'] = 'PJltWkpqGafHgwxHhEWOa3Yd6'
    Twitter['Consumer Secret'] = 'M8XjkaGHrQr0AP7iiXUf5Eq74lO5CNdRa0C86JOePlx84qevE9'
    Twitter['Access Token'] = '4569071-70RMr6SFOUqrVz76FQGtzd3K0tlXdbL5MJq74eWsId'
    Twitter['Access Token Secret'] = 'VwhthQYTSGxupS0ooEkL5T1oeBqYqPwO1UVMaBrmMi1YD'
    with open('secret_twitter_credentials.pkl','wb') as f:
        pickle.dump(Twitter, f)
else:
    Twitter=pickle.load(open('secret_twitter_credentials.pkl','rb'))

Install the `twitter` package to interface with the Twitter API

In [3]:
!pip install twitter

Collecting twitter
  Downloading twitter-1.17.1-py2.py3-none-any.whl (55kB)
Installing collected packages: twitter
Successfully installed twitter-1.17.1


## Example 1. Authorizing an application to access Twitter account data

In [4]:
import twitter

auth = twitter.oauth.OAuth(Twitter['Access Token'],
                           Twitter['Access Token Secret'],
                           Twitter['Consumer Key'],
                           Twitter['Consumer Secret'])

twitter_api = twitter.Twitter(auth=auth)

# Nothing to see by displaying twitter_api except that it's now a
# defined variable

print(twitter_api)

<twitter.api.Twitter object at 0x000001DED5DE5208>


## Example 2. Retrieving trends

Twitter identifies locations using the Yahoo! Where On Earth ID.

The Yahoo! Where On Earth ID for the entire world is 1.
See https://dev.twitter.com/docs/api/1.1/get/trends/place and
http://developer.yahoo.com/geo/geoplanet/

look at the BOSS placefinder here: https://developer.yahoo.com/boss/placefinder/

In [5]:
WORLD_WOE_ID = 1
#US_WOE_ID = 23424977
COL_WOE_ID = 23424787

Look for the WOEID for [san-diego](http://woeid.rosselliot.co.nz/lookup/san%20diego%20%20ca)

You can change it to another location.

In [6]:
LOCAL_WOE_ID=368148 #Bogotá, CO

# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.

world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID)
col_trends = twitter_api.trends.place(_id=COL_WOE_ID)
local_trends = twitter_api.trends.place(_id=LOCAL_WOE_ID)

In [7]:
world_trends[:2]

[{'as_of': '2017-08-27T12:20:13Z',
  'created_at': '2017-08-27T12:13:37Z',
  'locations': [{'name': 'Worldwide', 'woeid': 1}],
  'trends': [{'name': '#BelgianGP',
    'promoted_content': None,
    'query': '%23BelgianGP',
    'tweet_volume': 61555,
    'url': 'http://twitter.com/search?q=%23BelgianGP'},
   {'name': '#ماذا_قدموا_مشايخ_الشمل_للمواطن',
    'promoted_content': None,
    'query': '%23%D9%85%D8%A7%D8%B0%D8%A7_%D9%82%D8%AF%D9%85%D9%88%D8%A7_%D9%85%D8%B4%D8%A7%D9%8A%D8%AE_%D8%A7%D9%84%D8%B4%D9%85%D9%84_%D9%84%D9%84%D9%85%D9%88%D8%A7%D8%B7%D9%86',
    'tweet_volume': 22318,
    'url': 'http://twitter.com/search?q=%23%D9%85%D8%A7%D8%B0%D8%A7_%D9%82%D8%AF%D9%85%D9%88%D8%A7_%D9%85%D8%B4%D8%A7%D9%8A%D8%AE_%D8%A7%D9%84%D8%B4%D9%85%D9%84_%D9%84%D9%84%D9%85%D9%88%D8%A7%D8%B7%D9%86'},
   {'name': '#FelizDomingo',
    'promoted_content': None,
    'query': '%23FelizDomingo',
    'tweet_volume': 14236,
    'url': 'http://twitter.com/search?q=%23FelizDomingo'},
   {'name': 'Tobe Hooper',


In [8]:
trends=local_trends
print(type(trends))
print(list(trends[0].keys()))
print(trends[0]['trends'])

<class 'twitter.api.TwitterListResponse'>
['trends', 'as_of', 'created_at', 'locations']
[{'name': '#McGregor', 'url': 'http://twitter.com/search?q=%23McGregor', 'promoted_content': None, 'query': '%23McGregor', 'tweet_volume': 475711}, {'name': '#PeleaDelSiglo', 'url': 'http://twitter.com/search?q=%23PeleaDelSiglo', 'promoted_content': None, 'query': '%23PeleaDelSiglo', 'tweet_volume': None}, {'name': 'Manuel', 'url': 'http://twitter.com/search?q=Manuel', 'promoted_content': None, 'query': 'Manuel', 'tweet_volume': 1209997}, {'name': 'Uribe', 'url': 'http://twitter.com/search?q=Uribe', 'promoted_content': None, 'query': 'Uribe', 'tweet_volume': 16024}, {'name': '#FelizDomingo', 'url': 'http://twitter.com/search?q=%23FelizDomingo', 'promoted_content': None, 'query': '%23FelizDomingo', 'tweet_volume': 14236}, {'name': '#F1xFOX', 'url': 'http://twitter.com/search?q=%23F1xFOX', 'promoted_content': None, 'query': '%23F1xFOX', 'tweet_volume': None}, {'name': '#ICFES', 'url': 'http://twitter

## Example 3. Displaying API responses as pretty-printed JSON

In [10]:
import json

print((json.dumps(col_trends[:2], indent=2)))

[
  {
    "trends": [
      {
        "name": "#FelizDomingo",
        "url": "http://twitter.com/search?q=%23FelizDomingo",
        "promoted_content": null,
        "query": "%23FelizDomingo",
        "tweet_volume": 14236
      },
      {
        "name": "#F1xFOX",
        "url": "http://twitter.com/search?q=%23F1xFOX",
        "promoted_content": null,
        "query": "%23F1xFOX",
        "tweet_volume": null
      },
      {
        "name": "#ICFES",
        "url": "http://twitter.com/search?q=%23ICFES",
        "promoted_content": null,
        "query": "%23ICFES",
        "tweet_volume": null
      },
      {
        "name": "Primera Emisi\u00f3n",
        "url": "http://twitter.com/search?q=%22Primera+Emisi%C3%B3n%22",
        "promoted_content": null,
        "query": "%22Primera+Emisi%C3%B3n%22",
        "tweet_volume": null
      },
      {
        "name": "#MargaritaOrtega",
        "url": "http://twitter.com/search?q=%23MargaritaOrtega",
        "promoted_content": null,


## Example 4. Computing the intersection of two sets of trends

In [12]:
trends_set = {}
trends_set['world'] = set([trend['name'] 
                        for trend in world_trends[0]['trends']])

trends_set['col'] = set([trend['name'] 
                     for trend in col_trends[0]['trends']]) 

trends_set['bogota'] = set([trend['name'] 
                     for trend in local_trends[0]['trends']]) 

In [15]:
for loc in ['world','col','bogota']:
    print(('-'*10,loc))
    print((','.join(trends_set[loc])))

('----------', 'world')
#CursoPoliticoPP,#NottingHillCarnival,#27Ago,#ShakespeareSunday,#وش_تتمني_عيديتك,#DomingoDetremuraSDV,3rd ODI,#DeshBachao,Tobe Hooper,#muzafferizgu,#バリバラ,#buitenhof,#ماذا_قدموا_مشايخ_الشمل_للمواطن,#FelizDomingo,#خميس_مشيط,Mick Schumacher,Enric González,#BANvAUS,#يلا_نردد_استغفرالله,One Love,#JLMMarseille,#feywil,Mignolet,#BelgianGP,#we_are_one,#HTCMania,#masterchefthailand,#dopa,#Cite4CoisasQueVcAma,Karius,#SundayMorning,#DejadHablarAGabriel,龍雲丸,#houstonflood,#F1xFOX,withB,#vvvaja,#MannKiBaat,#تقول_لمين_اتفوو,#FenerinMaçıVar,Mutlu Pazarlar,#رساله_بنت_يتيمه_تعور_القلب,La Diada del Terror,ブルゾンちえみ,鉄腕DASH,#BOCSGD,#CHEEVE,#NamimissKoYung,サライ,#AFLEaglesCrows
('----------', 'col')
#MargaritaOrtega,Quintero,Piedecuesta,#RadioReal,Manuel,Cuadrado,Chará,#ICFES,Ovelar,Juventus,#BatallaDeLosGallos,#FelizDomingo,Conor,#FelizSabado,Once Caldas,Primera Emisión,#PeleaDelSiglo,Chuck Norris,#McGregor,#F1xFOX,Egan Bernal,#DíaDeLaCrispeta,Tigres,#PeriodistasAsocajas,#ForoCDTurismo,

In [16]:
print(( '='*10,'intersection of world and colombia'))
print((trends_set['world'].intersection(trends_set['col'])))

print(('='*10,'intersection of us and san-diego'))
print((trends_set['bogota'].intersection(trends_set['col'])))

{'#FelizDomingo', '#F1xFOX'}
{'#MargaritaOrtega', 'Quintero', 'Piedecuesta', '#RadioReal', 'Manuel', 'Cuadrado', 'Chará', '#ICFES', 'Ovelar', 'Juventus', '#BatallaDeLosGallos', '#FelizDomingo', 'Conor', '#FelizSabado', 'Once Caldas', 'Primera Emisión', '#PeleaDelSiglo', 'Chuck Norris', '#McGregor', '#F1xFOX', 'Egan Bernal', '#DíaDeLaCrispeta', 'Tigres', '#PeriodistasAsocajas', '#ForoCDTurismo', 'Demi Lovato', 'Venezuela y Brasil', 'Texas', '#PremiumXDIRECTV', 'Death Note'}


## Example 5. Collecting search results

Set the variable `q` to a trending topic, 
or anything else for that matter. The example query below
was a trending topic when this content was being developed
and is used throughout the remainder of this chapter

In [62]:
q = '#ICFES' #q = topic 
number = 100

# See https://dev.twitter.com/docs/api/1.1/get/search/tweets

search_results = twitter_api.search.tweets(q=q, count=number)
statuses = search_results['statuses']

In [63]:
len(statuses)
print(statuses)

[{'created_at': 'Sun Aug 27 13:06:28 +0000 2017', 'id': 901793034064588800, 'id_str': '901793034064588800', 'text': 'RT @tweet_latino: Forma de responder el #ICFES : \n\nA) A la de Dios\nB) Bendíceme Señor\nC) Si no sé, es la C\nD) D de Diosito\n\nAhí les dejo el…', 'truncated': False, 'entities': {'hashtags': [{'text': 'ICFES', 'indices': [40, 46]}], 'symbols': [], 'user_mentions': [{'screen_name': 'tweet_latino', 'name': 'TweetLatino™', 'id': 746152640241893376, 'id_str': '746152640241893376', 'indices': [3, 16]}], 'urls': []}, 'metadata': {'iso_language_code': 'es', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 334971138, 'id_str': '334971138', 'name': 'Angy', 'screen_name': 'AngyPaola_1D', 'location': 'Valledupar', 'description': 'Uni

Twitter often returns duplicate results, we can filter them out checking for duplicate texts:

In [64]:
all_text = []
filtered_statuses = []
for s in statuses:
    if not s["text"] in all_text:
        filtered_statuses.append(s)
        all_text.append(s["text"])
statuses = filtered_statuses     

In [65]:
len(statuses)

79

In [66]:
[s['text'] for s in search_results['statuses']]

['RT @tweet_latino: Forma de responder el #ICFES : \n\nA) A la de Dios\nB) Bendíceme Señor\nC) Si no sé, es la C\nD) D de Diosito\n\nAhí les dejo el…',
 'RT @Santiiago_W: Mañana en el #ICFES\n-Si Andrea va a la luna y la luz solar pega en la tierra, ¿Que tipo de hongo alucinógeno se comió Manu…',
 '#ICFES Existos!',
 'RT @Mey_boni: Cuando tuve q hacer la prueba dl #icfes no hacían actividads antiestres,cómo yoga.que les pasa a los jóvenes d ahora. No pues…',
 'RT @guaritotapaazul: Mientras tanto a ésta hora en el #Icfes @ICFEScol https://t.co/4TxLUxeBNa',
 'RT @duenax: Llegó el día de #Saber11 #Icfes.  Espero que hayan podido descansar. Enfrenten la prueba con calma que hay tiempo suficiente. @…',
 'RT @duenax: Llegó el día de #Saber11 #Icfes.  Espero que hayan podido descansar. Enfrenten la prueba con calma que hay tiempo suficiente. @…',
 'Les deseamos lo mejor a todos los estudiantes que presentan las pruebas #ICFES en todo el país el día de hoy :) https://t.co/eJ5StdcyoB',
 'RT @gu

In [67]:
# Show one sample search result by slicing the list...
print(json.dumps(statuses[0], indent=1))

{
 "created_at": "Sun Aug 27 13:06:28 +0000 2017",
 "id": 901793034064588800,
 "id_str": "901793034064588800",
 "text": "RT @tweet_latino: Forma de responder el #ICFES : \n\nA) A la de Dios\nB) Bend\u00edceme Se\u00f1or\nC) Si no s\u00e9, es la C\nD) D de Diosito\n\nAh\u00ed les dejo el\u2026",
 "truncated": false,
 "entities": {
  "hashtags": [
   {
    "text": "ICFES",
    "indices": [
     40,
     46
    ]
   }
  ],
  "symbols": [],
  "user_mentions": [
   {
    "screen_name": "tweet_latino",
    "name": "TweetLatino\u2122",
    "id": 746152640241893376,
    "id_str": "746152640241893376",
    "indices": [
     3,
     16
    ]
   }
  ],
  "urls": []
 },
 "metadata": {
  "iso_language_code": "es",
  "result_type": "recent"
 },
 "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>",
 "in_reply_to_status_id": null,
 "in_reply_to_status_id_str": null,
 "in_reply_to_user_id": null,
 "in_reply_to_user_id_str": null,
 "in_reply_to_screen_name"

In [68]:
# The result of the list comprehension is a list with only one element that
# can be accessed by its index and set to the variable t
t = statuses[10]
#[ status for status in statuses 
#          if status['id'] == 316948241264549888 ][0]

# Explore the variable t to get familiarized with the data structure...

print(t['retweet_count'])
print(t['retweeted'])
print(t['coordinates'])

0
False
None


## Example 6. Extracting text, screen names, and hashtags from tweets

In [69]:
status_texts = [ status['text'] 
                 for status in statuses ]

screen_names = [ user_mention['screen_name'] 
                 for status in statuses
                     for user_mention in status['entities']['user_mentions'] ]

hashtags = [ hashtag['text'] 
             for status in statuses
                 for hashtag in status['entities']['hashtags'] ]

# Compute a collection of all words from all tweets
words = [ w 
          for t in status_texts 
              for w in t.split() ]

In [70]:
# Explore the first 5 items for each...

print(json.dumps(status_texts[0:5], indent=1))
print(json.dumps(screen_names[0:5], indent=1)) 
print(json.dumps(hashtags[0:5], indent=1))
print(json.dumps(words[0:5], indent=1))

[
 "RT @tweet_latino: Forma de responder el #ICFES : \n\nA) A la de Dios\nB) Bend\u00edceme Se\u00f1or\nC) Si no s\u00e9, es la C\nD) D de Diosito\n\nAh\u00ed les dejo el\u2026",
 "RT @Santiiago_W: Ma\u00f1ana en el #ICFES\n-Si Andrea va a la luna y la luz solar pega en la tierra, \u00bfQue tipo de hongo alucin\u00f3geno se comi\u00f3 Manu\u2026",
 "#ICFES Existos!",
 "RT @Mey_boni: Cuando tuve q hacer la prueba dl #icfes no hac\u00edan actividads antiestres,c\u00f3mo yoga.que les pasa a los j\u00f3venes d ahora. No pues\u2026",
 "RT @guaritotapaazul: Mientras tanto a \u00e9sta hora en el #Icfes @ICFEScol https://t.co/4TxLUxeBNa"
]
[
 "tweet_latino",
 "Santiiago_W",
 "Mey_boni",
 "guaritotapaazul",
 "ICFEScol"
]
[
 "ICFES",
 "ICFES",
 "ICFES",
 "icfes",
 "Icfes"
]
[
 "RT",
 "@tweet_latino:",
 "Forma",
 "de",
 "responder"
]


## Example 7. Creating a basic frequency distribution from the words in tweets

In [71]:
from collections import Counter

for item in [words, screen_names, hashtags]:
    c = Counter(item)
    print(c.most_common()[:10]) # top 10
    print()

[('#ICFES', 56), ('el', 40), ('que', 33), ('a', 30), ('de', 26), ('la', 22), ('en', 17), ('los', 17), ('RT', 14), ('para', 14)]

[('ICFEScol', 3), ('Santiiago_W', 2), ('duenax', 2), ('tweet_latino', 1), ('Mey_boni', 1), ('guaritotapaazul', 1), ('BarraganFCB', 1), ('USTA_COLOMBIA', 1), ('Pontifex_es', 1), ('SectorMovilidad', 1)]

[('ICFES', 63), ('icfes', 8), ('Icfes', 7), ('FelizDomingo', 5), ('PeñalosaTransmileniófilo', 5), ('Saber11', 3), ('margaritaOrtega', 2), ('CaminataSolidaridad', 1), ('PruebasSaber', 1), ('OrianaEnICFES', 1)]



## Example 8. Create a prettyprint function to display tuples in a nice tabular format

In [72]:
def prettyprint_counts(label, list_of_tuples):
    print("\n{:^20} | {:^6}".format(label, "Count"))
    print("*"*40)
    for k,v in list_of_tuples:
        print("{:20} | {:>6}".format(k,v))

In [73]:
for label, data in (('Word', words), 
                    ('Screen Name', screen_names), 
                    ('Hashtag', hashtags)):
    
    c = Counter(data)
    prettyprint_counts(label, c.most_common()[:10])


        Word         | Count 
****************************************
#ICFES               |     56
el                   |     40
que                  |     33
a                    |     30
de                   |     26
la                   |     22
en                   |     17
los                  |     17
RT                   |     14
para                 |     14

    Screen Name      | Count 
****************************************
ICFEScol             |      3
Santiiago_W          |      2
duenax               |      2
tweet_latino         |      1
Mey_boni             |      1
guaritotapaazul      |      1
BarraganFCB          |      1
USTA_COLOMBIA        |      1
Pontifex_es          |      1
SectorMovilidad      |      1

      Hashtag        | Count 
****************************************
ICFES                |     63
icfes                |      8
Icfes                |      7
FelizDomingo         |      5
PeñalosaTransmileniófilo |      5
Saber11              |      3


## Example 9. Finding the most popular retweets

In [74]:
retweets = [
            # Store out a tuple of these three values ...
            (status['retweet_count'], 
             status['retweeted_status']['user']['screen_name'],
             status['text'].replace("\n","\\")) 
            
            # ... for each status ...
            for status in statuses 
                # ... so long as the status meets this condition.
                if 'retweeted_status' in status
           ]


We can build another `prettyprint` function to print entire tweets with their retweet count.

We also want to split the text of the tweet in up to 3 lines, if needed.

In [75]:
row_template = "{:^7} | {:^15} | {:50}"
def prettyprint_tweets(list_of_tuples):
    print()
    print(row_template.format("Count", "Screen Name", "Text"))
    print("*"*80)
    for count, screen_name, text in list_of_tuples:
        print(row_template.format(count, screen_name, text[:50]))
        if len(text) > 50:
            print(row_template.format("", "", text[50:100]))
            if len(text) > 100:
                print(row_template.format("", "", text[100:]))

In [76]:
# Slice off the first 5 from the sorted results and display each item in the tuple

prettyprint_tweets(sorted(retweets, reverse=True)[:10])


 Count  |   Screen Name   | Text                                              
********************************************************************************
  66    |   andresfsoca   | RT @andresfsoca: -¿Y como te sientes para el #ICFE
        |                 | S? 😝 https://t.co/8vV6FlNDo8                      
  23    |  razeofficial   | RT @razeofficial: A dormir, que mañana #ICFES     
  18    |  tweet_latino   | RT @tweet_latino: Forma de responder el #ICFES : \
        |                 | \A) A la de Dios\B) Bendíceme Señor\C) Si no sé, e
        |                 | s la C\D) D de Diosito\\Ahí les dejo el…          
  15    |   BarraganFCB   | RT @BarraganFCB: Oh queridas pruebas SABER 11. Trá
        |                 | tenme bien. #ICFES https://t.co/MfAMeWWDOX        
  14    |   Santiiago_W   | RT @Santiiago_W: *Los demás en el #ICFES*\–Esto es
        |                 | tá fácil,me sirvió estudiar\\*Yo*\–¿Será la B de b
        |                 | endito seas o la C de