## Gather Follower Descriptions
We start with a twitter user name (as well as things like twitter credentials) and pull all the followers of that user. We then pull the descriptions of those users and write them to a file for future mining. You can use more than one user in a group.

I wrote some functions for you in the file `twitter_functions.py`. This will need to be in the same folder where you're running this notebook. 

You may not have installed `tweepy` yet. If you get an error on the next cell, you need to go to the command/console line (like where you type `git` commands) and type something like `pip install tweepy`. There may be dependencies (like `unidecode`) that you need as well. 

In [1]:
from datetime import datetime
import json
from pprint import pprint

from twitter_functions import * # these are the functions I wrote for you.

The next cell holds your Twitter authorization credentials. Then it calls a function that initializes your connection to Twitter. I've left my keys in there (slightly perturbed so that they won't actually work) so that you see the form these take).

In [2]:
from twitter_config import *

#auth =  { "consumer_key": "xks2XTK4gr2PajPio1RBGWsYU",
#          "consumer_secret": "jkjkCjph2vx38uBVbVHLkhzesGVY6ZqEywXd3B0sDeSAWVcDNo",
#          "access_key": "33029025-i1Mm907o7BsKnufMIxjVzByKsuDEhOBb0yV3EAa1E",
#          "access_secret": "jkjkXqAijIwRQMZmW3b7AgpFXU6Ve0RU30fzsbzpfx9uf"
#        }

api = initialize_twitter(auth)

Now you set the handle (or handles) that represent one group or topic on Twitter. These should be in a list. The output file name (`ofile_name`) is determined based on today's date and the first element in the list. Feel free to modify. 

In [3]:
starting_user = ['GeneralMills','KraftBrand'] # my first group
#starting_user = ['michaelpollan'] # my second group

ofile_name = (datetime.today().strftime("%Y%m%d") + "_" + 
             starting_user[0] + "_" + # Just use the first starting_user for the name if there are multiple
             "followers.txt")

In [4]:
# We'll now go lookup the full information on your starting user(s). 
starting_user_id = []

# All records will be a dictionary with the twitter ID as the key and 
# a UserRecord as the value. This is a named tuple I created. 
all_records = lookup_users_from_handles(api, starting_user)

# We need the IDs that we're getting followers from in a list. 
for id in all_records : #access the keys, which are ids.
    starting_user_id.append(id)

Start lookup_users_from_ids on 2 handles.
20191014-153408: looking up user records for 2 handles.
20191014-153408:  users pulled:  2
total failures: 0


In [5]:
all_records

defaultdict(twitter_functions.UserRecord,
            {280557152: UserRecord(id=280557152, id_str='280557152', name='General Mills', screen_name='GeneralMills', location='Minneapolis', followers_count=96999, friends_count=3437, favourites_count=7839, description='News/information from General Mills. Learn more at https://t.co/K7ale2Yhbq and https://t.co/brNL7A0kV1', geo_enabled=False, lang=None, statuses_count=18693, time_zone=None, created_at='Mon Apr 11 15:26:07 +0000 2011', verified=True, utc_offset=None, contributors_enabled=False, listed_count=911, protected=False, url='https://t.co/brNL7A0kV1'),
             908425516914528257: UserRecord(id=908425516914528257, id_str='908425516914528257', name='Kraft', screen_name='KraftBrand', location='', followers_count=3616, friends_count=163, favourites_count=1127, description='We make delicious food for families. #ForTheWinWin', geo_enabled=False, lang=None, statuses_count=3227, time_zone=None, created_at='Thu Sep 14 20:21:35 +0000 2017', 

In [6]:
# How long is it going to take us to pull these followers?
total_followers = 0
for id, rec in all_records.items() :
    total_followers += rec.followers_count
    
print("Ooh, {fol:,d} followers. A complete run with no limits run is ".format(fol=total_followers) + 
      "going to take {min:.2f} minutes ({hour:.2f} hours)".format(min=total_followers/5000,
                                                                  hour=total_followers/(60*5000)))

Ooh, 100,615 followers. A complete run with no limits run is going to take 20.12 minutes (0.34 hours)


In [7]:
# Now let's pull all the followers of our starting_user
# the function I wrote allows you to cap the number of followers you pull
# and uses the ID to generate the query.
# 
# Note that this pull is subject to rate limiting. You can make 15 calls per
# 15 minute window and each can return 5000 users. 
followers_of_starting = gather_followers(api,
                                         starting_user_id,
                                         follower_limit=None) # Modify this limit if you need to. 
                                                              # Set it to "None" to get all   

# followers_of_starting will be a dictionary with the key being the id(s) in starting_user_id
# and the value is a list of all the followers.

Pulling followers for 280557152
Number pulled: 5000
Number pulled: 10000
Number pulled: 15000
Number pulled: 20000
Number pulled: 25000
Number pulled: 30000
Number pulled: 35000
Number pulled: 40000
Number pulled: 45000
Number pulled: 50000
Number pulled: 55000
Number pulled: 60000
Number pulled: 65000
Number pulled: 70000


Rate limit reached. Sleeping for: 896


Number pulled: 75000
Number pulled: 80000
Number pulled: 85000
Number pulled: 90000
Number pulled: 95000
Number pulled: 97000
Pulling followers for 908425516914528257
Number pulled: 3617


In [8]:
# And now we'll "hydrate" these user records.
for start_id, list_of_followers in followers_of_starting.items() :
    
    # Using a set here instead of a list so that we pull each ID only once.
    ids_to_hydrate = {id for id in list_of_followers if id not in all_records} 
    
    these_records = lookup_users_from_ids(api,ids=ids_to_hydrate)

    for id, rec in these_records.items() :
        all_records[id] = rec


Start lookup_users_from_ids on 97000 IDs.
20191014-154915: looking up user records for 100 IDs.
20191014-154916: looking up user records for 100 IDs.
20191014-154917: looking up user records for 100 IDs.
20191014-154917: looking up user records for 100 IDs.
20191014-154918: looking up user records for 100 IDs.
20191014-154919: looking up user records for 100 IDs.
20191014-154920: looking up user records for 100 IDs.
20191014-154921: looking up user records for 100 IDs.
20191014-154921: looking up user records for 100 IDs.
20191014-154922: looking up user records for 100 IDs.
20191014-154923: looking up user records for 100 IDs.
20191014-154924: looking up user records for 100 IDs.
20191014-154924: looking up user records for 100 IDs.
20191014-154925: looking up user records for 100 IDs.
20191014-154926: looking up user records for 100 IDs.
20191014-154927: looking up user records for 100 IDs.
20191014-154927: looking up user records for 100 IDs.
20191014-154928: looking up user records

20191014-155110: looking up user records for 100 IDs.
20191014-155111: looking up user records for 100 IDs.
20191014-155112: looking up user records for 100 IDs.
20191014-155112: looking up user records for 100 IDs.
20191014-155113: looking up user records for 100 IDs.
20191014-155114: looking up user records for 100 IDs.
20191014-155115: looking up user records for 100 IDs.
20191014-155115: looking up user records for 100 IDs.
20191014-155116: looking up user records for 100 IDs.
20191014-155117: looking up user records for 100 IDs.
20191014-155117: looking up user records for 100 IDs.
20191014-155118: looking up user records for 100 IDs.
20191014-155119: looking up user records for 100 IDs.
20191014-155120: looking up user records for 100 IDs.
20191014-155120: looking up user records for 100 IDs.
20191014-155121: looking up user records for 100 IDs.
20191014-155122: looking up user records for 100 IDs.
20191014-155123: looking up user records for 100 IDs.
20191014-155123: looking up 

20191014-155306: looking up user records for 100 IDs.
20191014-155307: looking up user records for 100 IDs.
20191014-155308: looking up user records for 100 IDs.
20191014-155308: looking up user records for 100 IDs.
20191014-155309: looking up user records for 100 IDs.
20191014-155310: looking up user records for 100 IDs.
20191014-155311: looking up user records for 100 IDs.
20191014-155311: looking up user records for 100 IDs.
20191014-155312: looking up user records for 100 IDs.
20191014-155313: looking up user records for 100 IDs.
20191014-155314: looking up user records for 100 IDs.
20191014-155314: looking up user records for 100 IDs.
20191014-155315: looking up user records for 100 IDs.
20191014-155316: looking up user records for 100 IDs.
20191014-155317: looking up user records for 100 IDs.
20191014-155317: looking up user records for 100 IDs.
20191014-155318: looking up user records for 100 IDs.
20191014-155319: looking up user records for 100 IDs.
20191014-155320: looking up 

20191014-155502: looking up user records for 100 IDs.
20191014-155503: looking up user records for 100 IDs.
20191014-155504: looking up user records for 100 IDs.
20191014-155505: looking up user records for 100 IDs.
20191014-155505: looking up user records for 100 IDs.
20191014-155506: looking up user records for 100 IDs.
20191014-155507: looking up user records for 100 IDs.
20191014-155508: looking up user records for 100 IDs.
20191014-155509: looking up user records for 100 IDs.
20191014-155509: looking up user records for 100 IDs.
20191014-155510: looking up user records for 100 IDs.
20191014-155511: looking up user records for 100 IDs.
20191014-155512: looking up user records for 100 IDs.
20191014-155513: looking up user records for 100 IDs.
20191014-155513: looking up user records for 100 IDs.
20191014-155514: looking up user records for 100 IDs.
20191014-155515: looking up user records for 100 IDs.
20191014-155516: looking up user records for 100 IDs.
20191014-155516: looking up 

20191014-155700: looking up user records for 100 IDs.
20191014-155701: looking up user records for 100 IDs.
20191014-155701: looking up user records for 100 IDs.
20191014-155703: looking up user records for 100 IDs.
20191014-155703: looking up user records for 100 IDs.
20191014-155704: looking up user records for 100 IDs.
20191014-155705: looking up user records for 100 IDs.
20191014-155705: looking up user records for 100 IDs.
20191014-155706: looking up user records for 100 IDs.
20191014-155707: looking up user records for 100 IDs.
20191014-155708: looking up user records for 100 IDs.
20191014-155708: looking up user records for 100 IDs.
20191014-155709: looking up user records for 100 IDs.
20191014-155710: looking up user records for 100 IDs.
20191014-155711: looking up user records for 100 IDs.
20191014-155712: looking up user records for 100 IDs.
20191014-155712: looking up user records for 100 IDs.
20191014-155713: looking up user records for 100 IDs.
20191014-155714: looking up 

20191014-155857: looking up user records for 100 IDs.
20191014-155857: looking up user records for 100 IDs.
20191014-155858: looking up user records for 100 IDs.
20191014-155859: looking up user records for 100 IDs.
20191014-155900: looking up user records for 100 IDs.
20191014-155901: looking up user records for 100 IDs.
20191014-155902: looking up user records for 100 IDs.
20191014-155903: looking up user records for 100 IDs.
20191014-155904: looking up user records for 100 IDs.
20191014-155904: looking up user records for 100 IDs.
20191014-155905: looking up user records for 100 IDs.
20191014-155906: looking up user records for 100 IDs.
20191014-155907: looking up user records for 100 IDs.
20191014-155908: looking up user records for 100 IDs.
20191014-155908: looking up user records for 100 IDs.
20191014-155909: looking up user records for 100 IDs.
20191014-155910: looking up user records for 100 IDs.
20191014-155911: looking up user records for 100 IDs.
20191014-155911: looking up 

Rate limit reached. Sleeping for: 209


20191014-160046: looking up user records for 100 IDs.
20191014-160421: looking up user records for 100 IDs.
20191014-160422: looking up user records for 100 IDs.
20191014-160423: looking up user records for 100 IDs.
20191014-160424: looking up user records for 100 IDs.
20191014-160424: looking up user records for 100 IDs.
20191014-160425: looking up user records for 100 IDs.
20191014-160426: looking up user records for 100 IDs.
20191014-160427: looking up user records for 100 IDs.
20191014-160428: looking up user records for 100 IDs.
20191014-160428: looking up user records for 100 IDs.
20191014-160429: looking up user records for 100 IDs.
20191014-160430: looking up user records for 100 IDs.
20191014-160431: looking up user records for 100 IDs.
20191014-160432: looking up user records for 100 IDs.
20191014-160433: looking up user records for 100 IDs.
20191014-160433: looking up user records for 100 IDs.
20191014-160434: looking up user records for 100 IDs.
20191014-160435: looking up 

In [9]:
# Now let's write out all the records. I wrote some functions to help.
with open(ofile_name,'w') as ofile :
    write_user_rec_headers(ofile)
    for id, rec in all_records.items() :
        write_user_rec(ofile, rec)