# SC207 - Session 6
# APIs - Gathering Twitter Data
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/tweepy.jpg?raw=true" align="right" width="300">


- API = Application Programming Interface
- A Standardised way to retrieve data from platforms.
- Many platforms have an API and they all work relatively similarly
- Today we will use the package `tweepy` to retrieve data from the Twitter API

[Tweepy Documentation](http://docs.tweepy.org/en/stable/)

### Imports

Today we will be using Tweepy and Pandas to retrieve, store and explore data.

In [1]:
import tweepy
import pandas as pd

# This function is here just to make the class go smoothly!
def find_first_retweet(list_of_tweets):
    for tweet in list_of_tweets:
        if 'retweeted_status' in tweet._json:
            return tweet
        
def find_first_regular_tweet(list_of_tweets):
    for tweet in list_of_tweets:
        if 'retweeted_status' not in tweet._json:
            return tweet

# 1. Authorising and Connecting the API
`Tweepy` makes this process incredibly streamlined into essentially three simple stages.

### a) Identify your Access Tokens
APIs require authorisation tokens to identify who is using the API and to manage API usage by a single account holder.
- Go to https://developer.twitter.com
- Sign in with your Twitter account details
- You may have to navigate back to the Twitter developer page if you get redirected to normal Twitter.
- Once signed in use the drop down menu at the top right and select 'Apps'
- Create a new app (follow along in class with the details)
- Once created, go to the keys and tokens tab
- Copy and paste your Consumer Key and your Consumer secret into the variables below.


In [2]:
CONSUMER_KEY = '0R42gxh35fuYdmGP93KhtYHPx'
CONSUMER_SECRET = 'n6JHuWVmKaAGZidwfRueqOj1HhZNCDvvyeJD9ukEnLRJGTnW2W'

### b) Create an Authorisation Object
We create a special authorisation handler to store our keys.

In [3]:
auth = tweepy.AppAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)

### c) Connect the API
We create a new `API` object and feed it our authorisation handler.

We also set two additional arguments...
- `wait_on_rate_limit` sets the API to wait if you have maxed out your number of queries, and then resume when the limit is lifted
- `wait_on_rate_limit_notify` ensures Tweepy informs you of the wait occuring.

In [4]:
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

# 2. Gathering Data - Search
Search is one of the simpler ways you can interact with the API.
- Search returns a list of tweet objects matching your query
- Every request returns up to 100 tweets
- You can make 450 requests in a 15 minute window.
- A maximum of 45,000 tweets every 15 minutes.
- Each request counts against your quota, no matter how many Tweets it returns.

### What you recieve
It is important to be clear what Twitter is providing you when you ask for data.
>The Twitter's standard search API (search/tweets) allows simple queries against the indices of recent or popular Tweets and behaves similarly to, but not exactly like the Search UI feature available in Twitter mobile or web clients. The Twitter Search API searches against a sampling of recent Tweets published in the past 7 days. Before digging in, it’s important to know that the standard search API is focused on relevance and not completeness. This means that some Tweets and users may be missing from search results.
[Twitter API Documentation: Standard Search](https://developer.twitter.com/en/docs/tweets/search/overview/standard)

- Already sampled based on 'relevance'.
- Max. 7 days old.
- NOT complete.

### Making a Single Request
Lets make a single request for something that will have a lot of results.

- Tweepy has a range of 'arguments' built in to the search function.
- `q=` query: a string to search for. You can also use [operators](https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators) to make complex queries.
- `result_type=`: set this to either `mixed`, `recent` or `popular`. Note again that Twitter is to some extent pre-sampling for you.
    - `popular` - prioritise popular tweets
    - `recent`  - prioritise recent tweets
    - `mixed` **DEFAULT** - include both popular and recent
- `tweet_mode=` Set this to 'extended' to ensure you get the full text of a tweet, otherwise it will be cut off after 140 characters (Tweets can now be 280 characters). If your project doesn't care about tweet text then don't bother including the argument.
- `count=` max tweets per request. Defaults to 15, can be set up to 100.
- `since_id=` Each tweet has a unique ID. If you provide a tweet's ID number here, it will only return tweets posted AFTER that tweet was posted.
- `max_id=` As above, but limits the API to returning tweets posted BEFORE the tweet provided.

[You can view all the argument options in the Tweepy Documentation](http://docs.tweepy.org/en/latest/api.html#search-methods)

In [5]:
single_response = api.search(q='brexit',tweet_mode='extended',result_type='mixed', count=100)

In [6]:
# lets check the number of results we got
len(single_response)

100

In [7]:
# lets examine just one tweet object

single_tweet = single_response[0]

single_tweet

Status(_api=<tweepy.api.API object at 0x7fdef0847e90>, _json={'created_at': 'Sun Nov 22 14:01:44 +0000 2020', 'id': 1330511787712794624, 'id_str': '1330511787712794624', 'full_text': 'It\'s so ****ed up that the government whose record on Covid is the highest excess death rate in Europe and the worst economic recession in the G7, is using the damage THEY CAUSED to cover up the damage from Brexit, WHICH THEY ALSO CAUSED.\n#Marr "Rishi Sunak" https://t.co/3ySAXLXyRU', 'truncated': False, 'display_text_range': [0, 258], 'entities': {'hashtags': [{'text': 'Marr', 'indices': [239, 244]}], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 1330508816669171715, 'id_str': '1330508816669171715', 'indices': [259, 282], 'media_url': 'http://pbs.twimg.com/media/Enbxy7tW4Ak8BE5.jpg', 'media_url_https': 'https://pbs.twimg.com/media/Enbxy7tW4Ak8BE5.jpg', 'url': 'https://t.co/3ySAXLXyRU', 'display_url': 'pic.twitter.com/3ySAXLXyRU', 'expanded_url': 'https://twitter.com/Femi_Sorry/status/

In [8]:
# You get a LOT of data in one single Tweet of a single response, but it's also a bit unwieldy. 
# Luckily we can access a nice structured version of this with the ._json attribute attached to each tweet object

# we'll use the ._json attribute in later sessions...
single_tweet._json

{'created_at': 'Sun Nov 22 14:01:44 +0000 2020',
 'id': 1330511787712794624,
 'id_str': '1330511787712794624',
 'full_text': 'It\'s so ****ed up that the government whose record on Covid is the highest excess death rate in Europe and the worst economic recession in the G7, is using the damage THEY CAUSED to cover up the damage from Brexit, WHICH THEY ALSO CAUSED.\n#Marr "Rishi Sunak" https://t.co/3ySAXLXyRU',
 'truncated': False,
 'display_text_range': [0, 258],
 'entities': {'hashtags': [{'text': 'Marr', 'indices': [239, 244]}],
  'symbols': [],
  'user_mentions': [],
  'urls': [],
  'media': [{'id': 1330508816669171715,
    'id_str': '1330508816669171715',
    'indices': [259, 282],
    'media_url': 'http://pbs.twimg.com/media/Enbxy7tW4Ak8BE5.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/Enbxy7tW4Ak8BE5.jpg',
    'url': 'https://t.co/3ySAXLXyRU',
    'display_url': 'pic.twitter.com/3ySAXLXyRU',
    'expanded_url': 'https://twitter.com/Femi_Sorry/status/1330511787712794624

### Types of Data in a single Tweet object
Tweets from the API contain data such as...
- Time posted
- The text of the tweet
- Full details on the User who posted.
- Details of any media embedded in the tweet
- Details of any hashtags user mentions, urls

In [9]:
# If we check the type of our single_tweet we can see it is a tweepy Status object.
# When Tweepy recieved the response from Twitter, it wrapped it up into a useful object for us.
type(single_tweet)

tweepy.models.Status

In [10]:
# You can access any of these items individually as they are set as attributes of the Status class...

print(single_tweet.full_text)
print(single_tweet.source)
print(single_tweet.retweet_count)

It's so ****ed up that the government whose record on Covid is the highest excess death rate in Europe and the worst economic recession in the G7, is using the damage THEY CAUSED to cover up the damage from Brexit, WHICH THEY ALSO CAUSED.
#Marr "Rishi Sunak" https://t.co/3ySAXLXyRU
Twitter Media Studio
1333


In [11]:
# a clean way to see all the relevant attributes is to ask for the json keys...
single_tweet._json.keys()

dict_keys(['created_at', 'id', 'id_str', 'full_text', 'truncated', 'display_text_range', 'entities', 'extended_entities', 'metadata', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive', 'lang'])

In [14]:
# You can also use Jupyter to help you by using the code completion suggestions
# type single_tweet. and then hit Tab on your keyboard to see your options.

# single_tweet.

In [15]:
# Some the values of some items will themselves be other objects, with their own attributes...

type(single_tweet.user)

tweepy.models.User

In [16]:
single_tweet.user

User(_api=<tweepy.api.API object at 0x7fdef0847e90>, _json={'id': 234694571, 'id_str': '234694571', 'name': 'Femi😷', 'screen_name': 'Femi_Sorry', 'location': 'Solihull, England', 'description': 'Femi Oluwole\n🇬🇧Law grad "Do-Gooder"\nCalling out #Brexit and Boris & Farage\'s lies & lack of humanity in @Independent/@theIpaper/@LondonEconomic\nEx @OFOCbrexit', 'url': 'https://t.co/3qVH7HQFJk', 'entities': {'url': {'urls': [{'url': 'https://t.co/3qVH7HQFJk', 'expanded_url': 'http://Instagram.com/femi_sorry', 'display_url': 'Instagram.com/femi_sorry', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 282591, 'friends_count': 1598, 'listed_count': 1087, 'created_at': 'Thu Jan 06 09:50:22 +0000 2011', 'favourites_count': 97575, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 71859, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': True, 'profile_background_col

In [17]:
single_tweet.user._json

{'id': 234694571,
 'id_str': '234694571',
 'name': 'Femi😷',
 'screen_name': 'Femi_Sorry',
 'location': 'Solihull, England',
 'description': 'Femi Oluwole\n🇬🇧Law grad "Do-Gooder"\nCalling out #Brexit and Boris & Farage\'s lies & lack of humanity in @Independent/@theIpaper/@LondonEconomic\nEx @OFOCbrexit',
 'url': 'https://t.co/3qVH7HQFJk',
 'entities': {'url': {'urls': [{'url': 'https://t.co/3qVH7HQFJk',
     'expanded_url': 'http://Instagram.com/femi_sorry',
     'display_url': 'Instagram.com/femi_sorry',
     'indices': [0, 23]}]},
  'description': {'urls': []}},
 'protected': False,
 'followers_count': 282591,
 'friends_count': 1598,
 'listed_count': 1087,
 'created_at': 'Thu Jan 06 09:50:22 +0000 2011',
 'favourites_count': 97575,
 'utc_offset': None,
 'time_zone': None,
 'geo_enabled': True,
 'verified': True,
 'statuses_count': 71859,
 'lang': None,
 'contributors_enabled': False,
 'is_translator': False,
 'is_translation_enabled': True,
 'profile_background_color': 'C0DEED',
 'pr

In [18]:
# We can access these subvalues by just chaining our attribute requests

single_tweet.user.screen_name

'Femi_Sorry'

If a tweet is a retweet it will also contain another tweet object with all the information on the original tweet.

In [19]:
# lets make sure we are all looking at a retweet by using our handy function - this may be the tweet you were looking at already!

single_tweet_with_RT = find_first_retweet(single_response)
print(single_tweet_with_RT)

Status(_api=<tweepy.api.API object at 0x7fdef0847e90>, _json={'created_at': 'Mon Nov 23 13:31:55 +0000 2020', 'id': 1330866671498747904, 'id_str': '1330866671498747904', 'full_text': 'RT @liz_langfield: Right. So was Covid to blame for tearing up the countryside in Kent for the huge lorry parks?  Is Covid to blame for all…', 'truncated': False, 'display_text_range': [0, 140], 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'liz_langfield', 'name': 'Liz Langfield 3.5% #FBPE🐝', 'id': 826465702156627968, 'id_str': '826465702156627968', 'indices': [3, 17]}], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 4332062969, 'id_str': '4332062969', 'name': 'Neil 🏴\U000e0067\U000e00

In [20]:
#.... and therefore we can also access the details of that tweet

print(single_tweet_with_RT.created_at)
print(single_tweet_with_RT.full_text)
print(single_tweet_with_RT.user.screen_name)

print('*'*100)

print(single_tweet_with_RT.retweeted_status.created_at)
print(single_tweet_with_RT.retweeted_status.full_text)
print(single_tweet_with_RT.retweeted_status.user.screen_name)



2020-11-23 13:31:55
RT @liz_langfield: Right. So was Covid to blame for tearing up the countryside in Kent for the huge lorry parks?  Is Covid to blame for all…
devchem123
****************************************************************************************************
2020-11-22 23:31:49
Right. So was Covid to blame for tearing up the countryside in Kent for the huge lorry parks?  Is Covid to blame for all the extra “paperwork” needed to export/import goods?  Is Covid to blame for all those financial institutions moving out of London?   NO. NO. NO.  IT’S BREXIT.
liz_langfield


### Making Multiple Requests
 - With a single request we can retrieve 100 tweets
 - What if we want to maximise our data access and make multiple requests
 - We could make a second request and then join the lists of results together...
 - However Twitter doesn't know what tweets we already retrieved in the first request, so we might get the same ones again.
 - Enter Tweepy's `Cursor` object.
 - The `Cursor` will keep track of where we are in the results stream, handle any api limits and blocks, and keep producing results until it reaches the set limit.


In [21]:
import tweepy
import pandas as pd


CONSUMER_KEY = '0R42gxh35fuYdmGP93KhtYHPx'
CONSUMER_SECRET = 'n6JHuWVmKaAGZidwfRueqOj1HhZNCDvvyeJD9ukEnLRJGTnW2W'

auth = tweepy.AppAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)



In [22]:
# This was our original way we made a request for data from the API

old_approach = api.search(q='brexit',tweet_mode='extended',result_type='mixed', count=100)

In [23]:
# Using the cursor is similar to our original single_response method.
# we first create our custom cursor, providing it the api method we want to use,
# and any of the arguments we want to be used by that method.



our_cursor = tweepy.Cursor(api.search, q='brexit', tweet_mode='extended',result_type='mixed', count=100)

## Cursors
Cursor objects don't DO anything alone. They are almost like a set of instructions, but the instructions aren't being acted out until we do two things....
1. Specify whether we want our results as `items` or `pages`.
2. Iterate over the cursor

#### 1. Items / Pages

Cursors can return either a list of individual result items, or result pages depending on what is best. 
- Pages returns you a stream of response objects, each containing the maximum number of tweets per request.
- Items returns you a stream of tweets, essentially joining together the results of the responses.

We set whether we are using pages or items using a method attached to the cursor. The number we pass to the cursor defines the limit, of either pages or items. These arguments would return the same number of tweets, presuming we set our count to 100.

`our_cursor.pages(2)`

`our_cursor.items(200)`

#### 2. Iterating over the Cursor

For our purposes asking for the `.items()` is sufficient, now we need to iterate over it.

In [24]:
# The most explicit way - using a for loop

item_results = []
for status in our_cursor.items(200):
    item_results.append(status)
print(len(item_results))

200


# 3. Managing Tweet Data
- It's all well and good having this data and printing out pieces of it, but how do we...
- Structure it...
- Store it...
- and Explore it?

In [25]:
# First lets get a fresh set of results with a new cursor. Limit to 500 items.

results = []
our_cursor = tweepy.Cursor(api.search, q='brexit', count=100, tweet_mode='extended')

for item in our_cursor.items(500):
    results.append(item)

Ideally we'd like this data now in a Pandas DataFrame so we can work with it. Let's just try and put it in and see what happens...

In [26]:
df = pd.DataFrame(results)
df

Unnamed: 0,0
0,Status(_api=<tweepy.api.API object at 0x7fdef2...
1,Status(_api=<tweepy.api.API object at 0x7fdef2...
2,Status(_api=<tweepy.api.API object at 0x7fdef2...
3,Status(_api=<tweepy.api.API object at 0x7fdef2...
4,Status(_api=<tweepy.api.API object at 0x7fdef2...
...,...
495,Status(_api=<tweepy.api.API object at 0x7fdef2...
496,Status(_api=<tweepy.api.API object at 0x7fdef2...
497,Status(_api=<tweepy.api.API object at 0x7fdef2...
498,Status(_api=<tweepy.api.API object at 0x7fdef2...


Ok....partial success.
Pandas doesn't understand these `Status` objects we're trying to load into it. 
Whilst often people will load data into Pandas using .csv files, Pandas can create dataframes from python data structures such as lists and dictionaries.

In [27]:
flintstones_data = [ {'name':'Fred', 'age':30}, {'name':'Wilma', 'age':27}, {'name':'Barney', 'age':32}, {'name':'Betty', 'age':26}  ]

toy_df = pd.DataFrame(flintstones_data)
toy_df

Unnamed: 0,name,age
0,Fred,30
1,Wilma,27
2,Barney,32
3,Betty,26


So we need to somehow convert all of our `status` objects into some sort of Python data structure like our Flintstones data....

#### Luckily for us....
The `._json` method attached to each `Status` turns the object into a dictionary.

In [28]:
single_tweet = results[0]
print(type(single_tweet._json))
single_tweet._json

<class 'dict'>


{'created_at': 'Mon Nov 23 13:32:48 +0000 2020',
 'id': 1330866892983181313,
 'id_str': '1330866892983181313',
 'full_text': 'Bit of a pain having Covid restrictions this festive season, but remember - Brexit is for life, not just for Christmas.',
 'truncated': False,
 'display_text_range': [0, 119],
 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []},
 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'},
 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>',
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 4047208852,
  'id_str': '4047208852',
  'name': 'SpersJR',
  'screen_name': 'Frantically2',
  'location': '',
  'description': "The only thing we have to fear is fear itself. That's what scares me.",
  'url': None,
  'entities': {'description': {'urls': []}},
  'protected': False,
  

In [29]:
# Lets first create a new list of the transformed Status objects

json_results = []

for tweet in results:
    json_results.append(tweet._json)
    

In [30]:
# Now try...

df = pd.DataFrame(json_results)
df.head()

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,metadata,source,in_reply_to_status_id,...,favorite_count,favorited,retweeted,lang,retweeted_status,quoted_status_id,quoted_status_id_str,possibly_sensitive,quoted_status,extended_entities
0,Mon Nov 23 13:32:48 +0000 2020,1330866892983181313,1330866892983181313,Bit of a pain having Covid restrictions this f...,False,"[0, 119]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""https://mobile.twitter.com"" rel=""nofo...",,...,0,False,False,en,,,,,,
1,Mon Nov 23 13:32:48 +0000 2020,1330866892614086657,1330866892614086657,"RT @JamesGr49498338: @LeftieCatLady No, this i...",False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/iphone"" r...",,...,0,False,False,en,{'created_at': 'Mon Nov 23 08:46:51 +0000 2020...,,,,,
2,Mon Nov 23 13:32:47 +0000 2020,1330866891502522370,1330866891502522370,RT @carolJhedges: 🇪🇺😠The new Home Office regul...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/iphone"" r...",,...,0,False,False,en,{'created_at': 'Mon Nov 23 13:22:08 +0000 2020...,,,,,
3,Mon Nov 23 13:32:47 +0000 2020,1330866891154485251,1330866891154485251,RT @NvOndarza: Very interesting if this materi...,False,"[0, 139]","{'hashtags': [{'text': 'Brexit', 'indices': [6...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""https://mobile.twitter.com"" rel=""nofo...",,...,0,False,False,en,{'created_at': 'Mon Nov 23 11:04:06 +0000 2020...,1.330824e+18,1.330823757829722e+18,,,
4,Mon Nov 23 13:32:46 +0000 2020,1330866885345370113,1330866885345370113,RT @MilesKing10: Brain of Brexit Steve Baker. ...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/iphone"" r...",,...,0,False,False,en,{'created_at': 'Mon Nov 23 09:01:05 +0000 2020...,,,,,


If we check, we can see that the columns in the DataFrame, match the names of the attributes in our status objects, meaning each column represents that attribute, and each row represents a single Tweet/Status

In [31]:
results[0]._json.keys()

dict_keys(['created_at', 'id', 'id_str', 'full_text', 'truncated', 'display_text_range', 'entities', 'metadata', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'lang'])

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 31 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   created_at                 500 non-null    object 
 1   id                         500 non-null    int64  
 2   id_str                     500 non-null    object 
 3   full_text                  500 non-null    object 
 4   truncated                  500 non-null    bool   
 5   display_text_range         500 non-null    object 
 6   entities                   500 non-null    object 
 7   metadata                   500 non-null    object 
 8   source                     500 non-null    object 
 9   in_reply_to_status_id      97 non-null     float64
 10  in_reply_to_status_id_str  97 non-null     object 
 11  in_reply_to_user_id        103 non-null    float64
 12  in_reply_to_user_id_str    103 non-null    object 
 13  in_reply_to_screen_name    103 non-null    object 

# 4. Saving Tweet Data

We can finally save our data to disk if we like. In this case we're going to save to something called a `pickle` file. Why?

In [33]:
# If we examine one of our columns...

df.entities

0      {'hashtags': [], 'symbols': [], 'user_mentions...
1      {'hashtags': [], 'symbols': [], 'user_mentions...
2      {'hashtags': [], 'symbols': [], 'user_mentions...
3      {'hashtags': [{'text': 'Brexit', 'indices': [6...
4      {'hashtags': [], 'symbols': [], 'user_mentions...
                             ...                        
495    {'hashtags': [], 'symbols': [], 'user_mentions...
496    {'hashtags': [], 'symbols': [], 'user_mentions...
497    {'hashtags': [{'text': 'SteveBaker', 'indices'...
498    {'hashtags': [], 'symbols': [], 'user_mentions...
499    {'hashtags': [], 'symbols': [], 'user_mentions...
Name: entities, Length: 500, dtype: object

The values in the entities columns aren't strings, they're dictionaries...

In [34]:
# Here is the first row's value in the 'entities' column
df.loc[0, 'entities']

{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []}

In [35]:
# The type of the value is dict - dictionary.
type(df.loc[0, 'entities'])

dict

In [36]:
# and parts of it can be accessed like a dictionary
df.loc[0, 'entities']['user_mentions']

[]

<img src="https://github.com/Minyall/sc207_materials/blob/master/images/pickle.jpg?raw=true" align="right" height="200">
If we were to save this DataFrame as a .csv file, it would have to turn those dictionaries into strings, because .csv's don't understand Python objects. When we reloaded the data from a CSV our entities column would be a column of weird messy strings.

### How do we solve this?

# PICKLES!
- A pickle file is a saved version of a python object. So long as it is saved and loaded with the same version of Pandas, it will retain all the data exactly in the state it is in now.

How do we complete this highly complex procedure?....

In [37]:
# Pickle it!
df.to_pickle('my_tweet_df.pkl')

# Extending your Collection

If you want to gather data across a longer period, such as sampling across a week, you may want to pull from the Twitter API once a day. How do we do this without duplicating our data, and how do we easily just add the new data to our dataset, rather than creating a new one each time?

In [38]:
import os

my_data_filename = 'twitter_data.pkl'
query = 'brexit'
n_items = 1000


# First load in your data if you have it, otherwise create a new DataFrame

if os.path.exists(my_data_filename):
    df = pd.read_pickle(my_data_filename)
    
    # if there is data check to find the largest id in your dataset, this will be the most recent
    max_id = df['id'].max()
else:
    df = pd.DataFrame()
    # set max_id to None because on the first run we don't need to provide an id to limit results
    max_id = None
    
# Pull results from the Twitter API

results = []
our_cursor = tweepy.Cursor(api.search, q=query, count=100, tweet_mode='extended', since_id=max_id)

for item in our_cursor.items(n_items):
    results.append(item._json)

# Load this batch of data into a DataFrame
    
current_data = pd.DataFrame(results)

# Append the new data onto the end of the loaded data (or the empty dataframe if this is the first run)
df = df.append(current_data)

# Check the dataset for any duplicates by dropping any rows with duplicate ids
df = df.drop_duplicates('id')

# Save back to disk
df.to_pickle(my_data_filename)

print(len(df))

1000
