In this notebook:

1. [Connecting to the Twitter API](#1)
2. [Searching for a specific user](#2)
3. [Searching for a specific topic](#3)
4. [Extending the search and working with multi-level JSON Data](#4)

<a id="1"></a>
# 1. Connecting to the Twitter API

## Questions & Objectives

* Setting up access and validity signing
* Setting up a handler to manage the connection
* Running a test search

First we will download the libraries that deal with accessing the API (`tweepy`) and working with the JSON data (`json`).

In [None]:
# Run this cell now to import the libraries.

!pip install tweepy
import tweepy        # https://github.com/tweepy/tweepy
import json

We then set up the variables that hold the validation keys. You need to add your keys (tokens) and secrets in the spaces below. Make sure to put them between the speech marks and make sure there are no extra spaces.

In [None]:
# Add your keys and secrets and then run this cell.

access_key = ''
access_secret = ''
api_key = ''
api_secret = ''

Next, we set up the authication handler. We pass the keys and secrets as below and then set up the API object. We can use this object to connect to the API.

In [None]:
auth = tweepy.OAuthHandler(api_key, api_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)

To test the connection we will run a test query.

We use the API object and we are going to ask for some of the tweets from users you follow.


In [None]:
public_tweets = api.home_timeline()
for tweet in public_tweets:
    print(tweet.text)

<a id="2"></a>
# 2. Searching for a Specific User

* Search for a specific user
* Retrieve data from the Twitter API
* Call specific items from the JSON data object
* Look at the full JSON data

We will now look for tweets from a specific person. To do this we need their Twitter name. If you go to https://twitter.com/BarackObama you can see the Twitter name under the main name. You can see it has an @ sign in front that we remove from our code.  

For this we use the `get_user` method from the Twitter API.

In [None]:
# First, we create a variable, call the information on the user Barack Obama, and hold it
# in the variable we created.

user = api.get_user(screen_name='BarackObama')

In [None]:
# This object is in JSON tuples.
# We can call the tuples and print their content. 
# We will look more at JSON later.
# We can print the screen name as below:

print(user.screen_name)

In [None]:
# We can print the number of followers:

print(user.followers_count)

In [None]:
# We can print the user description:

print(user.description)

In [None]:
# To see all of the user information in its raw format we can type:

print(user)

### 🐛Minitask

* Try using the information from the user to print out to access the other information.
* See if you can work out how to get to the nested tuples.
* Try and look at another user.

In [None]:
# We can get tweets from the API user timeline.
# This time we call the user_timeline method again with the BarackObama user method.
# Here we call the last two tweets.
# These are retured in a list object.

new_tweets = api.user_timeline(screen_name = 'BarackObama', count = 2)   # replace BarackObama with another user's name

In [None]:
# Here we can tweet the first tweet (which remember is 0 in a list).
# What other information can you access from the tweets?  How about the number of retweets?

new_tweets[0]

<a id="3"></a>
# 3. Searching for a Topic

* Search the Twitter API using a keyword
* Retrieve the text from a single tweet
* Retrieve the text from multiple tweets
* Process and clean the text
* Visualise the text

We will now look for tweets that contain a specific word. 

For this we use the `search` method from the Twitter API.

In [None]:
# Here we are looking for the word covid.
# We are asking for 10 english tweets to be returned.
# They are returned as a list.

covid_tweets = api.search_tweets(q='covid', lang='en', count='10')

In [None]:
# We can print out the first tweet in the list.

covid_tweets[0]

This time we can't just call the JSON from the object (like we did with the user object).
We have to deal with the JSON directly. We do this using the `_json` function.
Then we can call all of the tuples as a dictionary object. 

(Remember a tuple takes the form `['text':'this is tweet text']`, which means that we can retrieve the content of the tuple by the key of the tuple.) 

In [None]:
# Here we can see all of the json in a nice format...

covid_tweets[0]._json

In [None]:
# ...or we can just call the text.

covid_tweets[0]._json['text'] 

In [None]:
# We can text put the text into its own list and just work with just the text.

tweets_text = []
for each in covid_tweets:
    tweets_text.append(each._json['text'])

In [None]:
# We can see how we have put the tweets' text into a list.

print(tweets_text)

In [None]:
# We can treat the tweets' text like we did in earlier badges,
# for example, we can turn it into a string and tokenise it.

tweets_string = " ".join(tweets_text)
from nltk.tokenize import word_tokenize
tokens = word_tokenize(tweets_string)
print(tokens[0:10])

In [None]:
# We can clean up the tweets' text like we did earlier, making it all lowercase and removing stop words.

import nltk
import string
nltk.download('stopwords')
from nltk.corpus import stopwords
lowercase_tokens = [token.lower() for token in tokens]
remove_these = set(stopwords.words('english') + list(string.punctuation) + list(string.digits))
filtered_text = [token 
                 for token in lowercase_tokens 
                 if not token in remove_these]
print(filtered_text)

In [None]:
# We can calculate word frequencies...

from collections import Counter
simple_frequencies_dict = Counter(filtered_text)

In [None]:
# ...and produce word clouds.

import matplotlib.pyplot as plt
from wordcloud import WordCloud

cloud = WordCloud(width=800, 
                  height=400, 
                  max_font_size=160, 
                  colormap="hsv").generate_from_frequencies(simple_frequencies_dict)
plt.figure(figsize=(16,12))
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

### 🐛Minitask

* Try using a visualisation method or a search method you have used before to visualise the text.
* Try searching for a different word.

In [None]:
# Write your code here.

<a id="4"></a>
## 4. Extending the search and working with multi-level JSON Data

* Search the Twitter API using an extended query with multiple terms
* Search using a tweepy cursor to retrieve more data
* Look at nested data in the JSON

We will now look for tweets that contain several words. We can combine query words with the operator `OR`. We can use this operator to say, give me tweets that contain `word1` or `word2`. You might want to do this with related words on the same topic, or with multiple spellings or potential typos of a word. 

For this we will continue to use the `search` method from the Twitter API.

We want to gather more data than we did before. The `search` method limits the data we can retrieve. To extend the amount of data we retrieve we use a tweepy `Cursor`. Twitter returns multiple pages of data, almost like a book, but it will only give you one page at a time. Before, we only took the first page. This time, we will page through the extended version using a `Cursor` object. The `Cursor` maintains the connection with the API and allows us to ask for the next page.

In [None]:
# We set up a list to hold the tweets so we can then append to it as we iterate through the pages.
# Previously, we created a list in the search, but here we need to create a list so we can add to it.

covid_tweets = []

# We set up a tweepy Cursor to maintain the connection.
# We set up the query with the OR operator.
# We iterate through the pages from the API using a for loop.
# We append the content to a list.
for page in tweepy.Cursor(api.search_tweets, 
                          q='covid OR covid19 OR COVID OR COVID19 or #covid', 
                          lang='en').pages(10):
    covid_tweets.append(page)

In [None]:
covid_tweets[0]

In [None]:
# We can see the text from the first tweet:

print(covid_tweets[0][0].text) # covid_tweets[0][0] is the first Status (tweet) object

Twitter data is nested.

This means that it can contain items within items. 

For example hashtags, user mentions, and URLs are contained within an `entities` dictionary.

This looks like:

```
'entities': { 
    'hashtags': [{'hashtag1'}, {'hashtag2'}], 
    'user_mentions': [{'screen_name':'barackobama', 'name': 'Barack Obama'}], 
    'urls': [{'url':'www.bbc.co.uk'}]
    }
```

In [None]:
# The hashtags are contained in a list within the entity tuple,
# which means we need to call the entity tuple (hashtag) and then iterate through the list.
# We set up a list to hold the hashtags so we can then append to it as we iterate.
# We iterate through each tweet, and then through the hashtags in the list,
# adding the tweets to the list.

covid_hashtags = []
for search_result in covid_tweets:
    for status in search_result:  # for every tweet
        hashtags = status.entities['hashtags']
        if len(hashtags) > 0:     # if there are hashtags
            for h in hashtags:
                covid_hashtags.append(h['text'])

print(covid_hashtags)

In [None]:
# We can then visualise these hashtags in the ways we learnt before.

hashtag_string = " ".join(covid_hashtags)
tokens = word_tokenize(hashtag_string)
simple_frequencies_dict_covid = Counter(tokens)
cloud = WordCloud(width=800, height=400, max_font_size=160, 
                  colormap="viridis", 
                  background_color='white',).generate_from_frequencies(simple_frequencies_dict_covid)
plt.figure(figsize=(16,12))
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

### 🐛Minitask

* Try creating a visualisation with a different nested item.

In [None]:
# # Have this as a task -- look for another item of interest, maybe alter to be URLs?

# covid_mentions = []
# for search_result in covid_tweets:
#     for status in search_result:
#         mention = status._json['entities']['user_mentions']
#         if len(mention) > 0:
#             i = 0
#             while i < len(mention):
#                 covid_mentions.append(mention[i]['name'])
#                 i += 1
# people_dict=Counter(covid_mentions)

In [None]:
# cloud = WordCloud(width=800, height=400, max_font_size=200,
#                   background_color='white', colormap="viridis").generate_from_frequencies(people_dict)
# plt.figure(figsize=(16,12))
# plt.imshow(cloud, interpolation='bilinear')
# plt.axis('off')
# plt.show()

In [None]:
# covid_tweets = []
# for page in tweepy.Cursor(api.search, q='brexit', lang='en', min_retweets="1000").pages(100):
#     covid_tweets.append(page)

In [None]:
# print(len(covid_tweets))
# print(covid_tweets[0])