# SC207 - Session 4
# Addendum: Exploring our Data
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/tweepy.jpg?raw=true" align="right" width="300">

It's down to you!
- Tweets give us a lot of different kinds of data.
- Some of these headings might be self-explanatory, others less so.
- Twitter provides us with explanations of these different values in their [Tweet Object Dictionary](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object)

Explore the dataset using what you learned in the Pandas session. Interesting questions you might ask...
- How many tweets are original, how many are retweets?
- Who is the most active user in the sample - have we collected lots of tweets from one person?
- Which tweets got the most retweets / favourites?
- When were these tweets posted, and when were the most active times?
- What hashtags do the tweets use and which are most popular (this may be a little more challenging - but possible with what we've learned)?

We'll set you up with a couple of things...
- The relevant imports you might need
- The data to examine
- A couple of quick examples 


## Imports

In [1]:
import pandas as pd # Just in case you want to easily come back and start at section 4
import seaborn as sns

You can either explore the data we generated together, or you can pull your own focused on a different search query.

# Step 1 - Load your data

## Option A: I want to use the data we have

In [2]:
df = pd.read_csv('brexit_tweets.csv')

You are finished here - proceed to step 2

## Option B: I want to use different data
We can compress all the api work and the data management into just a few steps...

In [2]:
# we need our Tweepy import
import tweepy

In [3]:
# we need our function to extract the tweets from the results list
def extract_original_tweets(list_of_tweets):
    json_results = []
    
    for tweet in list_of_tweets:
        json_results.append(tweet._json)
        if 'retweeted_status' in tweet._json:
            original_tweet = tweet._json['retweeted_status']
            json_results.append(original_tweet)
            
    return json_results

def datetime_to_twitter_id(day, month, year):
    def utc2snowflake(stamp):
        # from https://github.com/client9/snowflake2time
        return (int(round(stamp * 1000)) - 1288834974657) << 22

    stamp = datetime(year, month, day).replace(tzinfo=timezone.utc).timestamp()
    return utc2snowflake(stamp)

In [4]:
# We set up our api access make sure you fill in your key and secret details

CONSUMER_KEY = 'OrynKkk7gvFsu1AnJwCFDEBPn'
CONSUMER_SECRET = 'QFM6BzYkmvKSzJcuMIBV8PCcvhaK2hngr50X7X2NPro7DyilN5'
auth = tweepy.AppAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [7]:
# run our query and store the results
# remember to provide a query to the q= argument

results = []

for item in tweepy.Cursor(api.search, q='brexit', count=100, tweet_mode='extended').items(100000):
    results.append(item)

Rate limit reached. Sleeping for: 510
Rate limit reached. Sleeping for: 639


In [8]:
extracted_results = extract_original_tweets(results)

In [9]:
# make into a dataframe
df = pd.io.json.json_normalize(extracted_results)

In [10]:
# drop the duplicate rows
df = df.drop_duplicates('id')

In [11]:
# drop the columns with no data in at all
df = df.dropna(axis='columns', how='all')


In [23]:
subset = ['created_at',
          'id',
          'full_text',
          'user.screen_name',
          'user.id',
          'user.statuses_count',
          'user.followers_count',
          'retweeted_status.user.screen_name',
          'retweeted_status.user.id',
         'entities.user_mentions',
         'entities.hashtags']

In [24]:
df[subset].to_pickle('large_brexit_tweets.pkl')

In [14]:
df.to_pickle('large_brexit_tweets.pkl')

In [None]:
df.to_csv('large_brexit_tweets.csv')

## Example 1 - Most Favorited tweets

In [None]:
subset = ['full_text','favorite_count'] # not strictly necessary, but subset lets us control what columsn to show

df[subset].sort_values(by='favorite_count', ascending=False).head(10)

## Example 2 - Number of retweets collected

In [None]:
# remember we created a 'retweeted_status_id' column. It will either contain the id of an original tweet
# that was retweeted, or it will be empty
df['retweeted_status_id'].head(10)

In [None]:
# easiest to work with is a simple column is true if the tweet is a retweet.
df['is_retweet'] = df['retweeted_status_id'].isna()
df['is_retweet'].head()

In [None]:
# now we can do this

df['is_retweet'].value_counts()

## Example 3 - Popularity of Tweets, split by retweet status

In [None]:
sns.barplot(data=df, y='favorite_count', x='is_retweet')

## Example 4 - Time Series of Tweets
Sometimes it is really useful to get a sense of the time distribution of tweets. We can use Time series information to...

- See trends such as peak times for particular topics
- Detect potential co-ordinated disinformation campaigns by examining...
  - the account creation date of all the accounts pushing a particular hashtag. Were a significant proportion of the accounts created in a small window of time?
  - the rate at which accounts are tweeting. Some accounts might tweets hundreds of times per hour - upwards of 50 is considered highly unusual.

To ensure Pandas understands that the information in a column is a date, we convert it into date format...

In [None]:
df['created_at'] = pd.to_datetime(df['created_at']) #easy!


We then want to group our data into periods of time. There is no point grouping our data just on the 'created_at' column, because every time stamp will be slightly different by a second or two. Grouping by time needs a special object called a `Grouper`.

First we create a grouper. We provide it two arguments
- The `key` which is the column you want to group by
- The `freq` which specifies the time period you want to group by for example 'd' for day, or 'h' for hour.
- You can see all the options for freq [here in the Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases)


In [None]:
time_grouper = pd.Grouper(key='created_at', freq='d')
df.groupby(time_grouper).count()['id'].plot()

In [None]:
# depending on the result it might be better if we specify to only look at tweets posted this month

last_7_days_filter = df['created_at'] > '01 October 2019'
last_7_days = df[last_7_days_filter]

In [None]:
# try again

last_7_days.groupby(time_grouper).count()['id'].plot()

# Try your own explorations below