# SC207 - Session 7
# APIs - Exploring and Summarising Twitter Data
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/tweepy.jpg?raw=true" align="right" width="300">


What kinds of exploratory analysis can we run on social media data? This session covers various examples of the kinds of insights that can be gathered through the analysis of social media data, and how to present those results.

[Tweepy Documentation](http://docs.tweepy.org/en/stable/)

In [None]:
import pandas as pd
import seaborn as sns

In [None]:
filename = 'example_twitter_data.pkl'

tweets = pd.read_pickle(filename) 

In [None]:
tweets.head()

## 1. Summarising your Sample

### 1.1 How many Tweets did I get?

In [None]:
# Tweet ids are unique so we can count the number of unique Tweets using nunique


### 1.2 When are they from?
Twitter can be a very active place, meaning that whilst it may sound like you have a lot of data it may only represent the last 3 minutes. To better understand our dataset we can use the `created_at` column.

In [None]:
tweets['created_at']

Currently these are just strings, so Pandas doesn't know how to interpret them. If we convert them to a special type called `datetime` Pandas will be able to better handle them.

In [None]:
tweets['created_at'] = 

In [None]:
tweets['created_at']

In [None]:
tweets['created_at'] # describe the column

### 1.3 How often is it being Tweeted?

Whilst our time info is to the second, it is more intuitive to see larger trends by the minute or hour. Grouping by time needs a special object called a `Grouper`.

First we create a grouper. We provide it two arguments
- The `key` which is the column you want to group by
- The `freq` which specifies the time period you want to group by for example 'd' for day, or 'h' for hour, or 'min' for minute.
- You can see all the options for freq [here in the Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases)


In [None]:
time_grouper = 
count_per_hour = 
count_per_hour.head()

In [None]:
# Make a relplot



### 1.4 What types of activity are we seeing?
Is this all uniquely written material, or is this a lot of people engaging with the content of others?

On Twitter there are retweets. Retweets are where you reproduce and recirculate someone's tweet directly. Though not a hard rule, retweets tend to indicate agreement and can inflate the amount of attention a topic gets.

In [None]:
# We don't have a column for whether or not a collected tweet is a retweet, but we can make one.
# If we check the dataframe info we can see that 'retweeted_status' is a smaller number than our total number of tweets.
# if a tweet has data in 'retweeted_status' it is itself a retweet. If not then it's an original tweet.

tweets.info()

In [None]:
# We can ask what rows are empty at a particular column using .isna() 
# We then invert that result (True becomes False and False becomes True) by using tilde ~

is_retweet = 
is_retweet

In [None]:
tweets['is_retweet_status'] = 

In [None]:
tweet_type_data = 
tweet_type_data.head()

In [None]:
time_grouper = 
tweet_type_data_to_plot = 
tweet_type_data_to_plot.head()

In [None]:
# create a relplot



## 2. Unpacking our Data further
To better understand the trend we're examining we can do some qualitative and quantitative work but to do this we need to know how to delve deeper into the data to unearth specific information. It would also be helpful to unpack any tweets that are nested in the data as retweets, and add them to our dataset at this point so we have full access to any retweets that were particularly popular.


### 2.1 Unpacking Retweets

In [None]:
rt_filter = 
retweet_df = 

tweets = 
len(tweets)

In [None]:
# We could even turn this into a function...

def extract_original_tweets(df):
    rt_filter = ~df.retweeted_status.isna()
    retweet_df = pd.DataFrame( df[rt_filter]['retweeted_status'].tolist() )

    df = df.append(retweet_df).drop_duplicates('id').reset_index(drop=True)
    
            
    return df

### 2.2 Unpacking Nested Data
To examine the most popular tweets, see what kinds of hashtags are being shared most widely, and see what users are being drawn into the discussion we need to unpack this from the nested data.

As a lot of our data is retweets, it may be better to drop these and retain only original tweets to examine the kinds of activity being produced by people.

In [None]:
original_filter = 
original_tweets = 
original_tweets.info()

In [None]:
# Examining the first rows' 'user' column shows us the complex dictionary object containing all the user info.
original_tweets.loc[0,'user']

In [None]:
# Similarly for entities
original_tweets.loc[0,'entities']

We need to use a couple of functions to unpack our nested columns.
- `.to_dict(orient='records')` translates a dataframe into a list of dictionaries, each containing their own nested dictionaries
- `pd.json_normalize` can create a Dataframe from a list of dictionaries, and flattens out nested dictionaries into their own columns.

In [None]:
# create flatten_nested_dicts function

original_tweets = flatten_nested_dicts(original_tweets)
original_tweets.info(verbose=True,show_counts=True)

In [None]:
# We won't need all of this, let's just grab a subset of columns
keep_columns = ['id','full_text','favorite_count','retweet_count',
                'user.screen_name', 'entities.hashtags','entities.user_mentions']

original_tweets = original_tweets[keep_columns]
original_tweets.head()

### 2.1 What is the most popular content?

In [None]:
def tweet_url(screen_name, tweet_id):
    return f"https://twitter.com/{screen_name}/status/{tweet_id}"

def print_top_tweets(data, sort_by):
    top_favs = data.sort_values(by=sort_by, ascending=False).head(5)

    for index_number, row in top_favs.iterrows():

        print('*'*10)
        print("INDEX:", index_number)
        print("USER:", row['user.screen_name'])
        print("FAV:", row['favorite_count'])
        print("RT:", row['retweet_count'])
        print(row['full_text'])
        print(tweet_url(screen_name=row['user.screen_name'], tweet_id=row['id']))

In [None]:
print_top_tweets(original_tweets, 'favorite_count')

In [None]:
print_top_tweets(original_tweets, 'retweet_count')

### 2.2 Most Popular #Hashtags
Examining the hashtags of your data can give you a sense of the discourses around a particular topic, and inform you of connectivity to other issues. The first step is to get the hashtags out of their nested data structure.

For each entry in `entities.hashtags` we see a list, which if it is not empty, contains a set of dictionaries, and one value in each dictionary, the `text` value, is what we actually want.

In [None]:
hashtag_data = original_tweets[['id','entities.hashtags']].copy()
hashtag_data.head()

In [None]:
# exploding the dataset puts each item of the nested list into its own row.
# This means some Tweets will have multiple rows if they have multiple hashtags.

exp_hashtag_data =  # dropna in case any are empty
exp_hashtag_data

In [None]:
# Again we have nested dicts so we can use our function from earlier
flat_hashtags = 
flat_hashtags

In [None]:
# We can quickly check top hashtags like so...



In [None]:
# or plot them...
plot_tag_data = 
plot_tag_data = 
plot_tag_data

In [None]:
sns.set(rc={"figure.figsize":(8, 6)}) # For some reason seaborn doesn't accept height on a bar graph?

# barplot



In general Twitter activity is very skewed. Very few tweets get any attention, and those that do tend to then dominate the discourse. If we examine the top 50 most favourited tweets we can see that a few tweets get way ahead of the competition.

### 2.3 Most Mentioned Users
Similarly we can see what users are most mentioned. Often when big issues hit Twitter, particular key individuals get drawn in as people use their handles to draw their attention to it. Significant amounts of mentioning may also indicate centrality of that user in the wider debate.

In [None]:
# The steps are the same as above... let's speedrun it!

mention_data = original_tweets[['id','entities.user_mentions']].copy()
mention_data = mention_data
mention_data.head()

In [None]:
exp_mention_data = mention_data.explode('entities.user_mentions').dropna()
exp_mention_data

In [None]:
# Again we have nested dicts so we can use our function from earlier
flat_mentions = flatten_nested_dicts(exp_mention_data)
flat_mentions

In [None]:
flat_mentions['entities.user_mentions.screen_name'].value_counts().head(20)

In [None]:
# or plot them...
plot_user_data = flat_mentions['entities.user_mentions.screen_name'].value_counts().head(20).reset_index()
plot_user_data = plot_user_data.rename(columns={'entities.user_mentions.screen_name':'freq', 'index':'name'})
plot_user_data

In [None]:
sns.set(rc={"figure.figsize":(8, 6)}) # For some reason seaborn doesn't accept height on a bar graph?
sns.barplot(x='freq', y='name', data=plot_user_data)
