# SC207 - Session 7
# APIs - Exploring and Summarising Twitter Data
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/tweepy.jpg?raw=true" align="right" width="300">


What kinds of exploratory analysis can we run on social media data? This session covers various examples of the kinds of insights that can be gathered through the analysis of social media data, and how to present those results.

[Tweepy Documentation](http://docs.tweepy.org/en/stable/)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
filename = 'trhr.json'

tweets = pd.read_json(filename)

In [None]:
tweets.head()

In [None]:
tweets.info()

## 1. How many Tweets did I get?

In [None]:
# Tweet ids are unique so we can count the number of unique Tweets using nunique
tweets['id'].nunique()

## 2 When are they from?
Twitter can be a very active place, meaning that whilst it may sound like you have a lot of data it may only represent the last 3 minutes. To better understand our dataset we can use the `created_at` column.

In [None]:
tweets['created_at']

Currently these are just strings, so Pandas doesn't know how to interpret them. If we convert them to a special type called `datetime` Pandas will be able to better handle them.

In [None]:
tweets['created_at'] = pd.to_datetime(tweets['created_at'])

In [None]:
tweets['created_at']

In [None]:
tweets['created_at'].describe(datetime_is_numeric=True)

## 3. How often is it being Tweeted?

Whilst our time info is to the second, it is more intuitive to see larger trends by the minute or hour. Grouping by time needs a special object called a `Grouper`.

First we create a grouper. We provide it two arguments
- The `key` which is the column you want to group by
- The `freq` which specifies the time period you want to group by for example 'd' for day, or 'h' for hour, or 'min' for minute.
- You can see all the options for freq [here in the Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases)


In [None]:
time_grouper = pd.Grouper(key='created_at', freq='h')
count_per_hour = tweets.groupby(time_grouper,as_index=True)['id'].count().reset_index()
count_per_hour.head()

In [None]:
sns.relplot(x = 'created_at',
            y = 'id',
            height = 5, aspect = 3,
            kind = 'line', lw = 3,
            data = count_per_hour)

## 4 What types of activity are we seeing?
Is this all uniquely written material, or is this a lot of people engaging with the content of others?

Tweets perform different functions.
- They might be an utterance into the world completely unconnected from anyone else.
- They might be a retweet, amplifying and reposting what someone else has said.
- They might be a quote, embedding someone elses Tweet but adding their own commentary.
- They might be a reply to another Tweet, acting as a conversation.
- Notably, they may be multiple of these things at once...

In [None]:
tweets['referenced_tweets'].loc[3]

In [None]:
# Here we build a function that will check what options we have for classification, and then chooses the most fitting based on a hierarchy.

In [None]:
def check_type(row):
    if row is None:
        return 'unconnected'
    types = [item['type'] for item in row]
    if 'retweeted' in types:
        return 'retweet'
    elif 'quoted' in types:
        return 'quoted'
    elif 'replied_to' in types:
        return 'reply_to'

In [None]:
tweets['type'] = tweets['referenced_tweets'].apply(check_type)
tweets['type']

In [None]:
tweet_type_data = tweets[['id','created_at','type']].copy()
tweet_type_data.head()

In [None]:
time_grouper = pd.Grouper(key='created_at', freq='h')
tweet_type_data_to_plot = tweet_type_data.groupby([time_grouper,'type'])['id'].count().reset_index()
tweet_type_data_to_plot.head()

In [None]:
sns.relplot(x = 'created_at',
            y = 'id',
            hue='type',
            height = 5, aspect = 3,
            kind = 'line', lw = 3,
            data = tweet_type_data_to_plot)


## 5. What is the most popular content?

We need to use a couple of functions to unpack our nested columns.
- `.to_dict(orient='records')` translates a dataframe into a list of dictionaries, each containing their own nested dictionaries
- `pd.json_normalize` can create a Dataframe from a list of dictionaries, and flattens out nested dictionaries into their own columns.

In [None]:
def flatten_nested_dicts(df):
    dicts = df.to_dict(orient='records')
    flattened = pd.json_normalize(dicts)
    return flattened

tweets = flatten_nested_dicts(tweets)
tweets.info(verbose=True,show_counts=True)

In [None]:
# We won't need all of this, let's just grab a subset of columns
keep_columns = ['id','text',
                'context_annotations',
                'public_metrics.like_count',
                'public_metrics.retweet_count',
                'public_metrics.quote_count',
                'public_metrics.reply_count',
                'user_username', 'entities.hashtags','entities.mentions']

tweets = tweets[keep_columns]
tweets.head()

In [None]:
def tweet_url(screen_name, tweet_id):
    return f"https://twitter.com/{screen_name}/status/{tweet_id}"

def print_top_tweets(data, sort_by):
    top_favs = data.sort_values(by=sort_by, ascending=False).head(5)

    for index_number, row in top_favs.iterrows():

        print('*'*10)
        print("INDEX:", index_number)
        print("USER:", row['user_username'])
        print("REPLIES:", row['public_metrics.reply_count'])
        print("LIKE:", row['public_metrics.like_count'])
        print("RT:", row['public_metrics.retweet_count'])
        print(row['text'])
        print(tweet_url(screen_name=row['user_username'], tweet_id=row['id']))

In [None]:
print_top_tweets(tweets, 'public_metrics.like_count')

In [None]:
print_top_tweets(tweets, 'public_metrics.retweet_count')

In [None]:
print_top_tweets(tweets, 'public_metrics.reply_count')


## 6. Most Popular #Hashtags
Examining the hashtags of your data can give you a sense of the discourses around a particular topic, and inform you of connectivity to other issues. The first step is to get the hashtags out of their nested data structure.

For each entry in `entities.hashtags` we see a list, which if it is not empty, contains a set of dictionaries, and one value in each dictionary, the `text` value, is what we actually want.

In [None]:
hashtag_data = tweets[['id','entities.hashtags']]
hashtag_data.head()

In [None]:
# exploding the dataset puts each item of the nested list into its own row.
# This means some Tweets will have multiple rows if they have multiple hashtags.

exp_hashtag_data = hashtag_data.explode('entities.hashtags').dropna() # in case any are empty
exp_hashtag_data

In [None]:
# Again we have nested dicts so we can use our function from earlier
flat_hashtags = flatten_nested_dicts(exp_hashtag_data)
flat_hashtags

In [None]:
# We can quickly check top hashtags like so...

flat_hashtags['entities.hashtags.tag'].value_counts().head(20)

In [None]:
# or plot them...
top_20 = flat_hashtags['entities.hashtags.tag'].value_counts().head(20).index
top_20

In [None]:
sns.catplot(y='entities.hashtags.tag', data=flat_hashtags, order=top_20,
            height=5, aspect=1.5, kind='count')

## 7 Most Mentioned Users
Similarly we can see what users are most mentioned. Often when big issues hit Twitter, particular key individuals get drawn in as people use their handles to draw their attention to it. Significant amounts of mentioning may also indicate centrality of that user in the wider debate.

In [None]:
# The steps are the same as above... let's speedrun it!

mention_data = tweets[['id','entities.mentions']]
mention_data = mention_data
mention_data.head()

In [None]:
exp_mention_data = mention_data.explode('entities.mentions').dropna()
exp_mention_data

In [None]:
# Again we have nested dicts so we can use our function from earlier
flat_mentions = flatten_nested_dicts(exp_mention_data)
flat_mentions

In [None]:
flat_mentions['entities.mentions.username'].value_counts().head(20)

In [None]:
top_20 = flat_mentions['entities.mentions.username'].value_counts().iloc[:20].index
top_20

In [None]:
sns.catplot(y='entities.mentions.username', data=flat_mentions, order=top_20, kind='count', height=5, aspect=1.5)

## 8. Most Popular Topics, Figures, Entities etc.
Twitter also provides us 'context annotations'. These are keywords assigned to each Tweet by their AI models that tells us a little about the Tweet. The validity of their classification is not 100% but it may be indicative of certain trends or topics.

In [None]:
annotations = tweets['context_annotations'].explode()
annotations = pd.json_normalize(annotations.dropna())
annotations

In [None]:
annotations['domain.name'].value_counts()

In [None]:
annotation_types = ['Person','Politician','Brand','TV Shows','Events [Entity Service]','Interests and Hobbies']

for annot_type in annotation_types:
    subset = annotations[annotations['domain.name'] == annot_type]
    top_20 = subset['entity.name'].value_counts().iloc[:20].index
    ax = sns.catplot(y='entity.name', data=subset, kind='count', height=5, aspect=1.5, order=top_20)
    ax.set(title=f'Top 20 Entities: {annot_type}')
    plt.show()