# SC207 - Session 7
# APIs - Exploring and Summarising Twitter Data
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/tweepy.jpg?raw=true" align="right" width="300">


What kinds of exploratory analysis can we run on social media data? This session covers various examples of the kinds of insights that can be gathered through the analysis of social media data, and how to present those results.

[Tweepy Documentation](http://docs.tweepy.org/en/stable/)

## Section a) Unpacking your data

In [None]:
import pandas as pd

def pandas_find_first_retweet(df):
    rt_filter = ~df.retweeted_status.isna()
    return df[rt_filter].iloc[0].copy()  
def show_df_breakpoint(df,original_df_len):
    return df.iloc[original_df_len-3:original_df_len+3]

In [None]:
df = pd.read_pickle('example_twitter_data.pkl')

In [None]:
# check the .info() to see an overview of our data
df.info()

## 1 Extracting Original Tweets

Remember, retweeted statuses contain the data of the original tweet as well. If your project would benefit from increasing the amount of data further, or you want to include Tweets that may be older than the window in which you sampled data, then extracting retweets may be a useful technique for you.
Some projects may not need the retweets extracted - for example if you specifically wanted to only examine activity from the period when you sampled.

##### Presuming we do want to extract them...
If we first look at one row in our DataFrame that represents a retweet that contains the original material posted...

In [None]:
example_retweet = pandas_find_first_retweet(df)
example_retweet

... and we examine the `retweeted_status` column, we can see that it contains a dictionary of the original tweet, just like the ones we originally built the dataset with.

In [None]:
# check the retweeted_status of our example_retweet
example_retweet['retweeted_status']

So we know we have the data in a structure that we can use... in general our steps are...
 1. Filter the dataframe so we are only selecting rows that have retweet data.
 2. Turn the `retweeted_status` column into a simple list such that it is a list of dictionaries, like we originally had when we first made the dataframe.
 3. Turn that list of dictionaries into a DataFrame, and then add it to our existing data
 4. Drop any duplicates (we may already have collected a tweet that was also retweeted)
 

In [None]:
# First let's see how long our DF is so we can see the difference...
original_df_length = len(df)
original_df_length

In [None]:
# 1. Filter the data

rt_filter = ~df['retweeted_status'].isna()
retweets = df[rt_filter]['retweeted_status']
retweets


In [None]:
# Turn the column into a list of dictionaries

retweet_list = retweets.to_list()

# at this point we can check how many retweets we have
print(len(retweet_list))

retweet_list[:1] # if we examine the first item we can see that we have our list of dictionaries


In [None]:
# 3. Turn it into a DataFrame and add it to our existing data
retweet_df = pd.DataFrame(retweet_list)
df = df.append(retweet_df)

In [None]:
print(original_df_length)
print(len(df))

In [None]:
# 4. Drop any duplicates
df = df.drop_duplicates('id')
len(df)

Finally we reset the index. Why? 
- Because we've stuck two DataFrames together, each with their own indexes running from 0 to however long they are. By appending one to the other we now have rows with the same index name...
- Because dropping duplicates doesn't reset the index, so there will be holes in our dataframe index.


In [None]:
# This is a custom function built for teaching purposes (you can see how it works at the beginning of the notebook)
show_df_breakpoint(df,original_df_length)

You can also check this by checking the .info() See the mismatch...
> Int64Index: ....

How can the index be shorter than the number of entries we have?

In [None]:
# check the info for discrepancies between records and index length
df.info()

Having a broken index like this will cause problems, so we `.reset_index(drop=True)`. Drop means to completely forget the original index - otherwise it gets added as another column.

In [None]:
df.loc[0]

In [None]:
# drop the index
df = df.reset_index(drop=True)

In [None]:
# recheck the info
df.info()

As with much of what we do, we've broken the above into steps to better explain it, however it could reasonably be done in a few lines...

In [None]:
# set the filter
rt_filter = ~df['retweeted_status'].isna()

#filter the dataframe, convert the retweeted_status columnn to a list and wrap it in a new dataframe
retweet_df = pd.DataFrame( df[rt_filter]['retweeted_status'].tolist() )

# append the new dataframe to the original, drop duplicates and reset the index
df = df.append(retweet_df).drop_duplicates('id').reset_index(drop=True)

len(df)

In [None]:
# We could even turn this into a function...

# name the function extract_original_tweets

def extract_original_tweets(df):
    rt_filter = ~df['retweeted_status'].isna()
    retweet_df = pd.DataFrame( df[rt_filter]['retweeted_status'].tolist() )
    df = df.append(retweet_df).drop_duplicates('id').reset_index(drop=True)

    return df
    

In [None]:
# if we reload from disk to reset everything we can see if our function works
df = pd.read_pickle('example_twitter_data.pkl')
print(len(df))

df = extract_original_tweets(df)

print(len(df))

## 2. Unpacking Nested Data
Some of our other columns contain nested data, such as our user column.

In [None]:
df.loc[0,'user']

Whilst there are a variety of ways to do this, and we could just unpack the columns we need, it is ultimately simpler to unpack everything with two commands.
- `.to_dict(orient='records')` translates a dataframe into a list of dictionaries, each containing their own nested dictionaries
- `pd.json_normalize` can create a Dataframe from a list of dictionaries, and flattens out nested dictionaries into their own columns.

This will take a while, but worth doing once after extracting all retweets, and then saving to disk and working with the fully 'blown-up' dataframe from then on.

In [None]:
df_dicts = df.to_dict(orient='records')


In [None]:
# create a new dataframe using json_normalize on the list of dictionaries
df = pd.json_normalize(df_dicts)

In [None]:
# check the info, we might need to set some arguments
df.info(verbose=True,null_counts=True)
# df.info()

In [None]:
# save our uinpacked version to a new pickle file

df.to_pickle('example_twitter_data_unpacked.pkl')