# 05_epl_tweet_analysis
In addition to the analysis we conducted for the world cup,  we are repeating a similar tweet-level analysis with a focus on URLs for the recent round of premier league fixtures. Specifically, we collected tweets for:
- Liverpool vs Chelsea (21/01/23, 12:30 GMT)
- Arsenal vs Man United (22/01/23, 16:30 GMT)

For both of these games, we got tweets for one our before and after kickoff as well as during the game. 

This analysis will be structured as follows:
- summary-stats - number of tweets, volume by time, comparing the two datasets, etc.
- extracting URLs and expanding them

NL, 23/01/23

### IMPORTS

In [29]:
import os
import pandas as pd
import json
from tqdm import tqdm

### PATHS & CONSTANTS

In [7]:
TWEETS_PATH = '/home/nikloynes/projects/world_cup_misinfo_tracking/data/epl_tweets/'
DAY_1_PATH = TWEETS_PATH+'210123/'
DAY_2_PATH = TWEETS_PATH+'220123/'

In [4]:
EXPORT_PATH = '/home/nikloynes/projects/world_cup_misinfo_tracking/data/exports/epl_tweets/'

### INIT

In [5]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)

### THE THING!

Sorting out and reading in tweet & URL data... 

In [19]:
tweet_files = os.listdir(DAY_1_PATH)

In [20]:
tweet_files.remove('epl_tweets_2023_01_18-15_43_42.json') # data from testing
tweet_files.remove('meta') # meta dir
tweet_files.remove('epl_tweets_2023_01_18-15_50_01.json') # data from testing

In [21]:
tweet_files = [DAY_1_PATH+x for x in tweet_files]

In [23]:
tmp = os.listdir(DAY_2_PATH)

In [24]:
tmp.remove('meta')

In [25]:
tweet_files += [DAY_2_PATH+x for x in tmp]

In [26]:
tweet_files.sort()

In [27]:
tweet_files

['/home/nikloynes/projects/world_cup_misinfo_tracking/data/epl_tweets/210123/epl_tweets_2023_01_21-11_30_01.json',
 '/home/nikloynes/projects/world_cup_misinfo_tracking/data/epl_tweets/210123/epl_tweets_2023_01_21-12_30_01.json',
 '/home/nikloynes/projects/world_cup_misinfo_tracking/data/epl_tweets/210123/epl_tweets_2023_01_21-13_30_01.json',
 '/home/nikloynes/projects/world_cup_misinfo_tracking/data/epl_tweets/210123/epl_tweets_2023_01_21-14_30_01.json',
 '/home/nikloynes/projects/world_cup_misinfo_tracking/data/epl_tweets/220123/epl_tweets_2023_01_22-15_30_01.json',
 '/home/nikloynes/projects/world_cup_misinfo_tracking/data/epl_tweets/220123/epl_tweets_2023_01_22-16_30_01.json',
 '/home/nikloynes/projects/world_cup_misinfo_tracking/data/epl_tweets/220123/epl_tweets_2023_01_22-17_30_01.json',
 '/home/nikloynes/projects/world_cup_misinfo_tracking/data/epl_tweets/220123/epl_tweets_2023_01_22-18_30_02.json']

Pulling all tweets and all URL-sharing instances into memory

In [30]:
all_tweets = []
all_urls_timestamps = []

for file in tqdm(tweet_files):
    with open(file, 'r') as infile:
        for line in infile:
            
            # tweets
            tmp = json.loads(line)
            if 'domains' in tmp.keys():
                del tmp['domains']
            if 'entities' in tmp.keys():
                del tmp['entities']
            all_tweets.append(tmp)

            # urls
            if 'urls' in tmp.keys():
                for item in tmp['urls']:
                    all_urls_timestamps.append({'url' : item, 'timestamp' : tmp['created_at']})

100%|██████████| 8/8 [00:26<00:00,  3.32s/it]


In [35]:
len(all_tweets)

1141414

In [42]:
all_tweets_df = pd.DataFrame(all_tweets)

In [44]:
all_tweets_df = all_tweets_df.rename(columns={'id' : 'tweet_id'})

In [45]:
all_tweets_df = pd.concat([all_tweets_df.drop(['public_metrics'], axis=1), all_tweets_df['public_metrics'].apply(pd.Series)], axis=1)

In [None]:
all_tweets_df = pd.concat([all_tweets_df.drop(['user'], axis=1), all_tweets_df['user'].apply(pd.Series)], axis=1)

In [None]:
all_tweets_df = all_tweets_df.rename(columns={'id' : 'user_id'})

In [None]:
all_tweets_df = pd.concat([all_tweets_df.drop(['public_metrics'], axis=1), all_tweets_df['public_metrics'].apply(pd.Series)], axis=1)

In [None]:
all_tweets_df = all_tweets_df.rename(columns = {all_tweets_df.columns[1] : 'tweet_created_at', all_tweets_df.columns[11] : 'user_created_at'})    

A total of `1,141,414` tweets were collected over both data collection periods.

Extracting URLs and sending to a separate process for expansion... 

In [34]:
all_urls_timestamps_df = pd.DataFrame(all_urls_timestamps)
len(all_urls_timestamps_df)

542944

In [36]:
len(all_urls_timestamps)/len(all_tweets)*100

47.56766607033031

A total of `542,944` link-sharing incidences in our entire sample, out of a total of `1,141,414` tweets. So, `~47.6%` of tweets contained a URL.

In [37]:
unique_urls_freqs_df = all_urls_timestamps_df.groupby('url').count().reset_index().rename(columns={'timestamp' : 'freq'}).sort_values('freq', ascending=False).reset_index(drop=True)

In [39]:
len(unique_urls_freqs_df)

186024

In [40]:
unique_urls_freqs_df.head()

Unnamed: 0,url,freq
0,https://t.co/Dw9ltLW7T7,10226
1,https://t.co/1Tbu8SP8p9,5924
2,https://t,5681
3,https://t.co/hnt6oNc0Nw,3202
4,https://t.co/eZnJLPp6B5,2810


We have a total of `186024` unique (shortened) URLs. Let us now export these for expansion... 

In [41]:
unique_urls_freqs_df.to_csv(EXPORT_PATH+'unique_urls_freqs.csv', index=False)

#### 1. Users

This is what our tweet dataset looks like:

In [43]:
all_tweets_df.head()

Unnamed: 0,author_id,created_at,id,public_metrics,referenced_tweets,text,user,urls,withheld
0,1306264256384503810,2023-01-21T11:30:14.000Z,1616760076613685248,"{'retweet_count': 0, 'reply_count': 0, 'like_c...","[{'type': 'replied_to', 'id': '161640912456818...",@1cEphraimCFC Lol trash ke?,"{'created_at': '2020-09-16T16:10:56.000Z', 'de...",,
1,1313192105439711234,2023-01-21T11:30:14.000Z,1616760076043423745,"{'retweet_count': 191, 'reply_count': 0, 'like...","[{'type': 'retweeted', 'id': '1616758845757607...",RT @TrollFootball: Liverpool vs Chelsea be lik...,"{'created_at': '2020-10-05T18:59:41.000Z', 'de...",[https://t.co/2NrxYozag2],
2,1473100428015513604,2023-01-21T11:30:15.000Z,1616760078434201600,"{'retweet_count': 0, 'reply_count': 0, 'like_c...","[{'type': 'quoted', 'id': '1616759107096088576'}]",Klopp Salope t’as chaud ou quoi 😂😂😂🫵🏾 https://...,"{'created_at': '2021-12-21T01:18:22.000Z', 'de...",[https://t.co/aEltVjkUt2],
3,1469337076579643392,2023-01-21T11:30:15.000Z,1616760078782332931,"{'retweet_count': 10, 'reply_count': 0, 'like_...","[{'type': 'retweeted', 'id': '1616759242450472...",RT @the_smallie: This Chelsea Vs Liverpool mat...,"{'created_at': '2021-12-10T16:04:26.000Z', 'de...",,
4,1546112807321600001,2023-01-21T11:30:15.000Z,1616760077574176772,"{'retweet_count': 0, 'reply_count': 0, 'like_c...","[{'type': 'replied_to', 'id': '161669937602545...",@FrankKhalidUK All of them,"{'created_at': '2022-07-10T12:43:32.000Z', 'de...",,


In [None]:
all_users_df = all_tweets_df.groupby('author_id').count()[['text']].reset_index().merge(all_tweets_df.groupby('author_id').sum()[['retweet_count', 'reply_count', 'like_count', 'quote_count']].reset_index(), on='author_id', how='left')