# 03_reddit_analysis
Having thoroughly analysed the data obtained from Twitter, we now move on to Reddit. We are keen to make some strong comparisons to our Twitter analysis. This is why we will try and conduct this reddit data investigation with identical, or at least similar units of analysis to our Twitter report. This means, we will...

- run sentiment analysis on text in reddit posts (not in the links)
- do some wordclouds for negative and positive words. also frequency tables for these words
- we will look at the URLs contained in our reddit posts. 
    - what's the proportion going to reddit vs. outside of reddit?
    - frequency table of most frequent urls / domains. we may have to expand a bunch of urls
    - finally, compare the urls and domains we saw a lot on twitter. is there going to be some significant overlap with the urls we saw on twitter? 

Overall, we collected top-level posts (i.e. no comments) for 5 distinct subreddits, which we considered as relevant to the world cup a priori: 
`FIFAWorldCupQATAR22, football, qatar, soccer, worldcup`

NL, 12/01/23

### IMPORTS

In [80]:
import os
import json
import pandas as pd
from dateutil import parser as date_parser
import re
from urllib.parse import urlparse

### PATHS & CONSTANTS

In [3]:
SUBREDDITS = ['FIFAWorldCupQATAR22','football','qatar','soccer','worldcup']
REDDIT_PATH = '/home/nikloynes/projects/world_cup_misinfo_tracking/data/reddit_posts/'
EXPORT_PATH = '/home/nikloynes/projects/world_cup_misinfo_tracking/data/exports/reddit/'

### INIT

In [75]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)

### FUNCTIONS

In [83]:
def extract_urls(text:str) -> list:
    '''
    extracts urls from text
    returns all urls in text in a list
    '''
    urls = re.findall("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", text)

    return urls


### THE THING!

Let's list all our reddit files

In [10]:
reddit_files = []

for r in SUBREDDITS:
    reddit_files += [REDDIT_PATH+r+'/'+x for x in os.listdir(REDDIT_PATH+r+'/')]

We have data collected in hourly chunks for each of our 5 subreddits.  
This leaves us with a total of 4096 files. Let's quickly check how many reddit posts that represents.

In [12]:
counter = 0
for file in reddit_files:
    with open(file, 'r') as infile:
        for line in infile:
            counter += 1

In [13]:
counter

18711

Great, so we have collected data for a total of `18,711` reddit posts. Clearly, this is quite a bit less than we got for Twitter, but we did follow a completely different strategy: rather than filtering the entirety of Reddit for a bunch of keywords, we instead tracked 5 subreddits. This is obviously also an artefact of how reddit's API works, and indeed how the platform works.

OK, let's just refresh our memories -- what does a given reddit post look like in json format? 

In [20]:
for file in reddit_files:
    with open(file, 'r') as infile:
        for line in infile:
            tmp = json.loads(line)
            break
    break

In [15]:
tmp

{'id': 'yyt2cj',
 'created_utc': '2022/11/18 19:58:45',
 'title': 'Watch FIFA World Cup QATAR 2022 Live Stream Online Without Cable.',
 'selftext': "Welcome to **Watch [FIFA World Cup Qatar 2022 Live](https://fifa-world-cup-2022-live-streams.blogspot.com/) Stream Online for Free HD TV** coverage match online from here. Everybody knows that (FIFA World Cup) games means excitation. Depending on the country you in, find available options to watch FIFA World Cup games online. So, you can easily watch FIFA World Cup Qatar 2022 live matches from here.\n\n\n#Watch : [FIFA World Cup Qatar 2022 Live Stream Free](https://fifa-world-cup-2022-live-streams.blogspot.com/)\n\n\nYou can easily watch your favorite teams FIFA World Cup live streaming online free. This is a free FIFA World Cup streaming website that's provide multiple links to watch any FIFA World Cup game live. We offer the best FIFA World Cup live streams link in HD/HQ/4K. Watch the FIFA World Cup all game from your mobile, tablet, Mac

First off, it's quite funny that our very first reddit post by chance seems to be telling the same story we observed via our tweet collections. Ignoring that, however, this gives us a good idea of what we want to aggregate over the entirety of our available reddit data.

In [57]:
on_reddit_domains = [
    'i.redd.it',
    'self.worldcup',
    'self.qatar',
    'v.redd.it',
    'self.football',
    'self.FIFAWorldCupQATAR22',
    'self.soccer',
    'reddit.com',
]

In [87]:
urls_df = pd.DataFrame()
domains_freq = {}
user_overview = []
unique_ids = [] # we've noticed that we've collected some duplicate data. not really a problem, but let's make sure we don't double-count
urls_in_text_df = pd.DataFrame()

for file in reddit_files:
    with open(file, 'r') as infile:
        for line in infile:
            tmp = json.loads(line)
            
            if tmp['id'] not in unique_ids:
                unique_ids.append(tmp['id'])
            else:
                # we have already seen this id. let's move on
                continue
              
            # urls
            internal = False
            if tmp['domain'] in on_reddit_domains:
                internal = True
            
            tmp_url = {
                'url' : tmp['url'],
                'domain' : tmp['domain'],
                'internal_link' : internal,
                'timestamp' : date_parser.parse(tmp['created_utc']),
                'user_name' : tmp['user_name'],
                'user_id' : tmp['user_id']                 
            }
            urls_df = pd.concat([urls_df, pd.DataFrame([tmp_url])])

            if tmp['domain'] not in domains_freq.keys():
                domains_freq[tmp['domain']] = 1
            else:
                domains_freq[tmp['domain']] += 1

            # we are also interested in URLs contained in the 
            # 'selftext' field in our reddit post object-- as our
            # example from above contains some which were not captured
            # in the object itself.
            embedded_urls = extract_urls(tmp['selftext'])
            for eurl in embedded_urls:
                try:
                    tmp_domain = urlparse(eurl).netloc
                except ValueError:
                    tmp_domain = None
                tmp_eurl = {
                    'url' : eurl,
                    'domain' : tmp_domain,
                    'timestamp' : date_parser.parse(tmp['created_utc']),
                    'user_name' : tmp['user_name'],
                    'user_id' : tmp['user_id']
                }
                urls_in_text_df = pd.concat([urls_in_text_df, pd.DataFrame([tmp_eurl])]) 
            
            if tmp['user_id'] not in [x['user_id'] for x in user_overview]:
                if tmp['user_created_utc']!=None:
                    user_created = date_parser.parse(tmp['user_created_utc'])
                else:
                    user_created = None
                tmp_user = {
                    'user_id' : tmp['user_id'],
                    'user_name' : tmp['user_name'],
                    'user_created_at' : user_created,
                    'user_comments_total' : tmp['num_comments'],
                    'user_karma_total' : tmp['user_total_karma'],
                    'user_upvotes_total' : tmp['ups'],
                    'user_downvotes_total' : tmp['downs'],
                    'total_wc_posts' : 1
                }
                user_overview.append(tmp_user)
            else:
                # first find the index where this is the case
                the_index = next((index for (index, d) in enumerate(user_overview) if d['user_id'] == tmp['user_id']), None)

                # then make changes in this entry
                user_overview[the_index]['user_comments_total'] += tmp['num_comments']
                user_overview[the_index]['user_upvotes_total'] += tmp['ups']
                user_overview[the_index]['user_downvotes_total'] += tmp['downs']
                user_overview[the_index]['total_wc_posts'] += 1
                
                # the karma recorded in our jsons is already cumulative. so it could only be
                # that a user's karma from a previous json is outdated, so we need to check if
                # it's grown and only then update this value. 
                try:
                    if tmp['user_total_karma'] > user_overview[the_index]['user_karma_total']:
                        user_overview[the_index]['user_karma_total'] = tmp['user_total_karma']
                except TypeError as e:
                    print(f'Error encountered when comparing user karma. Skipping.')
                    print(e)

Error encountered when comparing user karma. Skipping.
'>' not supported between instances of 'NoneType' and 'int'
Error encountered when comparing user karma. Skipping.
'>' not supported between instances of 'NoneType' and 'int'
Error encountered when comparing user karma. Skipping.
'>' not supported between instances of 'NoneType' and 'int'
Error encountered when comparing user karma. Skipping.
'>' not supported between instances of 'NoneType' and 'int'
Error encountered when comparing user karma. Skipping.
'>' not supported between instances of 'NoneType' and 'int'
Error encountered when comparing user karma. Skipping.
'>' not supported between instances of 'NoneType' and 'int'
Error encountered when comparing user karma. Skipping.
'>' not supported between instances of 'NoneType' and 'int'
Error encountered when comparing user karma. Skipping.
'>' not supported between instances of 'NoneType' and 'int'
Error encountered when comparing user karma. Skipping.
'>' not supported between

In [88]:
# now some formatting/cleaning on the data we've just built
domains_freq = dict(sorted(domains_freq.items(), key=lambda x:x[1], reverse=True))
domains_freq_df = pd.DataFrame([domains_freq]).transpose().reset_index().rename(columns={'index' : 'domain', 0 : 'freq'})
domains_freq_df.loc[domains_freq_df['domain'].isin(on_reddit_domains), 'internal'] = True
domains_freq_df['internal'].fillna(False, inplace=True)

user_overview_df = pd.DataFrame(user_overview)

Let's now have a look at the dataframe we've created.

We've recorded `15,049` URL sharing incidences.

In [89]:
urls_df['internal_link'].value_counts()

True     9563
False    5486
Name: internal_link, dtype: int64

In [90]:
urls_df['internal_link'].value_counts(normalize=True)

True     0.635458
False    0.364542
Name: internal_link, dtype: float64

Out of these URL shares, `64%` (`9563 shares`) were to internal Reddit resources, while `36%` (`5486 shares`) were to external URLs.

Let's have a look at our unique domains

In [91]:
domains_freq_df[:100]

Unnamed: 0,domain,freq,internal
0,i.redd.it,2555,True
1,self.worldcup,2185,True
2,v.redd.it,1430,True
3,self.qatar,1400,True
4,twitter.com,1123,False
5,self.football,1083,True
6,self.soccer,591,True
7,streamja.com,513,False
8,streamin.me,467,False
9,youtu.be,321,False


Interesting. We again see quite a lot of apparent links to streaming sites... albeit there appear to be fewer of them compared to what we saw on Twitter. Let's dig into this a bit more. 

Let's now have a look at URLs contained within the `selftext` field of a given reddit post.

In [104]:
urls_in_text_freq_df = urls_in_text_df.groupby('domain').count().sort_values('url', ascending=False).reset_index()[['domain', 'url']].rename(columns={'url' : 'freq'})

In [107]:
len(urls_in_text_df)

2773

In [108]:
len(urls_in_text_freq_df)

175

OK, so when regarding only URLs contained directly in text, we have a further `2,773` URL posting incidences; resulting in `175` unique domains from these URL shares.

Let's have a look at the most popular of those shared domains

In [109]:
urls_in_text_freq_df[:100]

Unnamed: 0,domain,freq
0,www.reddit.com,767
1,www.espn.com,301
2,reddit.com,299
3,www.reddit-stream.com,295
4,preview.redd.it,239
5,k2s.cc,130
6,discord.gg,109
7,old.reddit.com,91
8,new.reddit.com,88
9,www.youtube.com,71


OK - enough for URLs for now, let's have a look at users.

We will first look at users ranked by how often they've posted in our collection of reddit posts

In [115]:
user_overview_df.sort_values('total_wc_posts', ascending=False)[:100]

Unnamed: 0,user_id,user_name,user_created_at,user_comments_total,user_karma_total,user_upvotes_total,user_downvotes_total,total_wc_posts
2211,zsdhx,dotuan,2016-07-25 04:40:27,19096,157259,86716,0,449
2205,ihocpit6,Meladroite,2022-01-12 00:11:53,1683,261406,4397,0,412
15,7h7t33at,MatchCaster,2020-07-31 17:01:38,3447,192855,1373,0,285
2312,14zl0j,PSGAcademy,2017-02-01 20:51:50,68184,5537206,387809,0,192
2233,fujeo,MatchThreadder,2014-03-25 16:08:35,175658,753668,18350,0,168
984,49ocws,DrArcadeTV,2017-09-07 21:16:17,1874,17801,2620,0,154
2201,19pp70b8,dragon8811,2019-02-24 15:35:27,20299,684553,103034,0,113
4455,f2iugf6r,JoJo-Bizarre-1997,2022-01-16 04:41:42,319,10526,464,0,90
2227,2glav0rp,2soccer2bot,2018-10-22 23:50:16,17727,111405,3278,0,80
101,gbbx93s4,newzee1,2021-11-07 22:52:56,203,133869,210,0,65


Wow - so our top 5 users have all posted more than 100 times in our sample, some of them well more than 400 times. That's something like 15-20 posts on average, every single day. It will be interesting to look into what kind of posts these individuals are sharing.

In [116]:
len(user_overview_df)

7209

In [117]:
len(unique_ids)

15049

Interim summary / takeaways:

- we've collected a total of `15,049` reddit posts. 
- these posts have been shared by a total of `7209` unique users.
- some of those users have been very prolific - the max being almost 500 posts total in our entire sample
- URLs shared in the url field of our post objects are split approximately 64% (internal) to 36% external. 
- Out of external URLs, we again find a significantly large amount linking to streaming sites
- We also had a look at URLs embedded within the text of a reddit post. This gave us a further `2,773` url sharing incidences. Here, again, we see quite a few streaming-looking sites in the top domains. 

Next steps: 
- try and come up with a proportion of links to streaming sites out of urls (normal), urls (embedded) and overall. 
- have a slightly deeper look at the content (self text) posted by our most active users
- run the vader sentiment analysis on each of our reddit posts that contain self text
- produce most frequent words in positive (>0 compound) and negative (< 0 compound) posts. show as tables & as word clouds.

In [None]:
streaming_domains = [
    'streamja.com',
    'streamin.me',
    'streamable.com',
    'streamff.com',
    'streamag.com',
    'streamingdigitally.com',
    'streamscores.link',
    'www.reddit-stream.com',
    'reddit-stream.com',
    'www.livesoccertv.com',
    'livesoccertv.com',
    'streamin.me'
]