# 02_reddit_strategy
We now look at how we can get reddit data. This is a bit more complex, as the flow of information on reddit isn't quite as clean and one-directional as on Twitter.  
Hence, for now, we will focus solely on subreddits, and collecting new posts to these subreddits, along with some author info, any text associated with it, and crucially, a url of a given post in that subreddit. If we then feel the need to collect comments on that post, we can do this at a later stage, when required.  
For now, we will use the `PRAW` module, although it might also make sense to investigate `pushshift`.

NL, 29/11/22

## IMPORTS

In [198]:
import os
from dotenv import load_dotenv
import praw
from psaw import PushshiftAPI
import datetime as dt
from dateutil import parser as date_parser
from prawcore.exceptions import NotFound
import json

## FUNCTIONS

In [190]:
def reddit_post_to_dict(post:praw.models.reddit.submission.Submission,
                        custom_fields:list=None,
                        overwrite_core_fields:bool=False,
                        convert_timestamp:bool=True) -> dict:
    '''
    function that converts an object of the praw.submission type
    into a dict. this is useful primarily in order to serialise this
    to json and write out to file.

    args:
        - post, a submission ('post') in a subreddit retrieved via praw
        - custom_fields, a list of strings containing additional fields returned by 
          praw which are to be retained
        - overwrite_core_fields, bool, indicates whether core (standard) fields
          of the output dict are to be retained, or whether they should be replaced
          entirely by 'custom_fields'
    '''    
    if not isinstance(post, praw.models.reddit.submission.Submission):
        raise TypeError('post must be a praw submission object')

    if overwrite_core_fields:
        if not custom_fields:
            raise ValueError('custom_fields is not defined')

    core_fields = ['id', 'created_utc', 'title', 'selftext', 'domain', 'url', 'num_comments', 'score', 'ups', 'downs', 'author']
    if overwrite_core_fields:
        core_fields = custom_fields
    elif isinstance(custom_fields, list) and len(custom_fields)>0:
        core_fields += custom_fields

    core_user_fields = ['name', 'id', 'total_karma', 'verified', 'created_utc']
    #user_field_names_out = ['user_'+x for x in core_user_fields]

    tmp_dict = vars(post)
    out = {field:tmp_dict[field] for field in core_fields if field in tmp_dict.keys()}

    if 'created_utc' in out.keys():
        if convert_timestamp:
            out['created_utc'] = dt.datetime.fromtimestamp(out['created_utc']).strftime('%Y/%m/%d %H:%M:%S')

    if 'author' in out.keys():
        # we will now process the author object and append to our out object
        # tmp_dict = vars(post.author)
        # user_dict = {'user_'+field:tmp_dict[field] for field in core_user_fields if field in tmp_dict.keys()}
        # try:
        #     user_name = post.author.name
        # except AttributeError:
        #     user_name = None

        # try:
        #     user_id = post.author.id
        # except AttributeError:
        #     user_id = None

        # try: 
        #     user_total_karma = post.author.total_karma
        # except AttributeError:
        #     user_total_karma = None

        # try:
        #     user_verified = post.author.verified
        # except AttributeError:
        #     user_verified = None

        # try:
        #     user_created_utc = post.author.created_utc
        # except AttributeError:
        #     user_created_utc = None

        user_dict = {'user_'+field:get_user_attribute(user=post.author, attribute=field) for field in core_user_fields}

        # user_dict = {
        #     'user_name' : user_name,
        #     'user_id' : user_id,
        #     'user_total_karma' : user_total_karma,
        #     'user_verified' : user_verified,
        #     'user_created_utc' : user_created_utc
        # }
        
        del out['author']

        out.update(user_dict)

        if convert_timestamp:
            if 'user_created_utc' in out.keys() and isinstance(out['user_created_utc'], float):
                out['user_created_utc'] = dt.datetime.fromtimestamp(out['user_created_utc']).strftime('%Y/%m/%d %H:%M:%S')

    return out

In [183]:
def get_user_attribute(user:praw.models.reddit.redditor.Redditor,
                       attribute:str):
    '''  
    helper function for getting user-level attributes from a praw
    redditor object. this function avoids breaking code due to 
    missing attributes when throwing an attribute error
    '''
    try:
        val = getattr(user, attribute, None)
    except NotFound:
        val = None
        
    return val

In [212]:
def extract_newest_date_from_json(reddit_post_json:str):
    ''' 
    when we re-start collecting reddit post data, 
    we want to make sure we don't collect any posts we've already
    retrieved. so, we'll get the date/time of when the most recently
    collected post was published.

    args:  
        - reddit_post_json: full filepath to the json we want to 
          extract a date from.  
    '''
    if not os.path.isfile(reddit_post_json):
        raise ValueError(f'{reddit_post_json} is not a valid file. Please supply a valid reddit json file')

    with open(reddit_post_json, 'r') as infile:
        for line in infile:
            tmp = json.loads(line)
            break

    date = date_parser.parse(tmp['created_utc'])

    return date


## PATHS & CONSTANTS

In [2]:
SUBREDDITS = '../data_collection/reddit_subreddits.txt'
ACCOUNTS = '../data_collection/reddit_accounts.txt'

## INIT

In [3]:
load_dotenv()

True

In [4]:
reddit = praw.Reddit(
    client_id=os.getenv('REDDIT_ID'),
    client_secret=os.getenv('REDDIT_SECRET'),
    user_agent=os.getenv('REDDIT_USER_AGENT'),
    username=os.getenv('REDDIT_USERNAME'),
    password=os.getenv('REDDIT_PASSWORD')
)

In [53]:
with open(SUBREDDITS, 'r') as infile:
    subreddits = [line.rstrip() for line in infile]

In [102]:
filters = ['id', 'created_utc', 'title', 'selftext', 'domain', 'url', 'num_comments', 'score', 'ups', 'downs', 'author']

In [207]:
filelist = os.listdir('../data/logfiles/twitter/', )

In [209]:
filelist.sort(key=lambda x: os.path.getmtime('../data/logfiles/twitter/'+x))


In [210]:
filelist

['2022_11_26-04_00_02.log',
 '2022_11_26-05_00_01.log',
 '2022_11_26-06_00_01.log',
 '2022_11_26-07_00_01.log',
 '2022_11_26-08_00_01.log',
 '2022_11_26-09_00_01.log',
 '2022_11_26-10_00_01.log',
 '2022_11_26-11_00_01.log',
 '2022_11_26-12_00_01.log',
 '2022_11_26-13_00_01.log',
 '2022_11_26-14_00_01.log']

In [213]:
min_date = extract_newest_date_from_json('../data/reddit_posts/worldcup/worldcup_2022_11_29-13_45_19.json')

In [214]:
min_date

datetime.datetime(2022, 11, 29, 13, 47, 7)

## THE THING!

In [204]:
res = reddit.subreddit(subreddits[0]).new(limit=10)

In [220]:
dt.datetime.fromtimestamp(post.created) > dt.datetime.now()

False

In [187]:
posts = []
for post in res:
    tmp = reddit_post_to_dict(post)
    posts.append(tmp)

In [197]:
DT_TODAY = dt.datetime.now()
TODAY = DT_TODAY.strftime('%Y_%m_%d-%H_%M_%S')

In [205]:
# same but with streaming to json file
outfile = f'../data/reddit_posts/{subreddits[0]}/{subreddits[0]}_{TODAY}.json'
for post in res:
    tmp = reddit_post_to_dict(post)
    with open(outfile, 'a') as o:
        o.write(json.dumps(tmp)+'\n')

In [206]:
dt.datetime.timestamp(1)

TypeError: descriptor 'timestamp' for 'datetime.datetime' objects doesn't apply to a 'int' object

In [191]:
posts[-1]

{'id': 'yc55y1',
 'created_utc': '2022/10/24 07:45:03',
 'title': 'Cwmbran versus Caerleon in the Welsh fourth tier was nothing to scoff at. 6 goals, an immense late comeback and a kicked over bin. Read about it here!',
 'selftext': '',
 'domain': 'georgeharrisfootball598442982.wordpress.com',
 'url': 'https://georgeharrisfootball598442982.wordpress.com/2022/10/24/22nd-october-2022-cwmbran-town-afc-vs-caerleon-afc/',
 'num_comments': 7,
 'score': 90,
 'ups': 90,
 'downs': 0,
 'user_name': 'THEPOSTMATCH',
 'user_id': 'ppe1fboz',
 'user_total_karma': 207,
 'user_verified': True,
 'user_created_utc': 1657216964.0}

In [115]:
date_parser.parse(strdate)

datetime.datetime(2022, 11, 29, 11, 5, 8)

In [88]:
vars(post_dict['author'])

{'_listing_use_sort': True,
 'name': 'WillingCoach',
 '_reddit': <praw.reddit.Reddit at 0xffffb80436a0>,
 '_fetched': True,
 'is_employee': False,
 'is_friend': False,
 'subreddit': UserSubreddit(display_name='u_WillingCoach'),
 'snoovatar_size': [380, 600],
 'awardee_karma': 0,
 'id': 'y4nu3lg',
 'verified': True,
 'is_gold': False,
 'is_mod': True,
 'awarder_karma': 292,
 'has_verified_email': True,
 'icon_img': 'https://styles.redditmedia.com/t5_g1wss/styles/profileIcon_snoo795bb301-91d1-4739-a8a7-572660b73dc1-headshot.png?width=256&height=256&crop=256:256,smart&s=a9756fc135069fbc0a46c08d5b366d6e1cf13056',
 'hide_from_robots': True,
 'link_karma': 906,
 'pref_show_snoovatar': False,
 'is_blocked': False,
 'total_karma': 1198,
 'accept_chats': False,
 'created': 1519210934.0,
 'created_utc': 1519210934.0,
 'snoovatar_img': 'https://i.redd.it/snoovatar/avatars/795bb301-91d1-4739-a8a7-572660b73dc1.png',
 'comment_karma': 0,
 'accept_followers': True,
 'has_subscribed': True,
 'accept_p

In [103]:
{field:post_dict[field] for field in filters}

{'id': 'z7qz9l',
 'created_utc': 1669719908.0,
 'title': 'Pitch invader with rainbow flag interrupts World Cup match between Portugal and Uruguay',
 'selftext': '',
 'domain': 'youtube.com',
 'url': 'https://www.youtube.com/watch?v=0TqHNVilm9o',
 'num_comments': 1,
 'score': 4,
 'ups': 4,
 'downs': 0,
 'author': Redditor(name='WillingCoach')}

trying to find a post with text not link

In [95]:
post2 = reddit.submission('nlj59n')

In [128]:
reddit_post_to_dict(post=post2)

{'id': 'nlj59n',
 'created_utc': '2021/05/26 15:02:56',
 'title': 'Some API attributes - What do they mean',
 'selftext': 'Hi,\n\nCan someone please help me what these API attributes mean - \n\nPWLS\n\nWLS\n\nIS\\_REDDIT\\_MEDIA\\_DOMAIN\n\nIS\\_META\n\nCAN\\_MOD\\_POST\n\nREMOVED\\_BY\\_CATEGORY\n\nAUTHOR\\_FLAIR\\_TYPE\n\nTREATMENT\\_TAGS \n\nMOD\\_REASON\\_BY \n\nIS\\_ROBOT\\_INDEXABLE\n\nNUM\\_DUPLICATES\n\nThank you for your time. \n\n*"Fidelity to Duty"*',
 'domain': 'self.redditdev',
 'url': 'https://www.reddit.com/r/redditdev/comments/nlj59n/some_api_attributes_what_do_they_mean/',
 'num_comments': 6,
 'score': 1,
 'ups': 1,
 'downs': 0,
 'user_name': 'AjithaBuvan'}

In [101]:
vars(post2)['selftext']

'Hi,\n\nCan someone please help me what these API attributes mean - \n\nPWLS\n\nWLS\n\nIS\\_REDDIT\\_MEDIA\\_DOMAIN\n\nIS\\_META\n\nCAN\\_MOD\\_POST\n\nREMOVED\\_BY\\_CATEGORY\n\nAUTHOR\\_FLAIR\\_TYPE\n\nTREATMENT\\_TAGS \n\nMOD\\_REASON\\_BY \n\nIS\\_ROBOT\\_INDEXABLE\n\nNUM\\_DUPLICATES\n\nThank you for your time. \n\n*"Fidelity to Duty"*'

PSAW

In [58]:
api = PushshiftAPI(reddit)

In [59]:
earliest = int(date_parser.parse('2022-11-15').timestamp())
latest = int((dt.datetime.now()-dt.timedelta(days=7)).timestamp())

In [68]:
res_psaw = list(api.search_submissions(
        subreddit=subreddits[0],   
        after=earliest,      
        before=latest,       
        filter=filters,        
        limit=100))

KeyboardInterrupt: 

In [7]:
for item in res:
    print(item.title)

[Rob Harris] FIFA President Gianni Infantino at news conference in Qatar: “Today I feel Qatari. Today I feel Arab. Today I feel African. Today I feel gay. Today I feel disabled. Today I feel a migrant worker. ... I know what it feels to be discriminated … I was bullied because I had red hair.”
Sky News: FIFA chief Gianni Infantino hits out at Qatar criticism saying European countries should instead 'be apologising for the next 3,000 years'
Gianni Infantino ‘feels gay’ and ‘like a migrant worker’ as he recalls being bullied for ‘red hair and freckles’
FIFA chief Gianni Infantino hits out at Qatar criticism saying European countries should instead 'be apologising for the next 3,000 years'
FIFA chief Gianni Infantino hits out at Qatar criticism saying European countries should instead 'be apologising for the next 3,000 years'
World Cup 2022: Fifa president Gianni Infantino accuses West of 'hypocrisy'
Erik Niva: Infantino Has completely lost touch with reality [Translation in comments]
Bel

In [16]:
item.url

'https://www.reddit.com/r/soccercirclejerk/comments/yzfryu/rob_harris_fifa_president_gianni_infantino_at/'

In [12]:
item.author.name

'AlwaysFullyObjective'

In [15]:
item.media

In [29]:
item.num_crossposts

0

In [31]:
reddit.submission('z76zb0').url

'https://www.youtube.com/watch?v=PwLAtUDIq1w'