## Before you start

Please make sure that you have a Reddit account and created an app. This app will serve us (similarly to the old good days of open Twitter API -- which was Elons ago) to get the data.

To create an app you basically need to click this [link](https://www.reddit.com/prefs/apps) and fill in a form looking more or less like the one below.

<div style="text-align:center"><img src="../png/reddit_app.png" /></div>

You should select the script and put as redirect ur `http://localhost:8080`. In the screenshot above I hide the Reddit Client Id (the red square just below the "person use script") and Reddit Client Secret. I did it on purpose because as I said during the class it is information that you should not share with anyone. Especially, Reddit Client Secret has in the name secret for a very good reason. Just a reminder, it allows Reddit to match the request you are sending to Reddit with your account. In other words, it allows to the webpage recognize the app as yours. Therefore, if you share your credentials with a stranger they might use it to do something malicious and you would be one to blame. It more or less works like with your ID card. If you give your ID to a shady person they might take a bank loan in your name and you will be the one who has to pay it back. 

In [None]:
!pip install praw
## Import modules
import praw
from praw.models import MoreComments
import os
from datetime import datetime
import spacy
import json
import pandas as pd

## Ignore the following
client_id = os.getenv("Reddit_Client_Id")
client_secret = os.getenv('Reddit_Client_Secret')
password = os.getenv('Reddit_password')
user_agent = os.getenv('Reddit_User_Agent')
username = os.getenv('Reddit_Username')

def convert_date(date_float : float) -> str:
    '''
    Takes a date in epoch time format and converts it into a string in human-readable date format.
    
    Parameters:
    -----------
        date_float (float): a float representing a date in epoch time format.
        
    Returns:
    --------
        (str) : a string representing a date in human-readable format.
    '''
    return datetime.fromtimestamp(date_float).strftime('%d-%m-%Y %H:%M:%S')


The chunk above serves for loading modules, loading my credentials from my computers, and defining a helper function that converts epoch time into a human-readable format. However, it would return empty strings in your case because it works only on my computer (I will show you sometime how to set it also for your computer but not now). For now, you need to paste your credentials as strings in the chunk below. 

* `clinet_id` is just the string below the persona use script.
* `client_secret` is just the script following the secret.
* `password` is just your Reddit password.
* `user_agent` in practice could be just the name of your app. However, theoretically, you should provide here a string with `<operating system>:<client_id>:<version of the app> (by u/<your username>)`.
* `username` is just your Reddit username.

In [None]:
## Replace empty strings with your credentials
## client_id = ''
## client_secret = ''
## password = ''
## user_agent = ''
## username = ''

Once you have all your credentials stored in _Python_ as strings let's connect to Reddit API. [Here](https://praw.readthedocs.io/en/stable/index.html) is the docummentation of this module.

In [None]:
## Connect to Reddit API
reddit = praw.Reddit(
    client_id=client_id,
    client_secret = client_secret,
    password=password,
    user_agent=user_agent,
    username=username
)

In [None]:
## Get all submissions from Reddit about multilingualparenting
subreddit = reddit.subreddit('multilingualparenting').top(time_filter="all", limit = None)
## Create a list of dictionaries with the submissions
submissions = [ { 'title' : line.title,
                  'id' : line.id,
                  'upvote_ratio' : line.upvote_ratio,
                  'selftext' : line.selftext,
                  'score' : line.score,
                  'flair' : line.link_flair_text,
                  'num_comments' : line.num_comments,
                  'is_self' : line.is_self,
                  'created' : convert_date(line.created_utc)} 
               for line in subreddit ]

There is a lot of information we can get from submissions. The following fields are out there but probably we don't need all of them. I put them here just in case.

* author -- provides an instance of Redditor.
* author_flair_text -- the text content of the author’s flair, or None if not flared. In simple terms, a flair on Reddit is a kind of tag added to either post or username. They are meant to categorize posts or users.
* clicked -- whether or not the submission has been clicked by the client.
* comments -- provides an instance of CommentForest.
* created_utc -- time the submission was created, represented in Unix Time.
* distinguished -- whether or not the submission is distinguished.
* edited -- Whether or not the submission has been edited.
* id -- ID of the submission.
* is_original_content -- whether or not the submission has been set as original content.
* is_self -- whether or not the submission is a selfpost (text-only).
* link_flair_template_id -- the link flair’s ID.
* link_flair_text -- The link flair’s text content, or None if not flared.
* locked -- whether or not the submission has been locked.
* name -- full name of the submission.
* num_comments -- the number of comments on the submission.
* over_18 -- whether or not the submission has been marked as NSFW.
* permalink -- a permalink for the submission.
* poll_data -- a PollData object representing the data of this submission, if it is a poll submission.
* saved -- whether or not the submission is saved.
* score -- the number of upvotes for the submission.
* selftext -- the submissions’ selftext - an empty string if a link post.
* spoiler -- whether or not the submission has been marked as a spoiler.
* stickied -- whether or not the submission is stickied.
* subreddit -- provides an instance of Subreddit.
* title -- the title of the submission.
* upvote_ratio -- the percentage of upvotes from all votes on the submission.
* url -- the URL the submission links to, or the permalink if a selfpost.

In [None]:
## Just print out the most important information about each submission.
for sub in submissions: print({ 'id' : sub['id'], 'title' : sub['title'], 'num_comments' : sub['num_comments'] })

There is a lot of information about a single comment. The following fields are out there but probably we don't need all of them. I put them here just in case.

* author -- provides an instance of Redditor.
* body -- the body of the comment, as Markdown.
* body_html -- the body of the comment, as HTML.
* created_utc -- time the comment was created, represented in Unix Time.
* distinguished -- whether or not the comment is distinguished.
* edited -- whether or not the comment has been edited.
* id -- the ID of the comment.
* is_submitter -- whether or not the comment author is also the author of the submission.
* link_id -- the submission ID that the comment belongs to.
* parent_id -- the ID of the parent comment (prefixed with t1_). If it is a top-level comment, this returns the submission ID instead (prefixed with t3_).
* permalink -- a permalink for the comment. Comment objects from the inbox have a context attribute instead.
* replies -- provides an instance of CommentForest.
* saved -- whether or not the comment is saved.
* score -- the number of upvotes for the comment.
* stickied -- whether or not the comment is stickied.
* submission -- provides an instance of Submission. The submission that the comment belongs to.
* subreddit -- provides an instance of Subreddit. The subreddit that the comment belongs to.
* subreddit_id -- the subreddit ID that the comment belongs to.

And for the Redditor

* comment_karma -- the comment karma for the Redditor.
* comments -- provide an instance of SubListing for comment access.
* submissions -- provide an instance of SubListing for submission access.
* created_utc -- time the account was created, represented in Unix Time.
* has_verified_email -- whether or not the Redditor has verified their email.
* icon_img -- the url of the Redditors’ avatar.
* id -- the ID of the Redditor.
* is_employee -- whether or not the Redditor is a Reddit employee.
* is_friend -- whether or not the Redditor is friends with the authenticated user.
* is_mod -- whether or not the Redditor mods any subreddits.
* is_gold -- whether or not the Redditor has active Reddit Premium status.
* is_suspended -- whether or not the Redditor is currently suspended.
* link_karma -- the link karma for the Redditor.
* name -- the Redditor’s username.
* subreddit -- if the Redditor has created a user-subreddit, provides a dictionary of additional attributes. See below.
* subreddit["banner_img"] -- the URL of the user-subreddit banner.
* subreddit["name"]-- the fullname of the user-subreddit.
* subreddit["over_18"] -- whether or not the user-subreddit is NSFW.
* subreddit["public_description"] -- the public description of the user-subreddit.
* subreddit["subscribers"] -- the number of users subscribed to the user-subreddit.
* subreddit["title"] -- the title of the user-subreddit.

In [None]:
## Select a submission by id -- this one is the first submission from multilingualparenting subreddit
submission = reddit.submission("l4lgjs")

## Set the option to get all the comments
submission.comments.replace_more(limit=None)

## Iterate over all the comments. Ignore the comments
## tree. Write the comments to the JSON line file.
with open('comments_example.jl', 'w') as file:
  for comment in submission.comments.list():
      temp_dict = {}
      temp_dict['body'] = comment.body
      ## Sometimes a given comment was deleted. Then
      ## we don't want to write it out to the file.
      ## I use here the continue statement. It does not
      ## break the loop it just goes to the next iteration.
      ## In other words whenever the comment was deleted
      ## it skips the rest of the code below the continue
      ## statement and gos for the next comment.
      if temp_dict['body'] == '[deleted]':
          continue
      temp_dict['score'] = comment.score
      temp_dict['link'] = comment.permalink
      try:
          temp_dict['author'] = { 'name' : comment.author.name,
                                  'karma' : comment.author.comment_karma,
                                  'created_utc' : convert_date(comment.author.created_utc),
                                  'has_verified_email' : comment.author.has_verified_email,
                                  #'is_suspended' : comment.author.is_suspended,
                                  'is_gold' : comment.author.is_gold
          }
      except:
          pass
      temp_dict['created_utc'] = convert_date(comment.created_utc)
      temp_dict['edited'] = comment.edited
      temp_dict['is_submitter'] = comment.is_submitter
      
      file.write(json.dumps(temp_dict) + '\n')

## EXCEL File
If you want to create an excell file from a JSON line file you can easily do it in the following manner.

In [None]:
## Read into Python a JSON line file
with open('comments_example.jl', 'r') as file:
    df = [ json.loads(line) for line in file.readlines() ]
    
pd.DataFrame(df).to_excel('comments_example.xlsx')