# Scraping posts and comments from Reddit
*This notebook is part of the [LLMCode library](https://github.com/PerttuHamalainen/LLMCode).*

**Learning goals**
Get hands on experience in finding and scraping interesting Reddit discussions, which can be useful for, e.g, designers trying to understand their potential target users.

The provided example is for game designers, with the goal of understanding what makes a game memorable, based on discussions about Elden Ring, the remarkably successful hit from 2022.

**Before you use this Colab notebook**

*Step 1: Create a Reddit Account*

1. Create a new Reddit account that you will use for your bot, or you can use an existing account.
2. Verify the email address associated with the account.

*Step 2: Register a Reddit app (the scraping code needs this for authentication)*

1. Go to [Reddit's App Page](https://www.reddit.com/prefs/apps).
2. Click on “are you a developer? create an app...” or “create another app…” if you have other applications.
3. Fill out the form:
   - **name**: Your app’s name, e.g., "research_scraper"
   - **app type**: Choose 'script'.
   - **description**: (Optional) Describe what your app will do.
   - **about url**: (Optional).
   - **redirect uri**: Use `http://localhost:8080` (for most script applications).
4. Click “create app”.

You will be provided with a `client_id` and `client_secret`. These will be needed for the data scraping.

**How to use this Colab notebook?**
* Make a copy of the notebook to your own Google drive via the "File" menu above. Note: If you don't have a Google account but you are an Aalto University employee or student, you can log in to Google using your Aalto email, which takes you to the Aalto SSO sign-in page. This gives you access to an Aalto Google drive.
* Click "connect" in the top-right menu to make Colab create a Linux virtual machine for you.
* Progress top-down, following the instructions.

**New to Colab notebooks?**

Colab notebooks are browser-based learning environments consisting of *cells* that include either text or code. The code is executed in a Google virtual machine instead of your own computer. You can run code cell-by-cell (click the "play" symbol of each code cell) or run everything by selecting "Run all" from the "Runtime" menu. For more info, see Google's [Intro video](https://www.youtube.com/watch?v=inN8seMm7UI) and [curated example notebooks](https://colab.google/notebooks/)


## Initial setup
Do the following:
* Click the play icon (the triangle) to run the setup code.
* Optional: Click "Show code" to see the code.
Wait for the setup to complete.

In [None]:
# @title
#install 3rd party packages. The "!" at the beginning of a line means the line is a Linux command line command
!pip install praw
!pip install plyer
!pip install openpyxl
!pip install validators

#import all packages
import praw
from praw.models import MoreComments
import pandas as pd
import time
import string
import sys
import re
from plyer import notification
from openpyxl import Workbook
from openpyxl.utils.dataframe import dataframe_to_rows
import getpass
from collections import defaultdict
import validators

print("\nSetup complete!")

## Authentication

Do the following:

* Click the play icon to run the code below
* Enter your Reddit client id and client secret when requested


In [9]:
# @title
print("Please enter your Reddit client id:")
client_id=getpass.getpass()
print("Please enter your Reddit client secret:")
client_secret=getpass.getpass()

reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    redirect_uri="http://localhost:8080",
    user_agent="none",  #apparently, the app name is not really needed
    check_for_async=False
)
reddit.read_only = True

Please enter your Reddit client id:
··········
Please enter your Reddit client secret:
··········


## Scraping (demonstration)
The code below fetches posts and their comments based on a list of URLs. The URLs were obtained by searching the Elden Ring subreddit (r/Eldenring) as shown in the screenshot below, with the goal of understanding what makes the game memorable.

<img src="https://raw.githubusercontent.com/PerttuHamalainen/LLMCode/master/images/reddit_search.png" alt="screenshot of Reddit search" width="700"/>

Do the following:

* Run the code by clicking the play icon.
* Inspect the obtained data by clicking on the table icon in the output view.
* View the code to see how the list of URLs was defined.


In [7]:
# @title
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values.
urls=["https://www.reddit.com/r/Eldenring/comments/124wccb/what_were_some_your_greatest_holy_shit_moments_in/",
      "https://www.reddit.com/r/Eldenring/comments/1e2hfhy/what_are_your_most_memorable_moments_youve/",
      "https://www.reddit.com/r/Eldenring/comments/19ea8hd/what_was_the_most_memorable_momentarea_in_the/",
      "https://www.reddit.com/r/Eldenring/comments/11g6rys/what_was_your_most_memorable_moment_from_your/",
      "https://www.reddit.com/r/Eldenring/comments/zwjcgs/what_is_your_most_memorable_jawdropping_moment_in/",
      "https://www.reddit.com/r/Eldenring/comments/tlbdpe/what_for_you_is_the_most_amazing_moment_of_elden/",
      "https://www.reddit.com/r/Eldenring/comments/wbd9fh/what_is_your_most_unforgettable_moment_in_elden/",
      ]

sort_by_score=True #Set to True if you want the highest-scoring comments first

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing
def scrape_urls(urls):
  #Loop over urls (one thread per url)
  dfs=[] #we'll store one DataFrame per thread in this array
  for index,url in enumerate(urls):
    #get all messages in thread
    messages=[]
    if not validators.url(url):
      print(f"URL {url} is not valid.")
      continue
    #get id from url
    id=url.split(r"/comments/")[1].split("/")[0]

    #print some info
    submission = reddit.submission(id=id)
    print(f"Extracting comments from post {index+1}/{len(urls)}: {submission.title}")

    #Make the post the first message in the thread
    id=submission.id
    messages.append([
        submission.title+"\n\n"+submission.selftext,    #comment text
        0,                                      #message depth
        submission.id,                          #id
        "",                                   #parent (first message has no parent)
        submission.score,                   #score
        url                                     #url
    ])

    all_comments = submission.comments.list()
    for comment in all_comments:
        if isinstance(comment, MoreComments):
            continue
        if "[deleted]" in comment.body:
            continue
        messages.append(
            [comment.body,
            comment.depth+1,
            comment.id,
            comment.parent_id if comment.depth>0 else id,
            comment.score,
            "https://www.reddit.com/"+comment.permalink
            ])
    #Convert to a Pandas DataFrame
    df = pd.DataFrame(messages, columns=['text', 'depth', 'id', 'parent_id', 'score', 'url'])

    #Sort by score
    df=df.sort_values(by='score', ascending=False)

    ## Sort the df so that comments are below their parents
    n_removed=0
    # Ensure correct data types
    df["parent_id"]=df["parent_id"].astype(str)
    df["id"]=df["id"].astype(str)

    # Remove the "t1_" from the parent ids for correct linkage
    df["parent_id"]=df["parent_id"].str.replace("t1_","",regex=False)

    # Create a dictionary to hold children of each comment
    children_dict = defaultdict(list)
    for _, row in df.iterrows():
        parent_id = row['parent_id']
        if parent_id!="" and (not df["id"].isin([parent_id]).any()):
            print(f"Warning: Parent id {parent_id} of comment {row['id']} not found. The parent has probably been deleted. This comment will not be included in the output data.\n\n")
            #print(f"Comment: \n\n{row['comment']}\n\n")
            n_removed+=1
        children_dict[parent_id].append(row['id'])

    # Function to perform Depth First Search and get comments in order
    def get_thread_order(node_id, order_list):
        #print(f"Iterating node {node_id} with {len(children_dict[node_id])} children\n")
        order_list.append(node_id)
        for child_id in children_dict[node_id]:
            get_thread_order(child_id, order_list)

    # Start DFS from the root comments (where parent_id is "")
    final_order_list = []
    for root_id in children_dict[""]:
        get_thread_order(root_id, final_order_list)
    #print(f"ordered length {len(final_order_list)}\n")

    # Reorder DataFrame based on the computed order
    df = df.set_index('id').loc[final_order_list].reset_index()

    # Print some debug info:
    if n_removed>0:
      print(f"{n_removed} comments deleted due to missing parent comments.\n\n")

    # Store results
    dfs.append(df)
  #merge all dataframes
  if len(dfs)==0:
    return None
  df=pd.concat(dfs)
  return df
df=scrape_urls(urls)
df.to_excel("scraped.xlsx")
display(df)


Extracting comments from post 1/7: What were some your greatest “holy shit!” moments in your first playthrough?
Extracting comments from post 2/7: What are your most memorable moments you've experienced in Elden Ring? Are there any new ones with the DLC out? 
Extracting comments from post 3/7: What was the most memorable moment/area in The Lands Between for you?
Extracting comments from post 4/7: What was your most memorable moment from your first play through?
Extracting comments from post 5/7: What is your most memorable “jaw-dropping” moment in the game?
Extracting comments from post 6/7: What, for you, is the most amazing moment of Elden Ring?
Extracting comments from post 7/7: What is your most unforgettable moment in Elden Ring, Tarnished? Share your cherished memory.








4 comments deleted due to missing parent comments.




Unnamed: 0,id,text,depth,parent id,score,url
0,124wccb,What were some your greatest “holy shit!” mome...,0,,2386,https://www.reddit.com/r/Eldenring/comments/12...
1,je15ftf,Mine were:\n\n- Taking the Telporter Trap to L...,1,124wccb,782,https://www.reddit.com//r/Eldenring/comments/1...
2,je1he12,godwyn's body? >!which one?!<,2,je15ftf,149,https://www.reddit.com//r/Eldenring/comments/1...
3,je1m5il,"Both, but it was a bigger holy shit moment fin...",3,je1he12,130,https://www.reddit.com//r/Eldenring/comments/1...
4,je1y1ad,So I googled different versions of “Fia hugs s...,4,je1m5il,44,https://www.reddit.com//r/Eldenring/comments/1...
...,...,...,...,...,...,...
478,ii7anpi,Horrah loux’s transition cutscene. The cool f...,1,wbd9fh,2,https://www.reddit.com//r/Eldenring/comments/w...
479,ii7ajpc,Thinking that I found everything I could find ...,1,wbd9fh,2,https://www.reddit.com//r/Eldenring/comments/w...
480,ii7a1nu,Fighting the dragon near the start of the game...,1,wbd9fh,2,https://www.reddit.com//r/Eldenring/comments/w...
481,ii79lwo,"My entire 3 day battle against Radahn, that mo...",1,wbd9fh,2,https://www.reddit.com//r/Eldenring/comments/w...


## Scraping (exercise for you)
Do the following:
* Go to [Reddit](https://www.reddit.com) in a new tab or window
* Think of your own research question and search for related posts
* Copy-paste the post URLs to the code below
* Run the code below
* Download the results: Open the Colab file browser on the left, hover mouse over "scraped.xlsx", click on the "⋮" and select "download"


In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values.
#Enter the scraped urls between the brackets, separated by commas.
#The brackets indicate that the variable "urls" is a list of multiple objects.
urls=["<my url here>",
      "<my second url here>",
      ]

#Set to True if you want the highest-scoring comments first
sort_by_score=True

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing
df=scrape_urls(urls)
if df is not None:
  df.to_excel("scraped.xlsx")
  display(df)

## Manual coding

Before we get to coding with LLMs, we ask you to code some data yourself. This data will be used to "teach" the model and validate its output.

We ask to you to code 200 texts, or as many as you can during the session. Do the follow:
* Open the scraped.xlsx file you downloaded above in Excel, and divide it into two files scraped_to_be_coded.xlsx with the first 200 rows and scraped_uncoded.xslx with the rest of the rows. 
* Go to the manual coding tool at https://perttuhamalainen.github.io/LLMCode/ and upload scraped_to_be_coded.xlsx (the file with the first 200 rows). **Do not use any private browsing mode (incognito) with this tool**, since the tool stores your results in local storage on your computer and using private mode may result in loss of data.
* Code the texts following the instructions in the tool.
* Once you are finished, **download both the coded file and the coding logs** using the buttons on the left as we will use these later.