# 4plebs Archive Scraper

Max Rizzuto | DFRLAB 2022

------
This notebook uses Selenium and Firefox's Gecko Driver to retreave the results of a structured query on the 4chan archive [archive.4plebs.org](archive.4plebs.org).


### Step-by-Step of the building process

This scraper will be built in following order:
* Loading the page / inputting the queries
* Iterating through pages - I like to get this done first.
* Identifying the list of post elements
* Identify the elements within each post
* Iterating through the list of page elements
    * Creating a dictionary output for each page
    * Adding each dictionary to either a list or a central dictionary
* A function that will call all of the above functions, a loop to drive the whole thing.  
* A function that exports the results as a csv

Because there are so many results for our search terms of interest this scraper has got to run totally unattended. This means acknowledging when the all the results from one search have been collected and moving onto the next automatically. 

__Observations:__

_Loading the page_ ---------------------------------------------------------------------------------------------->

Boolean search does not appear to work on 4plebs. This isnt the end of the world, just means its going to take longer to run.
Because we will likely pull the same posts for various search terms, we're going to add the search term to the post results dictionary as metadata for us to de-dup later.

Also, Rather than using the search bar to search for results, we're going to use the url itself (which is sort of structured like an API). 
> ex: [archive.4plebs.org/{board}/search/text/{query}/](_)

Im going to be developing the scraper on a search that only produces a few results, in this case its the search term "reddit rules" which returns 80 results.

Something else that occurs to me is that we need to manage the list of search terms. Feeding each term to the "get_site" function once no more results remain.

Lmao, I just ran the scraper for the first time and it detected the robotic user agent and block the request.
They say that crawling like this is not necessary because everything is backed up on archive.org. This is only partly true and I will continue to work around this.
They caught us because the geckdriver announces itself in the brouser's user-agent. We can generate a fake user agent to appear more authentic.

This worked. I found some code [here](https://stackoverflow.com/questions/29916054/change-user-agent-for-selenium-web-driver), dropped it in and it worked great. 

But now I am on edge. If they put this crawling measure in place there may be others. We need to write the scraper to iteratively save progress so that if we are caught later down the line we dont lose all of our progress. 

_Iterating through pages_ ---------------------------------------------------------------------------------------->

At the bottom of the page I see this very nice, very static "next" button for advancing to the next page. We're going to use that because its not going to move a whole lot.
The list element that the next button is a child to also includes information about its status; When there are no more pages to advance to the li changes from "next" to "next disabled". We're going to use that to tell us when we've reached the last page. 

Selenium's ```find_elements_by_{something}``` is really interesting and useful. When using ```find_element_by_{something}``` (with ```element``` singular) if will return the first matching element. When using ```elements``` plural it will return a list of matches. 

We Figured out sytax for ```get_attribute("class")```. It's frustrating to forget the difference between ```find_element``` and ```get_attribute```. 

I encountered an odd error where the next button could not be clicked itself, but the child element could. You can see in the next button function the variable ```clickable_next_button``` which uses find_element_by_xpath to advance to the child element, at which point I can click.  

_Identifying the list of posts_ --------------------------------------------------------------------------------->

This step has been really easy. The posts are stored in an ```<aside>``` block with the unique class "posts", so we can just find it using ```find_element_by_class_name("posts")```. Note that this is the outer element, we want the inner list, so we do ```find_elements_by_xpath("./*")``` to navagate to the various children elements. 

_Identifying the specific elements from each post that we want to capture_ -------------------------------------->
I stepped through one of the items in the posts lists and pulled out each descrete value of interest. 
I've included the original process here. This was done first to make the second iteration of this process easier:
```
posts[0].find_element_by_tag_name("header").text
posts[0].get_attribute('id')
post_title = posts[0].find_element_by_tag_name('h2').text
post_author = posts[1].find_element_by_class_name("post_author").text
post_tripcode = posts[1].find_element_by_class_name("post_tripcode").text
post_hash = posts[1].find_element_by_class_name("poster_hash").text
post_datetime = posts[1].find_element_by_tag_name("time").get_attribute("datetime")
post_type = posts[1].find_element_by_class_name("post_type").find_element_by_xpath("./*").get_attribute("href")
post_replies = 
post_text = posts[4].find_element_by_class_name("text").text
```

_Iterating through the list of page elements
    * Creating a dictionary output for each page
    * Adding each dictionary to either a list or a central dictionary_ ----------------------------------------->
This turned out to be extremely straight forward. You can see the function ```gather_page_results()``` and how it compared to the above code block where I hashed out each of the values we wanted to extract.

I changed the variables into dictionary entries to save memory. you can declare your variables directly into the dictionary, no problem. 

I decided to go for the list of dictionaries because it plays extremely well with pandas, and I already know how it works. You can take the output of that fuction and put it straight into a dataframe, like so: ```pd.DataFrame(gather_page_results())```.

UPDATE: its now Monday the 11th, and with a fresh pair of eyes I can see that this isnt 100% going to work. We need make the code adaptable enough to work even when elements do not exist. This means that I am going to have to use the plural ```find_elements_by_x``` approach mentioned earlier because they will return empty lists rather than error values in the event that a desired value does not exist. This empty list output, coupled with some list comprehension approach is really going save us a lot of heartache. This will effect the output, each value accessed this way will be nested in a list element of unknown length. Its my hope that the lists only ever contain one item, but its likely that there will be variation. This will create some work for us later in exchange for saving us time now. 

UPDATE: its now Wednesday the 13th and I've discovered that I incorrectly called ```posts``` where I should have called ```post```. This was a simple but fatal mistake and a consequence of not testing / reviewing results thoroughly enough. I have fixed the issues and added new values to the dictionary results. This will require me to re-run the code and capture a new set of data, so I cleared the cache and output folders.

_A function that will call all of the above functions, a loop to drive the whole thing._ ----------------------->
I was preparing for the worst here. After I ran the scraper a couple of times and encountered rate limiting I decided to build in some redundancies to allow us to save progress in the event that we continued to encounter errors.
A consiquence of this is some added complexity in this otherwise pretty simple function. 
You will see the variable ```start_at_page``` is doing some strange things. It is declared in the _Input Parameters_ area below and is designed to allow the crawler to pick up where it left off after a crash inorder to avoide duplicate work. In this function it is called once (and set to be global) at which point it is set to zero, this is so that on the first run it will pick up at _x_ page of archive results, but will start the next search term at zero. In the event that the ```start_at_page``` variable is used users would need to truncate the list of terms to correspond to the values they have not yet captured. 

Down to brass tax---
We've got two vital loops here: 
>An outer loop that iterates though the list of search terms we plan to query. <br>
>It runs the function to gather page results and adds the results to a ```data``` variable and innitiates the inner loop...

>The inner loop is a while loop that continues to run the ```advance_to_next_page``` function and add gathered results to the ```data``` variable. It also creates caches every 5 pages for redundancy.<br>
>The while loop is broken when the next page button is no longer available and the ```advance_to_next_page``` function returns false. At which point the window is abandoned and a new window is opened for the next term.  

_A function that exports the results as a csv_--------------------------------------------------------------------->
Im not going to dwell on this too much but the last bit of the process is done pretty much entirely in pandas (python data science library) and im trying some new organization below which im not 100% sold on. 
Basically the section titled "_Exporting Processed Data_" adds a couple new columns to the data, cleans up some loose ends, and ships the data as a csv in a form that is more polished than the pickle outputs. 

## Import Libraries

In [1]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException

from fake_useragent import UserAgent

from time import sleep
import pandas as pd

## Input Parameters
The two cells below can be used interchangably.<br>
For queries on just one set of terms + time spans use the first of the two below cells.<br>
For queries on multiple terms over multiple time spans, possibly across multiple boards, use the second of the two cells and write out dictoinary elements for each.

In [6]:
start_at_page = 0
board = "pol"
search_terms = ["Tarrant"] # Must be list 
start_date = "2019-03-14"
end_date = "2022-07-14"

In [7]:
start_at_page = 0
queries = {
           3:  {"board": "pol",
                "terms": ["Roof"],
                "start_date": "2015-06-17",
                "end_date": "2017-08-09"
               },
#            4:  {"board": "pol",
#                 "terms": ["Breivik", "Anders"],
#                 "start_date": "2011-07-22",
#                 "end_date": "2022-07-14"
#                }
}
#            0: {"board": "pol",
#                 "terms": ["Tarrant"],
#                 "start_date": "2019-03-14",
#                 "end_date": "2022-07-14"
#                },
#            1:  {"board": "pol",
#                 "terms": ["Earnest", "Poway"],
#                 "start_date": "2019-04-27",
#                 "end_date": "2022-07-14"
#                },
#            2:  {"board": "pol",
#                 "terms": ["El Paso", "Crusius"],
#                 "start_date": "2019-08-03",
#                 "end_date": "2022-07-14"
#                },
#            3:  {"board": "pol",
#                 "terms": ["Charleston", "Roof", "Dylann"],
#                 "start_date": "2015-06-17",
#                 "end_date": "2022-07-14"
#                },
#            4:  {"board": "pol",
#                 "terms": ["Utøya", "Utoya", "Breivik", "Anders"],
#                 "start_date": "2011-07-22",
#                 "end_date": "2022-07-14"
#                }
#          }

## Functions

In [8]:
def get_site(page, board, search_term, start_date, end_date):
    """ Takes query parameters as input (the board of interest, the search term of interest, the start and end dates). 
        Loads archive.4plebs.org with a random user agent to get around anti-crawling measures. 
    """
    
    # Generate User Agent
    ua = UserAgent()
    user_agent = ua.random

    # Set up webdriver with user agent... 
    profile = webdriver.FirefoxProfile()
    profile.set_preference("general.useragent.override", user_agent)
    wd = webdriver.Firefox(profile)
    
    # Make URL with parameters:
    query_url = f"http://archive.4plebs.org/{board}/search/text/{search_term}/start/{start_date}/end/{end_date}/page/{page}/"
    
    # Get website with parameters. 
    try:
        wd.get(query_url)
        sleep(2)
        print("Page Loaded")
    except Exception as e:
        print(f"Error: {e}")
                
    return wd


def advance_to_next_page(wd):
    """ Does not take any inputs.
        Advances to the next page and returns True when possible. When the "next" button is inactive it returns False.  
    """  
    next_button = wd.find_element_by_class_name("next")
    button_status = next_button.get_attribute("class")
    clickable_next_button = next_button.find_element_by_xpath("./*")
    
    if button_status == "next":
        # Return something that will continue the loop.    
        next_button.location_once_scrolled_into_view
        clickable_next_button.click()
        return True
                #advance_to_next_page()
    else: 
        # Return something that will kill the "next page" loop and will move onto the next search term.
        print("Captured all results for this term, advancing to next term")
        return False
    

def gather_page_results(wd, term):
    """ Takes most recent webdriver state and the search term used for embedding in page results data as input.
        Returns a list of dictionary elements as output containing parsed page results.
    """
    # A list of all the post elements on the current page
    posts = wd.find_element_by_class_name("posts").find_elements_by_xpath("./*")
    
    # A list to store all the post data once its been parsed
    post_list = []
    
    # Post Parsing Loop - makes dictionary keys and values for each post element. 
    for post in posts:   
        
        post_values = {"search_term": term,
                       "current_url": wd.current_url,
                       "post_id": post.get_attribute('id'),
                       "post_archive_url": [v.get_attribute("href") for v in post.find_elements_by_link_text("No.")],
                       "post_file": [v.text for v in post.find_elements_by_class_name("post_file_filename")],
                       "post_title": [v.text for v in post.find_elements_by_tag_name("h2")],
                       "post_author": [v.text for v in post.find_elements_by_class_name("post_author")],
                       "post_tripcode": [v.text for v in post.find_elements_by_class_name("post_tripcode")],
                       "post_hash": [v.text for v in post.find_elements_by_class_name("poster_hash")],
                       "post_datetime": post.find_element_by_tag_name("time").get_attribute("datetime"),
                       "post_header": [v.text for v in post.find_element_by_class_name("post_type").find_elements_by_xpath("..")],
                       "post_type": [v.get_attribute('href') for v in post.find_element_by_class_name("post_type").find_elements_by_xpath("./*")],
                       "post_text": [v.text for v in post.find_elements_by_class_name("text")]
                      }
        
        post_list.append(post_values)
    
    return post_list


def run_code(board, search_terms, start_date, end_date):
    """ Takes parameters to run other functions as input.
        Makes periodic caches of pkl files in the results folder.
        Returns a list of dictionary elements for each post from each page of results.
    """
    
    # make start at page a global variable
    global start_at_page
    
    # a counter for saving a cache of progress
    i = 0
    
    # a list to contain data from every page of results
    data = []
    
    # loop for each term
    for term in search_terms:
        # Determining the page to start on if crawl was interupted.
        # Although you need to manually pair down the list of assets to scrape,
        # this start_at_page variable will set itself to zero after one execution, 
        # enusing future pages start where they should.  
        
        if start_at_page != 0:
            wd = get_site(start_at_page, board, term, start_date, end_date)
            start_at_page = 0
        else: 
            wd = get_site(0, board, term, start_date, end_date)
                
        sleep(5)
        
        data += gather_page_results(wd, term)
        i += 1
        # while the next button is operable, click it and run the loop below.
        while advance_to_next_page(wd) == True:
            sleep(10)
            # See if the page gave us a search limit exceeded error.
            # If so, wait 20 seconds, refresh the page, try to gather results again. 
            try:
                data += gather_page_results(wd, term)
                i += 1
                if i % 5 == 0:
                    print(f"Caching: {i}", end="\r")
                    pd.DataFrame(data).to_pickle(f"cache/cache.pkl")
            
            except:    
                print(wd.find_element_by_class_name("alert").text)
                sleep(20)
                wd.refresh
                
                data += gather_page_results(wd, term)
                i += 1
                if i % 5 == 0:
                    print(f"Caching: {i}", end="\r")
                    pd.DataFrame(data).to_pickle(f"cache/cache.pkl")
            
        else:
            pass


        pd.DataFrame(data).to_pickle(f"results/term_{term}.pkl")
        
    pd.DataFrame(data).to_pickle(f"results/output.pkl")
    return data


# Run the Program
Below are code blocks for two different ways to run the code. <br>
If you intend to use just one set of terms, dates, and boards (as specified in the first of the two __Input Parameter__ blocks) then use the first of the two cells below.<br>
If you intend to run a list of number of terms, over a number of time spans, and a number of boards, use the second and third of the two cells below; the cell containing a function called ```dictoinary_input_run_code```. 


__WAIT__, before running the code you need to make some folders in this notebook's working directory.
* One folder named: "results"
* another named: "cache"

Once that is done you can proceed.

A cache file will be made in the ```cache``` folder that will be updated periodically throughout the scraping process.
<br>Once finished the an output file will be made in the ```results``` folder.

In [None]:
out = run_code(board, search_terms, start_date, end_date)

If you ran the code above, then proceed to the _Exporting Processed Data_ section below.

In [9]:
def dictionary_input_run_code(queries):
    """ Takes parameters to run other functions as input.
        Makes periodic caches of pkl files in the results folder.
        Returns a list of dictionary elements for each post from each page of results.
    """
    
    # make start at page a global variable
    global start_at_page
    
    # a counter for saving a cache of progress
    i = 0
    
    # a list to contain data from every page of results
    data = []
    
    for k, v in queries.items():
        #board, search_terms, start_date, end_date
        board = v['board']
        search_terms = v['terms']
        start_date = v['start_date']
        end_date = v['end_date']
        
        
        # loop for each term
        for term in search_terms:
            # Determining the page to start on if crawl was interupted.
            # Although you need to manually pair down the list of assets to scrape,
            # this start_at_page variable will set itself to zero after one execution, 
            # enusing future pages start where they should.  

            if start_at_page != 0:
                wd = get_site(start_at_page, board, term, start_date, end_date)
                start_at_page = 0
            else: 
                wd = get_site(0, board, term, start_date, end_date)

            sleep(5)

            data += gather_page_results(wd, term)
            i += 1
            # while the next button is operable, click it and run the loop below.
            while advance_to_next_page(wd) == True:
                sleep(10)
                # See if the page gave us a search limit exceeded error.
                # If so, wait 20 seconds, refresh the page, try to gather results again. 
                try:
                    data += gather_page_results(wd, term)
                    i += 1
                    if i % 5 == 0:
                        print(f"Caching: {i}", end="\r")
                        pd.DataFrame(data).to_pickle(f"cache/cache.pkl")

                except:    
                    print(wd.find_element_by_class_name("alert").text)
                    sleep(20)
                    wd.refresh

                    data += gather_page_results(wd, term)
                    i += 1
                    if i % 5 == 0:
                        print(f"Caching: {i}", end="\r")
                        pd.DataFrame(data).to_pickle(f"cache/cache.pkl")

            else:
                pass


            pd.DataFrame(data).to_pickle(f"results/term_{term}.pkl")

    pd.DataFrame(data).to_pickle(f"results/output.pkl")
    return data


In [10]:
out = dictionary_input_run_code(queries)

Page Loaded
Captured all results for this term, advancing to next term


# Exporting Processed Data:
Using Pandas to turn the output into a nice looking csv for further analysis.

In [13]:
def explode_list_columns(df):
    print("Exploding list cols...")
    return(df.explode(['post_title', 'post_author', 'post_tripcode', 'post_header','post_text', 'post_archive_url'])
             .explode(['post_hash'])
             .explode(['post_file'])
             .fillna("")
          )

def manage_flags(df):
    print("Extracting flag info...")
    return(df.assign(flag_lst = lambda x: x["post_type"].fillna("[]").apply(lambda y: "" if y == [] else y[0]),
                     post_flag = lambda x: x["flag_lst"].str.split("/").fillna("[]").apply(lambda y: "" if (len(y) == 1) else y[-2]))
              .drop(["post_type", "flag_lst"], axis="columns")
          )

def get_thread_id(df):
    print("Getting thread ID...")
    return(df.assign(thread_id = lambda x: x['post_archive_url'].str.split("/").apply(lambda y: y[-2])))

def manage_thread_replies(df):
    print("Counting thread replies...")
    return(df.assign(thread_replies = lambda x: x['post_header'].astype(str).str.extract(r"(?<=Replies\: )(\d+)"))
              .drop("post_header", axis='columns')
              .fillna("")
          )

def extract_text_info(df):
    print("Extracting text info...")
    extracted_reply_no  = df.assign(replying_to = lambda x: x['post_text'].str.findall(r"(?<=\>\>)(\d+)").apply(set).apply(list))

    unique_mentions_set = (extracted_reply_no.drop_duplicates(subset='post_id', keep='last')
                               .explode('replying_to')[['post_id', 'replying_to']].groupby('replying_to').count().sort_values(by='post_id')
                          )
    _df = pd.merge(extracted_reply_no, unique_mentions_set,
             how = 'left',
             left_on = 'post_id',
             right_on = 'replying_to'
            ).rename({'post_id_x':'post_id', 'post_id_y':'replies_in_crawl'}, axis='columns')
    
    return(_df)

def account_for_dates(df):
    print("Adjusting dates to UTC...")
    return (df.assign(dt_split = lambda x: x['post_datetime'].str.split("-"),
                     dt_hour = lambda x: x['dt_split'].apply(lambda y: pd.Timestamp(y[-1]).hour),
                     post_datetime_utc = lambda x: x[['dt_split', 'dt_hour']].apply(lambda y: pd.to_datetime("-".join(y['dt_split'][:-1]))-pd.to_timedelta(y['dt_hour'], unit='h'), axis=1)
                    )
               .drop(['dt_split','dt_hour'], axis=1)
           )


In [14]:
_df = (df.reset_index(drop=True)
    .pipe(explode_list_columns)
    .pipe(manage_flags)
    .pipe(get_thread_id)
    .pipe(manage_thread_replies)
    .pipe(extract_text_info)
    .pipe(account_for_dates)
     [["search_term", "thread_id", "post_id", "post_datetime", "post_datetime_utc", "post_text", "post_file", "post_title", "post_author", "post_tripcode", "post_hash", "post_flag", "thread_replies", "replies_in_crawl", "replying_to", "current_url", "post_archive_url"]] 
)

_df.to_csv('results/output.csv')