# RWBY Wiki Scraper
This jupyter notebook can be used to scrape the transcipts of all the episodes from the RWBY Wiki. Run all of the cells to obtain .json files for the transcripts. I'll give a short description of what each cell does.

Running this cell will create the necessary directory structure required by the scripts.

In [2]:
!mkdir 'Main Series'
!mkdir 'Main Series/V1'
!mkdir 'Main Series/V2'
!mkdir 'Main Series/V3'
!mkdir 'Main Series/V4'
!mkdir 'Main Series/V5'
!mkdir 'Main Series/V6'
!mkdir 'Main Series/V7'
!mkdir 'Main Series/V8'
!mkdir Chibi
!mkdir 'Chibi/S1'
!mkdir 'Chibi/S2'
!mkdir 'Chibi/S3'

`test_if_valid(test_url)` takes the url of the main wiki page of a given episode and returns the season number and episode number. For episodes which do not belong to either the main series or RWBY Chibi (like the Red Trailer or the World of Remnant episodes), the function returns `'Special'` to differentiate it.

Example Output:
1. `test_if_valid('https://rwby.fandom.com/wiki/War_(episode)')` will return `'V8E7'` as its output.
2. `test_if_valid('https://rwby.fandom.com/wiki/Geist_Buster')` will return `'S2E2'` as its output. Note that episodes from the main series use 'V' to indicate the season while episodes from RWBY Chibi use 'S' to indicate season. This can be used to differentiate between episodes of the two series.



In [None]:
import requests
from bs4 import BeautifulSoup
def test_if_valid(test_url):
  test_page=requests.get(test_url)
  test_soup=BeautifulSoup(test_page.content, 'html.parser')
  test_main_link=test_soup.find('div', class_='page-header__page-subtitle')
  if test_main_link==None:
    return 'Special'
  test_back_page=test_main_link.find('a')
  if test_back_page==None:
    return 'Special'
  test_back_url='https://rwby.fandom.com'+test_back_page['href']
  test_page_back=requests.get(test_back_url)
  test_pbsoup=BeautifulSoup(test_page_back.content, 'html.parser')
  ul=test_pbsoup.find('ul', class_='categories')
  if ul==None:
    return 'Special'
  check=ul.find_all('a')
  if check==None:
    return 'Special'
  proxy=check[0]
  for c in check:
    if c['title']=='Category:RWBY Chibi episodes':
      proxy=c['title']
      break
    if c['title']=='Category:Episodes':
      proxy=c['title']
      break
  if proxy=='Category:RWBY Chibi episodes':
    td=test_pbsoup.find_all('td', class_='pi-horizontal-group-item pi-data-value pi-font pi-border-color pi-item-spacing')
    season=td[0].getText()
    episode=td[1].getText()
    return 'S'+str(season)+'E'+str(episode)
  elif proxy=='Category:Episodes':
    td=test_pbsoup.find_all('td', class_='pi-horizontal-group-item pi-data-value pi-font pi-border-color pi-item-spacing')
    volume=td[0].find('a').getText()
    episode=td[1].getText()
    return 'V'+str(volume)+'E'+str(episode)
  else:
    return 'Special'

`get_dialogue(url)` takes the link of the wiki page of the transcript as input and returns the dialogue as a list of lines.

In [None]:
import requests
from bs4 import BeautifulSoup
def get_dialogue(url):
  page=requests.get(url)
  soup=BeautifulSoup(page.content, 'html.parser')
  div=soup.find('div', class_='mw-parser-output')
  paras=div.find_all('p')
  text=[]
  for p in paras:
    text.append(p.getText())
  return text

Running this cell will save the .json files for each episode. It currently makes a request to the wiki every 1 second, so it should take around 3-4 minutes to finish running. I couldn't find any details about how frequently requests can be made to the fandom website, so do let me know if there are any official guidelines for bots.

In [None]:
import requests
import json
from bs4 import BeautifulSoup
from time import sleep
transcripts_url='https://rwby.fandom.com/wiki/Category:Transcripts'
transcripts_page=requests.get(transcripts_url)
soup=BeautifulSoup(transcripts_page.content, 'html.parser')
results=soup.find(id='mw-content-text')
links=results.find_all('a', class_='category-page__member-link')
i=0
for link in links:
  url='https://rwby.fandom.com'+link['href']
  print(i)
  i+=1
  sleep(1)
  test=test_if_valid(url)
  if test!='Special':
    text=get_dialogue(url)  
    if test[0]=='S':   
      with open('/content/Chibi/S'+str(test[1])+'/'+test+'.json', 'w') as json_file:
        json.dump(text, json_file)
    else:
      with open('/content/Main Series/V'+str(test[1])+'/'+test+'.json', 'w') as json_file:
        json.dump(text, json_file)