<a href="https://colab.research.google.com/github/DestructionCatalyst/TouhouDialogueGenerator/blob/main/THWiki_parser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from bs4 import BeautifulSoup

Sample parsing

In [None]:
in_file = open("TH06_Reimu.htm")
html_code = in_file.read() # File I downloaded from the wiki for testing

In [None]:
in_file.close()

In [None]:
soup = BeautifulSoup(html_code)

In [None]:
for match in soup.findAll('sup', {'class': 'reference'}):
    match.decompose() # Delete all references

In [None]:
tags_to_replace_with_children = ['a', 'p', 'b', 'i'] # All links, paragraphs and font types

In [None]:
for tag in tags_to_replace_with_children:
  for match in soup.findAll(tag):
    match.replaceWithChildren()

In [None]:
for match in soup.findAll('br'):
    match.replace_with(' ') # Merge paragraphs

In [None]:
tables = soup.find_all('table', {'class': 'wikitable'})

In [None]:
rows = list(map(lambda table: table.find_all('tr'), tables))

In [None]:
print(rows[0][0])

<tr>
<th>
</th>
<th lang="ja" style="width:45%">
夢幻夜行絵巻　～ <span lang="en">Mystic Flier</span>

</th>
<th style="width:55%">
Fantastic Night Parade Scroll ~ Mystic Flier

</th></tr>


In [None]:
print(rows[0][3])

<tr>
<th style="word-wrap: nowrap">
Reimu

</th>
<td lang="ja" width="45%">
気持ちいいわね
毎回、昼間に出発して悪霊が少ない から、夜に出てみたんだけど．．．
どこに行っていいかわからないわ 暗くて
でも．．．
夜の境内裏はロマンティックね （←のんき）

</td>
<td width="55%">
It sure feels great out.
There aren't many evil spirits about during the day, so I'm trying my luck at night...
But it's dark out, and I'm not sure where to go.
Still...
It's so romantic out behind the shrine at night. (← carefree)

</td></tr>


In [None]:
dialogues = []

In [None]:
out_file = open('output.txt', mode='w')

In [None]:
game = 60 # Because there will be fractional numbers, and floating point values suck

In [None]:
for stage, table in enumerate(rows):
  for row in table:
    name_cell = row.find('th', {'style': 'word-wrap: nowrap'})
    ja_cell = row.find('td', {'lang': 'ja', 'width': '45%'})
    en_cell = row.find('td', {'width': '55%'})

    if name_cell and ja_cell and en_cell:
      out_file.write(f"{name_cell.text.strip()}:\n{en_cell.text.strip()}\n\n")
      #dialogues.append({'Character': name_cell.text.strip(), 'Text': en_cell.text.strip(), 'Game': game, 'Stage': stage+1})

In [None]:
out_file.close()

In [None]:
!cat output.txt

All this process as a function

In [64]:
def parse_pages(input_file_path, output_file,
                tags_to_decompose=(('sup', {'class': 'reference'}),),
                tags_to_replace_with_children=(('a',), ('p',), ('b',), ('i',)),
                tags_to_replace={('br',): ' '}
                ):
  """
  Parses a page of Touhou wiki in htm format to extract dialogue and puts it into a text file


  input_file_path is a string containing the path to an input file.\
  This file will not be modified in any way.

  output_file is a file stream where the contents of the file will be written.\
  It must be opened for writing befor the function and closed after its usage

  tags_to_decompose is a tuple of pairs (2-item tuples),\
  in the format that is accepted by BeutifulSoup.find()\
  These tags will be decomposed (i.e. deleted with all their children)

  tags_to_replace_with_children is a tuple of pairs (2-item tuples),\
  in the format that is accepted by BeutifulSoup.find()\  
  These tags will be replaced with their children

  tags_to_replace is a dictionary, where key is a tag accepted by BeutifulSoup.find()\  
  These tags will be replaced with their values in the dictionary
  """

  # Reading data
  in_file = open(input_file_path)
  soup = BeautifulSoup(in_file.read())
  in_file.close()

  # Preprocessing
  for tag in tags_to_decompose:
    for match in soup.findAll(*tag):
      match.decompose()

  for tag in tags_to_replace_with_children:
    for match in soup.findAll(*tag):
      match.replaceWithChildren()

  for tag, replacement in tags_to_replace.items():
    for match in soup.findAll(*tag):
      match.replace_with(replacement)

  # Extracting table rows
  tables = soup.find_all('table', {'class': 'wikitable'})
  rows = list(map(lambda table: table.find_all('tr'), tables))

  # Selecting dialoue rows
  for table in rows:
    for row in table:
      name_cell = row.find('th', {'style': 'word-wrap: nowrap'})
      ja_cell = row.find('td', {'lang': 'ja', 'width': '45%'})
      en_cell = row.find('td', {'width': '55%'})

      # If row has all the necessary elements in it
      if name_cell and ja_cell and en_cell:
        # Write it to output file in the correct format
        out_file.write(f"{name_cell.text.strip()}:\n{en_cell.text.strip()}\n\n")

In [65]:
out_file = open('output1.txt', mode='w')

In [66]:
parse_pages("TH06_Reimu.htm", out_file)

In [67]:
out_file.close()

In [69]:
!cat output1.txt | head

Reimu:
It's been a while since my last job.

Reimu:
It sure feels great out.
There aren't many evil spirits about during the day, so I'm trying my luck at night...
But it's dark out, and I'm not sure where to go.
Still...
It's so romantic out behind the shrine at night. (← carefree)



In [73]:
files = ['TH06_Reimu.htm', 'TH06_Marisa.htm', 'TH06_Reimu_Extra.htm', 'TH06_Marisa_Extra.htm']

In [75]:
out_file = open('output2.txt', mode='w')

In [76]:
for input_file in files:
  parse_pages(input_file, out_file)

In [77]:
out_file.close()

In [80]:
!cat output2.txt | head

Reimu:
It's been a while since my last job.

Reimu:
It sure feels great out.
There aren't many evil spirits about during the day, so I'm trying my luck at night...
But it's dark out, and I'm not sure where to go.
Still...
It's so romantic out behind the shrine at night. (← carefree)



In [81]:
!cat output2.txt | tail

Marisa:
Wait, you don't know?
She got married and then there were none...

Flandre:
Marry who?

Marisa:
I'll introduce ya to a girl at a shrine.

