Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open evidence and Wiki scraper #12

Draft
wants to merge 53 commits into
base: v3
Choose a base branch
from
Draft

Conversation

D0ugins
Copy link

@D0ugins D0ugins commented Feb 18, 2022

Downloads round data and open source documents/cites from debate wiki. Also downloads file from openev. Uses xiki's rest api to pull the data. Main limiting factor is the response speed from the server, although should only take a day or two to run. In total there are around 320k rounds across the wikis with roughly half having open source documents + around 10k open ev files.

Todo:

  • Implement adding new rounds as they are created, this can be done with roughly one request per new round, so it can just be run once per day or something.
  • Add tags to downloaded files.
  • Maybe add some sort of parsing of round reports and/or cites. Maybe just extract links from cites and try to split round report by speech.
  • Better erorr handling in the parser for weird formats.

arvind-balaji and others added 26 commits January 4, 2021 14:19
Since extracting text with basic formatting is relatively simple, instead of using external libraries, it is much faster to just unzip the .docx file and parse the document.xml from scratch
Since the tokenizer is much faster, tokensToMarkup became the bottleneck in extracting cards, rewritting to use a string instead of cheerio dom speeds up card extraction by 3-4x
Some text tokens overwrite styles set by the style name, which wasnt handled properly
Sometimes documents will use the outlinelvl property instead of the heading style to indicate a heading, this is now handled properly
With the roughly 200,000 files, there is a decent chance of a collision between two file ids. Switching to 64 bit lowers the chance to around 1/1000
Switch to using the htmlparser2 library cheerio uses under the hood for parsing. Around 3 times faster, handles links properly, and the code is probably simpler
Add full cite field to database
Deduplicator independently fetches evidence by id and creates DedupTasks
Create db entity to hold groups of simmilar cards and store frequency
Most of deduplication time is spent waiting for responses, concurrent parsing is much faster, especially if ping with the radius server is high. Locks processing on cards with the same sentences, and updating the parent of a card whos parent is being updated to prevent race conditions.
With the htmlparser2 based parsing, simiplifyTokens takes up around 1/3 of parsing time due to the slow lodash methods
D0ugins added 14 commits May 22, 2022 00:57
Data about sentences is now stored inside binary strings. More compact than how it worked before, and more information is stored.
Data is split into buckets so the performances is reasonable. Each bucket contains a sequence of 11 byte blocks containing the sentence information.
First 5 bytes are the key of the sentence within the bucket, Next 4 bytes are card id, Last 2 bytes are index of sentence in card.
Still uses the one pass algorithm from the last implementation, but this method of storage is more flexible and allows for better algorithms.
Now takes into account index of matches in both cards when determining if a match is real or a coincidence.
Quality of matches are now much higher, I dont think its really worth it to do it in two passes. Maybe just look through EvidenceBucket entities ocasionally to fix edge cases.
Restructures scarper to the same format as other modules
@arvind-balaji
Copy link
Collaborator

Trying to run this, but the application seems to hang (I think while trying to load spaceData?). Any idea what's up?

@D0ugins
Copy link
Author

D0ugins commented Jun 25, 2022

Sorry, should have clarified. Loading the list of rounds to download takes a long time (Something like 30 minutes irrc)
If you want to load data quicker for testing you can add a .slice(0, 2) or .slice(0, 1) on these two lines so you only load the full data for a few of the wikis.
https://github.com/arvind-balaji/debate-cards/blob/e401edee268797b5afb22bcf6b9ff349e9e5eac4/src/lib/debate-tools/wiki.ts#L76-L78
https://github.com/arvind-balaji/debate-cards/blob/e401edee268797b5afb22bcf6b9ff349e9e5eac4/src/lib/debate-tools/wiki.ts#L85
In the future it would probably be a good idea to add some way of configuring which wikis to load

@D0ugins D0ugins marked this pull request as draft July 23, 2022 03:10
@D0ugins
Copy link
Author

D0ugins commented Jul 23, 2022

Wiki was just updated and the api overhauled, terms now also ban bulk downloads of data. I have a dump of most of the relevant data though.

@D0ugins D0ugins changed the title Open evidence and Wiki scraper Open evidence and Wiki scrape Sep 29, 2022
@D0ugins D0ugins changed the title Open evidence and Wiki scrape Open evidence and Wiki scraper Sep 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants