Open evidence and Wiki scraper #12

D0ugins · 2022-02-18T21:44:49Z

Downloads round data and open source documents/cites from debate wiki. Also downloads file from openev. Uses xiki's rest api to pull the data. Main limiting factor is the response speed from the server, although should only take a day or two to run. In total there are around 320k rounds across the wikis with roughly half having open source documents + around 10k open ev files.

Todo:

Implement adding new rounds as they are created, this can be done with roughly one request per new round, so it can just be run once per day or something.
Add tags to downloaded files.
Maybe add some sort of parsing of round reports and/or cites. Maybe just extract links from cites and try to split round report by speech.
Better erorr handling in the parser for weird formats.

Since extracting text with basic formatting is relatively simple, instead of using external libraries, it is much faster to just unzip the .docx file and parse the document.xml from scratch

Since the tokenizer is much faster, tokensToMarkup became the bottleneck in extracting cards, rewritting to use a string instead of cheerio dom speeds up card extraction by 3-4x

Some text tokens overwrite styles set by the style name, which wasnt handled properly

Sometimes documents will use the outlinelvl property instead of the heading style to indicate a heading, this is now handled properly

With the roughly 200,000 files, there is a decent chance of a collision between two file ids. Switching to 64 bit lowers the chance to around 1/1000

Switch to using the htmlparser2 library cheerio uses under the hood for parsing. Around 3 times faster, handles links properly, and the code is probably simpler Add full cite field to database

Deduplicator independently fetches evidence by id and creates DedupTasks

… into pr/D0ugins/13

Create db entity to hold groups of simmilar cards and store frequency

Most of deduplication time is spent waiting for responses, concurrent parsing is much faster, especially if ping with the radius server is high. Locks processing on cards with the same sentences, and updating the parent of a card whos parent is being updated to prevent race conditions.

With the htmlparser2 based parsing, simiplifyTokens takes up around 1/3 of parsing time due to the slow lodash methods

Data about sentences is now stored inside binary strings. More compact than how it worked before, and more information is stored. Data is split into buckets so the performances is reasonable. Each bucket contains a sequence of 11 byte blocks containing the sentence information. First 5 bytes are the key of the sentence within the bucket, Next 4 bytes are card id, Last 2 bytes are index of sentence in card. Still uses the one pass algorithm from the last implementation, but this method of storage is more flexible and allows for better algorithms.

Now takes into account index of matches in both cards when determining if a match is real or a coincidence. Quality of matches are now much higher, I dont think its really worth it to do it in two passes. Maybe just look through EvidenceBucket entities ocasionally to fix edge cases.

Restructures scarper to the same format as other modules

arvind-balaji · 2022-06-25T06:41:53Z

Trying to run this, but the application seems to hang (I think while trying to load spaceData?). Any idea what's up?

D0ugins · 2022-06-25T17:53:06Z

Sorry, should have clarified. Loading the list of rounds to download takes a long time (Something like 30 minutes irrc)
If you want to load data quicker for testing you can add a .slice(0, 2) or .slice(0, 1) on these two lines so you only load the full data for a few of the wikis.
https://github.com/arvind-balaji/debate-cards/blob/e401edee268797b5afb22bcf6b9ff349e9e5eac4/src/lib/debate-tools/wiki.ts#L76-L78
https://github.com/arvind-balaji/debate-cards/blob/e401edee268797b5afb22bcf6b9ff349e9e5eac4/src/lib/debate-tools/wiki.ts#L85
In the future it would probably be a good idea to add some way of configuring which wikis to load

D0ugins · 2022-07-23T03:15:36Z

Wiki was just updated and the api overhauled, terms now also ban bulk downloads of data. I have a dump of most of the relevant data though.

arvind-balaji and others added 26 commits January 4, 2021 14:19

Add project info to Readme

6f214b5

Rewrite documentToTokens to use document.xml

42cf220

Since extracting text with basic formatting is relatively simple, instead of using external libraries, it is much faster to just unzip the .docx file and parse the document.xml from scratch

Rewrite tokensToMarkup to use string

c2be8f4

Since the tokenizer is much faster, tokensToMarkup became the bottleneck in extracting cards, rewritting to use a string instead of cheerio dom speeds up card extraction by 3-4x

Fix formatting

68b84aa

Fix incorrect formatting in some documents

7ab51bc

Some text tokens overwrite styles set by the style name, which wasnt handled properly

Fix detection of headings on some documents

d5ca2bf

Sometimes documents will use the outlinelvl property instead of the heading style to indicate a heading, this is now handled properly

Fix debugging

b5fa204

Implement generateFile action

d70f4c0

Fix prisma output location

b32188d

Merge branch 'parser-rewrite' into generateFile

da8afb2

Add dev script

9eb51c1

Add downloading round data

b975145

Add open source downloading

525246d

Switch to downloading through api

9925a48

Add openev downloading

873fcdd

Fix crash on invalid document

acfc4d7

Fix cite downloading

ff11639

Fix closing of tags at the end of cards

00bbc15

Switch to 64 bit file ids

ffc663c

With the roughly 200,000 files, there is a decent chance of a collision between two file ids. Switching to 64 bit lowers the chance to around 1/1000

Fix out of memory error

88a3545

Merge branch 'v3' into scraper

b40f02f

Merge branch 'scraper' of github.com:D0ugins/debate-cards into scraper

2eaeabd

Fix out of memory error

5d02ef2

Parser improvements

fb0aab8

Switch to using the htmlparser2 library cheerio uses under the hood for parsing. Around 3 times faster, handles links properly, and the code is probably simpler Add full cite field to database

Implement card deduplication

23b23ee

Merge branch 'deduplication' into scraper

13a9f38

D0ugins force-pushed the scraper branch from c6f7e14 to 13a9f38 Compare March 30, 2022 07:04

arvind-balaji added 3 commits May 12, 2022 20:07

Decouple parser and dedup module

589b0d0

Deduplicator independently fetches evidence by id and creates DedupTasks

Merge branch 'deduplication' of https://github.com/d0ugins/debate-cards…

2fcc715

… into pr/D0ugins/13

Add EvidenceBucket entity

13a1324

Create db entity to hold groups of simmilar cards and store frequency

D0ugins added 6 commits May 15, 2022 19:48

Speed up simplify tokens

1acf284

With the htmlparser2 based parsing, simiplifyTokens takes up around 1/3 of parsing time due to the slow lodash methods

Improve cite detection

ee66830

Merge branch 'citeFix' into deduplication

f2f668f

Merge branch 'deduplication' into scraper

98ee8bd

Merge branch 'scraper' of github.com:D0ugins/debate-cards into scraper

e533326

arvind-balaji assigned D0ugins May 17, 2022

D0ugins added 14 commits May 22, 2022 00:57

change locations of deduplication functions

49d9e6c

Only return needed fields from db queries

cc78244

Fix failed upserts on evidenceBuckets

ed763da

Factor out action queue logic

626c6ef

Merge branch 'deduplication' into scraper

cbcc6a7

Restructure wiki modules

0ca3011

Restructures scarper to the same format as other modules

Add zod types for wiki api

daedad6

Improve error messages

482d806

Add strongly typed api responses

b3955a5

Fix wiki downloading

77d56c0

Merge branch 'scraper' of github.com:D0ugins/debate-cards into scraper

d5b89f8

Merge branch 'v3' into scraper

e401ede

D0ugins force-pushed the scraper branch from 7869838 to e401ede Compare June 26, 2022 04:06

Fix round-file relation

b0ecca8

D0ugins marked this pull request as draft July 23, 2022 03:10

D0ugins changed the title ~~Open evidence and Wiki scraper~~ Open evidence and Wiki scrape Sep 29, 2022

D0ugins changed the title ~~Open evidence and Wiki scrape~~ Open evidence and Wiki scraper Sep 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open evidence and Wiki scraper #12

Open evidence and Wiki scraper #12

D0ugins commented Feb 18, 2022

arvind-balaji commented Jun 25, 2022

D0ugins commented Jun 25, 2022 •

edited

Loading

D0ugins commented Jul 23, 2022

Open evidence and Wiki scraper #12

Are you sure you want to change the base?

Open evidence and Wiki scraper #12

Conversation

D0ugins commented Feb 18, 2022

arvind-balaji commented Jun 25, 2022

D0ugins commented Jun 25, 2022 • edited Loading

D0ugins commented Jul 23, 2022

D0ugins commented Jun 25, 2022 •

edited

Loading