Implement data handling for RP forum dumps #4

0x000011b · 2023-02-07T15:24:55Z

Summary

Scope of this task is to implement support for Enjin forum dumps in the data-toolbox.

Source file formats

Source files are SQLite3 databases generated by the encuum utility. Please reach out to me in private (via email, Matrix, Discord, etc.) for example files if you're interested in tackling this implementation.

Implementation details

An EnjinDataset class should be implemented under toolbox/datasets/enjim.py, following the general format of the other datasets. Threads should map to Episodes, and posts within threads should map to Turns within the Episode.

An EnjinVDM should then be implemented under toolbox/modules/enjim_pdm.py. Feel free to look at the light_pdm.py file in that same folder for an example.

A lot of data processing will then need to take place. Off the top of my head:

BBcode will need to be converted to its nearest Markdown representation, or dropped entirely if it's too excessive (e.g. different font colors, images)
Irrelevant threads and posts need to be pruned (e.g. non-roleplay, announcements and so on)
Overly short posts will need to be carefully pruned (usually OOC talk)

...and maybe more. This will probably be the trickiest part. Feel free to reach out, we can discuss these points here or in the Matrix.

The text was updated successfully, but these errors were encountered:

lloorree · 2023-02-08T10:51:38Z

You can assign this to me. I'll be working on it today. I'll follow-up in Matrix and put a summary here when that's done.

lloorree · 2023-02-08T14:18:02Z

Summary of the discussion:

Each forum has separate subforums for characters and actual roleplays. The character posts will be cross-referenced to add personas to the prompts.
Posts that are too long will be split up by a character limit and treated as the character speaking multiple times.
The first post in the thread and a summary of the thread so far is the scenario.
Individual prompts will be made by a rolling window over the thread to keep them from being too large.
Too-long character descriptions and scenarios will be run through the philschmid/bart-large-cnn-samsum summarizer, which I found after testing out a handful of summarizers for generating summaries from TV scripts and should work well for this as it's very similar.
At some point instead of summarizing this should be something with vector databases and lookups, but I'm not familiar enough with them to do it this way yet, so TBD.
There don't seem to be images in the dataset, but if there turn out to be a lot of them somewhere they'll be replaced with a constant image tag and a generated description of them.

lloorree · 2023-02-12T02:46:41Z

PR #9 is for this.

TearGosling · 2023-08-11T04:12:13Z

Very old issue, closing for now.

0x000011b added the enhancement New feature or request label Feb 7, 2023

0x000011b mentioned this issue Feb 7, 2023

Improve training data #2

Closed

4 tasks

0x000011b assigned lloorree Feb 8, 2023

lloorree added a commit to lloorree/data-toolbox that referenced this issue Feb 10, 2023

in progress work on PygmalionAI#4 adding enjim datasets

020c8d0

lloorree added a commit to lloorree/data-toolbox that referenced this issue Feb 12, 2023

prospective/semi-final draft for PygmalionAI#4

202a18d

0x000011b linked a pull request Feb 19, 2023 that will close this issue

Add enjim dataset(s) #9

Open

TearGosling closed this as completed Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement data handling for RP forum dumps #4

Implement data handling for RP forum dumps #4

0x000011b commented Feb 7, 2023

lloorree commented Feb 8, 2023

lloorree commented Feb 8, 2023

lloorree commented Feb 12, 2023

TearGosling commented Aug 11, 2023

Implement data handling for RP forum dumps #4

Implement data handling for RP forum dumps #4

Comments

0x000011b commented Feb 7, 2023

Summary

Source file formats

Implementation details

lloorree commented Feb 8, 2023

lloorree commented Feb 8, 2023

lloorree commented Feb 12, 2023

TearGosling commented Aug 11, 2023