Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement data handling for RP forum dumps #4

Closed
0x000011b opened this issue Feb 7, 2023 · 4 comments · May be fixed by #9
Closed

Implement data handling for RP forum dumps #4

0x000011b opened this issue Feb 7, 2023 · 4 comments · May be fixed by #9
Assignees
Labels
enhancement New feature or request

Comments

@0x000011b
Copy link
Collaborator

Summary

Scope of this task is to implement support for Enjin forum dumps in the data-toolbox.

Source file formats

Source files are SQLite3 databases generated by the encuum utility. Please reach out to me in private (via email, Matrix, Discord, etc.) for example files if you're interested in tackling this implementation.

Implementation details

An EnjinDataset class should be implemented under toolbox/datasets/enjim.py, following the general format of the other datasets. Threads should map to Episodes, and posts within threads should map to Turns within the Episode.

An EnjinVDM should then be implemented under toolbox/modules/enjim_pdm.py. Feel free to look at the light_pdm.py file in that same folder for an example.

A lot of data processing will then need to take place. Off the top of my head:

  • BBcode will need to be converted to its nearest Markdown representation, or dropped entirely if it's too excessive (e.g. different font colors, images)
  • Irrelevant threads and posts need to be pruned (e.g. non-roleplay, announcements and so on)
  • Overly short posts will need to be carefully pruned (usually OOC talk)

...and maybe more. This will probably be the trickiest part. Feel free to reach out, we can discuss these points here or in the Matrix.

@0x000011b 0x000011b added the enhancement New feature or request label Feb 7, 2023
@0x000011b 0x000011b mentioned this issue Feb 7, 2023
4 tasks
@lloorree
Copy link

lloorree commented Feb 8, 2023

You can assign this to me. I'll be working on it today. I'll follow-up in Matrix and put a summary here when that's done.

@lloorree
Copy link

lloorree commented Feb 8, 2023

Summary of the discussion:

  • Each forum has separate subforums for characters and actual roleplays. The character posts will be cross-referenced to add personas to the prompts.
  • Posts that are too long will be split up by a character limit and treated as the character speaking multiple times.
  • The first post in the thread and a summary of the thread so far is the scenario.
  • Individual prompts will be made by a rolling window over the thread to keep them from being too large.
  • Too-long character descriptions and scenarios will be run through the philschmid/bart-large-cnn-samsum summarizer, which I found after testing out a handful of summarizers for generating summaries from TV scripts and should work well for this as it's very similar.
  • At some point instead of summarizing this should be something with vector databases and lookups, but I'm not familiar enough with them to do it this way yet, so TBD.
  • There don't seem to be images in the dataset, but if there turn out to be a lot of them somewhere they'll be replaced with a constant image tag and a generated description of them.

lloorree added a commit to lloorree/data-toolbox that referenced this issue Feb 10, 2023
lloorree added a commit to lloorree/data-toolbox that referenced this issue Feb 12, 2023
@lloorree
Copy link

PR #9 is for this.

@0x000011b 0x000011b linked a pull request Feb 19, 2023 that will close this issue
@TearGosling
Copy link
Contributor

Very old issue, closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants