Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import reddit archives into Lemmy #999

Closed
Permafacture opened this issue Jul 20, 2020 · 11 comments
Closed

Import reddit archives into Lemmy #999

Permafacture opened this issue Jul 20, 2020 · 11 comments
Labels
enhancement New feature or request

Comments

@Permafacture
Copy link

Permafacture commented Jul 20, 2020

Is your proposal related to a problem?

Many people are wanting to switch to Lemmy because subreddit's they've enjoyed have been banned/deleted. Having this content that was removed from Reddit loaded into Lemmy communities would help boot strap communities and make users feel more at home.

Describe the solution you'd like

Having a method for Lemmy admins to upload posts and comments via JSON along with images previously hosted on Reddit would solve this.

Every post and comment could be from the user data_hoarder (or similar), and the post/comment body could be modified to include the original user name. Another option is that since the user unique ID is 8 characters of base 36, the original user names could be replaced with a fixed length humanhash. This would provide maintain readability of threads without associating text with reddit users who haven't consented while maintaining the traceability from reddit user to Lemmy post if that was ever needed.

Additional context

I've attached an example of two posts with all of their comments from MTC (JSON renamed to .txt because github doesn't like JSON). I can provide all the posts/comments and images from MTC and MMTC if requested, and a python script for getting the same from any other subreddit.

demo_posts.txt
demo_comments.txt


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@Permafacture Permafacture added the enhancement New feature or request label Jul 20, 2020
@Permafacture
Copy link
Author

Note: there's far more fields possible in the JSON. Most of it is useless (flair, awards, css, reddit specific functionality), so I've filtered it to draw attention to what's essential.

@Nutomic
Copy link
Member

Nutomic commented Jul 20, 2020

I dont think this should be a feature in Lemmy, cause then people would ask us to allow imports from dozens of other websites. It would be much better to write a script that parses the data and posts it to Lemmy through the API.

We also had a long thread about this topic a few days ago

@Permafacture
Copy link
Author

I was in that thread and muad_dibber asked me to make an issue on github for this. Doesn't need to be integrated into Lemmy.

So the preferred solution is for someone to write a script to use the API to do this? I can't tell from the api doc what a CommentView is. Would that have the ID of the comment created so that subsequent request could reply to it?

Also, MTC has 250K comments. Would that many requests be okay? If the Chapo folks want to do the same then it would be much bigger.

Finally, MTC has like 3 GB of images. is that okay?

@Permafacture
Copy link
Author

Also, it was suggested that this data could be injected into existing communities without disrupting the community if the creation date from reddit was preserved. Keeping the original score (from upvotes) would be nice too. Neither of these are possible through the API I believe.

@dessalines
Copy link
Member

dessalines commented Jul 20, 2020

Oh yeah, I just wanted it here to be able to track and work against it.

I was thinking really low level, since lots of columns (like published) need to be forced in there. As in, write a simple script (probably in rust), to parse the reddit .json, and generate a .sql file with a bunch of insert statements. The comments might get tricky, but converting reddit's post.json to lemmy sql rows would be really easy.

Wiping them out would be pretty easy too, as long as they all have the same creator.

Importing comments would get a bit trickier (parent_ids and all that), but is still pry doable.

Honestly I don't really care about images / memes, those would massively balloon our storage, we could just link to the reddit / imgur post as the url.

I mostly care about having posts / comments, especially educational-related things backed up.

@Permafacture
Copy link
Author

Permafacture commented Jul 20, 2020

So, the images in question are the images reddit is hosting for banned subreddits. I'm not grabbing files from imgur. It's not a great idea in my opinion to rely on reddit to host images that are abandoned (no linkage from an accessible post). If they were smart they'd wipe them in my opinion.

I think only caring about text and discussion is valid, but also there's a variety of users with different priorities, and keeping the posts intact would satisfy the priorities of the most users. The comments might not make sense if the image is removed anyway. MTC wast active for that long so Lemmy's storage requirements ought to grow to those levels in not that long if it get's any usage at all.

If your primary keys are auto-incrementing, then it would be easy to get just the first valid post_id and comment_id and specify the primary keys and foreign key relations. If you disable indexing and constraints then the whole process would be very quick.

@Nutomic
Copy link
Member

Nutomic commented Jul 21, 2020

250K comments is a lot, more than 25 times as much as we have now. And you probably want to import the votes as well to keep the same ranking. So that would completely drown out everything that is posted to Lemmy directly. For those reasons, I am really against importing it into the existing instance.

A separate instance would be a much better solution, because it would give a nice separation between imported content and original content. We also wouldnt have to worry about breaking the database, which could happen very easily if you import it through SQL. And instead of locking each thread, it would be enough to just close signups (at least until federation is working).

We could still run it on the same physical server, so there would be no extra money or effort needed. And there are 60 GB of free storage, so that wouldnt be a problem either.

@dessalines
Copy link
Member

The votes might be possible, by having a DB insert a higher score than 1 into the post_like table. BC obvi through the API, a user can only vote once, and with only 1 or -1.

@Permafacture
Copy link
Author

I see one of the main points of a data migration as improving the experience of users in the commie lemmy instance. Putting this data in another instance walls it off. New users might not necessarily come to commie lemmy, they go browse the archive without seeing new posts, or find the new content too sparse. As we gather more of a user base and more good content, there starts to be two different places one would search. If the old and new data is together, it gives new users a reason to linger and browse, and in the future it makes commie lemmy the single resource to search.

Drowning out new posts is a real concern. We could scale the votes so that rank is preserved but the "hottest" imported post is equally hot as current new posts. My hope is that the imported data would increase the number of users and their level of interaction, so the drowning action (Especially after scaling) would be short lived.

On breaking the database: yeah there would need to be a test run against a copy of the database and scheduled down time that includes time to backup and possibly restore in case of failure. It's an annoying amount of work to do and I can't be the one to do it.

@Nutomic Nutomic closed this as completed Jan 15, 2023
@Permafacture
Copy link
Author

Did this actually get done?

@dessalines
Copy link
Member

dessalines commented Jan 16, 2023

Someone did make a tool: https://github.com/rileynull/RedditLemmyImporter

I'll add that to the readme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants