Mal data collection#39
Merged
Merged
Conversation
Owner
ShafathZ
commented
Apr 26, 2026
- Scraped data from MAL / Jikan.moe to enrich our DB with new data in anime_scrape.py
- Added MongoDB collection and DB names to config
- Modified populate_mongo_db_scraped to work with new json form
- Pushed data to anime_enriched collection on live MongoDB store
…nime data required for app. Modified populate code to populate_mongo_db_scraped and archived old code to work with new scraped json integration
…o ensure clean data
… field instead of title
Suryanshg
reviewed
Apr 28, 2026
|
|
||
| # Fetch data from MAL | ||
| def _fetch_mal_page(ranking_type: str, page: int, limit: int) -> List[Dict]: | ||
| headers = {"X-MAL-CLIENT-ID": MAL_CLIENT_ID} |
Collaborator
There was a problem hiding this comment.
minor optimization: you can init this outside as global var since this is not gonna change every call?
Suryanshg
reviewed
Apr 28, 2026
| "ranking_type": ranking_type, | ||
| "limit": limit, | ||
| "offset": (page - 1) * limit, | ||
| "fields": "id,title,alternative_titles,mean,synopsis,genres,main_picture,start_date,status,studios,num_episodes,average_episode_duration,rating", |
Collaborator
There was a problem hiding this comment.
"id,title,alternative_titles,mean,synopsis,genres,main_picture,start_date,status,studios,num_episodes,average_episode_duration,rating"
maybe make this huge string a constant?
Suryanshg
reviewed
Apr 28, 2026
|
|
||
| # Exclude specific genres and empty genre lists | ||
| # Using rating + genre exclusion since sometimes rating or genre alone do not capture all non-friendly show content | ||
| EXCLUDED_GENRES = {"Hentai", "Erotica", "Unknown"} # If there exist genres # None of the genres are excluded |
Collaborator
There was a problem hiding this comment.
Confused by # None of the genres are excluded
Maybe a carryover comment from prev iteration, so just double check and update inline comments
Suryanshg
reviewed
Apr 28, 2026
| anime_df = anime_df[anime_df["synopsis"].notna() & (anime_df["synopsis"] != "")] | ||
|
|
||
| # Generate text metadata | ||
| # TODO: Make this metadata better for retrieval |
Collaborator
There was a problem hiding this comment.
This TODO needs to go in the method: create_text_metadata_and_embedding()
Suryanshg
approved these changes
Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.