Skip to content

Mal data collection#39

Merged
ShafathZ merged 14 commits into
mainfrom
mal-data-collection
Apr 28, 2026
Merged

Mal data collection#39
ShafathZ merged 14 commits into
mainfrom
mal-data-collection

Conversation

@ShafathZ
Copy link
Copy Markdown
Owner

  • Scraped data from MAL / Jikan.moe to enrich our DB with new data in anime_scrape.py
  • Added MongoDB collection and DB names to config
  • Modified populate_mongo_db_scraped to work with new json form
  • Pushed data to anime_enriched collection on live MongoDB store

@ShafathZ ShafathZ added the enhancement New feature or request label Apr 26, 2026
Comment thread data/anime_scrape.py Outdated

# Fetch data from MAL
def _fetch_mal_page(ranking_type: str, page: int, limit: int) -> List[Dict]:
headers = {"X-MAL-CLIENT-ID": MAL_CLIENT_ID}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor optimization: you can init this outside as global var since this is not gonna change every call?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in caa8551

Comment thread data/anime_scrape.py Outdated
"ranking_type": ranking_type,
"limit": limit,
"offset": (page - 1) * limit,
"fields": "id,title,alternative_titles,mean,synopsis,genres,main_picture,start_date,status,studios,num_episodes,average_episode_duration,rating",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"id,title,alternative_titles,mean,synopsis,genres,main_picture,start_date,status,studios,num_episodes,average_episode_duration,rating"

maybe make this huge string a constant?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in caa8551

Comment thread data/populate_mongo_db_scraped.py Outdated

# Exclude specific genres and empty genre lists
# Using rating + genre exclusion since sometimes rating or genre alone do not capture all non-friendly show content
EXCLUDED_GENRES = {"Hentai", "Erotica", "Unknown"} # If there exist genres # None of the genres are excluded
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused by # None of the genres are excluded

Maybe a carryover comment from prev iteration, so just double check and update inline comments

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in caa8551

Comment thread data/populate_mongo_db_scraped.py Outdated
anime_df = anime_df[anime_df["synopsis"].notna() & (anime_df["synopsis"] != "")]

# Generate text metadata
# TODO: Make this metadata better for retrieval
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO needs to go in the method: create_text_metadata_and_embedding()

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in caa8551

@ShafathZ ShafathZ merged commit fd1521d into main Apr 28, 2026
2 checks passed
@ShafathZ ShafathZ deleted the mal-data-collection branch May 10, 2026 03:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants