<center>
    <h1 id='api-consumer-scrapper' style='color:#7159c1'>⛏️ API Consumer: Scrapper ⛏️</h1>
    <i>Getting animes, users and scores data from MyAnimeList with Jikan</i>
</center>

```
- Animes Data Fetcher
- Users Data Fetcher
- Users Details Fetcher
- Users Scores Fetcher
```

All the datas are extracted from [MyAnimeList (MAL)](https://myanimelist.net) website via its unofficial API [Jikan](https://jikan.moe). Jikan contains some limitations about the amount of requisition per time, so you can use this API Consumer in two ways: 1) getting only data from id 0 to 100; 2) getting all available data. This last one takes several time, so patience 😅

The table below shows the requisitions limitations per time:

<table style='border-style: solid'>
    <caption>Jikan API Requisitions per Time</caption>
    <tr align='center' style='border-style: solid'>
        <th style='border-style: solid'>Duration</th>
        <th style='border-style: solid'>Requests</th>
    </tr>
    <tr align='center'>
        <td style='border-style: solid'>Daily</td>
        <td style='border-style: solid'><b>Unlimited</b></td>
    </tr>
    <tr align='center'>
        <td style='border-style: solid'>Per Minute</td>
        <td style='border-style: solid'>60 requests</td>
    </tr>
    <tr align='center'>
        <td style='border-style: solid'>Per second</td>
        <td style='border-style: solid'>3 requests</td>
    </tr>
</table>

You can check its documentation here: [Jikan Documentation](https://docs.api.jikan.moe/#section/Information).

Now, let's kick off fetching the data!!

In [1]:
# ---- Imports ----
from bs4 import BeautifulSoup   # pip install bs4
import json                     # pip install json
import numpy as np              # pip install numpy
import pandas as pd             # pip install pandas
import re                       # pip install re
import requests                 # pip install requests
import time                     # pip install time

# ---- Settings ----
pd.set_option('display.max_columns', None)

# ---- Constants ----
DATASETS_PATH = ('./datasets')

TRIES_LIMIT = (3)

# ANIME_IDS = (np.arange(0, 60000)) # all animes
# USER_IDS = (np.arange(0, 1030000)) # all users

ANIME_IDS = (np.arange(0, 100)) # sample animes
USER_IDS = (np.arange(0, 100)) # sample users

STATUS_CODE = (7)
BATCH_SIZE = (250) # number of users to fetch in each batch
MIN_DELAY_SECONDS = (60)
MAX_DELAY_SECONDS = (90)

<p id='0-animes-data-fetcher' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>0 | Animes Data Fetcher</p>

<p style='font-size:150%'>📛 Features</p>

<br />

> **ID** - `anime id on MyAnimeList`;

> **Name** - `anime original name`;

> **English Name** - `English version name`;

> **Japanese Name** - `Japanese version name`;

> **Score** - `average score`;

> **Genres** - `related genres`;

> **Synopsis** - `briefly description`;

> **Type** - `type of animation (movie, anime, OVA...)`;

> **Episodes** - `number of episodes. Movies are considered having 1 episode`;

> **Aired** - `period when anime was aired`;

> **Premiered** - `season when the anime was released`;

> **Status** - `current status (airing, hiatus, finished...)`;

> **Producers** - `related production companies`;

> **Licensors** - `related streaming platforms and licensors`;

> **Studios** - `related animation studios`;

> **Source** - `source material of the anime (originated from manga, light novel, movie or tv)`; 

> **Duration** - `duration of the movie or each episode`;

> **Rating** - `age restriction`;

> **Rank** - `rank position on MyAnimeList website (based on Score criteria)`;

> **Popularity** - `popularity position on MyAnimeList`;

> **Favorites** - `number of users that marked the anime as favorite`;

> **Members** - `number of users that added the anime to the watch list`;

> **Image Url** - `banner url`;

> **Watching** - `number of users that are watching`;

> **Completed** - `number of users that finished watching all episodes`;

> **On Hold** - `number of users that stopped watching the anime but kept it into the watch list`;

> **Dropped** - `number of users that stopped watching the anime and removed it from the watch list`;

> **Plan to Watch** - `number of users that added the anime into the watch list but did not started watching`;

> **Scored By** - `number of users that rated the anime`;

> **Score 1** - `number of users that rated the anime with score 1`;

> **Score 2** - `number of users that rated the anime with score 2`;

> **Score 3** - `number of users that rated the anime with score 3`;

> **Score 4** - `number of users that rated the anime with score 4`;

> **Score 5** - `number of users that rated the anime with score 5`;

> **Score 6** - `number of users that rated the anime with score 6`;

> **Score 7** - `number of users that rated the anime with score 7`;

> **Score 8** - `number of users that rated the anime with score 8`;

> **Score 9** - `number of users that rated the anime with score 9`;

> **Score 10** - `number of users that rated the anime with score 10`.

<br />

---

<br />

<p style='font-size:150%'>🧮 Calculations</p>

<br />

> **Number of Requisitions per Anime** - `2 (one for general data and another one for statistics)`;

> **Delay per Anime Requisition** - `2 seconds`;

<br />

> **Sample Number of IDS** - `100`;

> **Total Number of IDS** - `60,000`;

<br />

> **Time to Fetch the First 100 IDS** - `3 minutes (no retries) to 10 minutes (all retries)`;

> **Time to Fetch All Animes** - `84 hours (no retries) to 250 hours (all retries)`.

In [11]:
# ---- Dataset Structure ----
header = [
    'id', 'name', 'english_name', 'japanese_name', 'score', 'genres',
    'synopsis', 'type', 'episodes', 'aired', 'premiered', 'status',
    'producers', 'licensors', 'studios', 'source', 'duration', 'rating',
    'rank', 'popularity', 'favorites', 'members', 'image_url', 'watching',
    'completed', 'on_hold', 'dropped', 'plan_to_watch', 'scored_by',  'score_1',
    'score_2', 'score_3', 'score_4', 'score_5', 'score_6', 'score_7', 'score_8',
    'score_9', 'score_10'
]
full_anime_df = pd.DataFrame(columns=header)
full_anime_df

# ---- API Requisitions ----
for anime_id in ANIME_IDS:
    anime_api_url = f'https://api.jikan.moe/v4/anime/{anime_id}'
    statistics_api_url = f'https://api.jikan.moe/v4/anime/{anime_id}/statistics'
    anime_page = None
    statistics_page = None

    tries = 0
    while tries < TRIES_LIMIT:
        tries += 1
        anime_page = requests.get(anime_api_url)
        statistics_page = requests.get(statistics_api_url)
        
        if anime_page.status_code == 200 and statistics_page.status_code == 200: break
        time.sleep(2)
    
    # if the requisition successfully returned the anime data, it will be processed
    if anime_page.status_code == 200 and statistics_page.status_code == 200:
        anime_json_data = anime_page.json()
        statistics_json_data = statistics_page.json()
    
        # if 'data' property is present, it's formatted and inserted into the dataset
        if 'data' in anime_json_data and 'data' in statistics_json_data:
            anime = {}
        
            anime['id'] = [anime_id]
            anime['name'] = [anime_json_data['data'].get('title')]
            anime['english_name'] = [anime_json_data['data'].get('title_english')]
            anime['japanese_name'] = [anime_json_data['data'].get('title_japanese')]
            anime['score'] = [anime_json_data['data'].get('score')]
            anime['genres']  = [', '.join([genre['name'] for genre in anime_json_data['data'].get('genres', [])])]
        
            synopsis =  anime_json_data['data'].get('synopsis')
            if synopsis is not None:
                cleared_synopsis = re.sub(r'\[.*?\]', '', synopsis).strip() # removing all text into brackets and the brackets itself
                anime['synopsis'] = [cleared_synopsis]
            else:
                anime['synopsis'] = ['']
            
            anime['type'] = [anime_json_data['data'].get('type')]
            anime['episodes'] = [anime_json_data['data'].get('episodes')]
            anime['aired'] = [anime_json_data['data'].get('aired', {}).get('string')]
        
            premiered = anime_json_data['data'].get('season')
            year = anime_json_data['data'].get('year')
            if year is not None: premiered += ' ' + str(year)
            anime['premiered'] = [premiered]
        
            anime['status'] = [anime_json_data['data'].get('status')]
            anime['producers'] = [', '.join([producer['name'] for producer in anime_json_data['data'].get('producers', [])])]
            anime['licensors'] = [', '.join([license['name'] for license in anime_json_data['data'].get('licensors', [])])]
            anime['studios'] = [', '.join([studio['name'] for studio in anime_json_data['data'].get('studios', [])])]
            anime['source'] = [anime_json_data['data'].get('source')]
            anime['duration'] = [anime_json_data['data'].get('duration')]
            anime['rating'] = [anime_json_data['data'].get('rating')]
            anime['rank'] = [anime_json_data['data'].get('rank')]
            anime['popularity'] = [anime_json_data['data'].get('popularity')]
            anime['favorites'] = [anime_json_data['data'].get('favorites')]
            anime['members'] = [anime_json_data['data'].get('members')]
            anime['image_url'] = [anime_json_data['data'].get('images').get('jpg').get('image_url')]
        
            anime['watching'] = [statistics_json_data['data'].get('watching')]
            anime['completed'] = [statistics_json_data['data'].get('completed')]
            anime['on_hold'] = [statistics_json_data['data'].get('on_hold')]
            anime['dropped'] = [statistics_json_data['data'].get('dropped')]
            anime['plan_to_watch'] = [statistics_json_data['data'].get('plan_to_watch')]
            
            anime['scored_by'] = [anime_json_data['data'].get('scored_by')]
            anime['score_1'] = [statistics_json_data['data'].get('scores')[0].get('votes')]
            anime['score_2'] = [statistics_json_data['data'].get('scores')[1].get('votes')]
            anime['score_3'] = [statistics_json_data['data'].get('scores')[2].get('votes')]
            anime['score_4'] = [statistics_json_data['data'].get('scores')[3].get('votes')]
            anime['score_5'] = [statistics_json_data['data'].get('scores')[4].get('votes')]
            anime['score_6'] = [statistics_json_data['data'].get('scores')[5].get('votes')]
            anime['score_7'] = [statistics_json_data['data'].get('scores')[6].get('votes')]
            anime['score_8'] = [statistics_json_data['data'].get('scores')[7].get('votes')]
            anime['score_9'] = [statistics_json_data['data'].get('scores')[8].get('votes')]
            anime['score_10'] = [statistics_json_data['data'].get('scores')[9].get('votes')]
        
            anime_df = pd.DataFrame.from_dict(anime)
            full_anime_df = pd.concat([full_anime_df, anime_df])
            
        # if 'data' property is not present, the anime id is skipped
        else: print('Skipping anime {}: Invalid data'.format(anime_id))
        
    # if the requisition fails 'TRIES_LIMIT' times to fetch the data, the anime id is skipped
    else: print('Skipping anime {}: Not existent'.format(anime_id))

print('Finished fetching animes.')

Skipping anime 0: Invalid data
Skipping anime 2: Not existent
Skipping anime 3: Not existent
Skipping anime 4: Not existent
Skipping anime 9: Not existent
Skipping anime 10: Not existent
Skipping anime 11: Not existent
Skipping anime 12: Not existent
Skipping anime 13: Not existent
Skipping anime 14: Not existent
Skipping anime 34: Not existent
Skipping anime 35: Not existent
Skipping anime 36: Not existent
Skipping anime 37: Not existent
Skipping anime 38: Not existent
Skipping anime 39: Not existent
Skipping anime 40: Not existent
Skipping anime 41: Not existent
Skipping anime 42: Not existent
Skipping anime 70: Not existent
Skipping anime 78: Not existent
Finished fetching animes.


In [12]:
# ---- Transforming Datas for Better CSV Exportation ----
full_anime_df.set_index('id', inplace=True)
full_anime_df['name'] = full_anime_df['name'].str.replace(';', ' ')
full_anime_df['english_name'] = full_anime_df['english_name'].str.replace(';', ' ')
full_anime_df['japanese_name'] = full_anime_df['japanese_name'].str.replace(';', ' ')
full_anime_df

Unnamed: 0_level_0,name,english_name,japanese_name,score,genres,synopsis,type,episodes,aired,premiered,status,producers,licensors,studios,source,duration,rating,rank,popularity,favorites,members,image_url,watching,completed,on_hold,dropped,plan_to_watch,scored_by,score_1,score_2,score_3,score_4,score_5,score_6,score_7,score_8,score_9,score_10
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1
1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26,"Apr 3, 1998 to Apr 24, 1999",spring 1998,Finished Airing,Bandai Visual,Funimation,Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),46,43,80199,1810795,https://cdn.myanimelist.net/images/anime/4/196...,168945,1042925,103814,41257,453789,935165,2216,1034,1940,4850,13221,31264,94119,197678,264424,324405
5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1,"Sep 1, 2001",,Finished Airing,"Sunrise, Bandai Visual","Sony Pictures Entertainment, Funimation",Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),190,609,1514,368088,https://cdn.myanimelist.net/images/anime/1439/...,6746,272456,2901,1157,84816,209894,451,144,285,754,2453,7661,30044,65933,62938,39213
6,Trigun,Trigun,トライガン,8.21,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26,"Apr 1, 1998 to Sep 30, 1998",spring 1998,Finished Airing,Victor Entertainment,Funimation,Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,341,248,15511,742310,https://cdn.myanimelist.net/images/anime/7/203...,45150,435908,32602,18201,210412,363659,646,403,896,2488,7502,19880,64107,111389,93853,62490
7,Witch Hunter Robin,Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),7.24,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26,"Jul 3, 2002 to Dec 25, 2002",summer 2002,Finished Airing,"Bandai Visual, Dentsu, Victor Entertainment, T...","Funimation, Bandai Entertainment",Sunrise,Original,25 min per ep,PG-13 - Teens 13 or older,2916,1818,625,113948,https://cdn.myanimelist.net/images/anime/10/19...,5581,51590,5873,6132,44770,43264,170,190,408,1210,3269,6520,12968,10945,5168,2416
8,Bouken Ou Beet,Beet the Vandel Buster,冒険王ビィト,6.94,"Adventure, Fantasy, Supernatural",It is the dark century and the people are suff...,TV,52,"Sep 30, 2004 to Sep 29, 2005",fall 2004,Finished Airing,"TV Tokyo, Dentsu",Illumitoon Entertainment,Toei Animation,Manga,23 min per ep,PG - Children,4389,5230,16,15201,https://cdn.myanimelist.net/images/anime/7/215...,804,8137,832,1237,4190,6467,54,51,94,284,703,1200,1863,1320,560,338
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Turn A Gundam,∀ Gundam,∀ガンダム,7.76,"Action, Adventure, Award Winning, Drama, Roman...","It is the Correct Century, two millennia after...",TV,50,"Apr 9, 1999 to Apr 14, 2000",spring 1999,Finished Airing,"Sotsu, Fuji TV, Nakamura Production",Nozomi Entertainment,Sunrise,Original,24 min per ep,PG-13 - Teens 13 or older,1040,3104,1008,46903,https://cdn.myanimelist.net/images/anime/4/783...,3337,19189,2792,1792,19787,16519,106,102,155,371,861,1755,3646,4015,2863,2646
96,Kidou Butouden G Gundam,Mobile Fighter G Gundam,機動武闘伝Gガンダム,7.58,"Action, Drama, Romance, Sci-Fi","In the year Future Century 0060, the many coun...",TV,49,"Apr 1, 1994 to Mar 31, 1995",spring 1994,Finished Airing,"TV Asahi, Sotsu","Nozomi Entertainment, Bandai Entertainment",Sunrise,Original,24 min per ep,PG-13 - Teens 13 or older,1474,2540,1017,67979,https://cdn.myanimelist.net/images/anime/1187/...,3525,46151,3035,1971,13297,36719,126,151,324,798,1918,4060,9641,9900,5705,4096
97,Last Exile,Last Exile,LAST EXILE（ラストエグザイル）,7.79,"Action, Adventure, Sci-Fi","In the world of Prester, flight is the dominan...",TV,26,"Apr 8, 2003 to Sep 30, 2003",spring 2003,Finished Airing,"GDH, Victor Entertainment, TV Tokyo Music","Funimation, Geneon Entertainment USA",Gonzo,Original,24 min per ep,PG-13 - Teens 13 or older,974,1326,1841,172299,https://cdn.myanimelist.net/images/anime/7/108...,7218,84932,7206,5771,67172,70457,148,152,332,948,2941,6641,16565,21836,13392,7504
98,Mai-HiME,My-Hime,舞-HiME,7.42,"Action, Fantasy, Girls Love","While taking a ferry to Fuuka Academy, new stu...",TV,26,"Oct 1, 2004 to Apr 1, 2005",fall 2004,Finished Airing,"Bandai Visual, TV Tokyo Music, Studio Zain","Funimation, Bandai Entertainment",Sunrise,Original,23 min per ep,PG-13 - Teens 13 or older,2079,1814,1153,114373,https://cdn.myanimelist.net/images/anime/13/72...,4276,65389,4019,4745,35946,51489,236,240,440,1263,3410,6791,13936,13299,7527,4347


In [13]:
# ---- Storing Dataset into Disk ----
full_anime_df.to_csv(f'{DATASETS_PATH}/animes.csv')

<p id='1-users-data-fetcher' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>1 | Users Data Fetcher</p>

<p style='font-size:150%'>📛 Features</p>

<br />

> **ID** - `user id`;

> **Username** - `nickname`;

> **URL** - `user URL on MyAnimeList`.

<br />

---

<br />

<p style='font-size:150%'>🧮 Calculations</p>

<br />

> **Number of Requisitions per User** - `1`;

> **Delay per User Requisition** - `1 second`;

<br />

> **Sample Number of IDS** - `100`;

> **Total Number of IDS** - `1,030,000`;

<br />

> **Time to Fetch the First 100 IDS** - `1.5 minutes (no retries) to 5 minutes (all retries)`;

> **Time to Fetch All Users** - `286 hours (no retries) to 858 hours (all retries)`.

In [14]:
# ---- Dataset Structure ----
header = ['id', 'username', 'url']
full_users_df = pd.DataFrame(columns=header)

# ---- API Requisitions
for user_id in USER_IDS:
    user_api_url = f'https://api.jikan.moe/v4/users/userbyid/{user_id}'
    user_page = None
    
    tries = 0
    while tries < TRIES_LIMIT:
        tries += 1
        user_page = requests.get(user_api_url)
        if user_page.status_code == 200: break
        time.sleep(1)
    
    # if the requisition successfully returned the user data, it's processed
    if user_page.status_code == 200:
        user_json_data = user_page.json()
        
        # if 'data' property is present, it's formatted and inserted into the dataset
        if 'data' in user_json_data:
            user = {}
            
            user['id'] = [user_id]
            user['username'] = [user_json_data['data'].get('username')]
            user['url'] = [user_json_data['data'].get('url')]
            
            user_df = pd.DataFrame.from_dict(user)
            full_users_df = pd.concat([full_users_df, user_df])
        
        # if 'data' property is not present, the user id is skipped
        else: print('Skipping user {}: Invalid data.'.format(user_id))
    
    # if the requisition fails 'TRIES_LIMIT' times to fetch the data, the user id is skipped
    else: print('Skipping user {}: Not existent.'.format(user_id))

print('Finished fetching users.')

Skipping user 0: Invalid data.
Skipping user 2: Not existent.
Skipping user 5: Not existent.
Skipping user 6: Not existent.
Skipping user 7: Not existent.
Skipping user 8: Not existent.
Skipping user 10: Not existent.
Skipping user 11: Not existent.
Skipping user 12: Not existent.
Skipping user 13: Not existent.
Skipping user 14: Not existent.
Skipping user 15: Not existent.
Skipping user 16: Not existent.
Skipping user 17: Not existent.
Skipping user 19: Not existent.
Skipping user 21: Not existent.
Skipping user 22: Not existent.
Skipping user 24: Not existent.
Skipping user 25: Not existent.
Skipping user 26: Not existent.
Skipping user 27: Not existent.
Skipping user 28: Not existent.
Skipping user 29: Not existent.
Skipping user 30: Not existent.
Skipping user 31: Not existent.
Skipping user 32: Not existent.
Skipping user 33: Not existent.
Skipping user 34: Not existent.
Skipping user 35: Not existent.
Skipping user 38: Not existent.
Skipping user 39: Not existent.
Skipping user 

In [15]:
# ---- Transforming Datas for Better CSV Exportation ----
full_users_df.set_index('id', inplace=True)
full_users_df

Unnamed: 0_level_0,username,url
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Xinil,https://myanimelist.net/profile/Xinil
3,Aokaado,https://myanimelist.net/profile/Aokaado
4,Crystal,https://myanimelist.net/profile/Crystal
9,Arcane,https://myanimelist.net/profile/Arcane
18,Mad,https://myanimelist.net/profile/Mad
20,vondur,https://myanimelist.net/profile/vondur
23,Amuro,https://myanimelist.net/profile/Amuro
36,Baman,https://myanimelist.net/profile/Baman
37,megan,https://myanimelist.net/profile/megan
44,beddan,https://myanimelist.net/profile/beddan


In [17]:
# ---- Storing Dataset into Disk ----
full_users_df.to_csv(f'{DATASETS_PATH}/users.csv')

<p id='2-users-details-fetcher' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>2 | Users Details Fetcher</p>

<p style='font-size:150%'>📛 Features</p>

<br />

> **ID** - `user id`;

> **Username** - `nickname`;

> **Gender** - `user gender`;

> **Birthday** - `birthday`;

> **Location** - `user's location or country`;

> **Joined** - `the joined date on MyAnimeList Platform (ISO format)`;

> **Days Watched** - `total number of days the user spent on MyAnimeList`;

> **Mean Score** - `the average score the user gives to the watched animes`;

> **Total Entries** - `total number of animes into the user's list`;

> **Watching** - `number of animes currently being watched by the user`;

> **Completed** - `number of animes finished by the user`;

> **On Hold** - `number of animes that the user stopped watching but kept it into its list`;

> **Dropped** - `number of animes that the user stopped watching and removed from its list`;

> **Plan to Watch** - `number of animes that the user has added into the list but did not started watching`;

> **Rewatched** - `number of animes rewatched`;

> **Episodes Watched** - `number of episodes watched from all animes`.

<br />

---

<br />

<p style='font-size:150%'>🧮 Calculations</p>

<br />

> **Number of Requisitions per User** - `1`;

> **Delay per User Requisition** - `1 second`;

<br />

> **Sample Number of IDS** - `100`;

> **Total Number of IDS** - `1,030,000`;

<br />

> **Time to Fetch the First 100 IDS** - `1.5 minutes (no retries) to 5 minutes (all retries)`;

> **Time to Fetch All Users** - `286 hours (no retries) to 858 hours (all retries)`.

In [18]:
# ---- Dataset Structure ----
header = [
    'id', 'username', 'gender', 'birthday', 'location',
    'joined', 'days_watched', 'mean_score', 'total_entries',
    'watching', 'completed', 'on_hold', 'dropped', 'plan_to_watch',
    'rewatched', 'episodes_watched'
]

full_users_details_df = pd.DataFrame(columns=header)
usernames_list = full_users_df['username'].to_list()

# ---- API Requisitions ----
for username in usernames_list:
    user_details_url = f'https://api.jikan.moe/v4/users/{username}/full'
    user_details_page = None
    
    tries = 0
    while tries < TRIES_LIMIT:
        tries += 1
        user_details_page = requests.get(user_details_url)
        if user_details_page.status_code == 200: break
        time.sleep(1)
    
    # if the requisition successfully fetched the data, it's processed
    if user_details_page.status_code == 200:
        user_details_json_data = user_details_page.json()
        
        # if 'data' property is present, it's transformed and inserted into the dataset
        if 'data' in user_details_json_data:
            details = {}
            
            details['id'] = [user_details_json_data['data'].get('mal_id')]
            details['username'] = [user_details_json_data['data'].get('username')]
            details['gender'] = [user_details_json_data['data'].get('gender')]
            details['birthday'] = [user_details_json_data['data'].get('birthday')]
            details['location'] = [user_details_json_data['data'].get('location')]
            
            details['joined'] = [user_details_json_data['data'].get('joined')]
            
            anime_statistics = user_details_json_data['data'].get('statistics', {}).get('anime', {})
            details['days_watched'] = [anime_statistics.get('days_watched')]
            details['mean_score'] = [anime_statistics.get('mean_score')]
            details['total_entries'] = [anime_statistics.get('total_entries')]
            
            details['watching'] = [anime_statistics.get('watching')]
            details['completed'] = [anime_statistics.get('completed')]
            details['on_hold'] = [anime_statistics.get('on_hold')]
            details['dropped'] = [anime_statistics.get('dropped')]
            details['plan_to_watch'] = [anime_statistics.get('plan_to_watch')]
            details['rewatched'] = [anime_statistics.get('rewatched')]
            details['episodes_watched'] = [anime_statistics.get('episodes_watched')]
            
            details_df = pd.DataFrame.from_dict(details)
            full_users_details_df = pd.concat([full_users_details_df, details_df])
        
        # if 'data' property is not present, the user is skipped
        else: print('Skipping user {}: Invalid data.'.format(username))
    
    # if the requisition fails 'TRIES_LIMIT' times to fetch the data, the user is skipped
    else: print('Skipping user {}: Not existent.'.format(username))
        
print('Finished fetching users details.')

Finished fetching users details.


In [19]:
# ---- Transforming Datas for Better CSV Exportation ----
full_users_details_df.set_index('id', inplace=True)
full_users_details_df

Unnamed: 0_level_0,username,gender,birthday,location,joined,days_watched,mean_score,total_entries,watching,completed,on_hold,dropped,plan_to_watch,rewatched,episodes_watched
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,Xinil,Male,1985-03-04T00:00:00+00:00,California,2004-11-05T00:00:00+00:00,142.3,7.37,399,1,233,8,93,64,60,8481
3,Aokaado,Male,,"Oslo, Norway",2004-11-11T00:00:00+00:00,68.7,7.34,343,23,137,99,44,40,15,4073
4,Crystal,Female,,"Melbourne, Australia",2004-11-13T00:00:00+00:00,212.8,6.68,1000,16,636,303,0,45,10,12781
9,Arcane,,,,2004-12-05T00:00:00+00:00,30.0,7.71,66,5,54,4,3,0,0,1817
18,Mad,,,,2005-01-03T00:00:00+00:00,52.0,6.27,153,1,114,10,5,23,42,3038
20,vondur,Male,1988-01-25T00:00:00+00:00,"Bergen, Norway",2005-01-05T00:00:00+00:00,73.1,8.06,138,11,94,11,2,20,7,4374
23,Amuro,,1988-02-22T00:00:00+00:00,Canada,2005-01-23T00:00:00+00:00,142.5,7.41,392,20,298,5,19,50,0,8565
36,Baman,Male,,Land of Rain and Fjords,2005-02-05T00:00:00+00:00,276.5,5.91,1588,30,1166,10,55,327,36,16569
37,megan,Female,1987-06-18T00:00:00+00:00,"San Diego, California",2005-02-12T00:00:00+00:00,22.7,8.07,108,8,71,9,6,14,0,1346
44,beddan,Male,,,2005-02-21T00:00:00+00:00,18.6,7.6,37,0,37,0,0,0,0,1083


In [20]:
# ---- Storing Dataset into Disk ----
full_users_details_df.to_csv(f'{DATASETS_PATH}/users_details.csv')

<p id='3-users-scores-fetcher' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>3 | Users Scores Fetcher</p>

<p style='font-size:150%'>📛 Features</p>

<br />

header = ['user_id', 'username', 'anime_id', 'anime_title', 'score']

> **User ID** - `user id on MyAnimeList Platform`;

> **Username** - `nickname`;

> **Anime ID** - `anime id on MyAnimeList Platform`;

> **Anime Title** - `anime original name`;

> **Score** - `score that the user rated the anime`.

<br />

---

<br />

<p style='font-size:150%'>🧮 Calculations</p>

<br />

> **Number of Requisitions per User Score** - `1`;

> **Delay per User Score Requisition** - `1 second`;

<br />

> **Sample Number of IDS** - `100`;

> **Total Number of IDS** - `1,030,000`;

<br />

> **Number of Users Fetched for Batch** - `250 users`;

> **Delay Between Each Batch** - `60 seconds to 90 seconds`;

<br />

> **Time to Fetch the First 100 IDS** - `1.5 minutes (no retries) to 5 minutes (all retries)`;

> **Time to Fetch All Users** - `286 hours (no retries) to 858 hours (all retries), desconsidering the delay between each batch`.

In [21]:
# ---- Functions ----
def scrap_user_profile(username, user_id, status_code):
    """
    - Definition:
        \ makes a requisition to MyAnimeList.net in order to get a user's ratings;
        \ it tries 'TRIES_LIMIT' times to fetch the data;
        \ if it fails, RETURNS NONE;
        \ if it successfully fetches the data, all present tables in the document are extracted;
        \ if the scores data are present into the tables, they are formatted and RETURNED;
        \ else, NONE is RETURNED.
    
    - Parameters:
        \ username (string)
        \ user_id (integer)
        \ status_code (integer)
    """
    user_profile_url = f'https://myanimelist.net/animelist/{username}?status={status_code}'
    user_profile_page = None
    
    tries = 0
    while tries < TRIES_LIMIT:
        tries += 1
        user_profile_page = requests.get(user_profile_url)
        if user_profile_page.status_code == 200: break
    
    # if the requisition successfully fetched the data, it's processed
    if user_profile_page.status_code == 200:
        soup = BeautifulSoup(user_profile_page.content, 'html.parser')
        table_1 = soup.find('table', {'data-items': True})
        table_2 = soup.find_all('table', {'border': '0', 'cellpadding': '0', 'cellspacing': '0', 'width': '100%'})
        data_items_parsed = None
        
        # if the major table is present, its data are scrapped
        if table_1:
            data_items = table_1['data-items']
            
            try: data_items_parsed = json.loads(data_items)
            except json.JSONDecodeError: return None
            
            data =[]
            for data_item in data_items_parsed:
                anime_id = data_item['anime_id']
                title = data_item['anime_title']
                score = data_item['score']
                if score != 0: data.append([user_id, username, anime_id, title, score])
            return data
        
        # else, if another tables are present, their data are scrapped
        elif table_2:
            data = []
            
            for table in table_2:
                row = table.find('tr')
                
                if row:
                    cells = row.find_all('td')
                    
                    if len(cells) >= 5:
                        anime_title_cell = cells[1]
                        score_cell = cells[2]
                        
                        anime_title_link = anime_title_cell.find('a', class_='animetitle')
                        anime_id = anime_title_link['href'].split('/')[2] if anime_title_link else ''
                        anime_title = anime_title_link.find('span').text.strip() if anime_title_link else ''
                        
                        score_label = score_cell.find('span', class_='score-label')
                        score = score_label.text.strip() if score_label else '-'
                        
                        if anime_title and score != '-': data.append([user_id, username, anime_id, anime_title, score])
            return data
        
        # else, the user is skipped
        else:
            print('Skipping User Score: {} ({}): Data not found.'.format(username, user_id))
            return None
    
    # if the requisition fails 'TRIES_LIMIT' times to fetch the data, the user is skipped
    else:
        print('Skipping User Score: {} ({}): Invalid data.'.format(username, user_id))
        return None

In [23]:
# ---- Dataset Structure ----
header = ['user_id', 'username', 'anime_id', 'anime_title', 'score']
full_user_scores_df = pd.DataFrame(columns=header)

# ---- Users Data ----
usernames = full_users_df['username'].to_list()
user_ids = full_users_df.reset_index()['id'].to_list()

# ---- Scrapping ----
for batch_index in range(0, len(usernames), BATCH_SIZE):
    usernames_batch = usernames[batch_index:(batch_index + BATCH_SIZE)]
    user_ids_batch = user_ids[batch_index:(batch_index + BATCH_SIZE)]
    
    for username, user_id in zip(usernames_batch, user_ids_batch):
        scores_data = scrap_user_profile(username, user_id, STATUS_CODE)
        
        if scores_data:
            for score in scores_data:
                full_user_scores_df.loc[len(full_user_scores_df)] = score
        else: print(f'No user details found for username: {username} ({user_id})')
    
    if batch_index + BATCH_SIZE < len(usernames):
        # Adding random delay between requests
        delay_seconds = random.randint(MIN_DELAY_SECONDS, MAX_DELAY_SECONDS)
        print(f'Waiting for {delay_seconds} seconds before the next batch...')
        time.sleep(delay_seconds)
    
if len(full_user_scores_df) > 0: print('Users scores fetched successfully.')
else: print('No users scores found.')

No user details found for username: Aokaado (3)
Skipping User Score: Mad (18): Invalid data.
No user details found for username: Mad (18)
Skipping User Score: Baman (36): Invalid data.
No user details found for username: Baman (36)
Skipping User Score: beddan (44): Invalid data.
No user details found for username: beddan (44)
No user details found for username: Emp (77)
Skipping User Score: xich (90): Invalid data.
No user details found for username: xich (90)
Users scores fetched successfully.


In [24]:
# ---- Transforming Datas for Better CSV Exportation ----
full_user_scores_df['username'] = full_user_scores_df['username'].str.replace(';', ' ')
full_user_scores_df['anime_title'] = full_user_scores_df['anime_title'].str.replace(';', ' ')
full_user_scores_df

Unnamed: 0,user_id,username,anime_id,anime_title,score
0,1,Xinil,21,One Piece,9
1,1,Xinil,48,.hack//Sign,7
2,1,Xinil,320,A Kite,5
3,1,Xinil,49,Aa! Megami-sama!,8
4,1,Xinil,304,Aa! Megami-sama! Movie,8
...,...,...,...,...,...
3212,95,sorairo,21,One Piece,1
3213,95,sorairo,7079,Ookamikakushi,1
3214,95,sorairo,527,Pokemon,1
3215,95,sorairo,4719,Queen's Blade: Rurou no Senshi,3


In [27]:
# ---- Storing Dataset into Disk ----
full_user_scores_df.to_csv(f'{DATASETS_PATH}/users_scores.csv')

---

<p id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</p>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).