# COGS 108 - Data Checkpoint

## Authors
Team list and credits
- Kabyan Pathak: Writing - review & editing
- Nicholas Chan: Conceptualization, Background research,
- Siyuan Zheng: Conceptualization, Data curation,
- Sophia Zhao: Methodology, Experimental investigation
- Yuntian Zhao: Project administration, Analysis, Software


## Research Question

What impact has the increase in isekai anime production had on the audiences reception over the past decade? How has the average rating of the genre been affected? How does episode count corelate with rating and viewership within isekai titles‘ length? Does the time of release (Winter/Spring/Summer/Fall) have any impact on the reception of said titles? 

## Background and Prior Work


Over the last decade anime has become more mainstream and with it isekai has become one of its top produced genres. ”Isekai”, meaning another world in Japanese, refers to a story where the main protagonist is transported, trapped, or reborn into a fantasy world. 
While the genre has been around since the 1980’s, it wasn't until recently that the genre has really taken off. Just in 2024, the genre made up 15% of new release anime tv shows with 34 isekai being released. However with the rise of the genre, it 
has faced increasing criticism from audiences as they feel the genre is being overproduced and filled with unoriginal ideas. (1) 

Knowing this our group has decided to explore the truth behind how the audience really feels about the genre. By analyzing more than a decade of industry data and user reviews through MyAnimeList (MAL), we aim to determine whether the rapid growth 
in isekai production has contributed to a decline in average ratings or whether the genre has maintained its popularity. We also wish to see whether episode count or release season has any impact on overall audience rating. 

In reviewing previous analyses done on anime using data from MyAnimeList, we found that shows with more than 26 episodes seemed to be less popular than shorter series by about 33% (2). Another analysis done on the rise of isekai found that despite the 
increase in production of the genre isekai’s are still doing very well. Despite community consensus on there being too much isekai, the analysis viewership data shows that the genre has not yet hit the point of diminishing returns and was the second most 
popular in English speaking territories after action in 2024.(3) 

 Sources: 
1. [Anime Industry Overproduction & Isekai Report](https://screenrant.com/anime-industry-problems-overproduction-isekai-report/) 
2. [What Makes an Anime Popular? Statistical Analysis Using MAL Data](https://rpubs.com/SPIJ/714155) 
3. [The State of Isekai Anime](https://www.animenewsnetwork.com/feature/2025-01-22/the-state-of-isekai-anime/.219776) 


## Hypothesis


We hypothesize that we will see a decrease in the overall rating and viewership of the genre due to over saturation leading to lower quality productions and unoriginal repetitive plots. We think that we will start to see a trend of shorter series within the genre and higher quality/more popular isekai's being produced in the fall and winter.

## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name: MyAnimeList dataset extended to 2026
  - Link to the dataset: https://www.kaggle.com/code/rivuletnriver/mal-data-scraping and https://www.kaggle.com/datasets/neelagiriaditya/anime-dataset-jan-1917-to-oct-2025
  - Number of observations: 29389 
  - Number of variables: 29 
  - Description of the variables most relevant to this project: season (which anime season it is in), title (title of the anime), score (public rating), scored_by (number of participants in rating), year )
  - Descriptions of any shortcomings this dataset has with repsect to the project: the most recent ones that we extended to include doesn't have public rating yet.
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [2]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

# import sys
# sys.path.append('./modules') # this tells python where to look for modules to import

# import get_data # this is where we get the function we need to download data

# # replace the urls and filenames in this list with your actual datafiles
# # yes you can use Google drive share links or whatever
# # format is a list of dictionaries; 
# # each dict has keys of 
# #   'url' where the resource is located
# #   'filename' for the local filename where it will be stored 
# datafiles = [
#     { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
#     { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
# ]

# get_data.get_raw(datafiles,destination_directory='data/00-raw/')

In [3]:
# imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)

### MyAnimeList site info updated to 2026

This dataset is created from updating [kaggle dataset source](https://www.kaggle.com/datasets/neelagiriaditya/anime-dataset-jan-1917-to-oct-2025) with the most recent anime entries in 2026 using [this script](https://www.kaggle.com/code/rivuletnriver/mal-data-scraping) that essentially searches for all the animes listed using the official myanimelist api, comparing it with entries included in the existing kaggle dataset, and further extracted those that are missing in the existing dataset using jikan api (unofficial). 

The dataset includes many columns describing the information of the anime, and important ones for answering our research question are:
1. mal_id: anime id on myanimelist official website
2. title: english title of the anime
3. type: type of the work (there are multiple types of animes including Movie, Music, OVA, TV, Special, etc)
4. score: public rating of the anime
5. start_date: when the anime starts airing
6. synopsis: summary of the plot of the anime
7. rank, popularity, members, favorites: information that implies how successful an anime is besides the score
8. genres, themes: type of the plot of the anime
9. season: the season in which an anime is produced

One of the problems with this dataset is that it contains a lot of missing values that can either be inferred from known information or needs to be further retrieved using jikan API. However, since it's drawn from the largest public anime site that exists, it provides valuable insights as to how we can answer our research question.

#### Quick overview

In [4]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
anime2026_df = pd.read_csv("data/00-raw/anime_extended_winter2026.csv")
print(anime2026_df.shape)
anime2026_df.head()

(29382, 29)


Unnamed: 0,mal_id,title,title_japanese,url,image_url,type,status,score,scored_by,start_date,end_date,synopsis,rank,popularity,members,favorites,genres,studios,themes,demographics,source,rating,episodes,season,year,producers,explicit_genres,licensors,streaming
0,59356,-Socket-,-socket-,https://myanimelist.net/anime/59356/-Socket-,https://cdn.myanimelist.net/images/anime/1043/...,Movie,Finished Airing,,,2010-01-01T00:00:00+00:00,,A girl with a cord growing out of her back wan...,17086.0,22507,195,0,['Comedy'],[],[],[],Original,G - All Ages,1.0,,,['Nagoya Zokei University'],[],[],[]
1,56036,......,......,https://myanimelist.net/anime/56036/-,https://cdn.myanimelist.net/images/anime/1057/...,Music,Finished Airing,6.53,503.0,2023-06-11T00:00:00+00:00,,Music video directed by obmolot for the song ....,,15004,941,2,"['Horror', 'Supernatural']",['Flat Studio'],['Music'],[],Original,PG-13 - Teens 13 or older,1.0,,,[],[],[],[]
2,2928,.hack//G.U. Returner,.HACK//G.U. RETURNER,https://myanimelist.net/anime/2928/hack__GU_Re...,https://cdn.myanimelist.net/images/anime/1798/...,OVA,Finished Airing,6.65,9745.0,2007-01-18T00:00:00+00:00,,The characters from previous .hack//G.U. Games...,6366.0,5056,22525,31,"['Adventure', 'Drama', 'Fantasy']",['Bee Train'],['Video Game'],[],Game,PG-13 - Teens 13 or older,1.0,,,"['Bandai Visual', 'CyberConnect2']",[],[],[]
3,3269,.hack//G.U. Trilogy,.hack//G.U. Trilogy,https://myanimelist.net/anime/3269/hack__GU_Tr...,https://cdn.myanimelist.net/images/anime/1566/...,Movie,Finished Airing,7.06,15373.0,2007-12-22T00:00:00+00:00,,"Based on the CyberConnect2 HIT GAME, now will ...",4194.0,4215,34264,104,"['Action', 'Fantasy']",['CyberConnect2'],['Video Game'],[],Game,PG-13 - Teens 13 or older,1.0,,,['Bandai Visual'],[],"['Funimation', 'Bandai Entertainment']",[]
4,4469,.hack//G.U. Trilogy: Parody Mode,.hack//G.U. Trilogy,https://myanimelist.net/anime/4469/hack__GU_Tr...,https://cdn.myanimelist.net/images/anime/10/86...,Special,Finished Airing,6.35,4317.0,2008-03-25T00:00:00+00:00,,A special bonus Parody Mode added to the extra...,8182.0,6696,11135,10,"['Comedy', 'Fantasy', 'Sci-Fi']",[],"['Parody', 'Video Game']",[],Game,PG-13 - Teens 13 or older,1.0,,,['Bandai Visual'],[],[],[]


In [5]:
anime2026_df.columns

Index(['mal_id', 'title', 'title_japanese', 'url', 'image_url', 'type',
       'status', 'score', 'scored_by', 'start_date', 'end_date', 'synopsis',
       'rank', 'popularity', 'members', 'favorites', 'genres', 'studios',
       'themes', 'demographics', 'source', 'rating', 'episodes', 'season',
       'year', 'producers', 'explicit_genres', 'licensors', 'streaming'],
      dtype='object')

#### Manually input the season based on the start date for those missing

In [6]:
anime2026_df['season'].isna().sum()

23068

In [7]:
# end_date is not as helpful as there are too many missing
print(anime2026_df['start_date'].isna().sum())
print(anime2026_df['end_date'].isna().sum())

851
18075


In [8]:
anime2026_df['start_date'] = pd.to_datetime(anime2026_df['start_date'])
season_map = {12: 'Winter', 1: 'Winter', 2: 'Winter', 3: 'Spring', 4: 'Spring', 5: 'Spring', 6: 'Summer', 7: 'Summer', 8: 'Summer', 9: 'Fall', 10: 'Fall', 11: 'Fall'}
# season_map = {1: 'Winter', 2: 'Winter', 3: 'Winter', 4: 'Spring', 5: 'Spring', 6: 'Spring', 7: 'Summer', 8: 'Summer', 9: 'Summer', 10: 'Fall', 11: 'Fall', 12: 'Fall'}
anime2026_df['season_inferred'] =  anime2026_df['start_date'].dt.month.map(season_map)

anime2026_df['season_inferred']

0        Winter
1        Summer
2        Winter
3        Winter
4        Spring
          ...  
29377    Summer
29378    Winter
29379    Summer
29380    Summer
29381    Spring
Name: season_inferred, Length: 29382, dtype: object

In [9]:
# check for number of conflicts between inferred value and the original
real_season = anime2026_df['season'].str.lower()
inferred_season = anime2026_df['season_inferred'].str.lower()
conflicts = anime2026_df[(anime2026_df['season'].notna()) & (real_season != inferred_season)]
conflicts.shape[0]

43

In [10]:
# merge into season column
anime2026_df['season'] = anime2026_df['season'].str.capitalize()
anime2026_df['season'] = anime2026_df['season'].combine_first(anime2026_df['season_inferred'])
anime2026_df.drop(columns=['season_inferred'], inplace=True)
anime2026_df['season'].isna().sum()

851

In [11]:
# check if any of the rows missing season info is of genre isekai
nan_seasons = anime2026_df[anime2026_df['season'].isna()]
isekai_missing_season = nan_seasons[nan_seasons['themes'].astype(str).str.contains('Isekai', case=False)]
print(f"{len(isekai_missing_season)} Isekai misses season value")

30 Isekai misses season value


In [12]:
isekai_missing_season['status'].value_counts()

status
Not yet aired      29
Finished Airing     1
Name: count, dtype: int64

29 has not yet aired, so they can be ignored

In [13]:
isekai_missing_season[isekai_missing_season['status'] == 'Finished Airing']

Unnamed: 0,mal_id,title,title_japanese,url,image_url,type,status,score,scored_by,start_date,end_date,synopsis,rank,popularity,members,favorites,genres,studios,themes,demographics,source,rating,episodes,season,year,producers,explicit_genres,licensors,streaming
6664,40244,Fushigi no Kuni no Alice Specials,ふしぎの国のアリス,https://myanimelist.net/anime/40244/Fushigi_no...,https://cdn.myanimelist.net/images/anime/1166/...,Special,Finished Airing,,,NaT,,,18695.0,19064,397,1,"['Adventure', 'Fantasy']",['Nippon Animation'],"['Anthropomorphic', 'Isekai']",['Kids'],Book,G - All Ages,2.0,,,[],[],[],[]


From [its myanimelist site](https://myanimelist.net/anime/40244/Fushigi_no_Kuni_no_Alice_Specials), it's "Two episodes that were made for Japanese market but were not aired on TV", so no information is given for the season and year it was published. \

So all 30 ones that doesn't have season value can be neglected

In [14]:
# filter out the 30 nan entries missing season value
anime2026_df = anime2026_df[~anime2026_df['season'].isna()]

#### Limit the scope, include only isekai animes, and fill in missing values in important freature columns with jikan api

In [15]:
isekai_df = anime2026_df[anime2026_df['themes'].astype(str).str.contains('Isekai', case=False)].reset_index(drop = True)

In [16]:
import requests
import time
from tqdm import tqdm

ids_to_fix = isekai_df[isekai_df['score'].isna() | isekai_df['rank'].isna() | isekai_df['rating'].isna() | isekai_df['scored_by'].isna() | isekai_df['episodes'].isna()].index
print(f"total of {len(ids_to_fix)} rows")

for idx in tqdm(ids_to_fix, desc = "retrieving"):
    mal_id = isekai_df.at[idx, 'mal_id']
    try:
        url = f"https://api.jikan.moe/v4/anime/{mal_id}"
        response = requests.get(url)
        if response.status_code == 200:
            data = response.json().get('data', {})
            isekai_df.at[idx, 'score'] = data.get('score')
            isekai_df.at[idx, 'rank'] = data.get('rank')
            isekai_df.at[idx, 'rating'] = data.get('rating')
            isekai_df.at[idx, 'scored_by'] = data.get('scored_by')
            isekai_df.at[idx, 'episodes'] = data.get('episodes')
            
        else:
            print(f"ID {mal_id}: {response.status_code})")
            
        time.sleep(1.2)

    except Exception as e:
        print(f"error on ID {mal_id}: {e}")

print('Done')

total of 56 rows


retrieving: 100%|██████████| 56/56 [01:47<00:00,  1.92s/it]

Done





In [17]:
print(f"now we have {isekai_df[isekai_df['score'].isna()].shape[0]} entries missing a score")

now we have 19 entries missing a score


The ones that are missing is either not aired yet or doesn't have a score on the official site, so we discard them

In [18]:
# Discard the nan's in score and further filter by fan base
real_isekai = isekai_df[isekai_df['score'].notna() & ((isekai_df['scored_by'] > 1000) | (isekai_df['members'] > 1000))].reset_index(drop = True)
real_isekai.shape

(387, 29)

good enough size to do T-Tests / ANOVA to see if the difference between Winter, Fall and other two seasons' scores is statistically significant

In [19]:
real_isekai.isna().sum()

mal_id               0
title                0
title_japanese       1
url                  0
image_url            0
type                 0
status               0
score                0
scored_by            0
start_date           0
end_date           100
synopsis             1
rank                12
popularity           0
members              0
favorites            0
genres               0
studios              0
themes               0
demographics         1
source               0
rating               0
episodes             0
season               0
year               145
producers            1
explicit_genres      1
licensors            1
streaming            1
dtype: int64

In [20]:
real_isekai['inferred_year'] = real_isekai['start_date'].dt.year
print(real_isekai[real_isekai['inferred_year'] != real_isekai['year']].shape[0])
# so it's only the 145 that are missing the year value, we just fill it in
real_isekai['year'] = real_isekai['inferred_year'] 
real_isekai = real_isekai.drop('inferred_year', axis = 1)
list_cols = ['producers', 'licensors', 'streaming', 'demographics']
for col in list_cols:
    real_isekai[col] = real_isekai[col].apply(lambda x: [] if pd.isna(x) else x)
real_isekai = real_isekai.fillna(value={'rank' : 0,})
real_isekai = real_isekai.drop(['end_date', 'explicit_genres'], axis = 1)
mask_syn = real_isekai['synopsis'].isna()
real_isekai.loc[mask_syn, 'synopsis'] = ("follows a Japanese high schooler reincarnated as 8-year-old Princess Pride Royal Ivy, "
                                         "the destined evil final boss of an otome game. She uses her cheat abilities and "
                                         "game knowledge to prevent tragedies and save her kingdom")
mask_title = real_isekai['title_japanese'].isna()
real_isekai.loc[mask_title, 'title_japanese'] = '劇場版 転生したらスライムだった件 紅蓮の絆編'
                                           

145


In [21]:
real_isekai.isna().sum()

mal_id            0
title             0
title_japanese    0
url               0
image_url         0
type              0
status            0
score             0
scored_by         0
start_date        0
synopsis          0
rank              0
popularity        0
members           0
favorites         0
genres            0
studios           0
themes            0
demographics      0
source            0
rating            0
episodes          0
season            0
year              0
producers         0
licensors         0
streaming         0
dtype: int64

In [22]:
real_isekai

Unnamed: 0,mal_id,title,title_japanese,url,image_url,type,status,score,scored_by,start_date,synopsis,rank,popularity,members,favorites,genres,studios,themes,demographics,source,rating,episodes,season,year,producers,licensors,streaming
0,58985,0-saiji Start Dash Monogatari,0歳児スタートダッシュ物語,https://myanimelist.net/anime/58985/0-saiji_St...,https://cdn.myanimelist.net/images/anime/1973/...,TV,Finished Airing,5.43,2530.0,2024-07-07 00:00:00+00:00,"A woman in her thirties, who is enthusiastic a...",12703.0,6424,12263,24,"['Adventure', 'Fantasy']",[],"['Isekai', 'Reincarnation']",[],Manga,PG-13 - Teens 13 or older,12.0,Summer,2024,[],[],[]
1,60425,0-saiji Start Dash Monogatari Season 2,0歳児スタートダッシュ物語 シーズン2,https://myanimelist.net/anime/60425/0-saiji_St...,https://cdn.myanimelist.net/images/anime/1635/...,TV,Finished Airing,5.63,1080.0,2025-01-05 00:00:00+00:00,After reincarnating from her previous life as ...,12053.0,9660,4376,10,"['Adventure', 'Fantasy']",[],"['Isekai', 'Reincarnation']",[],Manga,PG-13 - Teens 13 or older,12.0,Winter,2025,[],[],[]
2,41380,100-man no Inochi no Ue ni Ore wa Tatteiru,100万の命の上に俺は立っている,https://myanimelist.net/anime/41380/100-man_no...,https://cdn.myanimelist.net/images/anime/1506/...,TV,Finished Airing,6.52,161906.0,2020-10-02 00:00:00+00:00,"Yuusuke Yotsuya has always disliked Tokyo, but...",7107.0,826,324016,701,"['Action', 'Adventure', 'Drama', 'Fantasy']",['Maho Film'],['Isekai'],['Shounen'],Manga,PG-13 - Teens 13 or older,12.0,Fall,2020,"['Lantis', 'Warner Bros. Japan']",['Crunchyroll'],"['Crunchyroll', 'Netflix']"
3,44881,100-man no Inochi no Ue ni Ore wa Tatteiru 2nd...,100万の命の上に俺は立っている,https://myanimelist.net/anime/44881/100-man_no...,https://cdn.myanimelist.net/images/anime/1424/...,TV,Finished Airing,6.79,85676.0,2021-07-10 00:00:00+00:00,"Once again, the Game Master's world of quests ...",5509.0,1453,182088,640,"['Action', 'Adventure', 'Drama', 'Fantasy']",['Maho Film'],['Isekai'],['Shounen'],Manga,PG-13 - Teens 13 or older,12.0,Summer,2021,['Warner Bros. Japan'],[],"['Crunchyroll', 'Ani-One Asia', 'Aniplus TV', ..."
4,48976,100-man no Inochi no Ue ni Ore wa Tatteiru Recap,100万の命の上に俺は立っている,https://myanimelist.net/anime/48976/100-man_no...,https://cdn.myanimelist.net/images/anime/1453/...,TV Special,Finished Airing,6.28,1306.0,2021-07-02 00:00:00+00:00,Recap of the first season of 100-man no Inochi...,8588.0,10198,3696,4,"['Action', 'Adventure', 'Drama', 'Fantasy']",['Maho Film'],['Isekai'],['Shounen'],Manga,PG-13 - Teens 13 or older,1.0,Summer,2021,[],[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
382,1840,Zero no Tsukaima: Futatsuki no Kishi,ゼロの使い魔 ～双月の騎士～,https://myanimelist.net/anime/1840/Zero_no_Tsu...,https://cdn.myanimelist.net/images/anime/2/227...,TV,Finished Airing,7.39,318015.0,2007-07-09 00:00:00+00:00,Revered as heroes for their role in defending ...,2490.0,465,522752,1290,"['Action', 'Adventure', 'Comedy', 'Fantasy', '...",['J.C.Staff'],"['Harem', 'Isekai', 'School']",[],Light novel,PG-13 - Teens 13 or older,12.0,Summer,2007,"['Genco', 'Media Factory', 'Shochiku', 'Happin...",['Sentai Filmworks'],['HIDIVE']
383,3712,Zero no Tsukaima: Princesses no Rondo,ゼロの使い魔 ～三美姫（プリンセッセ）の輪舞（ロンド）～,https://myanimelist.net/anime/3712/Zero_no_Tsu...,https://cdn.myanimelist.net/images/anime/6/102...,TV,Finished Airing,7.31,279380.0,2008-07-07 00:00:00+00:00,Following his brave sacrifice in the war again...,2856.0,543,462835,809,"['Action', 'Adventure', 'Comedy', 'Fantasy', '...",['J.C.Staff'],"['Harem', 'Isekai', 'School']",[],Light novel,PG-13 - Teens 13 or older,12.0,Summer,2008,[],['Sentai Filmworks'],['HIDIVE']
384,5764,Zero no Tsukaima: Princesses no Rondo - Yuuwak...,ゼロの使い魔 ～三美姫（プリンセッセ）の輪舞（ロンド）～ 誘惑の砂浜,https://myanimelist.net/anime/5764/Zero_no_Tsu...,https://cdn.myanimelist.net/images/anime/6/123...,Special,Finished Airing,7.22,52275.0,2008-12-25 00:00:00+00:00,DVD beach episode of Zero no Tsukaima: Princes...,3333.0,2360,96777,117,"['Comedy', 'Fantasy', 'Ecchi']",['J.C.Staff'],"['Harem', 'Isekai']",[],Light novel,PG-13 - Teens 13 or older,1.0,Winter,2008,[],['Sentai Filmworks'],['HIDIVE']
385,6489,Zero no Tsukaima: Princesses no Rondo Picture ...,ゼロの使い魔～三美姫の輪舞～ ピクチャードラマ,https://myanimelist.net/anime/6489/Zero_no_Tsu...,https://cdn.myanimelist.net/images/anime/12/22...,Special,Finished Airing,6.91,22948.0,2008-09-25 00:00:00+00:00,Picture Drama episodes included in each DVD vo...,4919.0,3377,52176,66,"['Action', 'Adventure', 'Comedy', 'Fantasy', '...",['J.C.Staff'],"['Harem', 'Isekai', 'School']",[],Light novel,PG-13 - Teens 13 or older,7.0,Fall,2008,[],[],[]


Now whats left to do is filter based on the columns we will be using and animes of type TV. 

In [23]:
finalized_isekai = real_isekai[['mal_id', 'title', 'type', 'status', 'score', 'start_date', 'rank', 'popularity', 'members','favorites', 'genres', 'themes', 'episodes', 'season', 'year']]
finalized_isekai = finalized_isekai[finalized_isekai['type'] == 'TV']
finalized_isekai

Unnamed: 0,mal_id,title,type,status,score,start_date,rank,popularity,members,favorites,genres,themes,episodes,season,year
0,58985,0-saiji Start Dash Monogatari,TV,Finished Airing,5.43,2024-07-07 00:00:00+00:00,12703.0,6424,12263,24,"['Adventure', 'Fantasy']","['Isekai', 'Reincarnation']",12.0,Summer,2024
1,60425,0-saiji Start Dash Monogatari Season 2,TV,Finished Airing,5.63,2025-01-05 00:00:00+00:00,12053.0,9660,4376,10,"['Adventure', 'Fantasy']","['Isekai', 'Reincarnation']",12.0,Winter,2025
2,41380,100-man no Inochi no Ue ni Ore wa Tatteiru,TV,Finished Airing,6.52,2020-10-02 00:00:00+00:00,7107.0,826,324016,701,"['Action', 'Adventure', 'Drama', 'Fantasy']",['Isekai'],12.0,Fall,2020
3,44881,100-man no Inochi no Ue ni Ore wa Tatteiru 2nd...,TV,Finished Airing,6.79,2021-07-10 00:00:00+00:00,5509.0,1453,182088,640,"['Action', 'Adventure', 'Drama', 'Fantasy']",['Isekai'],12.0,Summer,2021
5,306,Abenobashi Mahou☆Shoutengai,TV,Finished Airing,7.21,2002-04-04 00:00:00+00:00,3376.0,2364,96358,380,"['Award Winning', 'Comedy', 'Fantasy', 'Ecchi']","['Isekai', 'Parody']",13.0,Spring,2002
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
379,58502,Zenshuu.,TV,Finished Airing,7.58,2025-01-05 00:00:00+00:00,1705.0,1706,149997,865,"['Action', 'Fantasy']",['Isekai'],12.0,Winter,2025
380,1195,Zero no Tsukaima,TV,Finished Airing,7.20,2006-07-03 00:00:00+00:00,3495.0,243,855507,8120,"['Action', 'Adventure', 'Comedy', 'Fantasy', '...","['Harem', 'Isekai', 'School']",13.0,Summer,2006
381,11319,Zero no Tsukaima F,TV,Finished Airing,7.41,2012-01-07 00:00:00+00:00,2423.0,576,439736,1228,"['Action', 'Adventure', 'Comedy', 'Fantasy', '...","['Harem', 'Isekai', 'School']",12.0,Winter,2012
382,1840,Zero no Tsukaima: Futatsuki no Kishi,TV,Finished Airing,7.39,2007-07-09 00:00:00+00:00,2490.0,465,522752,1290,"['Action', 'Adventure', 'Comedy', 'Fantasy', '...","['Harem', 'Isekai', 'School']",12.0,Summer,2007


Finally we also added a season_length column categorizing them as short, medium, or long which we will be using to answer one of the questions in our research question

In [24]:
finalized_isekai["season_length"] = pd.cut(
    finalized_isekai["episodes"],
    bins=[0, 13, 26, float("inf")],
    labels=["Short", "Medium", "Long"]
)
finalized_isekai

Unnamed: 0,mal_id,title,type,status,score,start_date,rank,popularity,members,favorites,genres,themes,episodes,season,year,season_length
0,58985,0-saiji Start Dash Monogatari,TV,Finished Airing,5.43,2024-07-07 00:00:00+00:00,12703.0,6424,12263,24,"['Adventure', 'Fantasy']","['Isekai', 'Reincarnation']",12.0,Summer,2024,Short
1,60425,0-saiji Start Dash Monogatari Season 2,TV,Finished Airing,5.63,2025-01-05 00:00:00+00:00,12053.0,9660,4376,10,"['Adventure', 'Fantasy']","['Isekai', 'Reincarnation']",12.0,Winter,2025,Short
2,41380,100-man no Inochi no Ue ni Ore wa Tatteiru,TV,Finished Airing,6.52,2020-10-02 00:00:00+00:00,7107.0,826,324016,701,"['Action', 'Adventure', 'Drama', 'Fantasy']",['Isekai'],12.0,Fall,2020,Short
3,44881,100-man no Inochi no Ue ni Ore wa Tatteiru 2nd...,TV,Finished Airing,6.79,2021-07-10 00:00:00+00:00,5509.0,1453,182088,640,"['Action', 'Adventure', 'Drama', 'Fantasy']",['Isekai'],12.0,Summer,2021,Short
5,306,Abenobashi Mahou☆Shoutengai,TV,Finished Airing,7.21,2002-04-04 00:00:00+00:00,3376.0,2364,96358,380,"['Award Winning', 'Comedy', 'Fantasy', 'Ecchi']","['Isekai', 'Parody']",13.0,Spring,2002,Short
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
379,58502,Zenshuu.,TV,Finished Airing,7.58,2025-01-05 00:00:00+00:00,1705.0,1706,149997,865,"['Action', 'Fantasy']",['Isekai'],12.0,Winter,2025,Short
380,1195,Zero no Tsukaima,TV,Finished Airing,7.20,2006-07-03 00:00:00+00:00,3495.0,243,855507,8120,"['Action', 'Adventure', 'Comedy', 'Fantasy', '...","['Harem', 'Isekai', 'School']",13.0,Summer,2006,Short
381,11319,Zero no Tsukaima F,TV,Finished Airing,7.41,2012-01-07 00:00:00+00:00,2423.0,576,439736,1228,"['Action', 'Adventure', 'Comedy', 'Fantasy', '...","['Harem', 'Isekai', 'School']",12.0,Winter,2012,Short
382,1840,Zero no Tsukaima: Futatsuki no Kishi,TV,Finished Airing,7.39,2007-07-09 00:00:00+00:00,2490.0,465,522752,1290,"['Action', 'Adventure', 'Comedy', 'Fantasy', '...","['Harem', 'Isekai', 'School']",12.0,Summer,2007,Short


In [25]:
# # save
# real_isekai.to_csv(r'data/02-processed/isekai_clean.csv', index = False)

finalized_isekai.to_csv(r'data/02-processed/isekai_final.csv', index = False)

### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [26]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

>  Our data is from Kaggle which is a public web scrapped dataset of MAL users. While individuals are not informed about the use of their data we are looking at average/overall ratings rather than information on individual profiles. 

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
>  Our data relies on MAL ratings which in itself could be biased. It does not represent the whole community and may not proportionally represent anime fans of different regions. 

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

>  In our dataset we would not be using any individuals personal information and only use the relevant data like title,genre, members, episode count and averaged ratings. 

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> N/A 

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

>  N/A all the data that we are using is open to the public either through open sourced datasets or API’s therefore anyone could get it, there is no need for protection. 

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

>  N/A as we will not be using anyone's personal info just aggregated summaries. 

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

>  We acknowledge that our dataset comes from a subgroup of the whole community and represent more of the international fan base, but for the purpose of our research this is sufficient and we do not need to find other relevant data. 

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

>  There is a possibility that when comparing by episode count, longer shows may have higher ratings than shorter ones, but we have not thought of how to address this yet. 

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

>  Yes, we are not looking at just the ratings but also the number of members to accurately show the popularity of a title. 

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

>  N/A as we will not be using any individuals personal info. 

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

>  Yes all our queries and data visualizations will be posted on the github repo. 

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

>  We do not have any variables that would have any bias associated to them. 

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

>  We will be using the overall rating of a show when doing our calculations and we have also considered somehow factoring in members or popularity rank to account for shows with high ratings but low members and vice versa. 

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

>  N/A 

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

>  N/A 

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

>  N/A 

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

>  N/A 

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?

>  N/A 

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

>  N/A 



## Team Expectations 

 
* Be active in team discussions, if can not contribute to the conversation at the time react to a comment so we know you have seen it 
* Communication will be done through discord, check often, reply within 24 hours 
* Complete assigned tasks in a timely manner 
* Come to meetings well prepared and ready to contribute 


## Project Timeline Proposal

Instructions: Replace this with your timeline.  **PLEASE UPDATE your Timeline!** No battle plan survives contact with the enemy, so make sure we understand how your plans have changed.  Also if you have lost points on the previous checkpoint fix them