# Data Preparation:

Dataset from Kaggle : **"MyAnimeList"** by *Azathoth*  
Source: https://www.kaggle.com/datasets/azathoth42/myanimelist/data (requires login)

---

### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

---

### Import the Dataset (UserList)

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [None]:
userlist = pd.read_csv('DataSets/Raw Data/UserList.csv')
userlist.head()

Description of the dataset, as available on Kaggle, is as follows.


> **username**         : user name
> **user_id**            : ID for each user      
> **user_watching**    : how many anime currently the user is watching     
> **user_completed**   : how many anime watched by the user 
> **user_onhold**        : how many anime is watching halfway   
> **user_dropped**             : how many anime the user remove from his list 
> **user_plantowatch**           : how many anime the user added to his watch list    
> **user_days_spent_watching**         : How much time the user spend on watching anime 
> **gender**           : user gender    
> **location**           : where is the user from 
> **birth_date**     : user age 
> **access_rank**            :   ??
> **join_date**         : when the user join the community    
> **last_online**           : when is user last seen 
> **stats_mean_score**            : average score the user rate for the anime 
> **stats_rewatched**        : how many episode the user rewatch
> **stats_episodes**             : how many episode the user completed
---

Check the vital statistics of the dataset using the `type` and `shape` attributes.

In [None]:
print("Data type : ", type(userlist))
print("Data dims : ", userlist.shape)

Check the variables (and their types) in the dataset using the `dtypes` attribute.

In [None]:
userlist.info()

---

### Import the Dataset (AnimeList)

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [None]:
animelist = pd.read_csv('DataSets/Raw Data/AnimeList.csv')
animelist.head()


Description of the dataset, as available on Kaggle, is as follows.


> **anime_id**         : ID for each anime show  
> **title**            : Anime title    
> **title_english**    : Anime title in english     
> **title_japanese**   : Anime title in japanese   
> **image_url**        : Front poster   
> **type**             : Anime types (TV, Movie, etc)    
> **source**           : Anime source (Manga, Original)    
> **episodes**         : How many episodes   
> **status**           : Current status (airing, finieshed airinig)    
> **airing**           : Is it currently airing    
> **aired_string**     : Start date and finished date    
> **aired**            : Start date and finished date in java   
> **duration**         : How long is the anime(episode or movie)     
> **rating**           : Anime rating (pg13, NC16, M18, R21)   
> **score**            : Overall score of the anime (out of 10)     
> **scored_by**        : How many user give the score to the anime  
> **rank**             : Rank base on the score of the anime     
> **popularity**       : Rank base on how many people watch the anime  
> **members**          : How many people watch the anime  
> **favorites**        : How many people favorite the anime     
> **background**       : Background of the anime   
> **premiered**        : Which season the anime come out    
> **broadcast**        : Which day it broadcast   
> **related**          : Are there any sequel or prequel  
> **producer**         : Where the anime produce     
> **licensor**         : Which film it came from     
> **studio**           : which studio animated the anime   
> **genre**            : what are the genres in the anime   
> **opening_theme**    : opening song   
> **ending_theme**     : endinng song    
---

Check the vital statistics of the dataset using the `type` and `shape` attributes.

In [None]:
print("Data type : ", type(animelist))
print("Data dims : ", animelist.shape)

Check the variables (and their types) in the dataset using the `dtypes` attribute.

In [None]:
animelist.info()


## Clean Data (AnimeList)

how might we (action) for (target audiences) in order to (outcome, what are the result we would like to see)

e.g. 
how might we recommend the top 20 anime shows for anime beginner?
how might we recommend the top 10 anime shows in winter season for anime user?

### Gathering information 
---

> Describe numeric      
> Desccirbe object      
> Display columns     

In [None]:
## for numeric data
animelist.describe()

In [None]:
## for data that is object
animelist.describe(include=object)

In [None]:
## what are the columns involved in the dataset
animelist.columns

---
### Premiered 

> Convert Null value to binary indicator (1 or 0)

In [None]:
## creates a new column "isPremiered" that contains 1 for rows where the "premiered" column is null and 0 for rows where it is not null. 
##This new column acts as a binary indicator, showing whether an anime has a premiere date or not.
animelist["isPremiered"] = animelist["premiered"].isnull().astype(int)

In [None]:
animelist.isPremiered.info()

### Studio filtering
Filter the studio that is less popular (<40) and combine into one "SmallStudio"

In [None]:
## calculate all the value from each studio

## studio that is empty, replace with unknown
animelist["studio"] = animelist["studio"].fillna("unknown")
studio_counts = animelist.studio.value_counts()
studio_counts

In [None]:
# group studio less than 40
minor = studio_counts[studio_counts < 40].index.to_list()
minor

In [None]:
## combine those minor studio to one "SmallStudio"
animelist["studio"] = animelist["studio"].apply(lambda x : "SmallStudio" if x in minor else x)
animelist.studio.value_counts()

### Drop the data that is not important
TODO: Add back producers
---
|  |             **Unnecessory data**          |      |
|:---------------:|:---------------:|:---------------:|
| anime_id        | background      | opening_theme   |
| title_english   | premiered       | ending_theme    |
| title_japanese  | boardcast       | air_string      |
| title_synonyms  | producer        |                 |
| image_url       | lincensor       |                 |

In [None]:
## drop useless dat
animelist.drop(columns=['anime_id','title_english',  'title_japanese','title_synonyms', 'image_url', 'background',
       'premiered', 'broadcast','producer','licensor','opening_theme', 'ending_theme','aired_string' ], inplace=True)

In [None]:
## after dropping the columns 
animelist.shape

In [None]:
animelist.columns

### Split aired date (from and to)
--- 
aired contain { from: yyyy-mm-dd, to: yyyy-mm-dd}

split into:
aired_from -> yyyy-mm-dd
aired_to   -> yyyy-mm-dd

calculate the number of days for the episode
calculate the how frequent it aired

In [None]:
# Splitting the 'aired' column into 'from' and 'to' columns
animelist[['aired_from', 'aired_to']] = animelist['aired'].str.extract(r"'from': '(.*?)', 'to': '(.*?)'")

# Displaying the DataFrame with the new columns
print(animelist[['aired_from', 'aired_to']])

In [None]:
animelist.info()

### Split the Genres to columns
---

In [None]:
## fill the missing value 'Nan' with 'NA'
animelist.genre = animelist.genre.fillna("NA")

In [None]:
## split the genres by the parameter ','
genre_animelist = animelist['genre'].str.get_dummies(sep=',')
genre_animelist

In [None]:
genre_animelist.info()

In [None]:
## combining the animelist data and genre data into animelist_df
animelist_df = pd.concat([animelist, genre_animelist], axis=1)
animelist_df.head()

In [None]:
## remove genre columns 
animelist_df.drop(columns=["genre"], inplace=True)
animelist_df

In [None]:
animelist_df.info()

### Check value contain any NULL
---

In [None]:
# let's make sure no null values
for col in animelist_df:
    print(f" {col}         | has ({animelist_df[col].isnull().sum()})")
    
    
## below table we can see:
## rating have 544 'Nan'
## rank have 1574 'Nan'
## aired_from have 2191 'Nan'
## aired-to have 2191 'Nan'

In [None]:
## count the total for each rating
animelist_df.rating.value_counts()

In [None]:
## ensure the rating is at least PG13
animelist_df['rating'].fillna("G - All Ages",inplace=True)

## convert the rank to the max rank (prevent skewness)
animelist_df['rank'].fillna(animelist_df['rank'].max(), inplace=True)

## convert 'Nan' to None for aired dates.
animelist_df['aired_from'].fillna("Not aired",inplace=True)
animelist_df['aired_to'].fillna("Not aired",inplace=True)

##find out whether aired time and primied have relation

In [None]:
# let's double confirmed there are no null values
for col in animelist_df:
    print(f" {col}         | has ({animelist_df[col].isnull().sum()})")
    

### Convert to new csv file.
---


In [None]:
#animelist_df.to_csv('outV2.csv', index=False) 

### if have other data need to be clean
---

## Dealing with Related Column

### Exploring JSON Structure for each data unit

In [None]:
related_cell0 = animelist_df["related"][0]
related_cell_mod = related_cell0.replace("'", "\"")
related_cell_mod

In [None]:
import json
related_dict = json.loads(related_cell_mod)
related_dict['Sequel'][0]['type']

### Testing with a sample df with 5 rows

In [144]:
related_df = animelist_df[["title","related"]]
test_related = related_df[10:20]
test_related

Unnamed: 0,title,related
10,Junjou Romantica 2,"{'Adaptation': [{'mal_id': 765, 'type': 'manga..."
11,Kaichou wa Maid-sama!,"{'Adaptation': [{'mal_id': 2921, 'type': 'mang..."
12,Sekaiichi Hatsukoi 2,"{'Adaptation': [{'mal_id': 10309, 'type': 'man..."
13,Tonari no Kaibutsu-kun,"{'Adaptation': [{'mal_id': 13702, 'type': 'man..."
14,Bleach,"{'Adaptation': [{'mal_id': 12, 'type': 'manga'..."
15,Chobits,"{'Adaptation': [{'mal_id': 107, 'type': 'manga..."
16,Kimi ni Todoke,"{'Adaptation': [{'mal_id': 3378, 'type': 'mang..."
17,Naruto: Shippuuden,"{'Adaptation': [{'mal_id': 11, 'type': 'manga'..."
18,Ranma ½,"{'Adaptation': [{'mal_id': 23, 'type': 'manga'..."
19,Toradora!,"{'Adaptation': [{'mal_id': 7149, 'type': 'mang..."


In [167]:

related_dict = {}
related_dict['title']=[]

related_row_dict_list = [];

for i, related_row in enumerate(related_df['related']):
    # Original JSON data used single quotation, and double quotes inside values
    # According to JSON guidelines, strings should use double quotes, we will convert double quotes inside to single quotes
    related_row = related_row.replace("\"", "(temp_double_quotes)")
    related_row = related_row.replace("'", "\"")
    related_row = related_row.replace("(temp_double_quotes)", "'")
    
    # Convert each row into its own dictionary
    related_row_dict = json.loads(related_row)
    related_row_dict_list.append(related_row_dict)

    # Fill in title list
    related_dict['title'].append(related_df['title'].iloc[i])

    # Fill keys with all unique relations
    for relation in related_row_dict:
        if not relation in related_dict.keys():
            related_dict[relation] = []

for related_row_dict in related_row_dict_list:
    for relation in related_dict:
        if relation=='title': 
            continue
        if relation in related_row_dict:
            related_dict[relation].append(len(list(related_row_dict[relation])))
        else:
            related_dict[relation].append(0)
related_df_separated = pd.DataFrame.from_dict(related_dict)
related_df_separated.head()

Unnamed: 0,title,Adaptation,Sequel,Side story,Alternative version,Prequel,Summary,Other,Spin-off,Alternative setting,Character,Parent story,Full story
0,Inu x Boku SS,1,1,0,0,0,0,0,0,0,0,0,0
1,Seto no Hanayome,1,1,1,1,0,0,0,0,0,0,0,0
2,Shugo Chara!! Doki,1,1,0,0,1,0,0,0,0,0,0,0
3,Princess Tutu,1,0,0,0,0,1,0,0,0,0,0,0
4,Bakuman. 3rd Season,1,0,0,0,2,0,1,0,0,0,0,0


### Sum each column to find how many of each type of relation there are

In [168]:
relation_counts = {}
for column in related_df_separated:
    if column=='title':
        continue
    else:
        relation_counts[column] = related_df_separated[column].sum()
relation_counts        

{'Adaptation': 4758,
 'Sequel': 2550,
 'Side story': 1700,
 'Alternative version': 1631,
 'Prequel': 2535,
 'Summary': 422,
 'Other': 2996,
 'Spin-off': 573,
 'Alternative setting': 715,
 'Character': 371,
 'Parent story': 1923,
 'Full story': 437}