<center>
    <h1 id='data-wrangling-and-feature-engineering' style='color:#7159c1'>🧼 Data Wrangling and Feature Engineering 🧼</h1>
    <i>Cleaning and applying transformations to the datasets</i>
</center>

> **Topics**

```
- 🧼 Data Wrangling;
- 🚿 Feature Engineering;
- 💾 Exporting Transformed Datasets.
```

In [1]:
# ---- Imports ----
import chardet                  # pip install chardet
import pandas as pd             # pip install pandas
import re                       # pip install re

# ---- Settings ----
pd.set_option('display.max_columns', None)

# ---- Constants ----
DATASETS_PATH = ('./datasets')
CATEGORICAL_VARIABLES_REGEX_PATTERN = (r'[^A-Za-z0-9, -]')

# ---- Pertinent Variables ----
dropped_animes_ids = []
dropped_users_ids = []
dropped_scores_ids = []

# ---- Functions ----

# \ Description:
#    - replaces all characters that are not upper and lowercase letters, numbers, commas, spaces and hyphens by space;
#    - strips the parameter removing spaces from its beggining and ending;
#    - transforms the parameter to lowercase;
#    - splits the parameter removing all spaces;
#    - joins the parameter again, removing any extra space between each word;
#    - returns the result
#
# \ Parameters:
#    - categorical_variable: string
#
standardize_categorical_variables = lambda categorical_variable: ' '.join(re.sub(CATEGORICAL_VARIABLES_REGEX_PATTERN, ' ', categorical_variable).strip().lower().split())


# \ Description:
#    - strips the parameter removing spaces from its beggining and ending;
#    - transforms the parameter to lowercase;
#    - returns the result
#
# \ Parameters:
#    - categorical_variable: string
#
lower_case_categorical_variables = lambda categorical_variable: categorical_variable.strip().lower()

# \ Description:
#    - splits parameter by each comma-space;
#    - removes duplicated values converting the parameter into a tuple;
#    - converts the set into a list;
#    - converts the list into a string joining with comma-space;
#    - returns the result 
#
# \ Parameters:
#    - categorical_variable: string
#
remove_duplicated_subvalues = lambda categorical_variable: ', '.join(list(set(categorical_variable.split(', '))))

<h1 id='0-data-wrangling' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🧼 | Data Wrangling</h1>

Before creating Recommender System Models and even before exploring the datasets statistically, it is needed to clean up the variables applying some methods, such as filtering features, dealing with missing values and standardizing texts. So, let's go!

<br />

<p style='color:#7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-size:150%'>0.1 | Anime Dataset</p>

> **Steps**

```
- Checking Charset Encoder;
- Changing Features Names;
- Filtering Not Available and Not Aired Animes;
- Filtering Features;
- Checking Data Types;
- Checking Score Range;
- Dealing with Missing Values;
- Dealing with UNKNOWN and No description available for this anime. Values;
- Cleaning Texts;
- Converting Texts to Lower Case;
- Dealing with Duplicated Texts.
```

---

**- Checking Charset Encoder**

First of all, we have to check out the dataset's charset encoding. Since `UTF-8` is the universal encoding format, we have two options here:

1 - `if the dataset charset is UTF-8 already, we can go straight to the next step`;

2 - `if the dataset charset is not UTF-8, we have to convert it to the desirable encoding`.

In [2]:
# ---- Checking Charset Encoder ----
with open(f'{DATASETS_PATH}/anime-dataset-2023.csv', 'rb') as file:
    guessed_charset = chardet.detect(file.read(100000))

print(guessed_charset)

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}


With `99.0%` of confidence, the dataset's encoding is `utf-8`!!

---

**- Changing Features Names**

In order to work better with the data, the features names must be standardized. So let's do it following the rules:

1 - `the name must only contain lower case letters and underscores`;

2 - `the spaces between each word must be replaced by underscores`.

In [3]:
# ---- Reading Dataset ----
header = [
    'id', 'title', 'english_title', 'japanese_title', 'score',
    'genres', 'synopsis', 'type', 'episodes', 'aired',
    'premiered', 'status', 'producers', 'licensors', 'studios',
    'source', 'duration', 'rating', 'rank', 'popularity',
    'favorites', 'scored_by', 'members', 'image_url'
]

animes_df = pd.read_csv(f'{DATASETS_PATH}/anime-dataset-2023.csv')

# ---- Renaming Header ----
#
# - rename method: allows more sofisticated name transformations and allows to change the columns positions;
# - columns property: only allows to change the name by another string;
# - set_axis method: allows more sofisticated name transformations and allows to choose if the transformations will be
# applied to columns or indexes.
#
# Since we are just replacing the current column name to another string, we are using the 'columns property' way:
#
animes_df.columns = header
animes_df.head()

Unnamed: 0,id,title,english_title,japanese_title,score,genres,synopsis,type,episodes,aired,premiered,status,producers,licensors,studios,source,duration,rating,rank,popularity,favorites,scored_by,members,image_url
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,"Apr 3, 1998 to Apr 24, 1999",spring 1998,Finished Airing,Bandai Visual,"Funimation, Bandai Entertainment",Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,"Sep 1, 2001",UNKNOWN,Finished Airing,"Sunrise, Bandai Visual",Sony Pictures Entertainment,Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978,https://cdn.myanimelist.net/images/anime/1439/...
2,6,Trigun,Trigun,トライガン,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,"Apr 1, 1998 to Sep 30, 1998",spring 1998,Finished Airing,Victor Entertainment,"Funimation, Geneon Entertainment USA",Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,https://cdn.myanimelist.net/images/anime/7/203...
3,7,Witch Hunter Robin,Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),7.25,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26.0,"Jul 3, 2002 to Dec 25, 2002",summer 2002,Finished Airing,"Bandai Visual, Dentsu, Victor Entertainment, T...","Funimation, Bandai Entertainment",Sunrise,Original,25 min per ep,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931,https://cdn.myanimelist.net/images/anime/10/19...
4,8,Bouken Ou Beet,Beet the Vandel Buster,冒険王ビィト,6.94,"Adventure, Fantasy, Supernatural",It is the dark century and the people are suff...,TV,52.0,"Sep 30, 2004 to Sep 29, 2005",fall 2004,Finished Airing,"TV Tokyo, Dentsu",Illumitoon Entertainment,Toei Animation,Manga,23 min per ep,PG - Children,4240.0,5126,14,6413.0,15001,https://cdn.myanimelist.net/images/anime/7/215...


---

**- Filtering Not Available and Not Aired Animes**

In order to the Recommender Models do not recommend animes that has not been released yet or are not available, let's drop all animes that `Status` value is `Not yet aired` and `Aired` value is `Not available`.

In [4]:
# ---- Filtering Not Available and Not Aired Animes ----

# Info:
# 
# - index 0: original number of animes
# - index 1: current number of animes
# - index 2: number of dropped animes
# - index 3: dropped animes percentage
#
info = [0, 0, 0, 0]
info[0] = animes_df.shape[0]
dropped_animes_ids = animes_df.loc[(animes_df['status'] == 'Not yet aired') | (animes_df['aired'] == 'Not available')].id.to_list()
animes_df = animes_df.loc[~animes_df['id'].isin(dropped_animes_ids)]

info[1] = animes_df.shape[0]
info[2] = info[0] - info[1]
info[3] = round(info[2] * 100 / info[0], 4) # four significant digits

print(f'Original Number of Animes: {info[0]}')
print(f'Current Number of Animes: {info[1]}')
print(f'Number of Dropped Animes: {info[2]}')
print(f'Dropped Animes Percentage: {info[3]}%')

Original Number of Animes: 24905
Current Number of Animes: 23748
Number of Dropped Animes: 1157
Dropped Animes Percentage: 4.6457%


---

**- Filtering Features**

Some variables will not be needed to our Statistical Exploratory Analysis and Recommender System Models, so we can just drop them off.

Variables to Drop:

```
- English Title;
- Japanese Title;
- Aired;
- Premiered;
- Image URL.
```

In [5]:
# ---- Filtering Features ----
variables_to_drop = ['english_title', 'japanese_title', 'aired', 'premiered', 'image_url']
animes_df.drop(columns=variables_to_drop, inplace=True)
animes_df.head()

Unnamed: 0,id,title,score,genres,synopsis,type,episodes,status,producers,licensors,studios,source,duration,rating,rank,popularity,favorites,scored_by,members
0,1,Cowboy Bebop,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,Finished Airing,Bandai Visual,"Funimation, Bandai Entertainment",Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505
1,5,Cowboy Bebop: Tengoku no Tobira,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,Finished Airing,"Sunrise, Bandai Visual",Sony Pictures Entertainment,Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978
2,6,Trigun,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,Finished Airing,Victor Entertainment,"Funimation, Geneon Entertainment USA",Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252
3,7,Witch Hunter Robin,7.25,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26.0,Finished Airing,"Bandai Visual, Dentsu, Victor Entertainment, T...","Funimation, Bandai Entertainment",Sunrise,Original,25 min per ep,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931
4,8,Bouken Ou Beet,6.94,"Adventure, Fantasy, Supernatural",It is the dark century and the people are suff...,TV,52.0,Finished Airing,"TV Tokyo, Dentsu",Illumitoon Entertainment,Toei Animation,Manga,23 min per ep,PG - Children,4240.0,5126,14,6413.0,15001


---

**- Checking Data Types**

About the features, we have to assure that they are at the desirable data type:

```
- Object/String: title, genres, synopsis, type, status, producers, licensors, studios, source, duration, rating;
- Integer: id, episodes, rank, popularity, scored_by, members;
- Float: score.
```

In [6]:
# ---- Checking Data Types ----
animes_df.dtypes

id             int64
title         object
score         object
genres        object
synopsis      object
type          object
episodes      object
status        object
producers     object
licensors     object
studios       object
source        object
duration      object
rating        object
rank          object
popularity     int64
favorites      int64
scored_by     object
members        int64
dtype: object

So here are the variables we have to transform:

```
- Score: from object to float; replace 'UNKNOWN' by -1; UNKNOWN values tell that the anime did not reach the cut-off number of users scores;

- Episodes: from object to integer; replace 'UNKNOWN' by -1; UNKNOWN values tell that the anime is being released or will be released;

- Rank: from object to integer; replace 'UNKNOWN' by -1; UNKNOWN values tell that the anime is not released or is a hentai (+18);

- Scored By: from object to integer; replace 'UNKNOWN' by -1; UNKNOWN values tell that thee anime did not reach the cut-off number of users scores.
```

In [7]:
# ---- Replacing UNKNOWN by -1 ----
variables_to_convert = ['score', 'episodes', 'rank', 'scored_by']
animes_df[variables_to_convert] = animes_df[variables_to_convert].replace('UNKNOWN', '-1')

# ---- Converting Data Types ----|
animes_df[variables_to_convert[0]] = animes_df[variables_to_convert[0]].astype('float')
animes_df[variables_to_convert[1:]] = animes_df[variables_to_convert[1:]].astype('float').astype('int')

# ---- Checking Data Types ----
animes_df.dtypes

id              int64
title          object
score         float64
genres         object
synopsis       object
type           object
episodes        int32
status         object
producers      object
licensors      object
studios        object
source         object
duration       object
rating         object
rank            int32
popularity      int64
favorites       int64
scored_by       int32
members         int64
dtype: object

---

**- Checking Score Range**

Since the anime scores in MyAnimeList follows the scale from 1 to 10, excluding the animes that did not reach the cut-off number of users scores, getting -1 as value. So, we have to check out if the values into the dataset are following this condition!!

In [8]:
# ---- Checking Score Range ----
#
# - must follow the rule: Scale from 0 to 10;
# - animes that did reach the cut-off number of users scores have -1 as score value.
#
print(f'Minimum Score: {animes_df["score"].loc[animes_df["score"] != -1].min()}')
print(f'Maximum Score: {animes_df["score"].loc[animes_df["score"] != -1].max()}')

Minimum Score: 1.85
Maximum Score: 9.1


Hooray!! The score variable follows the expected range 🥳

---

**- Dealing with Missing Values**

The next step is to check out if the variables contain any missing values and deal with them properly. First let's confirm if there are any missing values and, depending to what variables and how much values are missing, we choose the more convenient dealing strategy.

In [9]:
# ---- Missing Values ----
animes_df.isnull().sum()

id            0
title         0
score         0
genres        0
synopsis      0
type          0
episodes      0
status        0
producers     0
licensors     0
studios       0
source        0
duration      0
rating        0
rank          0
popularity    0
favorites     0
scored_by     0
members       0
dtype: int64

Ok, there are no missing values 😐😐

---

**- Dealing with `UNKNOWN` and `No description available for this anime.` Values**

Even though there are no missing values, we have stumbled upon with some features with `UNKNOWN` values to represent the absence of data. Also, a tiny spoiler for you: the synopsis variable contains `No description available for this anime.` instead of `UNKNOWN`.

So, let's first get all variables that contain these values and then deal with them.

In [10]:
# ---- Unknown Values ----
#
# - counting unknown per variable
#
unknown_counts_per_variable = [(variable, (animes_df[variable] == 'UNKNOWN').sum()) for variable in animes_df.columns]
for variable, unknown_count in unknown_counts_per_variable: print(f'{variable} - {unknown_count}')

id - 0
title - 0
score - 0
genres - 4505
synopsis - 0
type - 4
episodes - 0
status - 0
producers - 12442
licensors - 19020
studios - 9671
source - 0
duration - 0
rating - 407
rank - 0
popularity - 0
favorites - 0
scored_by - 0
members - 0


In [11]:
# ---- Unknown Values ----
#
# - replacing unknown values by hyphens
#
for variable, unknown_count in unknown_counts_per_variable:
    if (unknown_count > 0): animes_df[variable].replace('UNKNOWN', '-', inplace=True)
    print(f'{variable} - {len(animes_df[variable].loc[animes_df[variable] == "UNKNOWN"])}')

id - 0
title - 0
score - 0
genres - 0
synopsis - 0
type - 0
episodes - 0
status - 0
producers - 0
licensors - 0
studios - 0
source - 0
duration - 0
rating - 0
rank - 0
popularity - 0
favorites - 0
scored_by - 0
members - 0


In [12]:
# ---- Not Available Synopsis ----
#
# - counting not available synopsis
#
not_available_synopsis_count = len(animes_df['synopsis'].loc[animes_df['synopsis'] == 'No description available for this anime.'])
print(f'Not Available Synopsis: {not_available_synopsis_count}')

Not Available Synopsis: 3856


In [13]:
# ---- Not Available Synopsis ----
#
# - replacing by hyphens
#
animes_df['synopsis'].replace('No description available for this anime.', '-', inplace=True)
not_available_synopsis_count = len(animes_df['synopsis'].loc[animes_df['synopsis'] == 'No description available for this anime.'])
print(f'Not Available Synopsis: {not_available_synopsis_count}')

Not Available Synopsis: 0


---

**- Cleaning and Lower Casing Texts**

Now that all datas are into their respective types and with any missing values, we can standardize them. Let's set our rules:

1 - `only letters (upper and lower case), numbers, commas, spaces and hyphens are accepted`;

2 - `all other characters must be replaced by spaces`;

3 - `all spaces in the beggining and ending of the text must be removed`;

4 - `all letters must be converted into lower case`.

These rules must be applied to all categorical variables. Except for some that will skip the first rule.

```
- Variables That Must Follow All Rules: title, type, status and source;
- Variables That Will Skip Rule 1: genres, synopsis, producers, licensors and studios.
```

In [14]:
# ---- Variables That Must Follow All Rules ----
#
# - apply method: transforms a whole single row or a whole single column;
# - map method: transforms a series element-wise;
# - applymap method: transforms a dataframe element-wise.
#
# Since we are working with dataframes and we must transform multiple columns, we are going to
# use 'applymap' method. There would be no problem in using 'apply method' here, but it is costlier.
#
variables_to_follow_all_rules = ['title', 'type', 'status', 'source']
animes_df[variables_to_follow_all_rules] = animes_df[variables_to_follow_all_rules].applymap(standardize_categorical_variables)
animes_df[variables_to_follow_all_rules].head()

Unnamed: 0,title,type,status,source
0,cowboy bebop,tv,finished airing,original
1,cowboy bebop tengoku no tobira,movie,finished airing,original
2,trigun,tv,finished airing,manga
3,witch hunter robin,tv,finished airing,original
4,bouken ou beet,tv,finished airing,manga


In [15]:
# ---- Variables That Will Skip Rule 1 ----
variables_to_skip_first_rule = ['genres', 'synopsis', 'producers', 'licensors', 'studios']
animes_df[variables_to_skip_first_rule] = animes_df[variables_to_skip_first_rule].applymap(lower_case_categorical_variables)
animes_df[variables_to_skip_first_rule].head()

Unnamed: 0,genres,synopsis,producers,licensors,studios
0,"action, award winning, sci-fi","crime is timeless. by the year 2071, humanity ...",bandai visual,"funimation, bandai entertainment",sunrise
1,"action, sci-fi","another day, another bounty—such is the life o...","sunrise, bandai visual",sony pictures entertainment,bones
2,"action, adventure, sci-fi","vash the stampede is the man with a $$60,000,0...",victor entertainment,"funimation, geneon entertainment usa",madhouse
3,"action, drama, mystery, supernatural",robin sena is a powerful craft user drafted in...,"bandai visual, dentsu, victor entertainment, t...","funimation, bandai entertainment",sunrise
4,"adventure, fantasy, supernatural",it is the dark century and the people are suff...,"tv tokyo, dentsu",illumitoon entertainment,toei animation


---

**- Dealing with Duplicated Texts**

In [16]:
# ---- Duplicated Titles ----
#
# - keep: determines which duplicates to mark:
#     \ first: marks duplicates as True except for the first occurrence;
#     \ last: marks duplicates as True except for the last occurrence;
#     \ False: marks all duplicates as True.
#
animes_df['title'].duplicated(keep=False).sum()

126

In [17]:
duplicated_titles_df = animes_df.copy() \
     .loc[animes_df['title'] \
     .duplicated(keep=False)] \
     .sort_values(by=['title'], ascending=True)

duplicated_titles_df

Unnamed: 0,id,title,score,genres,synopsis,type,episodes,status,producers,licensors,studios,source,duration,rating,rank,popularity,favorites,scored_by,members
15695,39783,5-toubun no hanayome,8.08,"comedy, romance",through their tutor fuutarou uesugi's diligent...,tv,12,finished airing,"pony canyon, kodansha, dax production, bs11, n...",-,bibury animation studios,manga,24 min per ep,PG-13 - Teens 13 or older,488,285,11271,398456,665156
14630,38101,5-toubun no hanayome,7.66,"comedy, romance","fuutarou uesugi is an ace high school student,...",tv,12,finished airing,"pony canyon, kodansha, zero-a, gyao!",funimation,tezuka productions,manga,24 min per ep,PG-13 - Teens 13 or older,1238,171,13209,576270,890586
24840,55658,awakening,-1.00,-,music video for the song awakening by ken ishii.,music,1,finished airing,-,-,-,original,4 min,G - All Ages,-1,24580,0,-1,35
24137,54618,awakening,-1.00,-,-,music,1,finished airing,-,-,-,mixed media,3 min,-,0,0,0,-1,0
24586,55351,azur lane,-1.00,"action, slice of life",assorted commercials for the azur lane mobile ...,special,-1,currently airing,-,-,-,game,30 sec,PG-13 - Teens 13 or older,0,0,0,-1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20876,48365,youkai watch,6.21,"comedy, supernatural",the new show will feature unique and returning...,tv,98,finished airing,-,-,olm,game,24 min per ep,G - All Ages,7850,10397,29,552,2191
6219,10495,yuru yuri,7.57,"award winning, comedy, girls love",after a year in grade school without her child...,tv,12,finished airing,"tv tokyo, daume, sotsu, pony canyon, dax produ...","nis america, inc.",doga kobo,manga,23 min per ep,PG-13 - Teens 13 or older,1469,660,4426,148777,329593
6727,12403,yuru yuri,7.82,"comedy, girls love",the girls of the amusement club return in yuru...,tv,12,finished airing,"sotsu, pony canyon, dax production","nis america, inc.",doga kobo,manga,23 min per ep,PG-13 - Teens 13 or older,876,1222,940,99916,180613
10656,30902,yuru yuri nachuyachumi,7.77,"comedy, girls love",extra episodes of yuru yuri nachuyachumi!\n\ns...,special,2,finished airing,-,-,tyo animations,manga,24 min per ep,PG-13 - Teens 13 or older,991,2804,85,28207,53558


After analysing each duplicated anime into the dataset and making some searches about them on MyAnimeList, I figured out that they are not duplicated!! 😮

In reality, they are sequels, prequels, specials and OVAs about the same anime, so, there is no problem keeping them into the main dataset.

<p style='color:#7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-size:150%'>0.2 | Users Details Dataset</p>

> **Steps**

```
- Checking Charset Encoder;
- Changing Features Names;
- Filtering Features;
- Filtering Closed Accounts;
- Checking Data Types;
- Checking Mean Score Range;
- Dealing with Missing Values;
- Cleaning Texts;
- Converting Texts to Lower Case;
- Dealing with Duplicated Texts.
```

---

**- Checking Charset Encoder**

In [18]:
# ---- Checking Charset Encoder ----
with open(f'{DATASETS_PATH}/users-details-2023.csv', 'rb') as file:
        guessed_charset = chardet.detect(file.read(100000))

print(guessed_charset)

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}


---

**- Changing Features Names**

In [19]:
# ---- Reading Dataset ----
header = [
	'id', 'name', 'gender', 'birthday', 'location'
	, 'joined', 'days_watched', 'mean_score', 'watching'
	, 'completed', 'on_hold', 'dropped', 'plan_to_watch'
	, 'total_entries', 'rewatched', 'episodes_watched'
]

users_details_df = pd.read_csv(f'{DATASETS_PATH}/users-details-2023.csv')

# ---- Renaming Header ----
#
# - rename method: allows more sofisticated name transformations and allows to change the columns positions;
# - columns property: only allows to change the name by another string;
# - set_axis method: allows more sofisticated name transformations and allows to choose if the transformations will be
# applied to columns or indexes.
#
# Since we are just replacing the current column name to another string, we are using the 'columns property' way:
#
users_details_df.columns = header
users_details_df.head()

Unnamed: 0,id,name,gender,birthday,location,joined,days_watched,mean_score,watching,completed,on_hold,dropped,plan_to_watch,total_entries,rewatched,episodes_watched
0,1,Xinil,Male,1985-03-04T00:00:00+00:00,California,2004-11-05T00:00:00+00:00,142.3,7.37,1.0,233.0,8.0,93.0,64.0,399.0,60.0,8458.0
1,3,Aokaado,Male,,"Oslo, Norway",2004-11-11T00:00:00+00:00,68.6,7.34,23.0,137.0,99.0,44.0,40.0,343.0,15.0,4072.0
2,4,Crystal,Female,,"Melbourne, Australia",2004-11-13T00:00:00+00:00,212.8,6.68,16.0,636.0,303.0,0.0,45.0,1000.0,10.0,12781.0
3,9,Arcane,,,,2004-12-05T00:00:00+00:00,30.0,7.71,5.0,54.0,4.0,3.0,0.0,66.0,0.0,1817.0
4,18,Mad,,,,2005-01-03T00:00:00+00:00,52.0,6.27,1.0,114.0,10.0,5.0,23.0,153.0,42.0,3038.0


---

> **- Filtering Features**

Variables to Drop:

```
- Birthday;
- Location.
```

In [20]:
# ---- Filtering Features ----
variables_to_drop = ['birthday', 'location']
users_details_df.drop(columns=variables_to_drop, inplace=True)
users_details_df.head()

Unnamed: 0,id,name,gender,joined,days_watched,mean_score,watching,completed,on_hold,dropped,plan_to_watch,total_entries,rewatched,episodes_watched
0,1,Xinil,Male,2004-11-05T00:00:00+00:00,142.3,7.37,1.0,233.0,8.0,93.0,64.0,399.0,60.0,8458.0
1,3,Aokaado,Male,2004-11-11T00:00:00+00:00,68.6,7.34,23.0,137.0,99.0,44.0,40.0,343.0,15.0,4072.0
2,4,Crystal,Female,2004-11-13T00:00:00+00:00,212.8,6.68,16.0,636.0,303.0,0.0,45.0,1000.0,10.0,12781.0
3,9,Arcane,,2004-12-05T00:00:00+00:00,30.0,7.71,5.0,54.0,4.0,3.0,0.0,66.0,0.0,1817.0
4,18,Mad,,2005-01-03T00:00:00+00:00,52.0,6.27,1.0,114.0,10.0,5.0,23.0,153.0,42.0,3038.0


---

**- Filtering Closed Accounts**

Closed accounts on MyAnimeList are those ones that has been inactivated by their user, turning the variables `Days Watched, Mean Score, Watching, Completed, On Hold, Dropped, Plan to Watch, Total Entries, Rewatched and Episodes Watched` to NaN.

Due to these accounts do not contain statistical values, we have to drop them in order to do not influence the Collaborative Filtering Models.

In [21]:
# ---- Filtering Closed Accounts ----

# Info:
#
# - index 0: original number of users
# - index 1: current number of users
# - index 2: number of dropped users
# - index 3: dropped users percentage
#
info = [0, 0, 0, 0]
info[0] = users_details_df.shape[0]
dropped_users_ids = users_details_df.loc[users_details_df['watching'].isnull()].id.to_list()
users_details_df = users_details_df.loc[~users_details_df['id'].isin(dropped_users_ids)]

info[1] = users_details_df.shape[0]
info[2] = info[0] - info[1]
info[3] = round(info[2] * 100 / info[0], 4) # four significant digits

print(f'Original Number of Users: {info[0]}')
print(f'Current Number of Users: {info[1]}')
print(f'Number of Dropped Users: {info[2]}')
print(f'Dropped Users Percentage: {info[3]}%')

Original Number of Users: 731290
Current Number of Users: 731282
Number of Dropped Users: 8
Dropped Users Percentage: 0.0011%


---

**- Checking Data Types**

```
- Object/String: name, gender, joined;
- Integer: id, watching, completed, on_hold, dropped, plan_to_watch, total_entries, rewatched and episodes_watched;
- Float: days_watched, mean_score.
```

In [22]:
# ---- Checking Data Types ----
users_details_df.dtypes

id                    int64
name                 object
gender               object
joined               object
days_watched        float64
mean_score          float64
watching            float64
completed           float64
on_hold             float64
dropped             float64
plan_to_watch       float64
total_entries       float64
rewatched           float64
episodes_watched    float64
dtype: object

In [23]:
# ---- Converting Data Types ----
#
# - Convert from Float to Integer: watching, completed, on_old, dropped, plan_to_watch,
# total_entries, rewatched and episodes_watched.
#
variables_to_convert = [
    'watching', 'completed', 'on_hold', 'dropped', 'plan_to_watch'
    , 'total_entries', 'rewatched', 'episodes_watched'
]

users_details_df[variables_to_convert] = users_details_df[variables_to_convert].astype('int')
    
# ---- Checking Data Type ----
users_details_df.dtypes

id                    int64
name                 object
gender               object
joined               object
days_watched        float64
mean_score          float64
watching              int32
completed             int32
on_hold               int32
dropped               int32
plan_to_watch         int32
total_entries         int32
rewatched             int32
episodes_watched      int32
dtype: object

---

**- Checking Mean Score Range**

In [24]:
# ---- Checking Mean Score Range ----
#
# - must follow the rule: Scale from 0 to 10
#
print(f'Minimum Score: {users_details_df["mean_score"].min()}')
print(f'Maximum Score: {users_details_df["mean_score"].max()}')

Minimum Score: 0.0
Maximum Score: 255.0


Wow, it seems that mean score does not follow the rule from 0 to 10. Let's check out all users with a mean score greater than 10.

In [25]:
# ---- Cehcking Mean Score Range ----
users_details_df.loc[users_details_df['mean_score'] > 10]

Unnamed: 0,id,name,gender,joined,days_watched,mean_score,watching,completed,on_hold,dropped,plan_to_watch,total_entries,rewatched,episodes_watched
699053,1241633,naneninonu,,2012-04-09T00:00:00+00:00,0.4,255.0,1,0,0,0,0,1,0,26


Looking at the user profile, we can assure that the mean score of 255.0 is real!! The user got a way to break the system 😮

You can check out the profile here: [naneninonu](https://myanimelist.net/profile/naneninonu)

As far as only one user is out of the range, let's change its mean score to 10.0.

In [26]:
# ---- Checking Mean Score Range ----
users_details_df.loc[users_details_df['mean_score'] > 10, 'mean_score'] = 10.0
print(f'Minimum Score: {users_details_df["mean_score"].min()}')
print(f'Maximum Score: {users_details_df["mean_score"].max()}')

Minimum Score: 0.0
Maximum Score: 10.0


----

**- Dealing with Missing Values**

In [27]:
# ---- Dealing with Missing Values ----
users_details_df.isnull().sum()

id                       0
name                     0
gender              506907
joined                   0
days_watched             0
mean_score               0
watching                 0
completed                0
on_hold                  0
dropped                  0
plan_to_watch            0
total_entries            0
rewatched                0
episodes_watched         0
dtype: int64

In [28]:
# ---- Dealing with Missing Values: Gender ----
#
# - replacing them by underscore
#
users_details_df.loc[users_details_df['gender'].isnull(), 'gender'] = '-'

# ---- Dealing with Missing Values: Name ----
#
# - the user with Null value on name is called "None";
# - since "None" is the "Null Value" in Python, the string has been converted to NaN/Null;
# - so let's replace this value to the string "None";
# - this is None's MyAnimeList Profile, thanks None :) https://myanimelist.net/profile/None
#
users_details_df.loc[users_details_df['name'].isnull(), 'name'] = 'None'

# ---- Dealing with Missing Values: Count ----
users_details_df.isnull().sum()

id                  0
name                0
gender              0
joined              0
days_watched        0
mean_score          0
watching            0
completed           0
on_hold             0
dropped             0
plan_to_watch       0
total_entries       0
rewatched           0
episodes_watched    0
dtype: int64

---

**- Cleaning and Lower Casing Texts**

In [29]:
# ---- Cleaning and Lower Casing Texts ----
#
# - The variables will skip rule 1;
#
# - apply method: transforms a whole single row or a whole single column;
# - map method: transforms a series element-wise;
# - applymap method: transforms a dataframe element-wise.
#
# Since we are working with dataframes and we must transform multiple columns, we are going to
# use 'applymap' method. There would be no problem in using 'apply method' here, but it is costlier.
variables_to_skip_first_rule = ['name', 'gender', 'joined']
users_details_df[variables_to_skip_first_rule] = users_details_df[variables_to_skip_first_rule].applymap(lower_case_categorical_variables)
users_details_df[variables_to_skip_first_rule].head()

Unnamed: 0,name,gender,joined
0,xinil,male,2004-11-05t00:00:00+00:00
1,aokaado,male,2004-11-11t00:00:00+00:00
2,crystal,female,2004-11-13t00:00:00+00:00
3,arcane,-,2004-12-05t00:00:00+00:00
4,mad,-,2005-01-03t00:00:00+00:00


---

**- Dealing with Duplicated Texts**

In [30]:
# ---- Duplicated Names ----
#
# - keep: determines which duplicates to mark:
#     \ first: marks duplicates as True except for the first occurrence;
#     \ last: marks duplicates as True except for the last occurrence;
#     \ False: marks all duplicates as True.
#
users_details_df['name'].duplicated(keep=False).sum()

0

<p style='color:#7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-size:150%'>0.3 | Users Scores Dataset</p>

> **Steps**

```
- Checking Charset Encoder;
- Changing Features Names;
- Dropping Observations with Dropped, Unavailable Animes and Users;
- Checking Data Types;
- Checking Scores Range;
- Dealing with Missing Values;
- Cleaning Texts;
- Convertiing Texts to Lower Case;
- Dealing Duplicated Pairs Texts (user_id and anime_id).
```

---

**- Checking Charset Encoder**

In [31]:
# ---- Checking Charset Encoder ----
with open(f'{DATASETS_PATH}/users-score-2023.csv', 'rb') as file:
    guessed_charset = chardet.detect(file.read(100000))
    
print(guessed_charset)

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}


---

**- Changing Features Names**

In [32]:
# ---- Reading Dataset ----
header = ['user_id', 'username', 'anime_id', 'anime_title', 'rating']
users_scores_df = pd.read_csv(f'{DATASETS_PATH}/users-score-2023.csv')

# ---- Renaming Header ----
#
# - rename method: allows more sofisticated name transformations and allows to change the columns positions;
# - columns property: only allows to change the name by another string;
# - set_axis method: allows more sofisticated name transformations and allows to choose if the transformations will be
# applied to columns or indexes.
#
# Since we are just replacing the current column name to another string, we are using the 'columns property' way:
#
users_scores_df.columns = header
users_scores_df.head()

Unnamed: 0,user_id,username,anime_id,anime_title,rating
0,1,Xinil,21,One Piece,9
1,1,Xinil,48,.hack//Sign,7
2,1,Xinil,320,A Kite,5
3,1,Xinil,49,Aa! Megami-sama!,8
4,1,Xinil,304,Aa! Megami-sama! Movie,8


---

**- Dropping Observations with Dropped, Unavailable Animes and Users**

For some reason, there are deleted users scores, so, since the `users-details-2023.csv` does not contain deleted users accounts, we will drop them. Also, we will drop deleted animes scores.

In [33]:
# ---- Dropping Observations with Dropped Animes and Dropped Users ----

# Info:
#
# - index 0: original number of ratings;
# - index 1: current number of ratings;
# - index 2: number of dropped ratins;
# - index 3: dropped ratings percentage
#
info = [0, 0, 0, 0]
info[0] = users_scores_df.shape[0]

users_ids_list = users_details_df['id'].to_list()
animes_ids_list = animes_df['id'].to_list()

dropped_scores_ids = users_scores_df.loc[
    (users_scores_df['anime_id'].isin(dropped_animes_ids))
    | (users_scores_df['user_id'].isin(dropped_users_ids))
    | ~(users_scores_df['user_id'].isin(users_ids_list))
    | ~(users_scores_df['anime_id'].isin(animes_ids_list))
].index.to_list()
users_scores_df = users_scores_df.loc[
    ~(users_scores_df['anime_id'].isin(dropped_animes_ids))
    & ~(users_scores_df['user_id'].isin(dropped_users_ids))
    & (users_scores_df['user_id'].isin(users_ids_list))
    & (users_scores_df['anime_id'].isin(animes_ids_list))
]

info[1] = users_scores_df.shape[0]
info[2] = info[0] - info[1]
info[3] = round(info[2] * 100 / info[0], 4) # four significant digits

print(f'Original Number of Scores: {info[0]}')
print(f'Current Number of Scores: {info[1]}')
print(f'Number of Dropped Scores: {info[2]}')
print(f'Dropped Scores Percentage: {info[3]}%')

Original Number of Scores: 24325191
Current Number of Scores: 23796586
Number of Dropped Scores: 528605
Dropped Scores Percentage: 2.1731%


---

**- Checking Data Types**

```
- Object/String: username and anime_title;
- Integer: user_id, anime_id and rating.
```

In [34]:
# ---- Checking Data Types ----
users_scores_df.dtypes

user_id         int64
username       object
anime_id        int64
anime_title    object
rating          int64
dtype: object

---

**- Checking Scores Range**

In [35]:
# ---- Checking Scores Range ----
#
# - must follow the rule: Scale from 0 to 10;
# - animes that did reach the cut-off number of users scores have -1 as score value.
#
print(f'Minimum Score: {users_scores_df["rating"].min()}')
print(f'Maximum Score: {users_scores_df["rating"].max()}')

Minimum Score: 1
Maximum Score: 10


---

**- Dealing with Missing Values**

In [36]:
# ---- Dealing with Missing Values ----
users_scores_df.isnull().sum()

user_id        0
username       0
anime_id       0
anime_title    0
rating         0
dtype: int64

In [37]:
# ---- Dealing with Missing Values: Username ----
users_scores_df.loc[users_scores_df['username'].isnull()]

Unnamed: 0,user_id,username,anime_id,anime_title,rating


It seems that all scores with null username are from our friend [None](https://myanimelist.net/profile/None). So let's replace these values by None and check out if there is any other missing values.

In [38]:
# ---- Dealing with Missing Values: Username ----
users_scores_df.loc[users_scores_df['username'].isnull(), 'username'] = 'None'
users_scores_df.isnull().sum()

user_id        0
username       0
anime_id       0
anime_title    0
rating         0
dtype: int64

---

**- Cleaning and Lower Casing Texts**

In [39]:
# ---- Variables That Must Follow All Rules ----
#
# - apply method: transforms a whole single row or a whole single column;
# - map method: transforms a series element-wise;
# - applymap method: transforms a dataframe element-wise.
#
# Since we are working with dataframes, we are going to use 'applymap' method.
# There would be no problem in using 'apply method' here, but it is costlier.
#
# Also, since we are updating only one column, we have to use 'loc' method.
#
variables_to_follow_all_rules = ['anime_title']
users_scores_df.loc[:, variables_to_follow_all_rules] = users_scores_df[variables_to_follow_all_rules].applymap(standardize_categorical_variables)
users_scores_df[variables_to_follow_all_rules].head()

Unnamed: 0,anime_title
0,one piece
1,hack sign
2,a kite
3,aa megami-sama
4,aa megami-sama movie


In [40]:
# ---- Variables That Will Skip Rule 1 ----
variables_to_skip_first_rule = ['username']
users_scores_df.loc[:, variables_to_skip_first_rule] = users_scores_df[variables_to_skip_first_rule].applymap(lower_case_categorical_variables)
users_scores_df[variables_to_skip_first_rule].head()

Unnamed: 0,username
0,xinil
1,xinil
2,xinil
3,xinil
4,xinil


---

**- Dealing Duplicated Pairs Texts (user_id and anime_id)**

To finish this first part, let's check out if there are any duplicated ratings from the same user to the same anime.

In [41]:
# ---- Dealing with Duplicated Pairs Texts ----
users_scores_df.loc[:, ['user_id', 'anime_id']].duplicated(keep=False).sum()

0

<h1 id='1-feature-engineering' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🚿 | Feature Engineering</h1>

We will be working only with `anime-dataset-2023.csv` file in this section in order to make do some variables transformations.

<br />

<p style='color:#7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-size:150%'>1.1 | Anime Dataset</p>

> **Steps**

```
- Avoiding Duplicated Genres, Producers, Licensors and Studios.
```

---

**- Avoiding Duplicated Genres, Producers, Licensors and Studios**

In [42]:
# ---- Avoiding Duplicated Genres, Producers, Licensors and Studios ----
variables_to_check = ['genres', 'producers', 'licensors', 'studios']
animes_df.loc[:, variables_to_check] = animes_df.loc[:, variables_to_check].applymap(remove_duplicated_subvalues)
animes_df.loc[:, variables_to_check].head()

Unnamed: 0,genres,producers,licensors,studios
0,"sci-fi, action, award winning",bandai visual,"bandai entertainment, funimation",sunrise
1,"sci-fi, action","bandai visual, sunrise",sony pictures entertainment,bones
2,"sci-fi, adventure, action",victor entertainment,"geneon entertainment usa, funimation",madhouse
3,"supernatural, drama, action, mystery","victor entertainment, dentsu, tv tokyo music, ...","bandai entertainment, funimation",sunrise
4,"supernatural, adventure, fantasy","dentsu, tv tokyo",illumitoon entertainment,toei animation


<h1 id='2-exporting-transformed-datasets' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>💾 | Exporting Transformed Datasets</h1>

After all this work, let's wrap this notebook up exporting the transformed datasets!!

In [None]:
# ---- Exporting Transformed Datasets ----
animes_df.to_csv(f'{DATASETS_PATH}/anime-transformed-dataset-2023.csv', index=False)
users_details_df.to_csv(f'{DATASETS_PATH}/users-details-transformed-2023.csv', index=False)
users_scores_df.to_csv(f'{DATASETS_PATH}/users-scores-transformed-2023.csv', index=False)

---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).