<center>
    <h1 id='data-wrangling-and-feature-engineering' style='color:#7159c1'>🧼 Data Wrangling and Feature Engineering 🧼</h1>
    <i>Cleaning and applying transformations to the datasets</i>
</center>

> **Topics**

```
- 🧼 Data Wrangling;
- 🚿 Feature Engineering.
```

In [131]:
# ---- Imports ----
import chardet                  # pip install chardet
import fuzzywuzzy               # pip install fuzzywuzzy
from fuzzywuzzy import process  # pip install fuzzywuzzy
import pandas as pd             # pip install pandas
import re                       # pip install re

# ---- Settings ----
pd.set_option('display.max_columns', None)

# ---- Constants ----
DATASETS_PATH = ('./datasets')
CATEGORICAL_VARIABLES_REGEX_PATTERN = (r'[^A-Za-z0-9, -]')

# ---- Functions ----

# \ Description:
#    - replaces all characters that are not upper and lowercase letters, numbers, commas, spaces and hyphens by space;
#    - strips the parameter removing spaces from its beggining and ending;
#    - transforms the parameter to lowercase;
#    - returns the result
#
# \ Parameters:
#    - categorical_variable: string
standard_categorical_variables = lambda categorical_variable: re.sub(CATEGORICAL_VARIABLES_REGEX_PATTERN, ' ', categorical_variable).strip().lower()

<h1 id='0-data-wrangling' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>0 | Data Wrangling</h1>

Before creating Recommender System Models and even before exploring the datasets statistically, it is needed to clean up the variables applying some methods, such as filter features, deal with missing values and standard texts. So, let's go!

<br />

<p style='color:#7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-size:150%'>0.1 | Anime Dataset</p>

> **Steps**

```
- Check Charset Encoder;
- Change Features Names;
- Filter Features (english name, other/japanese name, aired, premiered, image URL);
- Check Data Types;
- Check Score Range (from 1 to 10);
- Missing Values;
- Deal with "UNKNOWN" and "NO DESCRIPTION AVAILABLE FOR THIS ANIME." Values;
- Clear Texts (A-Za-z0-9,- );
- Convert Texts to Lower Case;
- Duplicated Texts (name, synopsis, image URL) - fuzzywuzzy package.
```

---

**- Checking Charset Encoder**

First of all, we have to check out the dataset's charset encoding. Since `UTF-8` is the universal encoding format, we have two options here:

1 - `if the dataset charset is UTF-8 already, we can go straight to the next step`;

2 - `if the dataset charset is not UTF-8, we have to convert it to the desirable encoding`.

In [132]:
# ---- Checking Charset Encoder ----
with open(f'{DATASETS_PATH}/anime-dataset-2023.csv', 'rb') as file:
    guessed_charset = chardet.detect(file.read(10000))

print(guessed_charset)

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}


With `99.0%` of confidence, the dataset's encoding is `utf-8`!!

---

**- Change Features Names**

In order to work better with the data, the features names must be standardized. So let's standardize them following the rules:

1 - `the name must only contain lower case letters and underscores`;

2 - `the spaces between each word must be replaced by underscores`.

In [133]:
# ---- Reading Dataset ----
header = [
    'id', 'title', 'english_title', 'japanese_title', 'score',
    'genres', 'synopsis', 'type', 'episodes', 'aired',
    'premiered', 'status', 'producers', 'licensors', 'studios',
    'source', 'duration', 'rating', 'rank', 'popularity',
    'favorites', 'scored_by', 'members', 'image_url'
]

animes_df = pd.read_csv(f'{DATASETS_PATH}/anime-dataset-2023.csv')

# ---- Renaming Header ----
#
# - rename method: allows more sofisticated name transformations and allows to change the columns positions;
# - columns property: only allows to change the name by another string;
# - set_axis method: allows more sofisticated name transformations and allows to choose if the transformations will be
# applied to columns or indexes.
#
# Since we are just replacing the current column name to another string, we are using the 'columns property' way:
animes_df.columns = header
animes_df.head()

Unnamed: 0,id,title,english_title,japanese_title,score,genres,synopsis,type,episodes,aired,premiered,status,producers,licensors,studios,source,duration,rating,rank,popularity,favorites,scored_by,members,image_url
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,"Apr 3, 1998 to Apr 24, 1999",spring 1998,Finished Airing,Bandai Visual,"Funimation, Bandai Entertainment",Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,"Sep 1, 2001",UNKNOWN,Finished Airing,"Sunrise, Bandai Visual",Sony Pictures Entertainment,Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978,https://cdn.myanimelist.net/images/anime/1439/...
2,6,Trigun,Trigun,トライガン,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,"Apr 1, 1998 to Sep 30, 1998",spring 1998,Finished Airing,Victor Entertainment,"Funimation, Geneon Entertainment USA",Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,https://cdn.myanimelist.net/images/anime/7/203...
3,7,Witch Hunter Robin,Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),7.25,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26.0,"Jul 3, 2002 to Dec 25, 2002",summer 2002,Finished Airing,"Bandai Visual, Dentsu, Victor Entertainment, T...","Funimation, Bandai Entertainment",Sunrise,Original,25 min per ep,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931,https://cdn.myanimelist.net/images/anime/10/19...
4,8,Bouken Ou Beet,Beet the Vandel Buster,冒険王ビィト,6.94,"Adventure, Fantasy, Supernatural",It is the dark century and the people are suff...,TV,52.0,"Sep 30, 2004 to Sep 29, 2005",fall 2004,Finished Airing,"TV Tokyo, Dentsu",Illumitoon Entertainment,Toei Animation,Manga,23 min per ep,PG - Children,4240.0,5126,14,6413.0,15001,https://cdn.myanimelist.net/images/anime/7/215...


---

**- Filtering Features**

Some variables will not be needed to our Statistical Exploratory Analysis and Recommender System Models, so we can just drop them off.

Variables to Drop:

```
- English Title;
- Japanese Title;
- Aired;
- Premiered;
- Image URL.
```

In [134]:
# ---- Filtering Features ----
variables_to_drop = ['english_title', 'japanese_title', 'aired', 'premiered', 'image_url']
animes_df = animes_df.drop(columns=variables_to_drop)
animes_df.head()

Unnamed: 0,id,title,score,genres,synopsis,type,episodes,status,producers,licensors,studios,source,duration,rating,rank,popularity,favorites,scored_by,members
0,1,Cowboy Bebop,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,Finished Airing,Bandai Visual,"Funimation, Bandai Entertainment",Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505
1,5,Cowboy Bebop: Tengoku no Tobira,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,Finished Airing,"Sunrise, Bandai Visual",Sony Pictures Entertainment,Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978
2,6,Trigun,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,Finished Airing,Victor Entertainment,"Funimation, Geneon Entertainment USA",Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252
3,7,Witch Hunter Robin,7.25,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26.0,Finished Airing,"Bandai Visual, Dentsu, Victor Entertainment, T...","Funimation, Bandai Entertainment",Sunrise,Original,25 min per ep,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931
4,8,Bouken Ou Beet,6.94,"Adventure, Fantasy, Supernatural",It is the dark century and the people are suff...,TV,52.0,Finished Airing,"TV Tokyo, Dentsu",Illumitoon Entertainment,Toei Animation,Manga,23 min per ep,PG - Children,4240.0,5126,14,6413.0,15001


---

**- Checking Data Types**

Abbout the features, we have to assure that they are at the desirable data type:

```
- Object/String: title, genres, synopsis, type, status, producers, licensors, studios, source, duration, rating;
- Integer: id, episodes, rank, popularity, scored_by, members;
- Float: score.
```

In [135]:
# ---- Checking Data Types ----
animes_df.dtypes

id             int64
title         object
score         object
genres        object
synopsis      object
type          object
episodes      object
status        object
producers     object
licensors     object
studios       object
source        object
duration      object
rating        object
rank          object
popularity     int64
favorites      int64
scored_by     object
members        int64
dtype: object

So here are the variables we have to transform:

```
- Score: from object to float; replace 'UNKNOWN' by -1; UNKNOWN values tell that the anime did not reach the cut-off number of users scores;

- Episodes: from object to integer; replace 'UNKNOWN' by -1; UNKNOWN values tell that the anime is being released or will be released;

- Rank: from object to integer; replace 'UNKNOWN' by -1; UNKNOWN values tell that the anime is not released or is a hentai (+18);

- Scored By: from object to integer; replace 'UNKNOWN' by -1; UNKNOWN values tell that thee anime did not reach the cut-off number of users scores.
```

In [136]:
# ---- Replacing UNKNOWN by -1 ----
animes_df['score'].replace('UNKNOWN', '-1', inplace=True)
animes_df['episodes'].replace('UNKNOWN', '-1', inplace=True)
animes_df['rank'].replace('UNKNOWN', '-1', inplace=True)
animes_df['scored_by'].replace('UNKNOWN', '-1', inplace=True)

# ---- Converting Data Types ----
animes_df['score'] = animes_df['score'].astype('float')
animes_df['episodes'] = animes_df['episodes'].astype('float').astype('int')
animes_df['rank'] = animes_df['rank'].astype('float').astype('int')
animes_df['scored_by'] = animes_df['scored_by'].astype('float').astype('int')

# ---- Checking Data Types ----
animes_df.dtypes

id              int64
title          object
score         float64
genres         object
synopsis       object
type           object
episodes        int32
status         object
producers      object
licensors      object
studios        object
source         object
duration       object
rating         object
rank            int32
popularity      int64
favorites       int64
scored_by       int32
members         int64
dtype: object

---

**- Checking Score Range**

Since the anime scores in MyAnimeList follows the scale from 1 to 10, excluding the animes that did not reach the cut-off number of users scores, these ones got -1 as score. So, we have to check out if the values into the dataset are following this condition!!

In [137]:
# ---- Checking Score Range ----
#
# - must follow the rule: Scale from 0 to 10;
# - animes that did reach the cut-off number of users scores have -1 as score value.
#
animes_score_range_df = animes_df['score'].copy().loc[animes_df['score'] != -1]

print(f'Minimum Score: {animes_df["score"].loc[animes_df["score"] != -1].min()}')
print(f'Maximum Score: {animes_df["score"].loc[animes_df["score"] != -1].max()}')

Minimum Score: 1.85
Maximum Score: 9.1


Hooray!! The score variable follows the expected range 🥳

---

**- Dealing with Missing Values**

The next step is to check out if the variables contain any missing values and deal with them properly. First let's confirm if there are any missing values and, depending to what variables and how much values are missing, we choose the more convenient dealing strategy.

In [138]:
# ---- Missing Values ----
animes_df.isnull().sum()

id            0
title         0
score         0
genres        0
synopsis      0
type          0
episodes      0
status        0
producers     0
licensors     0
studios       0
source        0
duration      0
rating        0
rank          0
popularity    0
favorites     0
scored_by     0
members       0
dtype: int64

Ok, there are no missing values 😐😐

---

**- Dealing with `UNKNOWN` and `No description available for this anime.` Values**

Even though there are no missing values, w havee stumbled upon with some features with `UNKNOWN` values to represent the absence of data. Also, a tiny spoiler for you: the synopsis variable contains `No description available for this anime.` instead of `UNKNOWN`.

So, let's first get all variables that contain these values and then deal with them.

In [139]:
# ---- Unknown Values ----
#
# - counting unknown per variable
#
unknown_counts_per_variable = [(variable, (animes_df[variable] == 'UNKNOWN').sum()) for variable in animes_df.columns]
for variable, unknown_count in unknown_counts_per_variable: print(f'{variable} - {unknown_count}')

id - 0
title - 0
score - 0
genres - 4929
synopsis - 0
type - 74
episodes - 0
status - 0
producers - 13350
licensors - 20170
studios - 10526
source - 0
duration - 0
rating - 669
rank - 0
popularity - 0
favorites - 0
scored_by - 0
members - 0


In [142]:
# ---- Unknown Values ----
#
# - replacing unknown values by empty string
#
for variable, unknown_count in unknown_counts_per_variable:
    if (unknown_count > 0): animes_df[variable].replace('UNKNOWN', '', inplace=True)
    print(f'{variable} - {len(animes_df[variable].loc[animes_df[variable] == "UNKNOWN"])}')

id - 0
title - 0
score - 0
genres - 0
synopsis - 0
type - 0
episodes - 0
status - 0
producers - 0
licensors - 0
studios - 0
source - 0
duration - 0
rating - 0
rank - 0
popularity - 0
favorites - 0
scored_by - 0
members - 0
