<center>
    <h1 id='statistic-analysis' style='color:#7159c1'>📊 Statistic Analysis 📊</h1>
    <i>Exploring and studying the Animes and Users Details Datasets</i>
</center>

> **Topics**

```
- 🔎 Exploratory Analysis;
- 📊 Plots;
- 🧪 Hypothesis Testing.
```

In [54]:
# ---- Imports ----
import matplotlib.pyplot as plt  # pip install matplotlib
import numpy as np               # pip install numpy
import pandas as pd              # pip install pandas
import re                        # pip install re
import seaborn as sns            # pip install seaborn
import scipy.stats as stats      # pip install scipy

# ---- Constants ----
DATASETS_PATH = ('./datasets')
NEW_VARIABLES_REGEX = (r'[^A-Za-z0-9 ]')
SEED = (20231105)

# ---- Settings ----
np.random.seed(SEED)
pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')

# ---- Functions ----

# \ Description:
#    - gets all unique values of the variable into the dataset;
#    - then joins the values with a comma-space;
#    - splits the values with comma-space;
#    - transforms the result into a set in order to remove duplicated values;
#    - transforms the set into a list for better manipulation;
#    - returns the list
#
# \ Parameters:
#    - dataset: Pandas DataFrame;
#    - variable: string
#
get_unique_variable_values = lambda dataset, variable: list(set(', '.join(dataset[variable].unique()).split(', ')))

# \ Description:
#    - replaces all characters that are not upper and lower case letters, numbers and spaces by a single space;
#    - strips the string in order to remove spaces from the beggining and ending;
#    - converts the string to lower case;
#    - replaces all spaces by underscores;
#    - returns the result
#
# \ Parameters:
#    - variable: string
#
standardize_new_variable_name = lambda variable: re.sub(NEW_VARIABLES_REGEX, ' ', variable).strip().lower().replace(' ', '_')

# \ Description
#    - standardizes each variable into 'variables_list' parameter using 'standardize_new_variable_name' function;
#    - concatenates the result with 'main_type' parameter and a underscore;
#    - storages each transformed variable name into a list;
#    - returns the list
#
# \ Parameters:
#    - main_type: string
#    - variables_list: list of strings
#
transform_new_variables_names = lambda main_type, variables_list: [main_type + '_' + standardize_new_variable_name(variable) for variable in variables_list]

<h1 id='0-exploratory-analysis' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🔎 | Exploratory Analysis</h1>


In this section, we are going to apply the Descriptive Analysis in order to understand how the data are spread, see their frequencies and distributions, as well as check out the correlation between the variables.

Also, we are going to create a new animes dataset transforming the genres, producers, licensors and studios variables for a better graphic exploration.

<br />

<p style='color:#7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-size:150%'>0.1 | Anime Dataset</p>

> **Steps**

```
- Separate Categorical Variables from Numerical Variables;
- Categorical Variables: Transform Genres, Producers, Licensors and Studios;
- Categorical Variables: Describe, Frequence Table and Conclusions;
- Numerical Variables: Describe, Histograms, Correlations, HeatMap Plot and Conclusions.
```

---

**- Separate Categorical Variables from Numerical Variables**

First of all, since the statistic is applied differently in categorical and numerical variables, we have to identify and separate them, so let's go!!

In [23]:
# ---- Reading Dataset ----
animes_df = pd.read_csv(f'{DATASETS_PATH}/anime-transformed-dataset-2023.csv')
print(f'- Variables: {animes_df.shape[1]}')
print(f'- Observations: {animes_df.shape[0]}')
animes_df.head()

- Variables: 19
- Observations: 23748


Unnamed: 0,id,title,score,genres,synopsis,type,episodes,status,producers,licensors,studios,source,duration,rating,rank,popularity,favorites,scored_by,members
0,1,cowboy bebop,8.75,"sci-fi, action, award winning","crime is timeless. by the year 2071, humanity ...",tv,26,finished airing,bandai visual,"bandai entertainment, funimation",sunrise,original,24 min per ep,R - 17+ (violence & profanity),41,43,78525,914193,1771505
1,5,cowboy bebop tengoku no tobira,8.38,"sci-fi, action","another day, another bounty—such is the life o...",movie,1,finished airing,"bandai visual, sunrise",sony pictures entertainment,bones,original,1 hr 55 min,R - 17+ (violence & profanity),189,602,1448,206248,360978
2,6,trigun,8.22,"sci-fi, adventure, action","vash the stampede is the man with a $$60,000,0...",tv,26,finished airing,victor entertainment,"geneon entertainment usa, funimation",madhouse,manga,24 min per ep,PG-13 - Teens 13 or older,328,246,15035,356739,727252
3,7,witch hunter robin,7.25,"supernatural, drama, action, mystery",robin sena is a powerful craft user drafted in...,tv,26,finished airing,"victor entertainment, dentsu, tv tokyo music, ...","bandai entertainment, funimation",sunrise,original,25 min per ep,PG-13 - Teens 13 or older,2764,1795,613,42829,111931
4,8,bouken ou beet,6.94,"supernatural, adventure, fantasy",it is the dark century and the people are suff...,tv,52,finished airing,"dentsu, tv tokyo",illumitoon entertainment,toei animation,manga,23 min per ep,PG - Children,4240,5126,14,6413,15001


In [21]:
# ---- Separating Categorical from Numerical Variables ----
categorical_variables = [
    variable for variable in animes_df.columns
    if animes_df[variable].dtype in ['O', 'object']
]

numerical_variables = [
    variable for variable in animes_df.columns
    if animes_df[variable].dtype in ['int64', 'float64', 'int32', 'float32']
]

print(f'- Categorical Variables: {categorical_variables}')
print(f'- Numerical Variables: {numerical_variables}')

- Categorical Variables: ['title', 'genres', 'synopsis', 'type', 'status', 'producers', 'licensors', 'studios', 'source', 'duration', 'rating']
- Numerical Variables: ['id', 'score', 'episodes', 'rank', 'popularity', 'favorites', 'scored_by', 'members']


---

**- Categorical Variables: Transform Genres, Producers, Licensors and Studios;**

Now, we are going to create a new dataset called `anime-exploratory-dataset-2023.csv` where the `genres, producers, licensors and studios variables` are considered as multiple variables. For instance, consider 'Cowboy Bebop', its genres are 'sci-fi', 'action' and 'award winning', so, the exploratory dataset will look like this:

<table style='border-style: solid'>
    <tr align='center' style='border-style: solid'>
        <th style='border-style: solid'>id</th>
        <th style='border-style: solid'>name</th>
        <th style='border-style: solid'>genre_sci_fi</th>
        <th style='border-style: solid'>genre_action</th>
        <th style='border-style: solid'>genre_mistery</th>
        <th style='border-style: solid'>genre_award_winning</th>
    </tr>
    <tr align='center' style='border-style: solid'>
        <td style='border-style: solid'>1</td>
        <td style='border-style: solid'>Cowboy Bebop</td>
        <td style='border-style: solid'>1</td>
        <td style='border-style: solid'>1</td>
        <td style='border-style: solid'>0</td>
        <td style='border-style: solid'>1</td>
    </tr>
</table>

In [25]:
# ----
variables_to_transform = ['genres', 'producers', 'licensors', 'studios']

for variable in variables_to_transform:
    print(f'{variable} - {animes_df[variable].nunique()} unique values')

genres - 992 unique values
producers - 4351 unique values
licensors - 265 unique values
studios - 1518 unique values


In [33]:
get_unique_variable_values = lambda dataset, variable: list(set(', '.join(dataset[variable].unique()).split(', ')))
get_unique_variable_values(animes_df, 'studios')

['caviar',
 'gav video',
 'makaria',
 'bakken record',
 'opera house',
 'decovocal',
 'studio g-1neo',
 'at-2',
 'gemba',
 'studio 3hz',
 'maro studio',
 'feel.',
 'graphinica',
 '10gauge',
 'sunflowers',
 'asmik ace',
 'studio d-volt',
 'morie inc.',
 'gravity well',
 'y.o.u.c',
 'ai yume mai',
 'cg year',
 'enzo animation',
 'studio gooneys',
 'usagi ou',
 'okumaza',
 'studio bind',
 'lmd',
 'drawiz',
 'anime r',
 'trif studio',
 'xflag pictures',
 'spoon',
 'shochiku animation institute',
 'alpha animation',
 'point pictures',
 'l-a-unch・box',
 'doga productions',
 'orenda',
 'wong ping animation lab',
 'studio kafka',
 'studio unicorn',
 'forest hunting one',
 'momoi planning',
 'huamei animation',
 'actas',
 'oddjob',
 'yhkt entertainment',
 'heart & soul animation',
 'creators dot com',
 'aic frontier',
 'nippon animation',
 'rabbit gate',
 'mmt technology',
 'colored pencil animation japan',
 'acca effe',
 'animation staff room',
 'tsukimidou',
 'geno studio',
 'c and r',
 'saig