# Anime Rating Full Dataset - Exploratory Data Analysis
This notebook contains an exploratory data analysis (EDA) of the complete top anime dataset fetched from the Jikan API. The goal is to understand the structure, patterns, and distribution of the data before modeling.

In [None]:
# Standard library imports
import json

# Third-party imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import the current working directory
import sys
from pathlib import Path

# Add the parent directory of 'src' to sys.path
sys.path.append(str(Path().resolve().parent))

# Local application imports
from src.preprocessing import (
    fill_missing_values,
    convert_types,
    clean_string_columns,
    convert_list_columns,
    add_missing_indicators,
    drop_duplicates_by_title,
    filter_impossible_scores,
    reorder_columns
)

# Plot style
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

## 1. Data Loading
We load the raw `top_anime_all.json` dataset which contains over 29,000 anime records. The data is prepared to be processed in the next steps.

In [None]:
# Import display function for Jupyter Notebook
from IPython.display import display
from pathlib import Path

# Load `top_anime_all.json` from `data/raw`
data_path = Path("../data/raw/top_anime_all.json")
with data_path.open(encoding="utf-8") as f:
    anime_data = json.load(f)

# Convert list of dict to DataFrame
df = pd.json_normalize(anime_data)

# Show basic info and preview
df.info()
display(df.head())

## 2. Data Overview
We start by checking the shape of the dataset, previewing the top rows, and reviewing basic statistics.

In [None]:
# Summary statistics for numeric columns
df.describe()

In [None]:
# Check missing values
df.isna().sum().sort_values(ascending=False)

In [None]:
# Plot distribution of anime scores
sns.histplot(data=df.dropna(), x="score", bins=20, kde=True)
plt.title("Distribution of Anime Scores")
plt.xlabel("Score")
plt.ylabel("Count")
plt.show()

## 3. Data Cleaning Process

Before summarizing the cleaned data, we applied several preprocessing steps to prepare the dataset for analysis and modeling:

- **Missing Values Handling**: Filled missing values with standard placeholders (e.g., `-1` for year, `"Unknown"` for rating).
- **Type Conversion**: Converted numeric columns (e.g., `episodes`, `year`, `rank`) to integers.
- **String Cleaning**: Standardized string columns by trimming and capitalizing text.
- **List Column Conversion**: Converted nested lists (e.g., genres and demographics) to comma-separated strings.
- **Indicator Columns**: Added `has_year` and `has_season` columns to mark original presence of these values.
- **Duplicate Removal**: Dropped duplicate anime titles to ensure uniqueness.
- **Invalid Score Filtering**: Removed anime entries with scores outside the 0–10 range (except for missing `-1` placeholders).
- **Column Reordering**: Ensured consistent column order for better readability.

All cleaning steps were modularized and handled in `src/preprocessing.py` for reuse and clarity.

In [None]:
# Select relevant columns only
columns_to_keep = [
    "title", "type", "source", "episodes", "status", "rating",
    "score", "scored_by", "rank", "popularity", "members", "favorites",
    "year", "season", "genres", "demographics"
]
df_selected = df[columns_to_keep]

# Preprocessing pipeline using functions from preprocessing.py
df_selected = add_missing_indicators(df_selected)
df_selected = fill_missing_values(df_selected)
df_selected = convert_types(df_selected)
df_selected = clean_string_columns(df_selected)
df_selected = convert_list_columns(df_selected)
df_selected = drop_duplicates_by_title(df_selected)
df_selected = filter_impossible_scores(df_selected)

# Reorder columns if needed
desired_order = [
    "title", "type", "source", "episodes", "status", "rating",
    "score", "scored_by", "rank", "popularity", "members", "favorites",
    "year", "season", "has_year", "has_season", "genres", "demographics"
]
df_selected = reorder_columns(df_selected, desired_order)

# Final check and save
display(df_selected.head())
df_selected.info()
df_selected.to_csv("../data/processed/clean_anime_full.csv", index=False)

## 4. Data Cleaning Recap
This section summarizes the preprocessing steps applied on the raw dataset, such as:

- Handling missing values (e.g., year, season, score)
- Converting data types (e.g., episodes to int)
- Transforming list-type columns into comma-separated strings
- Adding flags for missing year/season

These were handled in `src/preprocessing.py`.

## 5. Missing Value Summary
Below is the summary of remaining missing values, if any, after preprocessing.

In [None]:
# Display missing values in descending order
df_selected.isna().sum().sort_values(ascending=False)

Most of the missing values were successfully handled during the preprocessing stage.  
Features such as `year`, `season`, and `score` were filled with placeholders or cleaned.  
The final dataset is now ready for further analysis or modeling.

## 6. Duplicate Check
We check for any duplicate entries based on the anime titles.

In [None]:
# Check for duplicates by title
duplicates = df_selected.duplicated(subset=["title"])
print("Total duplicates:", duplicates.sum())

# Optionally display duplicates (if any)
df_selected[duplicates]

No duplicate titles found. The dataset contains unique anime entries.

## 7. Feature Description

Below is a brief explanation of each column included in the final cleaned dataset:

| Feature        | Description |
|----------------|-------------|
| `title`        | Title of the anime. |
| `type`         | Type of media (e.g., TV, Movie, OVA). |
| `source`       | Original source material (e.g., Manga, Novel, Original). |
| `episodes`     | Total number of episodes. |
| `status`       | Airing status (e.g., Finished Airing, Currently Airing). |
| `rating`       | Age rating classification (e.g., PG-13, R). |
| `score`        | Average user score from MyAnimeList (range: 0–10). |
| `scored_by`    | Number of users who rated the anime. |
| `rank`         | Overall rank based on score. |
| `popularity`   | Popularity rank based on number of members. |
| `members`      | Number of users who added the anime to their list. |
| `favorites`    | Number of users who marked the anime as a favorite. |
| `year`         | Release year of the anime. |
| `season`       | Season of release (e.g., Spring, Fall). |
| `has_year`     | Binary indicator (1 if original `year` was present, 0 if filled). |
| `has_season`   | Binary indicator (1 if original `season` was present, 0 if filled). |
| `genres`       | List of genres the anime belongs to (as a comma-separated string). |
| `demographics` | Target demographic groups (e.g., Shounen, Seinen). |

## (Coming Next) Visual Explorations
In the next section, we will explore the dataset using various visualizations to uncover trends, relationships, and patterns.