# Práctica 1 – Data Cleaning (Anime Dataset)

En esta práctica vamos a:

- Cargar el dataset `Anime.csv`.
- Explorar su estructura (filas, columnas, tipos de datos).
- Identificar valores faltantes y posibles errores.
- Limpiar y transformar algunas columnas clave.
- Generar un nuevo archivo `Anime_clean.csv` que usaremos en las demás prácticas.


In [8]:
import pandas as pd

df = pd.read_csv("../data/Anime.csv")
df.head()






Unnamed: 0,Rank,Name,Japanese_name,Type,Episodes,Studio,Release_season,Tags,Rating,Release_year,End_year,Description,Content_Warning,Related_Mange,Related_anime,Voice_actors,staff
0,1,Demon Slayer: Kimetsu no Yaiba - Entertainment...,Kimetsu no Yaiba: Yuukaku-hen,TV,,ufotable,Fall,"Action, Adventure, Fantasy, Shounen, Demons, H...",4.6,2021.0,,'Tanjiro and his friends accompany the Hashira...,Explicit Violence,Demon Slayer: Kimetsu no Yaiba,"Demon Slayer: Kimetsu no Yaiba, Demon Slayer: ...","Inosuke Hashibira : Yoshitsugu Matsuoka, Nezuk...","Koyoharu Gotouge : Original Creator, Haruo Sot..."
1,2,Fruits Basket the Final Season,Fruits Basket the Final,TV,13.0,TMS Entertainment,Spring,"Drama, Fantasy, Romance, Shoujo, Animal Transf...",4.6,2021.0,,'The final arc of Fruits Basket.',"Emotional Abuse,, Mature Themes,, Physical Abu...","Fruits Basket, Fruits Basket Another","Fruits Basket 1st Season, Fruits Basket 2nd Se...","Akito Sohma : Maaya Sakamoto, Kyo Sohma : Yuum...","Natsuki Takaya : Original Creator, Yoshihide I..."
2,3,Mo Dao Zu Shi 3,The Founder of Diabolism 3,Web,12.0,B.C MAY PICTURES,,"Fantasy, Ancient China, Chinese Animation, Cul...",4.58,2021.0,,'The third season of Mo Dao Zu Shi.',,Grandmaster of Demonic Cultivation: Mo Dao Zu ...,"Mo Dao Zu Shi 2, Mo Dao Zu Shi Q","Lan Wangji, Wei Wuxian, Jiang Cheng, Jin Guang...","Mo Xiang Tong Xiu : Original Creator, Xiong Ke..."
3,4,Fullmetal Alchemist: Brotherhood,Hagane no Renkinjutsushi: Full Metal Alchemist,TV,64.0,Bones,Spring,"Action, Adventure, Drama, Fantasy, Mystery, Sh...",4.58,2009.0,2010.0,"""The foundation of alchemy is based on the law...","Animal Abuse,, Mature Themes,, Violence,, Dome...","Fullmetal Alchemist, Fullmetal Alchemist (Ligh...","Fullmetal Alchemist: Brotherhood Specials, Ful...","Alphonse Elric : Rie Kugimiya, Edward Elric : ...","Hiromu Arakawa : Original Creator, Yasuhiro Ir..."
4,5,Attack on Titan 3rd Season: Part II,Shingeki no Kyojin Season 3: Part II,TV,10.0,WIT Studio,Spring,"Action, Fantasy, Horror, Shounen, Dark Fantasy...",4.57,2019.0,,'The battle to retake Wall Maria begins now! W...,"Cannibalism,, Explicit Violence","Attack on Titan, Attack on Titan: End of the W...","Attack on Titan, Attack on Titan 2nd Season, A...","Armin Arlelt : Marina Inoue, Eren Jaeger : Yuu...","Hajime Isayama : Original Creator, Tetsurou Ar..."


In [9]:
print("Shape (filas, columnas):", df.shape)
print("\nTipos de datos:")
print(df.dtypes)

print("\nInformación completa:")
df.info()


Shape (filas, columnas): (18495, 17)

Tipos de datos:
Rank                 int64
Name                object
Japanese_name       object
Type                object
Episodes           float64
Studio              object
Release_season      object
Tags                object
Rating             float64
Release_year       float64
End_year           float64
Description         object
Related_Mange       object
Related_anime       object
Voice_actors        object
staff               object
dtype: object

Información completa:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18495 entries, 0 to 18494
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Rank             18495 non-null  int64  
 1   Name             18495 non-null  object 
 2   Japanese_name    7938 non-null   object 
 3   Type             18495 non-null  object 
 4   Episodes         9501 non-null   float64
 5   Studio           12018 non-null  object 


In [10]:
df.describe()


Unnamed: 0,Rank,Episodes,Rating,Release_year,End_year
count,18495.0,9501.0,15364.0,18112.0,2854.0
mean,9248.0,20.92085,3.355133,2006.520318,2004.256132
std,5339.19095,37.990858,0.400624,15.189537,13.257484
min,1.0,1.0,0.96,1907.0,1962.0
25%,4624.5,2.0,3.13,2001.0,1996.0
50%,9248.0,12.0,3.36,2012.0,2007.0
75%,13871.5,26.0,3.59,2017.0,2015.0
max,18495.0,800.0,4.6,2023.0,2022.0


In [11]:
nulos = df.isna().sum().sort_values(ascending=False)
porcentaje_nulos = (df.isna().mean() * 100).sort_values(ascending=False)

missing_df = pd.DataFrame({
    "nulos": nulos,
    "% nulos": porcentaje_nulos.round(2)
})

missing_df


Unnamed: 0,nulos,% nulos
Content_Warning,16655,90.05
End_year,15641,84.57
Release_season,14379,77.75
Related_Mange,10868,58.76
Japanese_name,10557,57.08
Episodes,8994,48.63
Related_anime,8432,45.59
Studio,6477,35.02
staff,5490,29.68
Voice_actors,3186,17.23


In [12]:

df_clean = df.copy()


In [13]:
# 1) Episodes: muchos NaN, la usaremos como numérica
#    Estrategia: rellenar episodios faltantes con la mediana por tipo (TV, Movie, OVA, etc.)

# Calcular mediana de episodios por Type (ignorando NaN)
episodes_median_by_type = df_clean.groupby("Type")["Episodes"].median()

episodes_median_by_type


Type
DVD S     NaN
Movie     NaN
Music     NaN
OVA       1.0
Other     NaN
TV       24.0
TV Sp     NaN
Web       4.0
Name: Episodes, dtype: float64

In [15]:
import numpy as np

def fill_episodes(row):
    if pd.isna(row["Episodes"]):
        return episodes_median_by_type.get(row["Type"], np.nan)
    return row["Episodes"]

df_clean["Episodes"] = df_clean.apply(fill_episodes, axis=1)

# Convertir Episodes a entero (redondeando)
df_clean["Episodes"] = df_clean["Episodes"].round().astype("Int64")

df_clean[["Type", "Episodes"]].head()


Unnamed: 0,Type,Episodes
0,TV,24
1,TV,13
2,Web,12
3,TV,64
4,TV,10


In [16]:
# Contar cuántos registros pierdo si elimino filas sin Rating o Release_year
print("Filas totales:", len(df_clean))
print("Sin Rating:", df_clean["Rating"].isna().sum())
print("Sin Release_year:", df_clean["Release_year"].isna().sum())

# Estrategia sencilla:
#   - Eliminar filas sin Rating (no tienen sentido en análisis de popularidad)
#   - Eliminar filas sin Release_year (para trabajar series de tiempo)
df_clean = df_clean.dropna(subset=["Rating", "Release_year"]).reset_index(drop=True)

print("Filas después de eliminar nulos en Rating y Release_year:", len(df_clean))


Filas totales: 18495
Sin Rating: 3131
Sin Release_year: 383
Filas después de eliminar nulos en Rating y Release_year: 15344


In [17]:
# Asegurarnos de que los tipos numéricos estén correctos
numeric_cols = ["Rank", "Episodes", "Rating", "Release_year", "End_year"]

for col in numeric_cols:
    if col in df_clean.columns:
        # End_year puede quedarse como entero nullable
        if col == "End_year":
            df_clean[col] = df_clean[col].astype("Int64")
        else:
            df_clean[col] = df_clean[col].astype(float)

df_clean.dtypes


Rank               float64
Name                object
Japanese_name       object
Type                object
Episodes           float64
Studio              object
Release_season      object
Tags                object
Rating             float64
Release_year       float64
End_year             Int64
Description         object
Related_Mange       object
Related_anime       object
Voice_actors        object
staff               object
dtype: object

In [18]:
print("Shape final:", df_clean.shape)

print("\nNulos después de limpieza en columnas clave:")
print(df_clean[["Episodes", "Rating", "Release_year", "End_year"]].isna().sum())

df_clean.head()


Shape final: (15344, 17)

Nulos después de limpieza en columnas clave:
Episodes         6862
Rating              0
Release_year        0
End_year        12753
dtype: int64


Unnamed: 0,Rank,Name,Japanese_name,Type,Episodes,Studio,Release_season,Tags,Rating,Release_year,End_year,Description,Content_Warning,Related_Mange,Related_anime,Voice_actors,staff
0,1.0,Demon Slayer: Kimetsu no Yaiba - Entertainment...,Kimetsu no Yaiba: Yuukaku-hen,TV,24.0,ufotable,Fall,"Action, Adventure, Fantasy, Shounen, Demons, H...",4.6,2021.0,,'Tanjiro and his friends accompany the Hashira...,Explicit Violence,Demon Slayer: Kimetsu no Yaiba,"Demon Slayer: Kimetsu no Yaiba, Demon Slayer: ...","Inosuke Hashibira : Yoshitsugu Matsuoka, Nezuk...","Koyoharu Gotouge : Original Creator, Haruo Sot..."
1,2.0,Fruits Basket the Final Season,Fruits Basket the Final,TV,13.0,TMS Entertainment,Spring,"Drama, Fantasy, Romance, Shoujo, Animal Transf...",4.6,2021.0,,'The final arc of Fruits Basket.',"Emotional Abuse,, Mature Themes,, Physical Abu...","Fruits Basket, Fruits Basket Another","Fruits Basket 1st Season, Fruits Basket 2nd Se...","Akito Sohma : Maaya Sakamoto, Kyo Sohma : Yuum...","Natsuki Takaya : Original Creator, Yoshihide I..."
2,3.0,Mo Dao Zu Shi 3,The Founder of Diabolism 3,Web,12.0,B.C MAY PICTURES,,"Fantasy, Ancient China, Chinese Animation, Cul...",4.58,2021.0,,'The third season of Mo Dao Zu Shi.',,Grandmaster of Demonic Cultivation: Mo Dao Zu ...,"Mo Dao Zu Shi 2, Mo Dao Zu Shi Q","Lan Wangji, Wei Wuxian, Jiang Cheng, Jin Guang...","Mo Xiang Tong Xiu : Original Creator, Xiong Ke..."
3,4.0,Fullmetal Alchemist: Brotherhood,Hagane no Renkinjutsushi: Full Metal Alchemist,TV,64.0,Bones,Spring,"Action, Adventure, Drama, Fantasy, Mystery, Sh...",4.58,2009.0,2010.0,"""The foundation of alchemy is based on the law...","Animal Abuse,, Mature Themes,, Violence,, Dome...","Fullmetal Alchemist, Fullmetal Alchemist (Ligh...","Fullmetal Alchemist: Brotherhood Specials, Ful...","Alphonse Elric : Rie Kugimiya, Edward Elric : ...","Hiromu Arakawa : Original Creator, Yasuhiro Ir..."
4,5.0,Attack on Titan 3rd Season: Part II,Shingeki no Kyojin Season 3: Part II,TV,10.0,WIT Studio,Spring,"Action, Fantasy, Horror, Shounen, Dark Fantasy...",4.57,2019.0,,'The battle to retake Wall Maria begins now! W...,"Cannibalism,, Explicit Violence","Attack on Titan, Attack on Titan: End of the W...","Attack on Titan, Attack on Titan 2nd Season, A...","Armin Arlelt : Marina Inoue, Eren Jaeger : Yuu...","Hajime Isayama : Original Creator, Tetsurou Ar..."


In [19]:
output_path = "../data/Anime_clean.csv"
df_clean.to_csv(output_path, index=False)

output_path


'../data/Anime_clean.csv'