# Hands-On Pertemuan 12 :  Data Cleaning and Preparation using Pandas

### Topics Covered
- Identifying and handling missing data.
- Data transformation and normalization.
- Data filtering and deduplication.
- Standardization of categorical data.
- Outlier detection and handling.

In [2]:
# Exercise 1: Identifying and Handling Missing Data
import pandas as pd

# Sample dataset with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', None],
    'Age': [24, 30, None, 22, 35],
    'Salary': [48000, None, 57000, None, 60000]
}
df = pd.DataFrame(data)

# Filling missing values and dropping rows
df.fillna({
    'Age': df['Age'].mean(),
    'Salary': df['Salary'].median()
}, inplace=True)

df.dropna(subset=['Name'], inplace=True)
print('After cleaning:\n', df)

After cleaning:
       Name    Age   Salary
0    Alice  24.00  48000.0
1      Bob  30.00  57000.0
2  Charlie  27.75  57000.0
3    David  22.00  57000.0


In [3]:
# Exercise 2: Standardizing Categorical Data
# Sample dataset with inconsistent categorical values
data = {
    'Product': ['Laptop', 'Laptop', 'Desktop', 'Tablet', 'Tablet'],
    'Category': ['Electronics', 'electronics', 'Electronics', 'Gadgets', 'gadgets']
}
df = pd.DataFrame(data)

# Standardize category values
df['Category'] = df['Category'].str.capitalize()
print('Standardized Data:\n', df)

Standardized Data:
    Product     Category
0   Laptop  Electronics
1   Laptop  Electronics
2  Desktop  Electronics
3   Tablet      Gadgets
4   Tablet      Gadgets


### Practice Tasks

- Load a dataset of your choice and identify missing values.

In [2]:
# Load a dataset of your choice and identify missing values
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# 1. Load dataset
df = pd.read_csv('/home/wann/Downloads/Titanic - Titanic.csv')

# 2. Identifikasi missing values
print("Jumlah missing tiap kolom:")
print(df.isna().sum())

Jumlah missing tiap kolom:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


- Implement data transformations to normalize numerical columns.

In [3]:
# Implement data transformations to normalize numerical columns
num_cols = df.select_dtypes(include=['int64', 'float64']).columns

# Tangani missing value numerik dengan rata-rata (mean)
for col in num_cols:
    df[col] = df[col].fillna(df[col].mean())

# Normalisasi dengan MinMaxScaler (nilai 0–1)
scaler = MinMaxScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

print("\nNormalized Numerical Columns (first 5 rows):\n", df[num_cols].head())


Normalized Numerical Columns (first 5 rows):
    PassengerId  Survived  Pclass       Age  SibSp  Parch      Fare
0     0.000000       0.0     1.0  0.271174  0.125    0.0  0.014151
1     0.001124       1.0     0.0  0.472229  0.125    0.0  0.139136
2     0.002247       1.0     1.0  0.321438  0.000    0.0  0.015469
3     0.003371       1.0     0.0  0.434531  0.125    0.0  0.103644
4     0.004494       0.0     1.0  0.434531  0.000    0.0  0.015713


- Standardize categorical columns and remove duplicates.

In [8]:
# Standardize categorical columns and remove duplicates

# Standarisasi kolom kategorikal
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print("Kolom kategori:", categorical_cols)

# Contoh sederhana: ubah semua ke huruf kecil + kapitalisasi
for col in categorical_cols:
    df[col] = df[col].str.strip().str.lower().str.capitalize()

# Menghapus duplikat
df = df.drop_duplicates()

print("Setelah standarisasi kategori:")
print(df[categorical_cols].head())

Kolom kategori: ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
Setelah standarisasi kategori:
                                                Name     Sex  \
0                            Braund, mr. owen harris    Male   
1  Cumings, mrs. john bradley (florence briggs th...  Female   
2                             Heikkinen, miss. laina  Female   
3       Futrelle, mrs. jacques heath (lily may peel)  Female   
4                           Allen, mr. william henry    Male   

             Ticket Cabin Embarked  
0         A/5 21171   NaN        S  
1          Pc 17599   C85        C  
2  Ston/o2. 3101282   NaN        S  
3            113803  C123        S  
4            373450   NaN        S  


### Homework for Students

- Clean a real-world dataset (from Kaggle or another source), perform normalization, handle outliers, and prepare the data for analysis.

In [11]:
import pandas as pd

df = pd.read_csv('/home/wann/Downloads/Demon Slayer.csv')
df.head()

Unnamed: 0,Character_ID,Name,Alias,Status,Gender,Age,Height,Weight,Race,Affiliation,...,Weapon,Special_Abilities,First_Appearance,Last_Appearance,Allies,Enemies,Personality,Voice_Actor_Japanese,Voice_Actor_English,Role
0,1,Tanjiro Kamado,Child of Brightness,Alive,Male,16,165 cm,61 kg,Human,Demon Slayer Corps,...,Nichirin Sword,"Enhanced smell,Demon Slayer Mark,Adaptive figh...",Manga Chapter 1,,"Nezuko Kamado,Zenitsu Agatsuma,Inosuke Hashibi...","Muzan Kibutsuji,Demons","Kind,Empathetic,Determined",Natsuki Hanae,Zach Aguilar,Protagonist
1,2,Nezuko Kamado,,Alive,Female,14,153 cm,45 kg,Hybrid,Demon Slayer Corps,...,,,Manga Chapter 1,,"Tanjiro Kamado, Zenitsu Agatsuma, Inosuke Hash...","Muzan Kibutsuji, Demons","Protective, Compassionate, Strong-willed",Akari Kito,Abby Trott,Deuteragonist
2,3,Zenitsu Agatsuma,,Alive,Male,16,164 cm,53 kg,Human,Demon Slayer Corps,...,Nichirin Sword,"Super hearing, Enhanced speed and reflexes, Li...",Manga Chapter 1,,"Tanjiro Kamado, Nezuko Kamado, Inosuke Hashibira",Demons,"Cowardly, Timid, Loyal, Compassionate",Hiro Shimono,Aleks Le,Tertiary Protagonist
3,4,Inosuke Hashibira,,Alive,Male,15,165 cm,55 kg,Human,Demon Slayer Corps,...,Dual Nichirin Swords,"Enhanced smell, Superhuman strength and agilit...",Manga Chapter 1,,"Tanjiro Kamado, Nezuko Kamado, Zenitsu Agatsuma",Demons,"Hot-headed, Reckless, Brave, Loyal",Yoshitsugu Matsuoka,Bryce Papenbrook,Tertiary Protagonist
4,5,Kanao Tsuyuri,,Alive,Female,16,158 cm,45 kg,Human,Demon Slayer Corps,...,Nichirin Sword,"Enhanced reflexes, Exceptional swordsmanship, ...",Manga Chapter 23,,"Tanjiro Kamado, Nezuko Kamado, Zenitsu Agatsum...",Demons,"Quiet, Observant, Disciplined, Loyal",Nao Toyama,Brianna Knickerbocker,Supporting Protagonist


In [12]:
import pandas as pd

df = pd.read_csv('/home/wann/Downloads/Demon Slayer.csv')

print("Jumlah missing values tiap kolom:")
print(df.isna().sum())

Jumlah missing values tiap kolom:
Character_ID             0
Name                     0
Alias                   59
Status                   0
Gender                   0
Age                     11
Height                   1
Weight                   8
Race                     0
Affiliation             36
Family                  59
Mentor                  74
Rank                    68
Breathing_Style         62
Demon_Art               75
Weapon                  61
Special_Abilities        1
First_Appearance         0
Last_Appearance         16
Allies                  15
Enemies                  9
Personality              0
Voice_Actor_Japanese    25
Voice_Actor_English     25
Role                     2
dtype: int64


In [29]:
for col in df.columns:
    if df[col].dtype != 'object':   # numerik
        df[col] = df[col].fillna(df[col].median())
    else:                           # kategorikal
        df[col] = df[col].fillna(df[col].mode()[0])

    print('Values:\n', df)

Values:
     Character_ID               Name                Alias    Status  Gender  \
0              1     Tanjiro Kamado  Child of Brightness     Alive    Male   
1              2      Nezuko Kamado                  NaN     Alive  Female   
2              3   Zenitsu Agatsuma                  NaN     Alive    Male   
3              4  Inosuke Hashibira                  NaN     Alive    Male   
4              5      Kanao Tsuyuri                  NaN     Alive  Female   
..           ...                ...                  ...       ...     ...   
76            77           Susamaru                  NaN  Deceased  Female   
77            78      Spider Mother                  NaN  Deceased  Female   
78            79      Spider Father                  NaN  Deceased    Male   
79            80      Spider Sister                  NaN  Deceased  Female   
80            81     Spider Brother                  NaN  Deceased    Male   

    Age  Height Weight    Race         Affiliation  ..

In [32]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler

df = pd.read_csv('/home/wann/Downloads/Demon Slayer.csv')

# Clean numeric fields: extract numbers
def extract_num(x):
    if isinstance(x,str):
        return pd.to_numeric(''.join([c for c in x if c.isdigit()]), errors='coerce')
    return x

df['Height_cm'] = df['Height'].apply(extract_num)
df['Weight_kg'] = df['Weight'].apply(extract_num)

# Drop 
df_clean = df.drop(columns=['Height','Weight'])

# Normalization MinMax
scaler = MinMaxScaler()
df_clean_norm = df_clean.copy()
df_clean_norm[numeric_cols] = scaler.fit_transform(df_clean[numeric_cols])

df_clean_norm.head()

Unnamed: 0,Character_ID,Name,Alias,Status,Gender,Age,Race,Affiliation,Family,Mentor,...,First_Appearance,Last_Appearance,Allies,Enemies,Personality,Voice_Actor_Japanese,Voice_Actor_English,Role,Height_cm,Weight_kg
0,0.0,Tanjiro Kamado,Child of Brightness,Alive,Male,16,Human,Demon Slayer Corps,"Nezuko Kamado,Kie Kamado,Tanjuro Kamado,Hanako...",Sakonji Urokodaki,...,Manga Chapter 1,,"Nezuko Kamado,Zenitsu Agatsuma,Inosuke Hashibi...","Muzan Kibutsuji,Demons","Kind,Empathetic,Determined",Natsuki Hanae,Zach Aguilar,Protagonist,0.31746,0.190909
1,0.0125,Nezuko Kamado,,Alive,Female,14,Hybrid,Demon Slayer Corps,"Tanjiro Kamado (brother), Kie Kamado (mother),...",,...,Manga Chapter 1,,"Tanjiro Kamado, Zenitsu Agatsuma, Inosuke Hash...","Muzan Kibutsuji, Demons","Protective, Compassionate, Strong-willed",Akari Kito,Abby Trott,Deuteragonist,0.126984,0.045455
2,0.025,Zenitsu Agatsuma,,Alive,Male,16,Human,Demon Slayer Corps,,,...,Manga Chapter 1,,"Tanjiro Kamado, Nezuko Kamado, Inosuke Hashibira",Demons,"Cowardly, Timid, Loyal, Compassionate",Hiro Shimono,Aleks Le,Tertiary Protagonist,0.301587,0.118182
3,0.0375,Inosuke Hashibira,,Alive,Male,15,Human,Demon Slayer Corps,,,...,Manga Chapter 1,,"Tanjiro Kamado, Nezuko Kamado, Zenitsu Agatsuma",Demons,"Hot-headed, Reckless, Brave, Loyal",Yoshitsugu Matsuoka,Bryce Papenbrook,Tertiary Protagonist,0.31746,0.136364
4,0.05,Kanao Tsuyuri,,Alive,Female,16,Human,Demon Slayer Corps,,Shinobu Kocho,...,Manga Chapter 23,,"Tanjiro Kamado, Nezuko Kamado, Zenitsu Agatsum...",Demons,"Quiet, Observant, Disciplined, Loyal",Nao Toyama,Brianna Knickerbocker,Supporting Protagonist,0.206349,0.045455
