# Tugas Praktikum: Wisconsin Breast Cancer Dataset

Dataset terdiri dari 569 data untuk mendiagnosis jenis kanker:
- **M** = Malignant (Ganas)
- **B** = Benign (Jinak)

## Tugas:
1. Pisahkan variabel yang dapat digunakan dan tidak dapat digunakan
2. Lakukan encoding pada kolom "diagnosis"
3. Lakukan standardisasi pada kolom numerik
4. Lakukan stratified split data (80:20)

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# Load dataset
df = pd.read_csv('wbc.csv')

print("=" * 60)
print("EKSPLORASI DATA WISCONSIN BREAST CANCER")
print("=" * 60)
print(f"\nShape: {df.shape}")
print(f"\nKolom-kolom dataset:")
print(df.columns.tolist())
print(f"\nTipe data:")
print(df.dtypes)
print(f"\n5 Data Pertama:")
df.head()

EKSPLORASI DATA WISCONSIN BREAST CANCER

Shape: (569, 33)

Kolom-kolom dataset:
['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32']

Tipe data:
id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave point

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


---
## Tugas 1: Pisahkan Variabel yang Dapat dan Tidak Dapat Digunakan

In [2]:
# ============================================================
# TUGAS 1: PISAHKAN VARIABEL YANG DAPAT DAN TIDAK DAPAT DIGUNAKAN
# ============================================================

# Variabel yang TIDAK dapat digunakan untuk analisis:
# - 'id': hanya identifier, tidak memiliki informasi prediktif
# - Kolom dengan semua nilai NaN atau 'Unnamed' (jika ada)

unused_columns = ['id']

# Cek apakah ada kolom kosong/unnamed
for col in df.columns:
    if 'Unnamed' in col or df[col].isna().all():
        unused_columns.append(col)

# Variabel yang DAPAT digunakan:
# - 'diagnosis': variabel target (M/B)
# - Semua kolom numerik lainnya: fitur untuk prediksi (30 fitur)

usable_columns = [col for col in df.columns if col not in unused_columns]

print("=" * 60)
print("TUGAS 1: PEMISAHAN VARIABEL")
print("=" * 60)

print("\nüö´ VARIABEL YANG TIDAK DAPAT DIGUNAKAN:")
print("-" * 40)
for col in unused_columns:
    print(f"  - '{col}': hanya identifier, tidak informatif untuk prediksi")

print("\n‚úÖ VARIABEL YANG DAPAT DIGUNAKAN:")
print("-" * 40)
print(f"  - 'diagnosis': variabel TARGET (M=Malignant, B=Benign)")
print(f"  - {len(usable_columns) - 1} fitur numerik untuk prediksi")

print(f"\nDaftar fitur yang dapat digunakan:")
for i, col in enumerate(usable_columns, 1):
    print(f"  {i:2d}. {col}")

# Buat dataframe bersih (tanpa kolom yang tidak digunakan)
df_clean = df[usable_columns].copy()
print(f"\nDataframe bersih shape: {df_clean.shape}")

TUGAS 1: PEMISAHAN VARIABEL

üö´ VARIABEL YANG TIDAK DAPAT DIGUNAKAN:
----------------------------------------
  - 'id': hanya identifier, tidak informatif untuk prediksi
  - 'Unnamed: 32': hanya identifier, tidak informatif untuk prediksi

‚úÖ VARIABEL YANG DAPAT DIGUNAKAN:
----------------------------------------
  - 'diagnosis': variabel TARGET (M=Malignant, B=Benign)
  - 30 fitur numerik untuk prediksi

Daftar fitur yang dapat digunakan:
   1. diagnosis
   2. radius_mean
   3. texture_mean
   4. perimeter_mean
   5. area_mean
   6. smoothness_mean
   7. compactness_mean
   8. concavity_mean
   9. concave points_mean
  10. symmetry_mean
  11. fractal_dimension_mean
  12. radius_se
  13. texture_se
  14. perimeter_se
  15. area_se
  16. smoothness_se
  17. compactness_se
  18. concavity_se
  19. concave points_se
  20. symmetry_se
  21. fractal_dimension_se
  22. radius_worst
  23. texture_worst
  24. perimeter_worst
  25. area_worst
  26. smoothness_worst
  27. compactness_worst
  

---
## Tugas 2: Encoding Kolom "diagnosis"

In [3]:
# ============================================================
# TUGAS 2: ENCODING KOLOM "DIAGNOSIS"
# ============================================================
# Kolom diagnosis berisi nilai kategorikal: 'M' (Malignant) dan 'B' (Benign)
# Kita akan mengubahnya menjadi nilai numerik menggunakan LabelEncoder

encoder = LabelEncoder()
df_clean['diagnosis_encoded'] = encoder.fit_transform(df_clean['diagnosis'])

print("=" * 60)
print("TUGAS 2: ENCODING KOLOM 'DIAGNOSIS'")
print("=" * 60)

print(f"\nNilai asli: {df_clean['diagnosis'].unique()}")
print(f"Nilai setelah encoding: {df_clean['diagnosis_encoded'].unique()}")

print(f"\nüìã Mapping Encoding:")
print("-" * 40)
for i, label in enumerate(encoder.classes_):
    print(f"  '{label}' ({('Benign/Jinak' if label == 'B' else 'Malignant/Ganas')}) -> {i}")

print(f"\nüìä Distribusi Diagnosis:")
print("-" * 40)
print(df_clean['diagnosis'].value_counts())
print(f"\nTotal data: {len(df_clean)}")

print(f"\nüìù Contoh Hasil Encoding:")
print("-" * 40)
df_clean[['diagnosis', 'diagnosis_encoded']].head(10)

TUGAS 2: ENCODING KOLOM 'DIAGNOSIS'

Nilai asli: ['M' 'B']
Nilai setelah encoding: [1 0]

üìã Mapping Encoding:
----------------------------------------
  'B' (Benign/Jinak) -> 0
  'M' (Malignant/Ganas) -> 1

üìä Distribusi Diagnosis:
----------------------------------------
diagnosis
B    357
M    212
Name: count, dtype: int64

Total data: 569

üìù Contoh Hasil Encoding:
----------------------------------------


Unnamed: 0,diagnosis,diagnosis_encoded
0,M,1
1,M,1
2,M,1
3,M,1
4,M,1
5,M,1
6,M,1
7,M,1
8,M,1
9,M,1


---
## Tugas 3: Standardisasi Kolom Numerik

In [4]:
# ============================================================
# TUGAS 3: STANDARDISASI KOLOM NUMERIK
# ============================================================
# Standardisasi menggunakan StandardScaler: z = (x - mean) / std
# Hasil: mean ‚âà 0, std ‚âà 1

scaler = StandardScaler()

# Pilih kolom numerik (kecuali diagnosis dan diagnosis_encoded)
numeric_cols = df_clean.select_dtypes(include=['float64', 'int64']).columns.tolist()
# Hapus kolom yang tidak perlu di-standardisasi
cols_to_exclude = ['diagnosis_encoded']
numeric_cols = [col for col in numeric_cols if col not in cols_to_exclude]

print("=" * 60)
print("TUGAS 3: STANDARDISASI KOLOM NUMERIK")
print("=" * 60)
print(f"\nJumlah kolom yang akan di-standardisasi: {len(numeric_cols)}")

print(f"\nüìã Kolom-kolom Numerik:")
print("-" * 40)
for i, col in enumerate(numeric_cols, 1):
    print(f"  {i:2d}. {col}")

# Buat dataframe hasil standardisasi
df_standardized = df_clean.copy()
df_standardized[numeric_cols] = scaler.fit_transform(df_clean[numeric_cols])

# Tampilkan statistik sebelum dan sesudah standardisasi
print(f"\nüìä PERBANDINGAN SEBELUM & SESUDAH STANDARDISASI:")
print("=" * 60)
sample_cols = numeric_cols[:5]  # Ambil 5 kolom pertama sebagai contoh

print("\nüîπ Sebelum Standardisasi (5 kolom pertama):")
print(df_clean[sample_cols].describe().loc[['mean', 'std', 'min', 'max']].round(4))

print("\nüîπ Sesudah Standardisasi (5 kolom pertama):")
print(df_standardized[sample_cols].describe().loc[['mean', 'std', 'min', 'max']].round(4))

print(f"\n‚úÖ Standardisasi berhasil! Mean ‚âà 0, Std ‚âà 1")

TUGAS 3: STANDARDISASI KOLOM NUMERIK

Jumlah kolom yang akan di-standardisasi: 30

üìã Kolom-kolom Numerik:
----------------------------------------
   1. radius_mean
   2. texture_mean
   3. perimeter_mean
   4. area_mean
   5. smoothness_mean
   6. compactness_mean
   7. concavity_mean
   8. concave points_mean
   9. symmetry_mean
  10. fractal_dimension_mean
  11. radius_se
  12. texture_se
  13. perimeter_se
  14. area_se
  15. smoothness_se
  16. compactness_se
  17. concavity_se
  18. concave points_se
  19. symmetry_se
  20. fractal_dimension_se
  21. radius_worst
  22. texture_worst
  23. perimeter_worst
  24. area_worst
  25. smoothness_worst
  26. compactness_worst
  27. concavity_worst
  28. concave points_worst
  29. symmetry_worst
  30. fractal_dimension_worst

üìä PERBANDINGAN SEBELUM & SESUDAH STANDARDISASI:

üîπ Sebelum Standardisasi (5 kolom pertama):
      radius_mean  texture_mean  perimeter_mean  area_mean  smoothness_mean
mean      14.1273       19.2896         

---
## Tugas 4: Stratified Split Data (80:20)

In [5]:
# ============================================================
# TUGAS 4: STRATIFIED SPLIT DATA (80:20)
# ============================================================
# Stratified split memastikan proporsi kelas target sama di train dan test

# Siapkan fitur (X) dan target (y)
X = df_standardized[numeric_cols]  # Fitur yang sudah di-standardisasi
y = df_standardized['diagnosis_encoded']  # Target yang sudah di-encode

# Lakukan stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% untuk test, 80% untuk train
    stratify=y,         # Stratified split berdasarkan y
    random_state=42     # Untuk reproducibility
)

print("=" * 60)
print("TUGAS 4: STRATIFIED SPLIT DATA (80:20)")
print("=" * 60)

print(f"\nüìä HASIL PEMBAGIAN DATA:")
print("-" * 40)
print(f"  Total data       : {len(X)} sampel")
print(f"  Data latih (80%) : {len(X_train)} sampel")
print(f"  Data uji (20%)   : {len(X_test)} sampel")

print(f"\nüìã DISTRIBUSI KELAS (Stratified):")
print("-" * 40)

# Hitung distribusi kelas
print(f"\nüîπ Distribusi di Data ASLI:")
print(f"  - Benign (0)    : {(y == 0).sum()} ({(y == 0).sum()/len(y)*100:.1f}%)")
print(f"  - Malignant (1) : {(y == 1).sum()} ({(y == 1).sum()/len(y)*100:.1f}%)")

print(f"\nüîπ Distribusi di Data LATIH:")
print(f"  - Benign (0)    : {(y_train == 0).sum()} ({(y_train == 0).sum()/len(y_train)*100:.1f}%)")
print(f"  - Malignant (1) : {(y_train == 1).sum()} ({(y_train == 1).sum()/len(y_train)*100:.1f}%)")

print(f"\nüîπ Distribusi di Data UJI:")
print(f"  - Benign (0)    : {(y_test == 0).sum()} ({(y_test == 0).sum()/len(y_test)*100:.1f}%)")
print(f"  - Malignant (1) : {(y_test == 1).sum()} ({(y_test == 1).sum()/len(y_test)*100:.1f}%)")

print(f"\n‚úÖ Stratified split berhasil!")
print(f"   Proporsi kelas tetap sama di train dan test set.")

TUGAS 4: STRATIFIED SPLIT DATA (80:20)

üìä HASIL PEMBAGIAN DATA:
----------------------------------------
  Total data       : 569 sampel
  Data latih (80%) : 455 sampel
  Data uji (20%)   : 114 sampel

üìã DISTRIBUSI KELAS (Stratified):
----------------------------------------

üîπ Distribusi di Data ASLI:
  - Benign (0)    : 357 (62.7%)
  - Malignant (1) : 212 (37.3%)

üîπ Distribusi di Data LATIH:
  - Benign (0)    : 285 (62.6%)
  - Malignant (1) : 170 (37.4%)

üîπ Distribusi di Data UJI:
  - Benign (0)    : 72 (63.2%)
  - Malignant (1) : 42 (36.8%)

‚úÖ Stratified split berhasil!
   Proporsi kelas tetap sama di train dan test set.


In [6]:
# ============================================================
# RINGKASAN HASIL PREPROCESSING
# ============================================================

print("=" * 60)
print("üìã RINGKASAN HASIL PREPROCESSING")
print("=" * 60)

print(f"""
1Ô∏è‚É£  PEMISAHAN VARIABEL:
    - Variabel tidak digunakan: {unused_columns}
    - Variabel digunakan: {len(usable_columns)} kolom (1 target + {len(numeric_cols)} fitur)

2Ô∏è‚É£  ENCODING:
    - Kolom 'diagnosis' di-encode dengan LabelEncoder
    - B (Benign) -> 0
    - M (Malignant) -> 1

3Ô∏è‚É£  STANDARDISASI:
    - {len(numeric_cols)} kolom numerik di-standardisasi
    - Menggunakan StandardScaler (z-score normalization)
    - Hasil: mean ‚âà 0, std ‚âà 1

4Ô∏è‚É£  STRATIFIED SPLIT:
    - Rasio: 80% train, 20% test
    - X_train shape: {X_train.shape}
    - X_test shape: {X_test.shape}
    - y_train shape: {y_train.shape}
    - y_test shape: {y_test.shape}
    - Proporsi kelas terjaga di kedua set

‚úÖ Data siap untuk digunakan dalam model machine learning!
""")

# Tampilkan sample data final
print("=" * 60)
print("üìä SAMPLE DATA AKHIR (X_train - 5 baris pertama):")
print("=" * 60)
X_train.head()

üìã RINGKASAN HASIL PREPROCESSING

1Ô∏è‚É£  PEMISAHAN VARIABEL:
    - Variabel tidak digunakan: ['id', 'Unnamed: 32']
    - Variabel digunakan: 31 kolom (1 target + 30 fitur)

2Ô∏è‚É£  ENCODING:
    - Kolom 'diagnosis' di-encode dengan LabelEncoder
    - B (Benign) -> 0
    - M (Malignant) -> 1

3Ô∏è‚É£  STANDARDISASI:
    - 30 kolom numerik di-standardisasi
    - Menggunakan StandardScaler (z-score normalization)
    - Hasil: mean ‚âà 0, std ‚âà 1

4Ô∏è‚É£  STRATIFIED SPLIT:
    - Rasio: 80% train, 20% test
    - X_train shape: (455, 30)
    - X_test shape: (114, 30)
    - y_train shape: (455,)
    - y_test shape: (114,)
    - Proporsi kelas terjaga di kedua set

‚úÖ Data siap untuk digunakan dalam model machine learning!

üìä SAMPLE DATA AKHIR (X_train - 5 baris pertama):


Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
10,0.537556,0.919273,0.442011,0.406453,-1.017686,-0.713542,-0.700684,-0.404686,-1.035476,-0.826124,...,0.604849,1.335771,0.492622,0.473611,-0.625477,-0.630828,-0.605872,-0.22621,0.076431,0.031819
170,-0.513297,-1.605595,-0.540376,-0.542624,0.458285,-0.654413,-0.614306,-0.307442,0.538081,-0.460382,...,-0.573451,-1.634499,-0.604391,-0.582718,0.268776,-0.812128,-0.709978,-0.315133,-0.119321,-0.899721
407,-0.362769,0.484112,-0.384677,-0.399281,-1.483819,-0.401411,-0.345755,-0.780246,-0.845627,-0.234983,...,-0.387077,0.217034,-0.465589,-0.412728,-1.681045,-0.385914,-0.424046,-0.892221,-0.667748,-0.134983
430,0.21946,0.754052,0.417297,0.085638,0.221305,2.239288,2.316401,1.243034,0.837458,0.876418,...,0.016734,0.308227,0.540279,-0.084174,0.417818,2.89275,3.021056,2.02352,-0.056227,1.748601
27,1.273153,0.22348,1.241101,1.248876,-0.139504,0.042812,0.755818,0.732313,-0.418466,-0.823289,...,1.043864,0.257745,0.972174,0.918363,0.062747,-0.270773,0.347396,0.5237,-0.905562,-0.539518
