# Tahapan Preprocessing Data

Notebook ini berisi langkah-langkah preprocessing data untuk mempersiapkan dataset sebelum digunakan dalam analisis atau pemodelan. Setiap tahapan akan dijelaskan input dan output yang diharapkan.

In [1]:
import pandas as pd
import numpy as np
import re

## Import Library
**Input:** Tidak ada input langsung, hanya mengimpor library yang diperlukan.

**Output:** Library seperti pandas, numpy, dan re siap digunakan untuk manipulasi data dan operasi lainnya.

In [88]:
df = pd.read_csv("../output/bajuWanita_enriched.csv")
print(df.shape)
df.head()

(2136, 20)


Unnamed: 0,id,name,url,category_breadcrumb,price_number,price_original,discountPercentage,mediaURL_image,ratingAverage,shop_id,shop_name,shop_url,shop_city,shop_tier,countSold,isTopAds,labelGroups,label_titles,totalRating,countReview
0,100823236046,LBITZ- Arika Blouse Stripe Atasan Wanita Peplu...,https://www.tokopedia.com/lbitzofficial/lbitz-...,fashion-wanita/atasan-wanita/blouse-wanita,59999,Rp165.000,0,https://p16-images-sign-sg.tokopedia-static.ne...,4.8,7494900101784111742,Lbitzofficial,https://www.tokopedia.com/lbitzofficial,Pekalongan,3,7000.0,False,"[{""position"": ""ri_product_credibility"", ""title...",7rb+ terjual | PreOrder | Bisa COD | 64% | Rp5...,1153.0,454.0
1,102196209582,Abelia Blouse - Blouse Korea Atasan Baju Oversize,https://www.tokopedia.com/raagm/abelia-blouse-...,fashion-wanita/atasan-wanita/blouse-wanita,52900,Rp99.500,0,https://p16-images-sign-sg.tokopedia-static.ne...,4.7,7496146497900284331,raagm,https://www.tokopedia.com/raagm,Kab. Bandung,3,10000.0,False,"[{""position"": ""ri_product_credibility"", ""title...",10rb+ terjual | Hemat s.d 1% Pakai Bonus | 47...,911.0,281.0
2,100530142425,AIJO| (READY) Violet Longsleeve - Oneck Lenga...,https://www.tokopedia.com/aijostoreid-262/aijo...,fashion-wanita/atasan-wanita/blouse-wanita,59000,Rp169.000,0,https://p16-images-sign-sg.tokopedia-static.ne...,4.8,7494083746980398941,AIJOstoreid_NEW,https://www.tokopedia.com/aijostoreid-262,Kab. Tangerang,3,100000.0,False,"[{""position"": ""ri_product_credibility"", ""title...",100rb+ terjual | Hemat s.d 1% Pakai Bonus | 6...,16264.0,3587.0
3,8109591304,CoreNation Active Thea Jacket Woman,https://www.tokopedia.com/corenationactive/cor...,olahraga/pakaian-olahraga-wanita/jaket-windbre...,274750,Rp549.500,0,https://p16-images-sign-sg.tokopedia-static.ne...,5.0,3599573,CoreNation Active,https://www.tokopedia.com/corenationactive,Surabaya,2,4.0,False,"[{""position"": ""ri_product_credibility"", ""title...",4 terjual | Beli Lokal | Hemat s.d 1% Pakai B...,1.0,0.0
4,16650965846,Queenbeer - Tracktop Jacket Phillo Green,https://www.tokopedia.com/queenbeer/queenbeer-...,fashion-pria/outerwear-pria/jaket-pria,465000,,0,https://p16-images-sign-sg.tokopedia-static.ne...,0.0,2713357,QUEENBEER Official Shop,https://www.tokopedia.com/queenbeer,Bekasi,3,1.0,False,"[{""position"": ""ri_product_credibility"", ""title...",1 terjual | Hemat s.d 1% Pakai Bonus | Rp465....,0.0,0.0


## Membaca Dataset
**Input:** File CSV `../output/bajuWanita_enriched.csv` yang berisi data mentah.

**Output:** DataFrame `df` yang memuat data dari file CSV, siap untuk diproses lebih lanjut.

In [89]:
# Cek duplikat berdasarkan kolom 'id'
duplicatedDataById = df['id'].duplicated().sum()
print("Duplicated rows based on 'id':", duplicatedDataById)

Duplicated rows based on 'id': 0


## Mengecek dan Menghapus Duplikat
**Input:** DataFrame `df` dengan kolom `id`.

**Output:** Jumlah baris duplikat berdasarkan kolom `id` dan DataFrame `df` tanpa baris duplikat.

In [59]:
# Drop duplikat kalau ada
df = df.drop_duplicates(subset=["id"])

In [90]:
# Untuk membersihkan teks
def clean_text(text):
    if pd.isna(text):
        return ""
    text = text.lower()                             # Jadikan semua huruf kecil
    text = re.sub(r"[^a-z0-9\s]", " ", text)        # Ganti semua karakter selain huruf a-z, angka 0-9, dan spasi menjadi spasi
    text = re.sub(r"\s+", " ", text).strip()        # Rapikan spasi berlebihan jadi satu spasi, lalu hapus spasi di awal/akhir
    return text

## Membersihkan Teks
**Input:** Kolom teks seperti `name`, `category_breadcrumb`, `shop_city`, dan `labelGroups`.

**Output:** Kolom baru seperti `name_clean`, `category_clean`, `city_clean`, dan `labelGroups_clean` yang sudah dibersihkan dari karakter tidak diinginkan.

In [91]:
# Bersihkan kolom ....
df["name_clean"] = df["name"].apply(clean_text)
df["category_clean"] = df["category_breadcrumb"].apply(clean_text)
df["city_clean"] = df["shop_city"].apply(clean_text)
df["labelGroups_clean"] = df["labelGroups"].apply(clean_text)

In [92]:
df[["name_clean", "category_clean", "city_clean", "labelGroups_clean"]].head()

Unnamed: 0,name_clean,category_clean,city_clean,labelGroups_clean
0,lbitz arika blouse stripe atasan wanita peplum...,fashion wanita atasan wanita blouse wanita,pekalongan,position ri product credibility title 7rb terj...
1,abelia blouse blouse korea atasan baju oversize,fashion wanita atasan wanita blouse wanita,kab bandung,position ri product credibility title 10rb ter...
2,aijo ready violet longsleeve oneck lengan panj...,fashion wanita atasan wanita blouse wanita,kab tangerang,position ri product credibility title 100rb te...
3,corenation active thea jacket woman,olahraga pakaian olahraga wanita jaket windbre...,surabaya,position ri product credibility title 4 terjua...
4,queenbeer tracktop jacket phillo green,fashion pria outerwear pria jaket pria,bekasi,position ri product credibility title 1 terjua...


In [95]:
# normalisasi kolom price_original
df["price_original"] = (
    df["price_original"]
        .astype("string")
        .str.replace(r"[^0-9]", "", regex=True)
        .fillna("0")
        .replace("", "0")
        .astype("Int64")
)

## Normalisasi Harga
**Input:** Kolom `price_original` yang berisi data harga dalam format string.

**Output:** Kolom `price_original` yang sudah dinormalisasi menjadi tipe data numerik dan kolom `discountPercentage` yang menunjukkan persentase diskon.

In [98]:
# Hitung discountPercentage
df["discountPercentage"] = np.where(
    df["price_original"].notna() & (df["price_original"] > 0),
    ((df["price_original"] - df["price_number"]) / df["price_original"] * 100),
    0                   # ← kalau price_original NaN / 0
)

# Rapikan hasil
df["discountPercentage"] = (
    df["discountPercentage"]
        .round(0)        # bisa ganti round(2)
        .clip(0, 100)    # jaga tetap masuk akal
        .astype(int)     # opsional: jadi integer
)

# cek hasil
df[["price_number", "price_original", "discountPercentage"]].head(10)


Unnamed: 0,price_number,price_original,discountPercentage
0,59999,165000,64
1,52900,99500,47
2,59000,169000,65
3,274750,549500,50
4,465000,0,0
5,218294,0,0
6,449000,0,0
7,37240,76000,51
8,44991,49990,10
9,63415,115300,45


In [99]:
# ubah ke int
cols = ["countSold", "countReview", "totalRating"]

df[cols] = df[cols].fillna(0).astype("Int64")

In [100]:
df[["countSold", "countReview", "totalRating"]].head(10)

Unnamed: 0,countSold,countReview,totalRating
0,7000,454,1153
1,10000,281,911
2,100000,3587,16264
3,4,0,1
4,1,0,0
5,0,0,0
6,0,0,0
7,100,23,24
8,500,54,59
9,10000,450,1207


In [101]:
df.head()

Unnamed: 0,id,name,url,category_breadcrumb,price_number,price_original,discountPercentage,mediaURL_image,ratingAverage,shop_id,...,countSold,isTopAds,labelGroups,label_titles,totalRating,countReview,name_clean,category_clean,city_clean,labelGroups_clean
0,100823236046,LBITZ- Arika Blouse Stripe Atasan Wanita Peplu...,https://www.tokopedia.com/lbitzofficial/lbitz-...,fashion-wanita/atasan-wanita/blouse-wanita,59999,165000,64,https://p16-images-sign-sg.tokopedia-static.ne...,4.8,7494900101784111742,...,7000,False,"[{""position"": ""ri_product_credibility"", ""title...",7rb+ terjual | PreOrder | Bisa COD | 64% | Rp5...,1153,454,lbitz arika blouse stripe atasan wanita peplum...,fashion wanita atasan wanita blouse wanita,pekalongan,position ri product credibility title 7rb terj...
1,102196209582,Abelia Blouse - Blouse Korea Atasan Baju Oversize,https://www.tokopedia.com/raagm/abelia-blouse-...,fashion-wanita/atasan-wanita/blouse-wanita,52900,99500,47,https://p16-images-sign-sg.tokopedia-static.ne...,4.7,7496146497900284331,...,10000,False,"[{""position"": ""ri_product_credibility"", ""title...",10rb+ terjual | Hemat s.d 1% Pakai Bonus | 47...,911,281,abelia blouse blouse korea atasan baju oversize,fashion wanita atasan wanita blouse wanita,kab bandung,position ri product credibility title 10rb ter...
2,100530142425,AIJO| (READY) Violet Longsleeve - Oneck Lenga...,https://www.tokopedia.com/aijostoreid-262/aijo...,fashion-wanita/atasan-wanita/blouse-wanita,59000,169000,65,https://p16-images-sign-sg.tokopedia-static.ne...,4.8,7494083746980398941,...,100000,False,"[{""position"": ""ri_product_credibility"", ""title...",100rb+ terjual | Hemat s.d 1% Pakai Bonus | 6...,16264,3587,aijo ready violet longsleeve oneck lengan panj...,fashion wanita atasan wanita blouse wanita,kab tangerang,position ri product credibility title 100rb te...
3,8109591304,CoreNation Active Thea Jacket Woman,https://www.tokopedia.com/corenationactive/cor...,olahraga/pakaian-olahraga-wanita/jaket-windbre...,274750,549500,50,https://p16-images-sign-sg.tokopedia-static.ne...,5.0,3599573,...,4,False,"[{""position"": ""ri_product_credibility"", ""title...",4 terjual | Beli Lokal | Hemat s.d 1% Pakai B...,1,0,corenation active thea jacket woman,olahraga pakaian olahraga wanita jaket windbre...,surabaya,position ri product credibility title 4 terjua...
4,16650965846,Queenbeer - Tracktop Jacket Phillo Green,https://www.tokopedia.com/queenbeer/queenbeer-...,fashion-pria/outerwear-pria/jaket-pria,465000,0,0,https://p16-images-sign-sg.tokopedia-static.ne...,0.0,2713357,...,1,False,"[{""position"": ""ri_product_credibility"", ""title...",1 terjual | Hemat s.d 1% Pakai Bonus | Rp465....,0,0,queenbeer tracktop jacket phillo green,fashion pria outerwear pria jaket pria,bekasi,position ri product credibility title 1 terjua...


In [102]:
df.columns

Index(['id', 'name', 'url', 'category_breadcrumb', 'price_number',
       'price_original', 'discountPercentage', 'mediaURL_image',
       'ratingAverage', 'shop_id', 'shop_name', 'shop_url', 'shop_city',
       'shop_tier', 'countSold', 'isTopAds', 'labelGroups', 'label_titles',
       'totalRating', 'countReview', 'name_clean', 'category_clean',
       'city_clean', 'labelGroups_clean'],
      dtype='object')

In [103]:
ADS_KEYWORDS = ["iklan", "sponsored", "ads"]

df["isTopAds"] = df["label_titles"].apply(
    lambda x: any(k in x for k in ADS_KEYWORDS)
)

In [104]:
df["isTopAds"].value_counts()

isTopAds
False    2136
Name: count, dtype: int64

“Pada dataset hasil pengambilan data, tidak ditemukan produk yang secara eksplisit diberi label iklan (TopAds) oleh platform e-commerce pada hasil pencarian. Oleh karena itu, seluruh produk diperlakukan sebagai non-iklan dalam analisis.”

“Fitur isTopAds tetap disertakan sebagai antisipasi karena secara desain sistem rekomendasi, iklan merupakan salah satu sumber bias. Namun pada dataset yang digunakan dalam penelitian ini, tidak ditemukan produk dengan label iklan, sehingga fitur tersebut tidak berkontribusi dalam pemodelan.”

In [105]:
df["has_promo"] = (
    (df["discountPercentage"] >= 40) |
    (df["label_titles"].str.contains("diskon|hemat|promo", regex=True))
)

In [106]:
df["has_promo"].value_counts()

has_promo
False    1316
True      820
Name: count, dtype: int64

In [107]:
UMKM_KEYWORDS = [
    "umkm", "lokal", "handmade", "kerajinan",
    "konveksi", "rumahan", "home industry",
    "custom", "distro"
]

In [108]:
# isTopAds: NA → False
df["isTopAds"] = df["isTopAds"].fillna(False)

# has_promo: NA → False
df["has_promo"] = df["has_promo"].fillna(False)

# shop_tier: NA → angka besar (anggap non-UMKM)
df["shop_tier"] = df["shop_tier"].fillna(99)

# countSold: NA → 0
df["countSold"] = df["countSold"].fillna(0)

# label_titles: NA → string kosong
df["label_titles"] = df["label_titles"].fillna("")

# name_clean & shop_name
df["name_clean"] = df["name_clean"].fillna("")
df["shop_name"] = df["shop_name"].fillna("")


In [109]:
def umkm_score(row):
    score = 0

    # Penjualan kecil
    if int(row["countSold"]) < 50000:
        score += 1

    # Tidak TopAds
    if row["isTopAds"] is False:
        score += 1

    # Promo
    if row["has_promo"] is False:
        score += 1

    # Tier toko rendah
    if int(row["shop_tier"]) <= 2:
        score += 1

    # Label lokal
    if "beli lokal" in row["label_titles"]:
        score += 1

    # Keyword UMKM
    if any(
        k in row["name_clean"] or k in row["shop_name"]
        for k in UMKM_KEYWORDS
    ):
        score += 1

    return score

## Menghitung Skor UMKM
**Input:** Kolom seperti `countSold`, `isTopAds`, `shop_tier`, `label_titles`, `name_clean`, dan `shop_name`.

**Output:** Kolom `umkm_score` yang menunjukkan skor UMKM berdasarkan kriteria tertentu dan kolom `is_umkm` yang menunjukkan apakah produk termasuk UMKM atau tidak.

In [110]:
df["umkm_score"] = df.apply(umkm_score, axis=1)
df["is_umkm"] = df["umkm_score"] >= 3

In [111]:
df["umkm_score"].describe()
df["is_umkm"].value_counts()

is_umkm
True     1535
False     601
Name: count, dtype: int64

In [112]:
df["is_umkm"].head(10)

0    False
1    False
2    False
3     True
4     True
5     True
6     True
7     True
8     True
9    False
Name: is_umkm, dtype: bool

In [113]:
print("10 Baris Pertama Data yang Telah Dilabel")
print("=" * 100)
print(df[["name", "shop_name", "countSold", "isTopAds", "shop_tier", "umkm_score", "is_umkm"]].head(10))

10 Baris Pertama Data yang Telah Dilabel
                                                name                shop_name  \
0  LBITZ- Arika Blouse Stripe Atasan Wanita Peplu...            Lbitzofficial   
1  Abelia Blouse - Blouse Korea Atasan Baju Oversize                    raagm   
2  AIJO| (READY)  Violet Longsleeve - Oneck Lenga...          AIJOstoreid_NEW   
3                CoreNation Active Thea Jacket Woman        CoreNation Active   
4           Queenbeer - Tracktop Jacket Phillo Green  QUEENBEER Official Shop   
5  Inara Maxy Knit | Dress Wanita Kekinian | OOTD...       Feminskin Official   
6       QueenBeer - Heavyweight Zipper Hoodie Hugger  QUEENBEER Official Shop   
7  Baju Oversize Unisex - Kaos Oversize Cewek - B...               Goev Store   
8  Setelan Wanita Emma One Set Terbaru Lengan Pan...            Selena shop88   
9  Mireya Blouse Kerut Depan Rayon Atasan Wanita ...          Ab-fhasionstlye   

   countSold  isTopAds  shop_tier  umkm_score  is_umkm  
0       70

In [None]:
def umkm_score(row):
    score = 0

    # 1️⃣ Penjualan kecil
    if row.get("countSold", 0) < 50000:
        score += 1

    # 2️⃣ Tidak TopAds
    if row.get("isTopAds") is False:
        score += 1

    # 3️⃣ Tier toko rendah
    if row.get("shop_tier", 99) <= 2:
        score += 1

    # 4️⃣ Label lokal
    if "beli lokal" in row.get("label_titles", ""):
        score += 1

    # 5️⃣ Keyword UMKM di nama/toko
    name = row.get("name_clean", "")
    shop = row.get("shop_name", "")
    if any(k in name or k in shop for k in UMKM_KEYWORDS):
        score += 1

    return score

In [None]:
df.head()

In [None]:
df["umkm_score"] = df.apply(umkm_score, axis=1)
df["is_umkm"] = df["umkm_score"] >= 3

In [None]:
# log transform harga
df["price_log"] = np.log(df["price_number"] + 1)

In [None]:
df["price_log"].head(10)

In [None]:
# normalisasi rating & diskon
df["rating_norm"] = (df["ratingAverage"].fillna(0).clip(0, 5)) / 5
df["discount_norm"] = (df["discountPercentage"].fillna(0).clip(0, 100)) / 100

In [None]:
df[["rating_norm", "discount_norm"]].head(10)

In [None]:
# normalisasi popularity
sold = df["countSold"].fillna(0)
rev  = df["countReview"].fillna(0)

df["sold_log"] = np.log(sold + 1)
df["review_log"] = np.log(rev + 1)

df["popularity"] = 0.5 * df["sold_log"] + 0.5 * df["review_log"]


In [None]:
df[["sold_log", "review_log", "popularity"]].head(10)

In [None]:
df[["shop_tier", "labelGroups"]].head()

In [None]:
import json, ast

def extract_label_titles(labelGroups):
    """
    Mengembalikan list title dari labelGroups.
    labelGroups bisa berupa: list[dict], string JSON, atau NaN.
    """
    if pd.isna(labelGroups):
        return []

    obj = labelGroups

    # kalau bentuknya string, coba parse
    if isinstance(obj, str):
        obj = obj.strip()
        if not obj:
            return []
        try:
            obj = json.loads(obj)
        except Exception:
            try:
                obj = ast.literal_eval(obj)
            except Exception:
                return []

    # sekarang obj diharapkan list/dict
    titles = []
    if isinstance(obj, list):
        for it in obj:
            if isinstance(it, dict):
                t = it.get("title")
                if t:
                    titles.append(str(t).lower())
    elif isinstance(obj, dict):
        t = obj.get("title")
        if t:
            titles.append(str(t).lower())

    return titles


In [None]:
def umkm_score(row):
    score = 0

    # 1️⃣ Penjualan kecil
    if row["countSold"] < 50000:
        score += 1

    # 2️⃣ Tidak TopAds
    if row.get("isTopAds") == False:
        score += 1

    # 3️⃣ Tier toko rendah
    if row["shop_tier"] <= 2:
        score += 1

    # 4️⃣ Ada label lokal
    if "beli lokal" in row["label_titles"]:
        score += 1

    # 5️⃣ Keyword UMKM di nama/toko
    if any(k in row["name_clean"] or k in row["shop_name"] for k in UMKM_KEYWORDS):
        score += 1

    return score


In [None]:
df["umkm_score"] = df.apply(umkm_score, axis=1)

In [None]:
UMKM_KEYWORDS = [
    "umkm", "lokal", "handmade", "daerah",
    "kerajinan", "konveksi", "rumahan"
]

def label_umkm(row):
    score = 0

    # 1. Official store (jika kolom ada)
    if "is_official" in row.index:
        if row["is_official"] == False:
            score += 1
    else:
        # jika tidak ada, anggap non-official
        score += 1

    # 2. TopAds
    if "isTopads" in row.index and row["isTopads"] == False:
        score += 1

    # 3. Penjualan
    if "countSold_num" in row.index and row["countSold_num"] < 50000:
        score += 1

    # 4. Keyword UMKM
    name = str(row.get("name_clean", "")).lower()
    shop = str(row.get("shop_name", "")).lower()
    if any(k in name or k in shop for k in UMKM_KEYWORDS):
        score += 1

    return score >= 3


In [None]:
df["umkm"] = df.apply(label_umkm, axis=1)


In [None]:
df["umkm"].value_counts()

In [None]:
def is_umkm(row):
    name = row.get("name", "")
    sold = row.get("countSold", 0)
    is_ads = row.get("isTopAds", False)     # kolom yang ada
    shop_tier = row.get("shop_tier", pd.NA) # angka
    labelGroups = row.get("labelGroups", pd.NA)

    # normalisasi NA
    name = "" if pd.isna(name) else str(name).lower()
    sold = 0 if pd.isna(sold) else float(sold)
    is_ads = False if pd.isna(is_ads) else bool(is_ads)

    # ekstrak label titles
    titles = extract_label_titles(labelGroups)

    # heuristik official dari labelGroups (paling “nyata” karena ada teksnya)
    official_markers = ["official", "official store", "tokopedia official", "mall"]
    is_official = any(any(m in t for m in official_markers) for t in titles)

    # fallback tambahan dari shop_tier (karena kamu punya 1 dan 3)
    # asumsi umum: tier lebih tinggi cenderung merchant besar/official
    # jadi kalau tier == 3, anggap bukan UMKM
    if (not pd.isna(shop_tier)) and int(shop_tier) >= 3:
        is_official = True

    # aturan UMKM (sesuaikan)
    keywords = ["umkm", "usaha mikro", "homemade", "rumahan", "handmade", "lokal"]
    keyword_match = any(k in name for k in keywords)

    sold_limit = sold <= 1000
    not_ads = not is_ads
    not_official = not is_official

    return bool(keyword_match and sold_limit and not_ads and not_official)


df["umkm"] = df.apply(is_umkm, axis=1)

In [None]:
df.head()

In [None]:
df["umkm"].value_counts()

In [None]:
df_umkm = df[df["umkm"] == True]
df_umkm.head()