# i. Introduction

Program ini dikembangkan untuk melakukan tahapan data cleaning dan preprocessing pada dataset hasil web scraping produk __TWS__ dari E-Commerce __Amazon__. Tahapan yang dilakukan mencakup normalisasi nama kolom, imputasi missing values sesuai dengan tipe data, konversi tipe data kolom _price_ dan _rating_ ke format numerik, serta ekstraksi informasi fitur produk ke dalam kolom __feature__ melalui pemisahan teks pada kolom __product_name__.

Tahapan data cleaning ini bertujuan untuk memastikan data memiliki kualitas yang baik, konsistensi format, serta kompatibilitas terhadap proses analisis statistik dan pemodelan data selanjutnya.

# ii. Import Libraries

In [None]:
# Import Libraries
import json
import pandas as pd
import numpy as np
import re
import warnings
warnings.filterwarnings('ignore')

# iii. Data Loading

In [None]:
#read json file
with open('soundcore.json') as f1, open('bose.json') as f2, open('baseus.json') as f3, open('jbl.json') as f4, open('sennheiser.json') as f5, open('realme.json') as f6, open('huawei.json') as f7, open('jbl.json') as f8, open('xiaomi.json') as f9:
    data1 = json.load(f1)
    data2 = json.load(f2)
    data3 = json.load(f3)
    data4 = json.load(f4)
    data5 = json.load(f5)
    data6 = json.load(f6)
    data7 = json.load(f7)
    data8 = json.load(f8)
    data9 = json.load(f9)

#combine all data
data = data1 + data2 + data3 + data4 + data5 + data6 + data7 + data8 + data9

#save combined data to json file
with open("all_data.json", "w") as review_json:
    json.dump(data, review_json, indent=2, ensure_ascii=True)

In [182]:
# Read JSON file into DataFrame
with open('all_data.json') as f:
    data = json.load(f)

df = pd.DataFrame(data)
df

Unnamed: 0,product_name,rating,price,brand,color,ear_placement,form_factor,impedance,image_url,reviews
0,Soundcore P30i by Anker Noise Cancelling Earbu...,4.4 out of 5 stars,$29.99,Soundcore,Black,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/51HbL1WS+K...,[{'review': 'I had a pair of the AirPod Pros f...
1,Soundcore P30i by Anker Noise Cancelling Earbu...,4.4 out of 5 stars,$29.99,Soundcore,Black,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/51HbL1WS+K...,[{'review': 'I had a pair of the AirPod Pros f...
2,Soundcore P30i by Anker Noise Cancelling Earbu...,4.4 out of 5 stars,$7.79,Soundcore,Blue,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/41Quiz3Dg-...,[{'review': 'I had a pair of the AirPod Pros f...
3,Soundcore P30i by Anker Noise Cancelling Earbu...,4.4 out of 5 stars,$7.59,Soundcore,Green,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/51013C6va8...,[{'review': 'I had a pair of the AirPod Pros f...
4,Soundcore P30i by Anker Noise Cancelling Earbu...,4.4 out of 5 stars,$29.99,Soundcore,Pink,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/41KMo8Ldy8...,[{'review': 'I had a pair of the AirPod Pros f...
...,...,...,...,...,...,...,...,...,...,...
709,JBL Tune Buds - True wireless Noise Cancelling...,4.2 out of 5 stars,$59.95,JBL,Black,In Ear,In Ear,32 Ohms,https://m.media-amazon.com/images/I/51EZig0rEA...,"[{'review': 'Very comfortable, decent noise ca..."
710,"realme Air 6 Pro True Wireless Earbuds, 50dB N...",4.4 out of 5 stars,$87.82,realme,Grey,In Ear,In Ear,,https://m.media-amazon.com/images/I/61Ktz5wyEw...,[{'review': 'ConThe only bad part about these ...
711,"realme Air 6 Pro True Wireless Earbuds, 50dB N...",4.4 out of 5 stars,$87.82,realme,Grey,In Ear,In Ear,,https://m.media-amazon.com/images/I/61Ktz5wyEw...,[{'review': 'ConThe only bad part about these ...
712,"realme Buds Air 6 Pro True Wireless Earbuds, 5...",4.4 out of 5 stars,$84.99,realme,Silver Blue,In Ear,In Ear,1.4 Ohms,https://m.media-amazon.com/images/I/61e3G6oLxj...,[{'review': 'Amazing sound. The app was wonky ...


In [183]:
# Save to CSV as backup
df.to_csv('data_before_clean.csv', index=False)

In [None]:
# Display DataFrame info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 714 entries, 0 to 713
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   product_name   714 non-null    object
 1   rating         714 non-null    object
 2   price          714 non-null    object
 3   brand          714 non-null    object
 4   color          714 non-null    object
 5   ear_placement  714 non-null    object
 6   form_factor    714 non-null    object
 7   impedance      714 non-null    object
 8   image_url      714 non-null    object
 9   reviews        714 non-null    object
dtypes: object(10)
memory usage: 55.9+ KB


# iii. Data Cleaning

In [None]:
# Capitalize First Letter in Brand Column
df['brand'] = df['brand'].str.title()

In [None]:
# Drop Missing Value in Brand Column
df['brand'] = df['brand'].replace('N/A', None)
df = df.dropna(subset=['brand'])

In [None]:
# Sort DataFrame by Brand
df = df.sort_values(by='brand')

df

Unnamed: 0,product_name,rating,price,brand,color,ear_placement,form_factor,impedance,image_url,reviews
270,"Baseus Eli 2i Fit Open-Ear Headphones, 8.8g Se...",4.4 out of 5 stars,$15.99,Baseus,Black,Open Ear,Open Ear,16 Ohms,https://m.media-amazon.com/images/I/51XnDCIDaF...,[{'review': 'These are actually surprisingly n...
225,"Baseus E20 True Wireless Earbuds, 12mm Drivers...",4.6 out of 5 stars,$15.99,Baseus,White,In Ear,In Ear,,https://m.media-amazon.com/images/I/51q9mJAXJj...,[{'review': 'I am seriously impressed with the...
226,Baseus BS1 NC Semi-in-Ear Noise Cancelling Ear...,4.3 out of 5 stars,$21.99,Baseus,Black,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/51V95ukJfM...,[{'review': 'absolutely amazing! Clear sound q...
227,Baseus Bass BS1 NC Semi-in-Ear True Wireless N...,4.3 out of 5 stars,$39.89,Baseus,White,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/41a41P57iZ...,[{'review': 'absolutely amazing! Clear sound q...
228,Baseus Bass BP1 Pro Noise Cancelling Wireless ...,4.3 out of 5 stars,$27.99,Baseus,Ocean Blue,In Ear,In Ear,,https://m.media-amazon.com/images/I/41SsqSVMpT...,[{'review': 'Have you ever felt the gut rumbli...
...,...,...,...,...,...,...,...,...,...,...
595,Xiaomi Redmi Buds 6 Active Wireles Earbuds (Gl...,4.2 out of 5 stars,$24.99,Xiaomi,White,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/41rWEasz8i...,"[{'review': 'Muy bien.', 'rating': '5.0 out of..."
596,"XIAOMI TWSEJ04LS Redmi Airdots Earphones, Blue...",4.2 out of 5 stars,$23.45,Xiaomi,Black,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/61KDNdskQJ...,"[{'review': '¡Excelente!', 'rating': '5.0 out ..."
597,Xiaomi Sound Packet Mi Portable Bluetooth 5.4 ...,4.7 out of 5 stars,$25.99,Xiaomi,,,,,https://m.media-amazon.com/images/I/71oM-ntW-+...,[{'review': 'El pedido llegó en excelentes con...
599,"Xiaomi Wireless Earbuds Redmi Buds 3 Lite, Wir...",4.2 out of 5 stars,$17.89,Xiaomi,White,In Ear,In Ear,70 Ohms,https://m.media-amazon.com/images/I/31L0-uMto2...,"[{'review': 'Todo en bien estado', 'rating': '..."


In [190]:
# Split Product Name into Product and Feature

df[['product', 'feature']] = (
    df['product_name']
    .str.split(r',|\||\s[-–—]\s', n=1, expand=True)
)

df['product'] = df['product'].str.strip()
df['feature'] = df['feature'].fillna('N/A').str.strip()

# Drop kolom awal
df.drop(columns=['product_name'], inplace=True)

df

Unnamed: 0,rating,price,brand,color,ear_placement,form_factor,impedance,image_url,reviews,product,feature
270,4.4 out of 5 stars,$15.99,Baseus,Black,Open Ear,Open Ear,16 Ohms,https://m.media-amazon.com/images/I/51XnDCIDaF...,[{'review': 'These are actually surprisingly n...,Baseus Eli 2i Fit Open-Ear Headphones,"8.8g Secure Fit, 36H Battery, BassUp Technolog..."
225,4.6 out of 5 stars,$15.99,Baseus,White,In Ear,In Ear,,https://m.media-amazon.com/images/I/51q9mJAXJj...,[{'review': 'I am seriously impressed with the...,Baseus E20 True Wireless Earbuds,"12mm Drivers with Powerful Bass, 33H Long Play..."
226,4.3 out of 5 stars,$21.99,Baseus,Black,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/51V95ukJfM...,[{'review': 'absolutely amazing! Clear sound q...,Baseus BS1 NC Semi-in-Ear Noise Cancelling Ear...,"No Ear Pressure, Hi-Res Audio (LDAC), 40H Play..."
227,4.3 out of 5 stars,$39.89,Baseus,White,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/41a41P57iZ...,[{'review': 'absolutely amazing! Clear sound q...,Baseus Bass BS1 NC Semi-in-Ear True Wireless N...,"Hi-Res Audio (LDAC), 40H Playtime, 4g Ultra-Li..."
228,4.3 out of 5 stars,$27.99,Baseus,Ocean Blue,In Ear,In Ear,,https://m.media-amazon.com/images/I/41SsqSVMpT...,[{'review': 'Have you ever felt the gut rumbli...,Baseus Bass BP1 Pro Noise Cancelling Wireless ...,"Real-Time Adaptive Noise Cancelling, Adaptive ..."
...,...,...,...,...,...,...,...,...,...,...,...
595,4.2 out of 5 stars,$24.99,Xiaomi,White,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/41rWEasz8i...,"[{'review': 'Muy bien.', 'rating': '5.0 out of...",Xiaomi Redmi Buds 6 Active Wireles Earbuds (Gl...,"Bluetooth 5.4 in-Ear Headphones, 30H Long Batt..."
596,4.2 out of 5 stars,$23.45,Xiaomi,Black,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/61KDNdskQJ...,"[{'review': '¡Excelente!', 'rating': '5.0 out ...",XIAOMI TWSEJ04LS Redmi Airdots Earphones,"Bluetooth, Sweatproof, True Wireless Earbuds, ..."
597,4.7 out of 5 stars,$25.99,Xiaomi,,,,,https://m.media-amazon.com/images/I/71oM-ntW-+...,[{'review': 'El pedido llegó en excelentes con...,Xiaomi Sound Packet Mi Portable Bluetooth 5.4 ...,IP67 Waterproof Outdoor Bluetooth Speaker with...
599,4.2 out of 5 stars,$17.89,Xiaomi,White,In Ear,In Ear,70 Ohms,https://m.media-amazon.com/images/I/31L0-uMto2...,"[{'review': 'Todo en bien estado', 'rating': '...",Xiaomi Wireless Earbuds Redmi Buds 3 Lite,Wireless Earphones Bluetooth 5.2 Latency Stere...


In [191]:
# Reorder columns
df = df[['product', 'feature', 'brand', 'price', 'rating', 'color', 'ear_placement', 'form_factor', 'impedance', 'image_url', 'reviews']]

df

Unnamed: 0,product,feature,brand,price,rating,color,ear_placement,form_factor,impedance,image_url,reviews
270,Baseus Eli 2i Fit Open-Ear Headphones,"8.8g Secure Fit, 36H Battery, BassUp Technolog...",Baseus,$15.99,4.4 out of 5 stars,Black,Open Ear,Open Ear,16 Ohms,https://m.media-amazon.com/images/I/51XnDCIDaF...,[{'review': 'These are actually surprisingly n...
225,Baseus E20 True Wireless Earbuds,"12mm Drivers with Powerful Bass, 33H Long Play...",Baseus,$15.99,4.6 out of 5 stars,White,In Ear,In Ear,,https://m.media-amazon.com/images/I/51q9mJAXJj...,[{'review': 'I am seriously impressed with the...
226,Baseus BS1 NC Semi-in-Ear Noise Cancelling Ear...,"No Ear Pressure, Hi-Res Audio (LDAC), 40H Play...",Baseus,$21.99,4.3 out of 5 stars,Black,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/51V95ukJfM...,[{'review': 'absolutely amazing! Clear sound q...
227,Baseus Bass BS1 NC Semi-in-Ear True Wireless N...,"Hi-Res Audio (LDAC), 40H Playtime, 4g Ultra-Li...",Baseus,$39.89,4.3 out of 5 stars,White,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/41a41P57iZ...,[{'review': 'absolutely amazing! Clear sound q...
228,Baseus Bass BP1 Pro Noise Cancelling Wireless ...,"Real-Time Adaptive Noise Cancelling, Adaptive ...",Baseus,$27.99,4.3 out of 5 stars,Ocean Blue,In Ear,In Ear,,https://m.media-amazon.com/images/I/41SsqSVMpT...,[{'review': 'Have you ever felt the gut rumbli...
...,...,...,...,...,...,...,...,...,...,...,...
595,Xiaomi Redmi Buds 6 Active Wireles Earbuds (Gl...,"Bluetooth 5.4 in-Ear Headphones, 30H Long Batt...",Xiaomi,$24.99,4.2 out of 5 stars,White,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/41rWEasz8i...,"[{'review': 'Muy bien.', 'rating': '5.0 out of..."
596,XIAOMI TWSEJ04LS Redmi Airdots Earphones,"Bluetooth, Sweatproof, True Wireless Earbuds, ...",Xiaomi,$23.45,4.2 out of 5 stars,Black,In Ear,In Ear,16 Ohms,https://m.media-amazon.com/images/I/61KDNdskQJ...,"[{'review': '¡Excelente!', 'rating': '5.0 out ..."
597,Xiaomi Sound Packet Mi Portable Bluetooth 5.4 ...,IP67 Waterproof Outdoor Bluetooth Speaker with...,Xiaomi,$25.99,4.7 out of 5 stars,,,,,https://m.media-amazon.com/images/I/71oM-ntW-+...,[{'review': 'El pedido llegó en excelentes con...
599,Xiaomi Wireless Earbuds Redmi Buds 3 Lite,Wireless Earphones Bluetooth 5.2 Latency Stere...,Xiaomi,$17.89,4.2 out of 5 stars,White,In Ear,In Ear,70 Ohms,https://m.media-amazon.com/images/I/31L0-uMto2...,"[{'review': 'Todo en bien estado', 'rating': '..."


In [192]:
# Change price to numeric
df['price'] = (
    df['price']
    .astype(str)
    .str.replace(r'[$,]', '', regex=True)
)

df['price'] = pd.to_numeric(df['price'], errors='coerce')

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 707 entries, 270 to 579
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product        707 non-null    object 
 1   feature        707 non-null    object 
 2   brand          707 non-null    object 
 3   price          707 non-null    float64
 4   rating         707 non-null    object 
 5   color          707 non-null    object 
 6   ear_placement  707 non-null    object 
 7   form_factor    707 non-null    object 
 8   impedance      707 non-null    object 
 9   image_url      707 non-null    object 
 10  reviews        707 non-null    object 
dtypes: float64(1), object(10)
memory usage: 66.3+ KB


In [193]:
df['product'].unique()

array(['Baseus Eli 2i Fit Open-Ear Headphones',
       'Baseus E20 True Wireless Earbuds',
       'Baseus BS1 NC Semi-in-Ear Noise Cancelling Earbuds with Adaptive Hybrid ANC',
       'Baseus Bass BS1 NC Semi-in-Ear True Wireless Noise Cancelling Earbuds with Adaptive Hybrid ANC',
       'Baseus Bass BP1 Pro Noise Cancelling Wireless Earbuds',
       'Baseus Inspire XP1 Adaptive Noise Cancelling Earbuds',
       'Baseus Inspire XC1 Open Ear Clip-On Earbuds',
       'Baseus Eli Sport 2 Open-Ear Headphones Wireless Earbuds',
       'Baseus Bowie MC1 Open Ear Clip-On Earbuds',
       'Baseus Bowie MC1 Pro Open Ear Earbuds',
       'Baseus Bowie 30 Max Active Noise Cancelling Headphones',
       'Baseus Bass BC1 Open Ear Earbuds Clip-On Headphones',
       'Baseus Bass BP1 NC Hybrid Active Noise Cancelling Wireless Earbuds',
       'Baseus Bowie MC1 Pro Open Ear Clip-On Headphones',
       'Baseus Inspire XH1 Adaptive Active Noise Cancelling Headphones',
       'Bose Ultra Open Earbuds', '

In [195]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 707 entries, 270 to 579
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product        707 non-null    object 
 1   feature        707 non-null    object 
 2   brand          707 non-null    object 
 3   price          707 non-null    float64
 4   rating         707 non-null    object 
 5   color          707 non-null    object 
 6   ear_placement  707 non-null    object 
 7   form_factor    707 non-null    object 
 8   impedance      707 non-null    object 
 9   image_url      707 non-null    object 
 10  reviews        707 non-null    object 
dtypes: float64(1), object(10)
memory usage: 66.3+ KB


In [None]:
# Check Rating Column
df.rating.value_counts()

rating
4.2 out of 5 stars    140
4.4 out of 5 stars    116
4.3 out of 5 stars    116
4.0 out of 5 stars     91
4.1 out of 5 stars     86
4.6 out of 5 stars     42
3.8 out of 5 stars     31
3.9 out of 5 stars     15
4.7 out of 5 stars     13
3.7 out of 5 stars     11
4.5 out of 5 stars     10
3.6 out of 5 stars     10
4.9 out of 5 stars      9
3.2 out of 5 stars      7
4.8 out of 5 stars      4
5.0 out of 5 stars      3
3.5 out of 5 stars      3
Name: count, dtype: int64

Rating masih bersifat object sehingga harus mengubahnya menjadi tipe data float dengan cara mengambil angka pertama saja

In [None]:
# Rating to numeric
df['rating'] = pd.to_numeric(
    df['rating'].astype(str).str.extract(r'(\d+\.\d+)')[0],
    errors='coerce'
)

df.rating.value_counts()


rating
4.2    140
4.4    116
4.3    116
4.0     91
4.1     86
4.6     42
3.8     31
3.9     15
4.7     13
3.7     11
4.5     10
3.6     10
4.9      9
3.2      7
4.8      4
5.0      3
3.5      3
Name: count, dtype: int64

# iv. Handle Missing Value

In [216]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 707 entries, 270 to 579
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product        707 non-null    object 
 1   feature        707 non-null    object 
 2   brand          707 non-null    object 
 3   price          707 non-null    float64
 4   rating         707 non-null    float64
 5   color          707 non-null    object 
 6   ear_placement  707 non-null    object 
 7   form_factor    707 non-null    object 
 8   impedance      707 non-null    object 
 9   image_url      707 non-null    object 
 10  reviews        707 non-null    object 
dtypes: float64(2), object(9)
memory usage: 66.3+ KB


Jika dilihat dari df.info data tidak memiliki nilai null, tetapi saat scraping jika menemukan missing value akan diubah menjadi N/A sehingga tidak terbaca sebagai missing value pada df.info. Sehingga dibutuhkan untuk mengubah nilai N/A  pada masing masing kolom menjadi nan

In [None]:
# Change feature 'N/A' to NaN
df['feature'] = df['feature'].replace('N/A', np.nan)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 707 entries, 270 to 579
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product        707 non-null    object 
 1   feature        691 non-null    object 
 2   brand          707 non-null    object 
 3   price          707 non-null    float64
 4   rating         707 non-null    float64
 5   color          707 non-null    object 
 6   ear_placement  707 non-null    object 
 7   form_factor    707 non-null    object 
 8   impedance      707 non-null    object 
 9   image_url      707 non-null    object 
 10  reviews        707 non-null    object 
dtypes: float64(2), object(9)
memory usage: 66.3+ KB


In [None]:
# Change brand 'N/A' to NaN
df['brand'] = df['brand'].replace('N/A', np.nan)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 707 entries, 270 to 579
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product        707 non-null    object 
 1   feature        691 non-null    object 
 2   brand          707 non-null    object 
 3   price          707 non-null    float64
 4   rating         707 non-null    float64
 5   color          707 non-null    object 
 6   ear_placement  707 non-null    object 
 7   form_factor    707 non-null    object 
 8   impedance      707 non-null    object 
 9   image_url      707 non-null    object 
 10  reviews        707 non-null    object 
dtypes: float64(2), object(9)
memory usage: 66.3+ KB


In [None]:
# Change color 'N/A' to NaN
df['color'] = df['color'].replace('N/A', np.nan)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 707 entries, 270 to 579
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product        707 non-null    object 
 1   feature        691 non-null    object 
 2   brand          707 non-null    object 
 3   price          707 non-null    float64
 4   rating         707 non-null    float64
 5   color          697 non-null    object 
 6   ear_placement  707 non-null    object 
 7   form_factor    707 non-null    object 
 8   impedance      707 non-null    object 
 9   image_url      707 non-null    object 
 10  reviews        707 non-null    object 
dtypes: float64(2), object(9)
memory usage: 66.3+ KB


In [None]:
# Change ear_placement'N/A' to NaN
df['ear_placement'] = df['ear_placement'].replace('N/A', np.nan)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 707 entries, 270 to 579
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product        707 non-null    object 
 1   feature        691 non-null    object 
 2   brand          707 non-null    object 
 3   price          707 non-null    float64
 4   rating         707 non-null    float64
 5   color          697 non-null    object 
 6   ear_placement  695 non-null    object 
 7   form_factor    707 non-null    object 
 8   impedance      707 non-null    object 
 9   image_url      707 non-null    object 
 10  reviews        707 non-null    object 
dtypes: float64(2), object(9)
memory usage: 66.3+ KB


In [None]:
# Change form_factor 'N/A' to NaN
df['form_factor'] = df['form_factor'].replace('N/A', np.nan)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 707 entries, 270 to 579
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product        707 non-null    object 
 1   feature        691 non-null    object 
 2   brand          707 non-null    object 
 3   price          707 non-null    float64
 4   rating         707 non-null    float64
 5   color          697 non-null    object 
 6   ear_placement  695 non-null    object 
 7   form_factor    696 non-null    object 
 8   impedance      707 non-null    object 
 9   image_url      707 non-null    object 
 10  reviews        707 non-null    object 
dtypes: float64(2), object(9)
memory usage: 66.3+ KB


In [None]:
# Change impedance 'N/A' to NaN
df['impedance'] = df['impedance'].replace('N/A', np.nan)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 707 entries, 270 to 579
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product        707 non-null    object 
 1   feature        691 non-null    object 
 2   brand          707 non-null    object 
 3   price          707 non-null    float64
 4   rating         707 non-null    float64
 5   color          697 non-null    object 
 6   ear_placement  695 non-null    object 
 7   form_factor    696 non-null    object 
 8   impedance      510 non-null    object 
 9   image_url      707 non-null    object 
 10  reviews        707 non-null    object 
dtypes: float64(2), object(9)
memory usage: 66.3+ KB


In [None]:
# Change image_url 'N/A' to NaN
df['image_url'] = df['image_url'].replace('N/A', np.nan)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 707 entries, 270 to 579
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product        707 non-null    object 
 1   feature        691 non-null    object 
 2   brand          707 non-null    object 
 3   price          707 non-null    float64
 4   rating         707 non-null    float64
 5   color          697 non-null    object 
 6   ear_placement  695 non-null    object 
 7   form_factor    696 non-null    object 
 8   impedance      510 non-null    object 
 9   image_url      707 non-null    object 
 10  reviews        707 non-null    object 
dtypes: float64(2), object(9)
memory usage: 66.3+ KB


## a. Imputation

Pada kolom price dan rating tidak memiliki null values tetapi akan tetap dilakukan handle missing value dengan imputasi menggunakan median sedangkan pada kolom feature, color, dan ear_placement akan diisi dengan nilai 'Unkown' menimbang bahwa missing value tidak terlalu banyak dan impedance akan diisi sebagai Not Specified, menjadi variable baru dikarenakan jika dilakukan dengan imputasi mode akan mempengaruhi data

In [207]:
# Numerik
df['price'] = df['price'].fillna(df['price'].median())
df['rating'] = df['rating'].fillna(df['rating'].median())

# Kategorikal
df['feature'] = df['feature'].fillna('N/A')
df['color'] = df['color'].fillna('Unknown')
df['ear_placement'] = df['ear_placement'].fillna('Unknown')
df['form_factor'] = df['form_factor'].fillna(df['form_factor'].mode()[0])
df['impedance'] = df['impedance'].fillna('Not Specified')

In [208]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 707 entries, 270 to 579
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product        707 non-null    object 
 1   feature        707 non-null    object 
 2   brand          707 non-null    object 
 3   price          707 non-null    float64
 4   rating         707 non-null    float64
 5   color          707 non-null    object 
 6   ear_placement  707 non-null    object 
 7   form_factor    707 non-null    object 
 8   impedance      707 non-null    object 
 9   image_url      707 non-null    object 
 10  reviews        707 non-null    object 
dtypes: float64(2), object(9)
memory usage: 66.3+ KB


# v. Review Data Frame

Membuat review data frame dengan memisahkan product, brand dan reviews yang akan digunakan pada NLP model

In [None]:
# Explode reviews into separate rows
review_df = (
    df[['product', 'brand', 'reviews']]
    .explode('reviews')
)

review_df['review_text'] = review_df['reviews'].apply(lambda x: x['review'])
review_df['review_rating'] = review_df['reviews'].apply(lambda x: x['rating'])

review_df = review_df.drop(columns=['reviews'])

review_df

Unnamed: 0,product,brand,review_text,review_rating
270,Baseus Eli 2i Fit Open-Ear Headphones,Baseus,These are actually surprisingly nice. To be cl...,5.0 out of 5 stars
270,Baseus Eli 2i Fit Open-Ear Headphones,Baseus,One of the best over the ear phones. It delive...,5.0 out of 5 stars
270,Baseus Eli 2i Fit Open-Ear Headphones,Baseus,Got mine today. Some quick observations• Diffi...,4.0 out of 5 stars
270,Baseus Eli 2i Fit Open-Ear Headphones,Baseus,I've had mixed experiences with Baseus earphon...,5.0 out of 5 stars
270,Baseus Eli 2i Fit Open-Ear Headphones,Baseus,"These are fine. Decent sound and battery life,...",3.0 out of 5 stars
...,...,...,...,...
579,Redmi Buds 6 Lite in-Ear Headphones,Xiaomi,The Redmi Buds 6 lite are a great low-cost opt...,5.0 out of 5 stars
579,Redmi Buds 6 Lite in-Ear Headphones,Xiaomi,They look and work really well,5.0 out of 5 stars
579,Redmi Buds 6 Lite in-Ear Headphones,Xiaomi,"I'm not an audiophile, I also don't tend to pu...",5.0 out of 5 stars
579,Redmi Buds 6 Lite in-Ear Headphones,Xiaomi,I bought a blue one but get a white one. Color...,1.0 out of 5 stars


# v. Saving Cleaned Data to CSV

In [212]:
# Save cleaned data to CSV
df.to_csv('data_after_clean.csv', index=False)

In [213]:
# Save review data to CSV
review_df.to_csv('review_data.csv', index=False)