# Exploratory Data Analysis (EDA)
## Analisis Komentar YouTube - Korupsi Proyek Whoosh

Notebook ini digunakan untuk melakukan *exploratory data analysis (EDA)* 
pada dataset komentar YouTube terkait korupsi kereta cepat Whoosh.

Tujuan EDA:
- Memahami struktur data awal
- Mengecek missing value dan duplikasi
- Melihat distribusi komentar per video
- Menganalisis panjang komentar dan jumlah likes
- Mengidentifikasi potensi noise (spam, komentar kosong, dsb.)

---


### Import Library

In [1]:
import pandas as pd
import os

### Import Dataset Raw

In [5]:
df = pd.read_csv(r"D:\Arsip Hafizh Fadhl Muhammad\Project\project-sentimen-analisis-datmin\data\raw_dataset_whoosh.csv")

### Check Structure DataFrame

In [8]:
df.head()

Unnamed: 0,video_id,video_title,comment_id,author,comment,likes,published_at
0,1_Xrj0mb7K4,Bedah KEGILAAN Project Whoosh,UgzlM3ejBnN9PKqB2sB4AaABAg,@MA_Alpha-l4q,SUDAH JELAS GENG SOLO YANG HARUS BERTANGGUNG J...,1,2025-12-01T06:59:37Z
1,1_Xrj0mb7K4,Bedah KEGILAAN Project Whoosh,UgzcxKsZ3V_n222CvEF4AaABAg,@nurhasanahssi2114,"Jokowi, Luhut, kroni2 yg harus bertanggungjaw...",0,2025-11-30T01:18:59Z
2,1_Xrj0mb7K4,Bedah KEGILAAN Project Whoosh,Ugx1UtqOkCAflakGpoh4AaABAg,@omsimon-k6k,Yg ditangkap gorengan yg makan duduk manis,0,2025-11-28T15:56:25Z
3,1_Xrj0mb7K4,Bedah KEGILAAN Project Whoosh,Ugz7lA1xVXYLNUEGjEp4AaABAg,@mohammadharriszulfika8486,buat bayar hutang whossh jual saja Aset tentar...,0,2025-11-28T14:11:29Z
4,1_Xrj0mb7K4,Bedah KEGILAAN Project Whoosh,UgzeBmAExGNOkXuCQO94AaABAg,@isaansyori8749,Pantas saja ngotot bgt lanjut 3 periode ternya...,1,2025-11-28T13:19:57Z


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   video_id      500 non-null    object
 1   video_title   500 non-null    object
 2   comment_id    500 non-null    object
 3   author        500 non-null    object
 4   comment       500 non-null    object
 5   likes         500 non-null    int64 
 6   published_at  500 non-null    object
dtypes: int64(1), object(6)
memory usage: 27.5+ KB


In [10]:
df.shape

(500, 7)

### Check Missing Values

In [12]:
df.isnull().sum().to_frame("Jumlah Missing Values")

Unnamed: 0,Jumlah Missing Values
video_id,0
video_title,0
comment_id,0
author,0
comment,0
likes,0
published_at,0


### Check Data Duplicate

In [13]:
if "comment_id" in df.columns:
    dup_comment_id = df.duplicated(subset="comment_id").sum()
    print("Jumlah duplikasi berdasarkan comment_id:", dup_comment_id)
else:
    print("Kolom 'comment_id' tidak ada, skip cek duplikasi berdasarkan id.")

Jumlah duplikasi berdasarkan comment_id: 0


In [14]:
dup_combo = df.duplicated(subset=["video_id", "comment"]).sum()
print("Jumlah duplikasi berdasarkan (video_id, comment):", dup_combo)

Jumlah duplikasi berdasarkan (video_id, comment): 1
