# Indonesian Hate Speech Detection - Data Retrieval

This notebook handles the data retrieval and initial loading for the Indonesian hate speech detection project.

## Overview
- Load raw datasets from multiple sources
- Combine and prepare data for preprocessing  
- Initial data inspection and validation
- Save processed datasets for further analysis

## Dataset Sources
1. **Main Dataset**: `data/raw/data.csv` - Primary hate speech dataset
2. **Abusive Words**: `IndonesianAbusiveWords/data.csv` - Indonesian abusive word dictionary  
3. **Additional Abusive Data**: `data/raw/abusive.csv` - Additional abusive content samples


In [249]:
# Import required libraries
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("Libraries imported successfully!")


Libraries imported successfully!


## Step 1: Data Loading and Initial Inspection

Let's start by loading all available datasets and understanding their structure.


In [250]:
# Function to safely load CSV with different encodings
def load_csv_safe(filepath, encodings=['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']):
    """
    Safely load CSV file trying different encodings
    """
    # Try both relative to current dir and relative to parent dir (in case running from notebooks/)
    possible_paths = [filepath, os.path.join('..', filepath)]
    
    for path in possible_paths:
        if os.path.exists(path):
            for encoding in encodings:
                try:
                    df = pd.read_csv(path, encoding=encoding)
                    print(f"Successfully loaded {path} with encoding: {encoding}")
                    print(f"  Shape: {df.shape}, Memory: {df.memory_usage(deep=True).sum()/1024/1024:.2f} MB")
                    return df
                except UnicodeDecodeError:
                    print(f"  Failed with {encoding} encoding")
                    continue
                except Exception as e:
                    print(f"  Error with {encoding}: {e}")
                    continue
    
    print(f"Failed to load {filepath} - file not found in any location")
    return None

# Check current working directory and available data files
print(f"Current working directory: {os.getcwd()}")
print("\n" + "="*60)
print("CHECKING DATA DIRECTORIES AND FILES")
print("="*60)

# Define possible paths (current dir and parent dir)
data_dirs = ["data/raw/", "../data/raw/"]
abusive_dirs = ["IndonesianAbusiveWords/", "../IndonesianAbusiveWords/"]

found_data_dir = None
found_abusive_dir = None

# Check for data/raw directory
for data_dir in data_dirs:
    if os.path.exists(data_dir):
        found_data_dir = data_dir
        print(f"\nData/raw directory found at: {data_dir}")
        for root, dirs, files in os.walk(data_dir):
            for file in files:
                filepath = os.path.join(root, file)
                if file.endswith('.csv'):
                    file_size = os.path.getsize(filepath) / 1024 / 1024  # MB
                    print(f"  {filepath} ({file_size:.2f} MB)")
        break

if not found_data_dir:
    print("\nData/raw directory: NOT FOUND")

# Check for IndonesianAbusiveWords directory
for abusive_dir in abusive_dirs:
    if os.path.exists(abusive_dir):
        found_abusive_dir = abusive_dir
        print(f"\nIndonesian abusive words directory found at: {abusive_dir}")
        for root, dirs, files in os.walk(abusive_dir):
            for file in files:
                filepath = os.path.join(root, file)
                if file.endswith('.csv'):
                    file_size = os.path.getsize(filepath) / 1024 / 1024  # MB
                    print(f"  {filepath} ({file_size:.2f} MB)")
        break

if not found_abusive_dir:
    print("\nIndonesian abusive words directory: NOT FOUND")

print("\n" + "="*60)


Current working directory: c:\Users\andre\Documents\GithubRepo\Data Science\indonesian-hate-speech-detection\notebooks

CHECKING DATA DIRECTORIES AND FILES

Data/raw directory found at: ../data/raw/
  ../data/raw/abusive.csv (0.00 MB)
  ../data/raw/data.csv (1.77 MB)

Indonesian abusive words directory found at: ../IndonesianAbusiveWords/
  ../IndonesianAbusiveWords/abusive.csv (0.00 MB)
  ../IndonesianAbusiveWords/data.csv (1.77 MB)
  ../IndonesianAbusiveWords/new_kamusalay.csv (0.27 MB)



## Step 2: Loading Main Dataset

Loading the primary Indonesian hate speech dataset.


In [251]:
# Load main dataset
print("="*60)
print("LOADING MAIN DATASET")
print("="*60)
main_df = load_csv_safe("data/raw/data.csv")

if main_df is not None:
    print(f"\nMAIN DATASET SUMMARY:")
    print(f"   Shape: {main_df.shape}")
    print(f"   Columns: {list(main_df.columns)}")
    print(f"   Memory: {main_df.memory_usage(deep=True).sum()/1024/1024:.2f} MB")
    
    print(f"\nFIRST 3 ROWS:")
    print(main_df.head(3))
    
    print(f"\nDATASET INFO:")
    print(main_df.info())
    
    # Check for common text and label columns
    text_cols = [col for col in main_df.columns if any(word in col.lower() for word in ['text', 'tweet', 'comment', 'content', 'message'])]
    label_cols = [col for col in main_df.columns if any(word in col.lower() for word in ['label', 'class', 'category', 'target', 'abusive', 'hate'])]
    
    print(f"\nIDENTIFIED COLUMNS:")
    print(f"   Potential text columns: {text_cols}")
    print(f"   Potential label columns: {label_cols}")
else:
    print("FAILED TO LOAD MAIN DATASET")


LOADING MAIN DATASET
  Failed with utf-8 encoding
Successfully loaded ..\data/raw/data.csv with encoding: latin-1
  Shape: (13169, 13), Memory: 3.26 MB

MAIN DATASET SUMMARY:
   Shape: (13169, 13)
   Columns: ['Tweet', 'HS', 'Abusive', 'HS_Individual', 'HS_Group', 'HS_Religion', 'HS_Race', 'HS_Physical', 'HS_Gender', 'HS_Other', 'HS_Weak', 'HS_Moderate', 'HS_Strong']
   Memory: 3.26 MB

FIRST 3 ROWS:
                                               Tweet  HS  Abusive  \
0  - disaat semua cowok berusaha melacak perhatia...   1        1   
1  RT USER: USER siapa yang telat ngasih tau elu?...   0        1   
2  41. Kadang aku berfikir, kenapa aku tetap perc...   0        0   

   HS_Individual  HS_Group  HS_Religion  HS_Race  HS_Physical  HS_Gender  \
0              1         0            0        0            0          0   
1              0         0            0        0            0          0   
2              0         0            0        0            0          0   

   HS_Other  H

## Step 3: Loading Additional Abusive Dataset

Loading additional abusive content samples for comparison.


In [252]:
# Load additional abusive dataset
print("="*60)
print("LOADING ADDITIONAL ABUSIVE DATASET")
print("="*60)
abusive_df = load_csv_safe("data/raw/abusive.csv")

if abusive_df is not None:
    print(f"\nABUSIVE DATASET SUMMARY:")
    print(f"   Shape: {abusive_df.shape}")
    print(f"   Columns: {list(abusive_df.columns)}")
    print(f"   Memory: {abusive_df.memory_usage(deep=True).sum()/1024/1024:.2f} MB")
    
    print(f"\nFIRST 3 ROWS:")
    print(abusive_df.head(3))
    
    print(f"\nDATASET INFO:")
    print(abusive_df.info())
else:
    print("FAILED TO LOAD ADDITIONAL ABUSIVE DATASET")


LOADING ADDITIONAL ABUSIVE DATASET
Successfully loaded ..\data/raw/abusive.csv with encoding: utf-8
  Shape: (125, 1), Memory: 0.01 MB

ABUSIVE DATASET SUMMARY:
   Shape: (125, 1)
   Columns: ['ABUSIVE']
   Memory: 0.01 MB

FIRST 3 ROWS:
  ABUSIVE
0    alay
1   ampas
2    buta

DATASET INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125 entries, 0 to 124
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ABUSIVE  125 non-null    object
dtypes: object(1)
memory usage: 1.1+ KB
None


## Step 4: Loading Indonesian Abusive Words Dictionary

Loading the comprehensive Indonesian abusive words dictionary.


In [253]:
# Load Indonesian abusive words dictionary
print("="*60)
print("LOADING INDONESIAN ABUSIVE WORDS DICTIONARY")
print("="*60)
abusive_words_df = load_csv_safe("IndonesianAbusiveWords/data.csv")

if abusive_words_df is not None:
    print(f"\nABUSIVE WORDS DATASET SUMMARY:")
    print(f"   Shape: {abusive_words_df.shape}")
    print(f"   Columns: {list(abusive_words_df.columns)}")
    print(f"   Memory: {abusive_words_df.memory_usage(deep=True).sum()/1024/1024:.2f} MB")
    
    print(f"\nFIRST 5 ROWS:")
    print(abusive_words_df.head(5))
    
    print(f"\nDATASET INFO:")
    print(abusive_words_df.info())
    
    # Check for unique values in each column to understand the structure
    print(f"\nCOLUMN ANALYSIS:")
    for col in abusive_words_df.columns:
        unique_count = abusive_words_df[col].nunique()
        print(f"   {col}: {unique_count} unique values")
        if unique_count <= 10:  # Show unique values if few
            print(f"      Values: {abusive_words_df[col].unique()}")
else:
    print("FAILED TO LOAD INDONESIAN ABUSIVE WORDS DATASET")


LOADING INDONESIAN ABUSIVE WORDS DICTIONARY
  Failed with utf-8 encoding
Successfully loaded ..\IndonesianAbusiveWords/data.csv with encoding: latin-1
  Shape: (13169, 13), Memory: 3.26 MB

ABUSIVE WORDS DATASET SUMMARY:
   Shape: (13169, 13)
   Columns: ['Tweet', 'HS', 'Abusive', 'HS_Individual', 'HS_Group', 'HS_Religion', 'HS_Race', 'HS_Physical', 'HS_Gender', 'HS_Other', 'HS_Weak', 'HS_Moderate', 'HS_Strong']
   Memory: 3.26 MB

FIRST 5 ROWS:
                                               Tweet  HS  Abusive  \
0  - disaat semua cowok berusaha melacak perhatia...   1        1   
1  RT USER: USER siapa yang telat ngasih tau elu?...   0        1   
2  41. Kadang aku berfikir, kenapa aku tetap perc...   0        0   
3  USER USER AKU ITU AKU\n\nKU TAU MATAMU SIPIT T...   0        0   
4  USER USER Kaum cebong kapir udah keliatan dong...   1        1   

   HS_Individual  HS_Group  HS_Religion  HS_Race  HS_Physical  HS_Gender  \
0              1         0            0        0           

## Step 5: Data Quality Assessment

Analyzing the quality and characteristics of our datasets.


In [254]:
# Analyze main dataset
if main_df is not None:
    print("="*60)
    print("MAIN DATASET QUALITY ANALYSIS")
    print("="*60)
    print(f"Shape: {main_df.shape}")
    print(f"Memory usage: {main_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    # Check for missing values
    print("\nMISSING VALUES:")
    missing_data = main_df.isnull().sum()
    has_missing = False
    for col, missing in missing_data.items():
        if missing > 0:
            print(f"  {col}: {missing} ({missing/len(main_df)*100:.2f}%)")
            has_missing = True
    if not has_missing:
        print("  No missing values found")
    
    # Check data types
    print(f"\nDATA TYPES:")
    for col, dtype in main_df.dtypes.items():
        print(f"  {col}: {dtype}")
    
    # Check for duplicate rows
    duplicates = main_df.duplicated().sum()
    print(f"\nDUPLICATE ROWS: {duplicates} ({duplicates/len(main_df)*100:.2f}%)")
    
    # Basic statistics for numerical columns
    numeric_cols = main_df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) > 0:
        print(f"\nNUMERICAL COLUMNS STATISTICS:")
        print(main_df[numeric_cols].describe())
else:
    print("Main dataset not available for analysis")


MAIN DATASET QUALITY ANALYSIS
Shape: (13169, 13)
Memory usage: 3.26 MB

MISSING VALUES:
  No missing values found

DATA TYPES:
  Tweet: object
  HS: int64
  Abusive: int64
  HS_Individual: int64
  HS_Group: int64
  HS_Religion: int64
  HS_Race: int64
  HS_Physical: int64
  HS_Gender: int64
  HS_Other: int64
  HS_Weak: int64
  HS_Moderate: int64
  HS_Strong: int64

DUPLICATE ROWS: 125 (0.95%)

NUMERICAL COLUMNS STATISTICS:
                 HS       Abusive  HS_Individual      HS_Group   HS_Religion  \
count  13169.000000  13169.000000   13169.000000  13169.000000  13169.000000   
mean       0.422280      0.382945       0.271471      0.150809      0.060217   
std        0.493941      0.486123       0.444735      0.357876      0.237898   
min        0.000000      0.000000       0.000000      0.000000      0.000000   
25%        0.000000      0.000000       0.000000      0.000000      0.000000   
50%        0.000000      0.000000       0.000000      0.000000      0.000000   
75%        1.0

## Step 6: Target Variable Analysis

Analyzing the distribution of hate speech and abusive content labels.


In [255]:
# Check target variable distribution (if available)
if main_df is not None:
    print("="*60)
    print("TARGET VARIABLE ANALYSIS")
    print("="*60)
    
    # Look for potential target columns (common names for hate speech classification)
    potential_targets = ['label', 'class', 'category', 'abusive', 'hate_speech', 'target', 'hs']
    target_cols = []
    
    for col in main_df.columns:
        col_lower = col.lower()
        if any(target in col_lower for target in potential_targets):
            target_cols.append(col)
    
    if target_cols:
        print(f"Found potential target columns: {target_cols}")
        
        # Analyze each target column
        for target_col in target_cols:
            print(f"\n--- Analysis for '{target_col}' ---")
            print(f"VALUE COUNTS:")
            value_counts = main_df[target_col].value_counts()
            print(value_counts)
            
            print(f"\nPERCENTAGE DISTRIBUTION:")
            percentage_dist = main_df[target_col].value_counts(normalize=True) * 100
            for val, pct in percentage_dist.items():
                print(f"  {val}: {pct:.2f}%")
                
            print(f"\nUnique values: {main_df[target_col].nunique()}")
            print(f"Data type: {main_df[target_col].dtype}")
    else:
        print("No clear target column identified. Available columns:")
        for col in main_df.columns:
            print(f"  - {col}")
        print("\nPlease verify which column contains the classification labels.")


TARGET VARIABLE ANALYSIS
Found potential target columns: ['HS', 'Abusive', 'HS_Individual', 'HS_Group', 'HS_Religion', 'HS_Race', 'HS_Physical', 'HS_Gender', 'HS_Other', 'HS_Weak', 'HS_Moderate', 'HS_Strong']

--- Analysis for 'HS' ---
VALUE COUNTS:
HS
0    7608
1    5561
Name: count, dtype: int64

PERCENTAGE DISTRIBUTION:
  0: 57.77%
  1: 42.23%

Unique values: 2
Data type: int64

--- Analysis for 'Abusive' ---
VALUE COUNTS:
Abusive
0    8126
1    5043
Name: count, dtype: int64

PERCENTAGE DISTRIBUTION:
  0: 61.71%
  1: 38.29%

Unique values: 2
Data type: int64

--- Analysis for 'HS_Individual' ---
VALUE COUNTS:
HS_Individual
0    9594
1    3575
Name: count, dtype: int64

PERCENTAGE DISTRIBUTION:
  0: 72.85%
  1: 27.15%

Unique values: 2
Data type: int64

--- Analysis for 'HS_Group' ---
VALUE COUNTS:
HS_Group
0    11183
1     1986
Name: count, dtype: int64

PERCENTAGE DISTRIBUTION:
  0: 84.92%
  1: 15.08%

Unique values: 2
Data type: int64

--- Analysis for 'HS_Religion' ---
VALUE COU

## Step 7: Data Export and Summary

Saving processed datasets and providing a comprehensive summary.


# Indonesian Hate Speech Detection - Data Retrieval

This notebook handles the data retrieval and initial loading for the Indonesian hate speech detection project.

## Overview
- Load raw datasets from multiple sources
- Combine and prepare data for preprocessing
- Initial data inspection and validation
- Save processed datasets for further analysis

## Dataset Sources
1. **Main Dataset**: `data/raw/data.csv` - Primary hate speech dataset
2. **Abusive Words**: `IndonesianAbusiveWords/data.csv` - Indonesian abusive word dictionary
3. **Additional Abusive Data**: `data/raw/abusive.csv` - Additional abusive content samples


In [256]:
# Import required libraries
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("Libraries imported successfully!")


Libraries imported successfully!


## 1. Data Loading and Initial Inspection

Let's start by loading all available datasets and understanding their structure.


In [257]:
# Function to safely load CSV with different encodings
def load_csv_safe(filepath, encodings=['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']):
    """
    Safely load CSV file trying different encodings
    """
    # Try both relative to current dir and relative to parent dir (in case running from notebooks/)
    possible_paths = [filepath, os.path.join('..', filepath)]
    
    for path in possible_paths:
        if os.path.exists(path):
            for encoding in encodings:
                try:
                    df = pd.read_csv(path, encoding=encoding)
                    print(f"Successfully loaded {path} with encoding: {encoding}")
                    print(f"  Shape: {df.shape}, Memory: {df.memory_usage(deep=True).sum()/1024/1024:.2f} MB")
                    return df
                except UnicodeDecodeError:
                    print(f"  Failed with {encoding} encoding")
                    continue
                except Exception as e:
                    print(f"  Error with {encoding}: {e}")
                    continue
    
    print(f"Failed to load {filepath} - file not found in any location")
    return None

# Check current working directory and available data files
print(f"Current working directory: {os.getcwd()}")
print("\n" + "="*50)

# Define possible paths (current dir and parent dir)
data_dirs = ["data/raw/", "../data/raw/"]
abusive_dirs = ["IndonesianAbusiveWords/", "../IndonesianAbusiveWords/"]

print("CHECKING DATA DIRECTORIES AND FILES")
print("="*50)

found_data_dir = None
found_abusive_dir = None

# Check for data/raw directory
for data_dir in data_dirs:
    if os.path.exists(data_dir):
        found_data_dir = data_dir
        print(f"\nData/raw directory found at: {data_dir}")
        for root, dirs, files in os.walk(data_dir):
            for file in files:
                filepath = os.path.join(root, file)
                if file.endswith('.csv'):
                    file_size = os.path.getsize(filepath) / 1024 / 1024  # MB
                    print(f"  {filepath} ({file_size:.2f} MB)")
        break

if not found_data_dir:
    print("\nData/raw directory: NOT FOUND")

# Check for IndonesianAbusiveWords directory
for abusive_dir in abusive_dirs:
    if os.path.exists(abusive_dir):
        found_abusive_dir = abusive_dir
        print(f"\nIndonesian abusive words directory found at: {abusive_dir}")
        for root, dirs, files in os.walk(abusive_dir):
            for file in files:
                filepath = os.path.join(root, file)
                if file.endswith('.csv'):
                    file_size = os.path.getsize(filepath) / 1024 / 1024  # MB
                    print(f"  {filepath} ({file_size:.2f} MB)")
        break

if not found_abusive_dir:
    print("\nIndonesian abusive words directory: NOT FOUND")

print("\n" + "="*50)


Current working directory: c:\Users\andre\Documents\GithubRepo\Data Science\indonesian-hate-speech-detection\notebooks

CHECKING DATA DIRECTORIES AND FILES

Data/raw directory found at: ../data/raw/
  ../data/raw/abusive.csv (0.00 MB)
  ../data/raw/data.csv (1.77 MB)

Indonesian abusive words directory found at: ../IndonesianAbusiveWords/
  ../IndonesianAbusiveWords/abusive.csv (0.00 MB)
  ../IndonesianAbusiveWords/data.csv (1.77 MB)
  ../IndonesianAbusiveWords/new_kamusalay.csv (0.27 MB)



In [258]:
# Load main dataset
print("\n" + "="*60)
print("STEP 1: LOADING MAIN DATASET")
print("="*60)
main_df = load_csv_safe("data/raw/data.csv")

if main_df is not None:
    print(f"\nMAIN DATASET SUMMARY:")
    print(f"   Shape: {main_df.shape}")
    print(f"   Columns: {list(main_df.columns)}")
    print(f"   Memory: {main_df.memory_usage(deep=True).sum()/1024/1024:.2f} MB")
    
    print(f"\nFIRST 3 ROWS:")
    print(main_df.head(3))
    
    print(f"\nDATASET INFO:")
    print(main_df.info())
    
    # Check for common text and label columns
    text_cols = [col for col in main_df.columns if any(word in col.lower() for word in ['text', 'tweet', 'comment', 'content', 'message'])]
    label_cols = [col for col in main_df.columns if any(word in col.lower() for word in ['label', 'class', 'category', 'target', 'abusive', 'hate'])]
    
    print(f"\nIDENTIFIED COLUMNS:")
    print(f"   Potential text columns: {text_cols}")
    print(f"   Potential label columns: {label_cols}")
else:
    print("FAILED TO LOAD MAIN DATASET")



STEP 1: LOADING MAIN DATASET
  Failed with utf-8 encoding
Successfully loaded ..\data/raw/data.csv with encoding: latin-1
  Shape: (13169, 13), Memory: 3.26 MB

MAIN DATASET SUMMARY:
   Shape: (13169, 13)
   Columns: ['Tweet', 'HS', 'Abusive', 'HS_Individual', 'HS_Group', 'HS_Religion', 'HS_Race', 'HS_Physical', 'HS_Gender', 'HS_Other', 'HS_Weak', 'HS_Moderate', 'HS_Strong']
   Memory: 3.26 MB

FIRST 3 ROWS:
                                               Tweet  HS  Abusive  \
0  - disaat semua cowok berusaha melacak perhatia...   1        1   
1  RT USER: USER siapa yang telat ngasih tau elu?...   0        1   
2  41. Kadang aku berfikir, kenapa aku tetap perc...   0        0   

   HS_Individual  HS_Group  HS_Religion  HS_Race  HS_Physical  HS_Gender  \
0              1         0            0        0            0          0   
1              0         0            0        0            0          0   
2              0         0            0        0            0          0   

   HS

In [259]:
# Load additional abusive dataset
print("\n" + "="*60)
print("STEP 2: LOADING ADDITIONAL ABUSIVE DATASET")
print("="*60)
abusive_df = load_csv_safe("data/raw/abusive.csv")

if abusive_df is not None:
    print(f"\nABUSIVE DATASET SUMMARY:")
    print(f"   Shape: {abusive_df.shape}")
    print(f"   Columns: {list(abusive_df.columns)}")
    print(f"   Memory: {abusive_df.memory_usage(deep=True).sum()/1024/1024:.2f} MB")
    
    print(f"\nFIRST 3 ROWS:")
    print(abusive_df.head(3))
    
    print(f"\nDATASET INFO:")
    print(abusive_df.info())
else:
    print("FAILED TO LOAD ADDITIONAL ABUSIVE DATASET")



STEP 2: LOADING ADDITIONAL ABUSIVE DATASET
Successfully loaded ..\data/raw/abusive.csv with encoding: utf-8
  Shape: (125, 1), Memory: 0.01 MB

ABUSIVE DATASET SUMMARY:
   Shape: (125, 1)
   Columns: ['ABUSIVE']
   Memory: 0.01 MB

FIRST 3 ROWS:
  ABUSIVE
0    alay
1   ampas
2    buta

DATASET INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125 entries, 0 to 124
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ABUSIVE  125 non-null    object
dtypes: object(1)
memory usage: 1.1+ KB
None


In [260]:
# Load Indonesian abusive words dictionary
print("\n" + "="*60)
print("STEP 3: LOADING INDONESIAN ABUSIVE WORDS DICTIONARY")
print("="*60)
abusive_words_df = load_csv_safe("IndonesianAbusiveWords/data.csv")

if abusive_words_df is not None:
    print(f"\nABUSIVE WORDS DATASET SUMMARY:")
    print(f"   Shape: {abusive_words_df.shape}")
    print(f"   Columns: {list(abusive_words_df.columns)}")
    print(f"   Memory: {abusive_words_df.memory_usage(deep=True).sum()/1024/1024:.2f} MB")
    
    print(f"\nFIRST 5 ROWS:")
    print(abusive_words_df.head(5))
    
    print(f"\nDATASET INFO:")
    print(abusive_words_df.info())
    
    # Check for unique values in each column to understand the structure
    print(f"\nCOLUMN ANALYSIS:")
    for col in abusive_words_df.columns:
        unique_count = abusive_words_df[col].nunique()
        print(f"   {col}: {unique_count} unique values")
        if unique_count <= 10:  # Show unique values if few
            print(f"      Values: {abusive_words_df[col].unique()}")
else:
    print("FAILED TO LOAD INDONESIAN ABUSIVE WORDS DATASET")



STEP 3: LOADING INDONESIAN ABUSIVE WORDS DICTIONARY
  Failed with utf-8 encoding


Successfully loaded ..\IndonesianAbusiveWords/data.csv with encoding: latin-1
  Shape: (13169, 13), Memory: 3.26 MB

ABUSIVE WORDS DATASET SUMMARY:
   Shape: (13169, 13)
   Columns: ['Tweet', 'HS', 'Abusive', 'HS_Individual', 'HS_Group', 'HS_Religion', 'HS_Race', 'HS_Physical', 'HS_Gender', 'HS_Other', 'HS_Weak', 'HS_Moderate', 'HS_Strong']
   Memory: 3.26 MB

FIRST 5 ROWS:
                                               Tweet  HS  Abusive  \
0  - disaat semua cowok berusaha melacak perhatia...   1        1   
1  RT USER: USER siapa yang telat ngasih tau elu?...   0        1   
2  41. Kadang aku berfikir, kenapa aku tetap perc...   0        0   
3  USER USER AKU ITU AKU\n\nKU TAU MATAMU SIPIT T...   0        0   
4  USER USER Kaum cebong kapir udah keliatan dong...   1        1   

   HS_Individual  HS_Group  HS_Religion  HS_Race  HS_Physical  HS_Gender  \
0              1         0            0        0            0          0   
1              0         0            0        0        

## 2. Data Quality Assessment

Now let's examine the quality and characteristics of our datasets.


In [261]:
# Analyze main dataset
if main_df is not None:
    print("\n" + "="*60)
    print("STEP 4: MAIN DATASET QUALITY ANALYSIS")
    print("="*60)
    print(f"Shape: {main_df.shape}")
    print(f"Memory usage: {main_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    # Check for missing values
    print("\nMISSING VALUES:")
    missing_data = main_df.isnull().sum()
    has_missing = False
    for col, missing in missing_data.items():
        if missing > 0:
            print(f"  {col}: {missing} ({missing/len(main_df)*100:.2f}%)")
            has_missing = True
    if not has_missing:
        print("  No missing values found")
    
    # Check data types
    print(f"\nDATA TYPES:")
    for col, dtype in main_df.dtypes.items():
        print(f"  {col}: {dtype}")
    
    # Check for duplicate rows
    duplicates = main_df.duplicated().sum()
    print(f"\nDUPLICATE ROWS: {duplicates} ({duplicates/len(main_df)*100:.2f}%)")
    
    # Basic statistics for numerical columns
    numeric_cols = main_df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) > 0:
        print(f"\nNUMERICAL COLUMNS STATISTICS:")
        print(main_df[numeric_cols].describe())
else:
    print("\nMain dataset not available for analysis")



STEP 4: MAIN DATASET QUALITY ANALYSIS
Shape: (13169, 13)
Memory usage: 3.26 MB

MISSING VALUES:
  No missing values found

DATA TYPES:
  Tweet: object
  HS: int64
  Abusive: int64
  HS_Individual: int64
  HS_Group: int64
  HS_Religion: int64
  HS_Race: int64
  HS_Physical: int64
  HS_Gender: int64
  HS_Other: int64
  HS_Weak: int64
  HS_Moderate: int64
  HS_Strong: int64



DUPLICATE ROWS: 125 (0.95%)

NUMERICAL COLUMNS STATISTICS:
                 HS       Abusive  HS_Individual      HS_Group   HS_Religion  \
count  13169.000000  13169.000000   13169.000000  13169.000000  13169.000000   
mean       0.422280      0.382945       0.271471      0.150809      0.060217   
std        0.493941      0.486123       0.444735      0.357876      0.237898   
min        0.000000      0.000000       0.000000      0.000000      0.000000   
25%        0.000000      0.000000       0.000000      0.000000      0.000000   
50%        0.000000      0.000000       0.000000      0.000000      0.000000   
75%        1.000000      1.000000       1.000000      0.000000      0.000000   
max        1.000000      1.000000       1.000000      1.000000      1.000000   

            HS_Race   HS_Physical     HS_Gender      HS_Other       HS_Weak  \
count  13169.000000  13169.000000  13169.000000  13169.000000  13169.000000   
mean       0.042980      0.024527      0.023236      0.284000

In [262]:
# Check target variable distribution (if available)
if main_df is not None:
    print("\n" + "="*60)
    print("STEP 5: TARGET VARIABLE ANALYSIS")
    print("="*60)
    
    # Look for potential target columns (common names for hate speech classification)
    potential_targets = ['label', 'class', 'category', 'abusive', 'hate_speech', 'target']
    target_col = None
    
    for col in main_df.columns:
        col_lower = col.lower()
        if any(target in col_lower for target in potential_targets):
            target_col = col
            break
    
    if target_col:
        print(f"Target column: '{target_col}'")
        print(f"\nVALUE COUNTS:")
        value_counts = main_df[target_col].value_counts()
        print(value_counts)
        
        print(f"\nPERCENTAGE DISTRIBUTION:")
        percentage_dist = main_df[target_col].value_counts(normalize=True) * 100
        for val, pct in percentage_dist.items():
            print(f"  {val}: {pct:.2f}%")
            
        print(f"\nUnique values: {main_df[target_col].nunique()}")
        print(f"Unique values list: {main_df[target_col].unique()}")
    else:
        print("No clear target column identified. Available columns:")
        for col in main_df.columns:
            print(f"  - {col}")
        print("\nPlease verify which column contains the classification labels.")



STEP 5: TARGET VARIABLE ANALYSIS
Target column: 'Abusive'

VALUE COUNTS:
Abusive
0    8126
1    5043
Name: count, dtype: int64

PERCENTAGE DISTRIBUTION:
  0: 61.71%
  1: 38.29%

Unique values: 2
Unique values list: [1 0]


## 3. Data Combination and Standardization

Let's combine the datasets and create a unified format for further processing.


In [263]:
# Create standardized dataset
print("\n" + "="*60)
print("STEP 6: DATA COMBINATION AND STANDARDIZATION")
print("="*60)

combined_datasets = []

# Process main dataset
if main_df is not None:
    main_processed = main_df.copy()
    main_processed['source'] = 'main_dataset'
    combined_datasets.append(main_processed)
    print(f"Added main dataset: {len(main_processed)} samples")

# Process additional abusive dataset
if abusive_df is not None:
    abusive_processed = abusive_df.copy()
    abusive_processed['source'] = 'abusive_dataset'
    combined_datasets.append(abusive_processed)
    print(f"Added abusive dataset: {len(abusive_processed)} samples")

# Combine all datasets
if combined_datasets:
    # Find common columns
    all_columns = set()
    for df in combined_datasets:
        all_columns.update(df.columns)
    
    print(f"\nAll available columns across datasets: {sorted(all_columns)}")
    
    # Try to combine datasets
    try:
        combined_df = pd.concat(combined_datasets, ignore_index=True, sort=False)
        print(f"\nSuccessfully combined datasets")
        print(f"Combined dataset shape: {combined_df.shape}")
        print(f"Source distribution:")
        print(combined_df['source'].value_counts())
        
    except Exception as e:
        print(f"\nError combining datasets: {e}")
        combined_df = None
else:
    print("No datasets available for combination")
    combined_df = None



STEP 6: DATA COMBINATION AND STANDARDIZATION
Added main dataset: 13169 samples
Added abusive dataset: 125 samples

All available columns across datasets: ['ABUSIVE', 'Abusive', 'HS', 'HS_Gender', 'HS_Group', 'HS_Individual', 'HS_Moderate', 'HS_Other', 'HS_Physical', 'HS_Race', 'HS_Religion', 'HS_Strong', 'HS_Weak', 'Tweet', 'source']

Successfully combined datasets
Combined dataset shape: (13294, 15)
Source distribution:
source
main_dataset       13169
abusive_dataset      125
Name: count, dtype: int64


In [264]:
# Create abusive words list for reference
print("\n" + "="*60)
print("STEP 7: CREATING ABUSIVE WORDS REFERENCE LIST")
print("="*60)

abusive_words_list = []

if abusive_words_df is not None:
    # Try to identify the column containing abusive words
    potential_word_cols = ['word', 'abusive_word', 'kata', 'term', 'text']
    word_col = None
    
    for col in abusive_words_df.columns:
        col_lower = col.lower()
        if any(word_col_name in col_lower for word_col_name in potential_word_cols):
            word_col = col
            break
    
    if word_col:
        abusive_words_list = abusive_words_df[word_col].dropna().unique().tolist()
        print(f"Extracted {len(abusive_words_list)} unique abusive words")
        print(f"Sample abusive words: {abusive_words_list[:10]}")
    else:
        print("Available columns in abusive words dataset:")
        for col in abusive_words_df.columns:
            print(f"  - {col}")
        print("Please identify which column contains the abusive words")
else:
    print("Abusive words dataset not available")

print(f"\nTotal abusive words available: {len(abusive_words_list)}")



STEP 7: CREATING ABUSIVE WORDS REFERENCE LIST
Available columns in abusive words dataset:
  - Tweet
  - HS
  - Abusive
  - HS_Individual
  - HS_Group
  - HS_Religion
  - HS_Race
  - HS_Physical
  - HS_Gender
  - HS_Other
  - HS_Weak
  - HS_Moderate
  - HS_Strong
Please identify which column contains the abusive words

Total abusive words available: 0


## 4. Data Export and Preparation

Save the processed datasets for the next stages of the pipeline.


In [265]:
# Save datasets to existing main data directory ONLY
print("="*60)
print("DATA EXPORT AND PREPARATION")
print("="*60)

# Always use the main project data directory (never create local ones)
# When running from notebooks/, the main data folder is at ../data/processed
processed_dir = "../data/processed"

# Verify the main data directory exists
if os.path.exists(processed_dir):
    print(f"Using main project data directory: {processed_dir}")
else:
    print(f"ERROR: Main data directory not found at {processed_dir}")
    print("Please ensure you're running from the notebooks/ folder and data/processed exists at project root")
    processed_dir = None

if processed_dir is not None:
    # Save individual datasets
    if main_df is not None:
        filepath = os.path.join(processed_dir, "raw_main_data.csv")
        main_df.to_csv(filepath, index=False, encoding='utf-8')
        print(f"Saved raw main data to {filepath}")

    if abusive_df is not None:
        filepath = os.path.join(processed_dir, "raw_abusive_data.csv")
        abusive_df.to_csv(filepath, index=False, encoding='utf-8')
        print(f"Saved raw abusive data to {filepath}")

    if abusive_words_df is not None:
        filepath = os.path.join(processed_dir, "raw_abusive_words_data.csv")
        abusive_words_df.to_csv(filepath, index=False, encoding='utf-8')
        print(f"Saved raw abusive words data to {filepath}")

print(f"\n" + "="*60)
print("DATA RETRIEVAL SUMMARY")
print("="*60)
print(f"Main dataset: {'SUCCESS' if main_df is not None else 'FAILED'}")
print(f"Additional abusive dataset: {'SUCCESS' if abusive_df is not None else 'FAILED'}")
print(f"Abusive words dictionary: {'SUCCESS' if abusive_words_df is not None else 'FAILED'}")

if main_df is not None:
    print(f"Total samples in main dataset: {len(main_df)}")
    print(f"Features available: {len(main_df.columns)}")
    
print(f"\nDataset ready for the next phase: DATA PREPARATION!")
print("="*60)

DATA EXPORT AND PREPARATION
Using main project data directory: ../data/processed
Saved raw main data to ../data/processed\raw_main_data.csv
Saved raw abusive data to ../data/processed\raw_abusive_data.csv
Saved raw abusive words data to ../data/processed\raw_abusive_words_data.csv

DATA RETRIEVAL SUMMARY
Main dataset: SUCCESS
Additional abusive dataset: SUCCESS
Abusive words dictionary: SUCCESS
Total samples in main dataset: 13169
Features available: 13

Dataset ready for the next phase: DATA PREPARATION!


## Summary

This notebook successfully completed the data retrieval phase:

1. **Data Loading**: Loaded raw datasets with robust encoding handling
2. **Quality Assessment**: Analyzed dataset structure, missing values, and distributions
3. **Data Combination**: Combined multiple datasets into a unified format
4. **Export**: Saved processed datasets for the next pipeline stages

### Next Steps:
- Proceed to `02_data_preparation.ipynb` for text cleaning and preprocessing
- The processed datasets are available in `data/processed/` directory
- Abusive words dictionary is ready for use in text analysis

### Key Outputs:
- `raw_main_data.csv`: Primary dataset
- `raw_abusive_data.csv`: Additional abusive content
- `raw_abusive_words_data.csv`: Indonesian abusive words dictionary
- `raw_combined_data.csv`: Combined dataset for modeling
- `abusive_words_list.txt`: Simple list of abusive words for reference
