<a href="https://colab.research.google.com/github/LatiefDataVisionary/deep-learning-college-task/blob/main/03_data_preprocessing_and_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 03_data_preprocessing_and_cleaning.ipynb

This notebook focuses on applying intensive cleaning and preprocessing steps to the raw Spotify app review data. The goal is to transform the raw text into a format suitable for advanced modeling, handling both English and Indonesian text.

**Input:** `../data/raw/spotify_reviews_raw.csv`
**Output:** `../data/processed/reviews_cleaned.csv`

## 1. Setup and Data Loading

This section imports all necessary libraries and loads the raw dataset into a pandas DataFrame.

In [3]:
%pip install langid Sastrawi

Collecting langid
  Downloading langid-1.1.6.tar.gz (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting Sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl.metadata (909 bytes)
Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.7/209.7 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: langid
  Building wheel for langid (setup.py) ... [?25l[?25hdone
  Created wheel for langid: filename=langid-1.1.6-py3-none-any.whl size=1941171 sha256=efc499b84d68806b1bcee5f25687cd6a89bc9d4f29df6cf5b59d4ce4fdc94c01
  Stored in directory: /root/.cache/pip/wheels/3c/bc/9d/266e27289b9019680d65d9b608c37bff1eff565b001c977ec5
Successfully built langid
Installing collected packages: Sastrawi, langid
Successfully installed Sastrawi-1.0.1 langid-1

In [30]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import os

# Ensure you have downloaded necessary NLTK data
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
     nltk.download('wordnet')
try:
    nltk.data.find('corpora/omw-1.4')
except LookupError:
     nltk.download('omw-1.4')


# Language detection libraries (choose one, langdetect is often easier to install)
# try:
#     from langdetect import detect, DetectorFactory
#     DetectorFactory.seed = 0 # for reproducible results
# except ImportError:
#     print("langdetect not installed. Please install it using: pip install langdetect")
try:
    import langid
except ImportError:
     print("langid not installed. Please install it using: pip install langid")

# Indonesian Stemmer
try:
    from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
    factory = StemmerFactory()
    stemmer = factory.create_stemmer()
except ImportError:
    print("Sastrawi not installed. Please install it using: pip install Sastrawi")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Load the raw dataset from the specified path into a pandas DataFrame.

In [10]:
id_path = 'https://raw.githubusercontent.com/LatiefDataVisionary/data-science-capstone-project-college/refs/heads/main/data/raw/combined-raw-dataset/spotify_reviews_indonesian_combine.csv'

In [17]:
# Load the datasets
try:
    df_en = pd.read_csv('/content/spotify_reviews_english_combine.csv.gz', compression='gzip')
    print("English dataset loaded successfully.")
    display(df_en.head())
except FileNotFoundError:
    print(f"Error: {en_path} not found. Please ensure the file is accessible.")
    df_en = None # Set df_en to None to avoid errors

try:
    df_id = pd.read_csv(id_path)
    print("Indonesian dataset loaded successfully.")
    display(df_id.head())
except FileNotFoundError:
    print(f"Error: {id_path} not found. Please ensure the file is accessible.")
    df_id = None # Set df_id to None to avoid errors

# Do not concatenate the dataframes as they are in different languages
# df = pd.concat([df_en, df_id], ignore_index=True)
# print("\nDatasets concatenated successfully.")
# display(df.head())
# display(df.info())

# Now you have two separate dataframes: df_en and df_id
if df_en is not None or df_id is not None:
    print("\nEnglish and Indonesian datasets loaded into separate dataframes (df_en and df_id).")
else:
    print("\nCould not load either dataset. Dataframes 'df_en' and 'df_id' are None.")

English dataset loaded successfully.


Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,appVersion,language,char_length,word_count,review_month
0,97b3a330-8135-4961-963b-d5b40aeaa80a,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,"they fixed it, I was just really pissy yesterd...",5,1,9.0.82.1032,2025-09-27 22:44:08,,,9.0.82.1032,English,108,19,2025-09
1,39a62a75-3998-483e-aca2-8719d3f8dd57,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,"Offline doesnt work, support doesnt help, just...",1,0,9.0.82.1032,2025-09-27 21:48:01,Hello. Thanks for bringing this to our attenti...,2025-09-27 21:52:49,9.0.82.1032,English,85,14,2025-09
2,03c513ab-df54-42a2-aa3c-7502c086d875,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,Super annoying ad experience! It feels like th...,1,5,9.0.56.591,2025-09-27 06:05:50,,,9.0.56.591,English,420,66,2025-09
3,5f769b79-7d95-4129-8078-1dc62d8f2b2b,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,👍,5,1,9.0.82.1032,2025-09-27 06:02:10,,,9.0.82.1032,English,1,1,2025-09
4,fd4665c6-ed09-4af0-94a2-06b8f5e2d8c4,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,super song for everything,5,0,9.0.80.1343,2025-09-27 06:01:26,,,9.0.80.1343,English,25,4,2025-09


Indonesian dataset loaded successfully.


Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,appVersion,language,char_length,word_count,review_month
0,427e2299-b27a-4f53-89ac-15f9885207c8,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,lagu bukan hanya alunan nada tapi bisa jadi un...,1,2,,2025-09-28 05:29:11,,,,Indonesian,91,14,2025-09
1,790cee66-b937-41ba-b7a8-e831c60f63a4,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,iklan Mulu gak jelass apa apa harus premium ko...,1,0,9.0.82.1032,2025-09-28 03:08:45,,,9.0.82.1032,Indonesian,50,9,2025-09
2,2a59f6d0-fa47-4e26-b097-9ef0ddc824f6,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,Terima kasih banyak 🙏👍👍👍,5,0,,2025-09-27 06:05:37,,,,Indonesian,24,4,2025-09
3,0d356a00-bd64-4539-a23f-aaa7e46c98d8,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,kok di aku mah gk bisa ada lirik ya sih tolong...,4,0,9.0.82.1032,2025-09-27 06:03:26,,,9.0.82.1032,Indonesian,116,19,2025-09
4,0349c341-fad0-4e60-8952-28dc1cb43659,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,sangat banyak lagu nya,5,0,9.0.80.1343,2025-09-27 05:56:16,,,9.0.80.1343,Indonesian,22,4,2025-09



English and Indonesian datasets loaded into separate dataframes (df_en and df_id).


Create a working copy of the DataFrame to perform cleaning and preprocessing steps without modifying the original data.

In [18]:
# Create working copies of the DataFrames
df_en_cleaned = None
df_id_cleaned = None

if df_en is not None:
    df_en_cleaned = df_en.copy()
    print("Working copy of English DataFrame created.")
else:
    print("English DataFrame is None. Could not create working copy.")

if df_id is not None:
    df_id_cleaned = df_id.copy()
    print("Working copy of Indonesian DataFrame created.")
else:
    print("Indonesian DataFrame is None. Could not create working copy.")

Working copy of English DataFrame created.
Working copy of Indonesian DataFrame created.


## 2. Basic Text Cleaning

This section covers fundamental text cleaning steps that are universally applicable to most text data, regardless of language.

### Subsection 2.1: Lowercasing

Converting all text to lowercase is crucial for ensuring uniformity. This prevents the model from treating words like "Apple" and "apple" as distinct entities, which can negatively impact analysis and modeling.

In [19]:
if df_en_cleaned is not None and not df_en_cleaned.empty:
    # Convert the 'content' column to lowercase for English reviews
    df_en_cleaned['cleaned_content'] = df_en_cleaned['content'].str.lower()
    print("English text converted to lowercase.")
    display(df_en_cleaned[['content', 'cleaned_content']].head())
else:
    print("English DataFrame is empty or None. Skipping lowercasing for English.")

if df_id_cleaned is not None and not df_id_cleaned.empty:
    # Convert the 'content' column to lowercase for Indonesian reviews
    df_id_cleaned['cleaned_content'] = df_id_cleaned['content'].str.lower()
    print("Indonesian text converted to lowercase.")
    display(df_id_cleaned[['content', 'cleaned_content']].head())
else:
    print("Indonesian DataFrame is empty or None. Skipping lowercasing for Indonesian.")

English text converted to lowercase.


Unnamed: 0,content,cleaned_content
0,"they fixed it, I was just really pissy yesterd...","they fixed it, i was just really pissy yesterd..."
1,"Offline doesnt work, support doesnt help, just...","offline doesnt work, support doesnt help, just..."
2,Super annoying ad experience! It feels like th...,super annoying ad experience! it feels like th...
3,👍,👍
4,super song for everything,super song for everything


Indonesian text converted to lowercase.


Unnamed: 0,content,cleaned_content
0,lagu bukan hanya alunan nada tapi bisa jadi un...,lagu bukan hanya alunan nada tapi bisa jadi un...
1,iklan Mulu gak jelass apa apa harus premium ko...,iklan mulu gak jelass apa apa harus premium ko...
2,Terima kasih banyak 🙏👍👍👍,terima kasih banyak 🙏👍👍👍
3,kok di aku mah gk bisa ada lirik ya sih tolong...,kok di aku mah gk bisa ada lirik ya sih tolong...
4,sangat banyak lagu nya,sangat banyak lagu nya


### Subsection 2.2: Removing URLs

URLs typically do not carry significant sentimental or topical value in review text and can introduce noise. Removing them simplifies the text and reduces the vocabulary size.

In [21]:
# Regex to find URLs
url_pattern = re.compile(r'https?://\S+|www\.\S+')

if df_en_cleaned is not None and not df_en_cleaned.empty:
    df_en_cleaned['cleaned_content'] = df_en_cleaned['cleaned_content'].apply(lambda x: url_pattern.sub(r'', str(x)))
    print("URLs removed from English DataFrame.")
    display(df_en_cleaned[['content', 'cleaned_content']].head())
else:
    print("English DataFrame is empty or None. Skipping URL removal for English.")

if df_id_cleaned is not None and not df_id_cleaned.empty:
    df_id_cleaned['cleaned_content'] = df_id_cleaned['cleaned_content'].apply(lambda x: url_pattern.sub(r'', str(x)))
    print("URLs removed from Indonesian DataFrame.")
    display(df_id_cleaned[['content', 'cleaned_content']].head())
else:
    print("Indonesian DataFrame is empty or None. Skipping URL removal for Indonesian.")

URLs removed from English DataFrame.


Unnamed: 0,content,cleaned_content
0,"they fixed it, I was just really pissy yesterd...","they fixed it, i was just really pissy yesterd..."
1,"Offline doesnt work, support doesnt help, just...","offline doesnt work, support doesnt help, just..."
2,Super annoying ad experience! It feels like th...,super annoying ad experience! it feels like th...
3,👍,👍
4,super song for everything,super song for everything


URLs removed from Indonesian DataFrame.


Unnamed: 0,content,cleaned_content
0,lagu bukan hanya alunan nada tapi bisa jadi un...,lagu bukan hanya alunan nada tapi bisa jadi un...
1,iklan Mulu gak jelass apa apa harus premium ko...,iklan mulu gak jelass apa apa harus premium ko...
2,Terima kasih banyak 🙏👍👍👍,terima kasih banyak 🙏👍👍👍
3,kok di aku mah gk bisa ada lirik ya sih tolong...,kok di aku mah gk bisa ada lirik ya sih tolong...
4,sangat banyak lagu nya,sangat banyak lagu nya


### Subsection 2.3: Removing Unnecessary Characters

This step removes numbers, punctuation marks, emojis, and other special characters that are generally not relevant for sentiment analysis or topic modeling. We will keep only alphabetic characters and spaces.

In [22]:
if df_en_cleaned is not None and not df_en_cleaned.empty:
    # Regex to keep only alphabetic characters and spaces
    alpha_pattern = re.compile(r'[^a-z\s]')
    df_en_cleaned['cleaned_content'] = df_en_cleaned['cleaned_content'].apply(lambda x: alpha_pattern.sub(r'', str(x)))
    print("Unnecessary characters removed from English DataFrame.")
    display(df_en_cleaned[['content', 'cleaned_content']].head())
else:
    print("English DataFrame is empty or None. Skipping unnecessary character removal for English.")

if df_id_cleaned is not None and not df_id_cleaned.empty:
    # Regex to keep only alphabetic characters and spaces
    alpha_pattern = re.compile(r'[^a-z\s]')
    df_id_cleaned['cleaned_content'] = df_id_cleaned['cleaned_content'].apply(lambda x: alpha_pattern.sub(r'', str(x)))
    print("Unnecessary characters removed from Indonesian DataFrame.")
    display(df_id_cleaned[['content', 'cleaned_content']].head())
else:
    print("Indonesian DataFrame is empty or None. Skipping unnecessary character removal for Indonesian.")

Unnecessary characters removed from English DataFrame.


Unnamed: 0,content,cleaned_content
0,"they fixed it, I was just really pissy yesterd...",they fixed it i was just really pissy yesterda...
1,"Offline doesnt work, support doesnt help, just...",offline doesnt work support doesnt help just a...
2,Super annoying ad experience! It feels like th...,super annoying ad experience it feels like the...
3,👍,
4,super song for everything,super song for everything


Unnecessary characters removed from Indonesian DataFrame.


Unnamed: 0,content,cleaned_content
0,lagu bukan hanya alunan nada tapi bisa jadi un...,lagu bukan hanya alunan nada tapi bisa jadi un...
1,iklan Mulu gak jelass apa apa harus premium ko...,iklan mulu gak jelass apa apa harus premium ko...
2,Terima kasih banyak 🙏👍👍👍,terima kasih banyak
3,kok di aku mah gk bisa ada lirik ya sih tolong...,kok di aku mah gk bisa ada lirik ya sih tolong...
4,sangat banyak lagu nya,sangat banyak lagu nya


### Subsection 2.4: Removing Extra Whitespace

After removing various characters, there might be leading, trailing, or multiple consecutive spaces. Cleaning these ensures consistent spacing and prevents issues in subsequent tokenization steps.

In [23]:
if df_en_cleaned is not None and not df_en_cleaned.empty:
    # Remove leading/trailing whitespace
    df_en_cleaned['cleaned_content'] = df_en_cleaned['cleaned_content'].str.strip()
    # Replace multiple spaces with a single space
    df_en_cleaned['cleaned_content'] = df_en_cleaned['cleaned_content'].str.replace(r'\s+', ' ', regex=True)
    print("Extra whitespace removed from English DataFrame.")
    display(df_en_cleaned[['content', 'cleaned_content']].head())
else:
    print("English DataFrame is empty or None. Skipping whitespace removal for English.")

if df_id_cleaned is not None and not df_id_cleaned.empty:
    # Remove leading/trailing whitespace
    df_id_cleaned['cleaned_content'] = df_id_cleaned['cleaned_content'].str.strip()
    # Replace multiple spaces with a single space
    df_id_cleaned['cleaned_content'] = df_id_cleaned['cleaned_content'].str.replace(r'\s+', ' ', regex=True)
    print("Extra whitespace removed from Indonesian DataFrame.")
    display(df_id_cleaned[['content', 'cleaned_content']].head())
else:
    print("Indonesian DataFrame is empty or None. Skipping whitespace removal for Indonesian.")

Extra whitespace removed from English DataFrame.


Unnamed: 0,content,cleaned_content
0,"they fixed it, I was just really pissy yesterd...",they fixed it i was just really pissy yesterda...
1,"Offline doesnt work, support doesnt help, just...",offline doesnt work support doesnt help just a...
2,Super annoying ad experience! It feels like th...,super annoying ad experience it feels like the...
3,👍,
4,super song for everything,super song for everything


Extra whitespace removed from Indonesian DataFrame.


Unnamed: 0,content,cleaned_content
0,lagu bukan hanya alunan nada tapi bisa jadi un...,lagu bukan hanya alunan nada tapi bisa jadi un...
1,iklan Mulu gak jelass apa apa harus premium ko...,iklan mulu gak jelass apa apa harus premium ko...
2,Terima kasih banyak 🙏👍👍👍,terima kasih banyak
3,kok di aku mah gk bisa ada lirik ya sih tolong...,kok di aku mah gk bisa ada lirik ya sih tolong...
4,sangat banyak lagu nya,sangat banyak lagu nya


## 3. Advanced Language-Specific Preprocessing

This section addresses the nuances of English and Indonesian text, applying language-specific techniques for more effective preprocessing.

### Subsection 3.1: Language Detection

To apply the correct language-specific tools (like stopword removal, stemming, or lemmatization), we first need to identify the language of each review. We will use a reliable library for this task.

In [24]:
# Language detection is not needed as datasets are already separated by language.
# The 'language' column from the original dataframes is already present.

if df_en_cleaned is not None and not df_en_cleaned.empty:
    print("English DataFrame language distribution:")
    display(df_en_cleaned['language'].value_counts())

if df_id_cleaned is not None and not df_id_cleaned.empty:
    print("\nIndonesian DataFrame language distribution:")
    display(df_id_cleaned['language'].value_counts())

if (df_en_cleaned is None or df_en_cleaned.empty) and (df_id_cleaned is None or df_id_cleaned.empty):
    print("Both DataFrames are empty or None. Skipping language distribution display.")

English DataFrame language distribution:


Unnamed: 0_level_0,count
language,Unnamed: 1_level_1
English,77078



Indonesian DataFrame language distribution:


Unnamed: 0_level_0,count
language,Unnamed: 1_level_1
Indonesian,59246


### Subsection 3.2: Stopword Removal

Stopwords are common words (like articles, prepositions, conjunctions) that appear frequently but usually do not contribute much meaning to the text's core message. Removing them reduces noise and helps focus on more significant terms. We will use NLTK's stopword lists for both English and Indonesian.

In [25]:
# Download English and Indonesian stopword lists if not already present
english_stopwords = set(stopwords.words('english'))
indonesian_stopwords = set(stopwords.words('indonesian'))

# Function to remove stopwords based on detected language
def remove_stopwords(text, language):
    if pd.isna(text):
        return ""
    words = str(text).split() # Ensure text is string
    if language == 'English': # Use 'English' as per the actual data
        return ' '.join([word for word in words if word not in english_stopwords])
    elif language == 'Indonesian': # Use 'Indonesian' as per the actual data
        return ' '.join([word for word in words if word not in indonesian_stopwords])
    else:
        return text # Keep text as is for unknown or other languages

if df_en_cleaned is not None and not df_en_cleaned.empty:
    # Apply stopword removal to English reviews
    df_en_cleaned['cleaned_content'] = df_en_cleaned.apply(lambda row: remove_stopwords(row['cleaned_content'], row['language']), axis=1)
    print("Stopwords removed from English DataFrame.")
    display(df_en_cleaned[['content', 'cleaned_content', 'language']].head())
else:
    print("English DataFrame is empty or None. Skipping stopword removal for English.")

if df_id_cleaned is not None and not df_id_cleaned.empty:
    # Apply stopword removal to Indonesian reviews
    df_id_cleaned['cleaned_content'] = df_id_cleaned.apply(lambda row: remove_stopwords(row['cleaned_content'], row['language']), axis=1)
    print("Stopwords removed from Indonesian DataFrame.")
    display(df_id_cleaned[['content', 'cleaned_content', 'language']].head())
else:
    print("Indonesian DataFrame is empty or None. Skipping stopword removal for Indonesian.")

Stopwords removed from English DataFrame.


Unnamed: 0,content,cleaned_content,language
0,"they fixed it, I was just really pissy yesterd...",fixed really pissy yesterday cause spent awhil...,English
1,"Offline doesnt work, support doesnt help, just...",offline doesnt work support doesnt help anothe...,English
2,Super annoying ad experience! It feels like th...,super annoying ad experience feels like app de...,English
3,👍,,English
4,super song for everything,super song everything,English


Stopwords removed from Indonesian DataFrame.


Unnamed: 0,content,cleaned_content,language
0,lagu bukan hanya alunan nada tapi bisa jadi un...,lagu alunan nada ungkapan kebebasan perdamaian...,Indonesian
1,iklan Mulu gak jelass apa apa harus premium ko...,iklan mulu gak jelass premium kocakk,Indonesian
2,Terima kasih banyak 🙏👍👍👍,terima kasih,Indonesian
3,kok di aku mah gk bisa ada lirik ya sih tolong...,mah gk lirik ya sih tolong apk ya bagussss bgtttt,Indonesian
4,sangat banyak lagu nya,lagu nya,Indonesian


### Subsection 3.3: Normalization of Indonesian Slang (Kamus Alay)

Indonesian text, especially in informal contexts like app reviews, often contains slang, abbreviations, and informal spellings (known as 'bahasa gaul' or 'bahasa alay'). Normalizing these to their standard forms is essential for consistent analysis. Below is a small example dictionary; a comprehensive dictionary is needed for robust normalization.

In [26]:
if df_id_cleaned is not None and not df_id_cleaned.empty:
    # Example slang dictionary (this is a small sample)
    slang_dict = {
        'ga': 'tidak', 'gak': 'tidak', 'nggak': 'tidak',
        'bgt': 'banget', 'bangett': 'banget',
        'yg': 'yang', 'iy': 'iya', 'aja': 'saja',
        'udh': 'sudah', 'dah': 'sudah',
        'kt': 'kita', 'dr': 'dari',
        'untukku': 'untuk saya', # Example of possessive normalization if needed
        'makin': 'semakin',
        'trs': 'terus'
        # Add more slang words and their formal equivalents here
    }

    # Function to normalize Indonesian slang
    def normalize_slang(text, slang_dictionary):
        if pd.isna(text):
            return ""
        words = str(text).split()
        normalized_words = [slang_dictionary.get(word, word) for word in words]
        return ' '.join(normalized_words)

    # Apply slang normalization to Indonesian reviews
    df_id_cleaned['cleaned_content'] = df_id_cleaned.apply(lambda row: normalize_slang(row['cleaned_content'], slang_dict), axis=1)
    print("Indonesian slang normalized (using a small dictionary).")
    display(df_id_cleaned[['content', 'cleaned_content', 'language']].head())
else:
    print("Indonesian DataFrame is empty or None. Skipping slang normalization.")

Indonesian slang normalized (using a small dictionary).


Unnamed: 0,content,cleaned_content,language
0,lagu bukan hanya alunan nada tapi bisa jadi un...,lagu alunan nada ungkapan kebebasan perdamaian...,Indonesian
1,iklan Mulu gak jelass apa apa harus premium ko...,iklan mulu tidak jelass premium kocakk,Indonesian
2,Terima kasih banyak 🙏👍👍👍,terima kasih,Indonesian
3,kok di aku mah gk bisa ada lirik ya sih tolong...,mah gk lirik ya sih tolong apk ya bagussss bgtttt,Indonesian
4,sangat banyak lagu nya,lagu nya,Indonesian


### (Optional but Recommended) Subsection 3.4: Lemmatization/Stemming

Lemmatization (for English) and Stemming (for Indonesian) are techniques to reduce words to their base or root form.
- **Lemmatization:** Reduces words to their dictionary form (e.g., "running" -> "run", "better" -> "good"). This is generally preferred for English as it considers the word's meaning and context. NLTK's WordNetLemmatizer is commonly used.
- **Stemming:** Reduces words to their root form by chopping off suffixes (e.g., "melewati" -> "lewat", "pembelian" -> "beli"). This is a more aggressive approach and might result in non-dictionary words. Sastrawi is a popular library for Indonesian stemming.

This step can be computationally intensive, especially on large datasets.

In [31]:
# if df_en_cleaned is not None and not df_en_cleaned.empty or df_id_cleaned is not None and not df_id_cleaned.empty:
#     from nltk.stem import WordNetLemmatizer
#     # Initialize WordNetLemmatizer for English
#     lemmatizer = WordNetLemmatizer()

#     # Function to apply lemmatization (English) or stemming (Indonesian)
#     def normalize_word(text, language):
#         if pd.isna(text):
#             return ""
#         words = str(text).split() # Ensure text is string
#         normalized_words = []
#         for word in words:
#             if language == 'English': # Use 'English' as per the actual data
#                 # Apply lemmatization for English
#                 normalized_words.append(lemmatizer.lemmatize(word))
#             elif language == 'Indonesian': # Use 'Indonesian' as per the actual data
#                 # Apply stemming for Indonesian using Sastrawi
#                 normalized_words.append(stemmer.stem(word))
#             else:
#                 normalized_words.append(word) # Keep word as is for other languages
#         return ' '.join(normalized_words)

#     if df_en_cleaned is not None and not df_en_cleaned.empty:
#         # Apply lemmatization to English reviews
#         df_en_cleaned['cleaned_content'] = df_en_cleaned.apply(lambda row: normalize_word(row['cleaned_content'], row['language']), axis=1)
#         print("English words lemmatized.")
#         display(df_en_cleaned[['content', 'cleaned_content', 'language']].head())
#     else:
#         print("English DataFrame is empty or None. Skipping lemmatization.")

#     if df_id_cleaned is not None and not df_id_cleaned.empty:
#         # Apply stemming to Indonesian reviews
#         df_id_cleaned['cleaned_content'] = df_id_cleaned.apply(lambda row: normalize_word(row['cleaned_content'], row['language']), axis=1)
#         print("Indonesian words stemmed.")
#         display(df_id_cleaned[['content', 'cleaned_content', 'language']].head())
#     else:
#         print("Indonesian DataFrame is empty or None. Skipping stemming.")

# else:
#     print("Both DataFrames are empty or None. Skipping normalization.")

English words lemmatized.


Unnamed: 0,content,cleaned_content,language
0,"they fixed it, I was just really pissy yesterd...",fixed really pissy yesterday cause spent awhil...,English
1,"Offline doesnt work, support doesnt help, just...",offline doesnt work support doesnt help anothe...,English
2,Super annoying ad experience! It feels like th...,super annoying ad experience feel like app des...,English
3,👍,,English
4,super song for everything,super song everything,English


KeyboardInterrupt: 

## 4. Finalizing and Saving the Clean Dataset

This section prepares the cleaned DataFrame for the next steps by inspecting the results, dropping unnecessary columns, and saving the processed data to a new CSV file.

Inspect the DataFrame to see the original content, the cleaned content, and the detected language after all preprocessing steps.

In [32]:
if df_en_cleaned is not None and not df_en_cleaned.empty:
    # Display the first few rows of the cleaned English DataFrame
    print("Final cleaned English DataFrame structure:")
    display(df_en_cleaned[['content', 'cleaned_content', 'language']].head())
else:
    print("English DataFrame is empty or None. Cannot display final structure.")

if df_id_cleaned is not None and not df_id_cleaned.empty:
    # Display the first few rows of the cleaned Indonesian DataFrame
    print("\nFinal cleaned Indonesian DataFrame structure:")
    display(df_id_cleaned[['content', 'cleaned_content', 'language']].head())
else:
    print("Indonesian DataFrame is empty or None. Cannot display final structure.")

Final cleaned English DataFrame structure:


Unnamed: 0,content,cleaned_content,language
0,"they fixed it, I was just really pissy yesterd...",fixed really pissy yesterday cause spent awhil...,English
1,"Offline doesnt work, support doesnt help, just...",offline doesnt work support doesnt help anothe...,English
2,Super annoying ad experience! It feels like th...,super annoying ad experience feel like app des...,English
3,👍,,English
4,super song for everything,super song everything,English



Final cleaned Indonesian DataFrame structure:


Unnamed: 0,content,cleaned_content,language
0,lagu bukan hanya alunan nada tapi bisa jadi un...,lagu alunan nada ungkapan kebebasan perdamaian...,Indonesian
1,iklan Mulu gak jelass apa apa harus premium ko...,iklan mulu tidak jelass premium kocakk,Indonesian
2,Terima kasih banyak 🙏👍👍👍,terima kasih,Indonesian
3,kok di aku mah gk bisa ada lirik ya sih tolong...,mah gk lirik ya sih tolong apk ya bagussss bgtttt,Indonesian
4,sangat banyak lagu nya,lagu nya,Indonesian


Drop columns that are no longer needed after preprocessing. We will keep the original `content` for reference, the `cleaned_content`, and the `language` column. You might choose to drop other columns depending on your needs.

In [33]:
columns_to_keep = ['content', 'cleaned_content', 'language']

if df_en_cleaned is not None and not df_en_cleaned.empty:
    # Drop columns not in the list of columns to keep for English DataFrame
    columns_to_drop_en = [col for col in df_en_cleaned.columns if col not in columns_to_keep]
    df_en_cleaned = df_en_cleaned.drop(columns=columns_to_drop_en)
    print(f"Dropped columns from English DataFrame: {columns_to_drop_en}")
    display(df_en_cleaned.head())
else:
    print("English DataFrame is empty or None. Skipping column dropping for English.")

if df_id_cleaned is not None and not df_id_cleaned.empty:
    # Drop columns not in the list of columns to keep for Indonesian DataFrame
    columns_to_drop_id = [col for col in df_id_cleaned.columns if col not in columns_to_keep]
    df_id_cleaned = df_id_cleaned.drop(columns=columns_to_drop_id)
    print(f"Dropped columns from Indonesian DataFrame: {columns_to_drop_id}")
    display(df_id_cleaned.head())
else:
    print("Indonesian DataFrame is empty or None. Skipping column dropping for Indonesian.")

Dropped columns from English DataFrame: ['reviewId', 'userName', 'userImage', 'score', 'thumbsUpCount', 'reviewCreatedVersion', 'at', 'replyContent', 'repliedAt', 'appVersion', 'char_length', 'word_count', 'review_month']


Unnamed: 0,content,language,cleaned_content
0,"they fixed it, I was just really pissy yesterd...",English,fixed really pissy yesterday cause spent awhil...
1,"Offline doesnt work, support doesnt help, just...",English,offline doesnt work support doesnt help anothe...
2,Super annoying ad experience! It feels like th...,English,super annoying ad experience feel like app des...
3,👍,English,
4,super song for everything,English,super song everything


Dropped columns from Indonesian DataFrame: ['reviewId', 'userName', 'userImage', 'score', 'thumbsUpCount', 'reviewCreatedVersion', 'at', 'replyContent', 'repliedAt', 'appVersion', 'char_length', 'word_count', 'review_month']


Unnamed: 0,content,language,cleaned_content
0,lagu bukan hanya alunan nada tapi bisa jadi un...,Indonesian,lagu alunan nada ungkapan kebebasan perdamaian...
1,iklan Mulu gak jelass apa apa harus premium ko...,Indonesian,iklan mulu tidak jelass premium kocakk
2,Terima kasih banyak 🙏👍👍👍,Indonesian,terima kasih
3,kok di aku mah gk bisa ada lirik ya sih tolong...,Indonesian,mah gk lirik ya sih tolong apk ya bagussss bgtttt
4,sangat banyak lagu nya,Indonesian,lagu nya


Save the finalized, clean DataFrame to a new CSV file. Ensure the index is not saved as it is not part of the dataset.

In [34]:
import os

output_dir = '../data/processed/'
os.makedirs(output_dir, exist_ok=True)

if df_en_cleaned is not None and not df_en_cleaned.empty:
    # Define the output file path for English data
    output_file_path_en = os.path.join(output_dir, 'reviews_cleaned_en.csv')
    # Save the cleaned English DataFrame
    df_en_cleaned.to_csv(output_file_path_en, index=False)
    print(f"Cleaned English dataset saved to {output_file_path_en}")
else:
    print("English DataFrame is empty or None. Cannot save to CSV.")

if df_id_cleaned is not None and not df_id_cleaned.empty:
    # Define the output file path for Indonesian data
    output_file_path_id = os.path.join(output_dir, 'reviews_cleaned_id.csv')
    # Save the cleaned Indonesian DataFrame
    df_id_cleaned.to_csv(output_file_path_id, index=False)
    print(f"Cleaned Indonesian dataset saved to {output_file_path_id}")
else:
    print("Indonesian DataFrame is empty or None. Cannot save to CSV.")

Cleaned English dataset saved to ../data/processed/reviews_cleaned_en.csv
Cleaned Indonesian dataset saved to ../data/processed/reviews_cleaned_id.csv


The data preprocessing and cleaning steps are now complete. The `reviews_cleaned.csv` dataset is ready for the next stages of analysis, such as advanced Exploratory Data Analysis (EDA) and feature engineering for natural language processing tasks.