# Traveloka Sentiment Analysis

![Traveloka Logo](https://console.kr-asia.com/wp-content/uploads/2020/12/traveloka.jpg)

![Python](https://img.shields.io/badge/Python-3.12-blue)
![Machine Learning](https://img.shields.io/badge/Machine%20Learning-Sentiment%20Analysis-orange)
![Scikit-Learn](https://img.shields.io/badge/Scikit--Learn-Modeling-yellow)
![Status](https://img.shields.io/badge/Status-Completed-brightgreen)

---

### 👤 My Identity
- **Name** : Indra Styawan  
- **Role** : Data Science  
- **Domicile** : Yogyakarta, Indonesia  
- **Email** : indrastyawan0925@gmail.com  
- **LinkedIn** : www.linkedin.com/in/indrastyawan25

---

<h1><center>📈 Analysis of App's Review on Traveloka Application </center></h1>

### 📝 Introduction
<p align="justify">The application review analysis program in the Traveloka application is a machine learning application that aims to analyze reviews in the Play Store on the Traveloka application. The general purpose of this program is to find out which reviews are positive, negative, or neutral, so that it can be analyzed how the public responds to the Traveloka application that can be used to book flights, hotels, and various travel-related services.</p>

### 🎯 Objective
<p align="justify">The process of analyzing and evaluating reviews or feelings expressed by Traveloka application users in their reviews that touch on Traveloka application services on the Play Store.</p>

### 🔍 Process
- 📥 **Data Collection**: Data collection was carried out by collecting data on the Play Store with the Traveloka application ID using the Google-Play-Scraper library.  
- 🧹 **Data Preprocessing**: CleaningText, casefoldingText, tokenizationText, filteringText, stemming/lemmatization, and toSentence.  
- 🏷️ **Data Labeling**: The process of assigning a category or label to each data entry based on available information.  
- ☁️ **Label Exploration**: This visualization uses WordCloud.  
- ✂️ **Dataset Splitting**: Splitting the dataset into training, validation, and test sets for the model training process.  
- 🧠 **Model Building**: Build a classification model using a random forest, support vector machines, gradient boosting machines, and XGBoost.  
- 🏋️‍♂️ **Model Training**: Training a model on a training dataset by optimizing its parameters and weights so that it can recognize patterns in text.  
- ✅ **Model Validation**: Validate the model on the validation dataset to measure its performance and prevent overfitting.  
- 📊 **Evaluation and Tuning**: Evaluate the model on the test dataset and adjust parameters if necessary.


## 1. Importing Packages

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section the required packages are imported, and briefly discuss, the libraries that will be used throughout the analysis and modelling. |

In [3]:
!pip install sastrawi

Collecting sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl.metadata (909 bytes)
Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/209.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m204.8/209.7 kB[0m [31m7.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.7/209.7 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sastrawi
Successfully installed sastrawi-1.0.1


In [4]:
# Import library for data cleaning
import pandas as pd  # Pandas for data manipulation and analysis
pd.options.mode.chained_assignment = None  # Disable chained assignment warning
import numpy as np  # NumPy for numerical computation
seed = 0
np.random.seed(seed)  # Set seed for reproducibility
import re  # Module for working with regular expressions
import string  # Contains string constants such as punctuation marks
import nltk  # Import NLTK (Natural Language Toolkit) library
nltk.download('punkt')  # Download dataset required for text tokenization
nltk.download('stopwords')  # Download dataset containing stopword lists in various languages
from nltk.tokenize import word_tokenize  # Text tokenization
from nltk.corpus import stopwords  # Stopword list in text
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory  # Stemming (removing word affixes) for Indonesian language
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory  # Remove stopwords in Indonesian language
import csv
import requests
from io import StringIO

# Import library for visualization
import matplotlib.pyplot as plt  # Matplotlib for data visualization
import seaborn as sns  # Seaborn for statistical data visualization and style setting
from wordcloud import WordCloud  # Create a word cloud visualization from text

# Import library for preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
from sklearn.preprocessing import StandardScaler

# Import library for processing
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## 2. Loading Data

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load data from local and then extract it to the storage directory. |

---

In [5]:
app_reviews_df = pd.read_csv('traveloka_review_data.csv')
app_reviews_df.shape

(15000, 11)

In [6]:
app_reviews_df.head()

Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,appVersion
0,249bad51-ff7c-41a2-8769-9c93164a9a2b,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,kenapa kalau mau melakukan pembayaran harganya...,1,31,5.21.0,2025-06-12 13:20:07,"Hai Dina, kami mohon maaf atas kekecewaan Anda...",2025-06-12 13:56:27,5.21.0
1,ac1f0759-b36e-40d1-af4f-f1a5e65d226e,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,Salah satu fitur terbaik adalah informasi posi...,5,45,5.21.0,2025-06-16 12:12:49,"Halo Kak, terima kasih banyak atas ratingnya. ...",2025-06-16 12:35:06,5.21.0
2,18b085b3-547e-47c6-a095-54d6d7ccef07,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,"kok traveloka skrg jd aneh, tiba2 saja limit t...",1,47,5.20.0,2025-06-07 04:56:51,"Hai Wida, kami mohon maaf atas ketidaknyamanan...",2025-06-07 05:17:08,5.20.0
3,6f5b0ca2-b296-45e8-9592-553632b91387,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,menginap di hotel o surabaya bayar by aplikasi...,1,8,5.21.0,2025-06-12 14:56:08,"Hai Youle, kami mohon maaf terkait masalah den...",2025-06-12 15:10:33,5.21.0
4,2bf470ad-aeac-4f0b-9c93-39f48f384009,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,saya mau pesan tiket kereta dari jember ke jak...,3,4,5.21.0,2025-06-13 05:41:36,"Hai Pengguna Setia Traveloka, kami mohon maaf ...",2025-06-13 06:28:55,5.21.0


In [7]:
app_reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   reviewId              15000 non-null  object
 1   userName              15000 non-null  object
 2   userImage             15000 non-null  object
 3   content               15000 non-null  object
 4   score                 15000 non-null  int64 
 5   thumbsUpCount         15000 non-null  int64 
 6   reviewCreatedVersion  12333 non-null  object
 7   at                    15000 non-null  object
 8   replyContent          11153 non-null  object
 9   repliedAt             11153 non-null  object
 10  appVersion            12333 non-null  object
dtypes: int64(2), object(9)
memory usage: 1.3+ MB


## 3. Data Preprocessing

---
    
| ⚡ Description: Data preprocessing ⚡ |
| :--------------------------- |
| These preprocessing steps aim to remove noise, convert text to a consistent format, and extract important features for further analysis. |

---
- <p align = "justify">Cleans the text by removing mentions, hashtags, RTs (retweets), links, numbers and punctuation. Additionally, newline characters are replaced with spaces and extra spaces at the start and end of text are removed.

- <p align = "justify">Converts all characters in the text to lowercase to make the text uniform.
tokenizingText(text): Breaks text into a list of words or tokens. It helps in breaking down the text into basic components for further analysis.

- <p align = "justify">Removes stop words in text. The list of stop words has been updated with some additional words.

- <p align = "justify">Applies stemming to text, i.e. reduces words to their basic forms. You use the Sastrawi library to do stemming in Indonesian.

- <p align = "justify">Combines a list of words into a sentence.

In [8]:
# Create a new DataFrame (clean_df) by removing rows with missing (NaN) values from app_reviews_df
clean_df = app_reviews_df.dropna(subset=['content'])

In [9]:
# Remove duplicate rows from the clean_df DataFrame
clean_df = clean_df.drop_duplicates(subset=['content'])

clean_df.shape

(14945, 11)

In [10]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14945 entries, 0 to 14999
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   reviewId              14945 non-null  object
 1   userName              14945 non-null  object
 2   userImage             14945 non-null  object
 3   content               14945 non-null  object
 4   score                 14945 non-null  int64 
 5   thumbsUpCount         14945 non-null  int64 
 6   reviewCreatedVersion  12282 non-null  object
 7   at                    14945 non-null  object
 8   replyContent          11103 non-null  object
 9   repliedAt             11103 non-null  object
 10  appVersion            12282 non-null  object
dtypes: int64(2), object(9)
memory usage: 1.4+ MB


#### Cleaning dataframe

In [11]:
def cleaningText(text):
    text = re.sub(r'@[A-Za-z0-9]+', '', text)  # remove mentions
    text = re.sub(r'#[A-Za-z0-9]+', '', text)  # remove hashtags
    text = re.sub(r'RT[\s]', '', text)  # remove retweet symbols
    text = re.sub(r"http\S+", '', text)  # remove URLs
    text = re.sub(r'[0-9]+', '', text)  # remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # remove special characters except letters and numbers
    text = re.sub(r'(.)\1+', r'\1\1', text)  # reduce repeated characters to at most two
    text = re.sub(r'\b(\w+)(?:\W\1\b)+', r'\1', text, flags=re.IGNORECASE)  # remove duplicated words
    text = re.sub(r'\s+', ' ', text)  # remove extra spaces
    text = re.sub(r'\b\w{1,3}\b', '', text)  # remove words with 1 to 3 letters

    text = text.replace('\n', ' ')  # replace newlines with spaces
    text = text.translate(str.maketrans('', '', string.punctuation))  # remove all punctuation
    text = text.strip(' ')  # remove leading and trailing whitespace
    return text
