<a href="https://colab.research.google.com/github/Pras89tyo/BigData/blob/main/UASBigData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pyspark



**Import modules and create spark session**

In [27]:
#import modules
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, IDF
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.sql.functions import col, when
from pyspark.ml.feature import StringIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Inisialisasi Spark Session
spark = SparkSession.builder \
    .appName("MarketplaceSentimentAnalysis") \
    .getOrCreate()

**1. Load And Read Data**

In [28]:
print("Step 1: Load dan Read Data")
# Baca dataset, specify delimiter to correctly parse the CSV
df = spark.read.csv("/content/dataset_final.csv", header=True, inferSchema=True, sep=";") # Added sep=";"

Step 1: Load dan Read Data


**Show Structure and Amount Data**

In [29]:
# Menampilkan struktur data
print("\nStruktur Dataset:")
df.printSchema()

# Menampilkan jumlah data
print("\nJumlah total data:", df.count())


Struktur Dataset:
root
 |-- userName: string (nullable = true)
 |-- score: integer (nullable = true)
 |-- content: string (nullable = true)
 |-- Layanan: integer (nullable = true)
 |-- Fitur: integer (nullable = true)
 |-- Kebermanfaatan: integer (nullable = true)
 |-- Bisnis: integer (nullable = true)
 |-- Non Aspek: integer (nullable = true)


Jumlah total data: 3000


**Show First line Data**

In [30]:
# Menampilkan beberapa baris pertama
print("\nContoh 5 baris pertama:")
df.show(5, truncate=False)


Contoh 5 baris pertama:
+-------------+-----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+-----+--------------+------+---------+
|userName     |score|content                                                                                                                                                              |Layanan|Fitur|Kebermanfaatan|Bisnis|Non Aspek|
+-------------+-----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+-----+--------------+------+---------+
|Kusyati Nisa |5    |Mantaf skali mudah juga buat ikut pelatihan prakerja sangat mudah skali 👍sungguh bermanfaat bagi kami                                                               |0      |0    |1             |0     |0        |
|Nurul Latifah|5    |keren buat beli kar

**creating sentiment column based on the score column**

In [31]:
# Creating sentiment column based on the score column
df = df.withColumn("sentiment",
    when(col("score") >= 4, "positive")
    .when(col("score") == 3, "neutral")
    .otherwise("negative")
)

**2. Split Data**

In [32]:
#Split Data
print("\nStep 2: Split Data Training dan Testing")
(training_data, test_data) = df.randomSplit([0.8, 0.2], seed=42)
print(f"Jumlah data training: {training_data.count()}")
print(f"Jumlah data testing: {test_data.count()}")


Step 2: Split Data Training dan Testing
Jumlah data training: 2451
Jumlah data testing: 549


**3. Tokenisasi**

In [33]:
print("\nStep 3: Tokenisasi")
# Menggunakan kolom 'content' sebagai input
tokenizer = Tokenizer(inputCol="content", outputCol="words")
tokenized = tokenizer.transform(df)
print("\nHasil tokenisasi (5 baris pertama):")
tokenized.select("content", "words").show(5, truncate=False)


Step 3: Tokenisasi

Hasil tokenisasi (5 baris pertama):
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|content                                                                                                                                                              |words                                                                                                                                                                                        |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------

**4. Penghilangan Stopwords**

In [34]:
print("\nStep 4: Penghilangan Stopwords")
# Stopwords bahasa Indonesia yang diperluas
indonesian_stopwords = [
    "yang", "di", "ke", "dari", "pada", "dalam", "untuk", "dengan", "dan", "atau",
    "ini", "itu", "juga", "sudah", "saya", "anda", "dia", "mereka", "kita", "akan",
    "bisa", "ada", "tidak", "saat", "oleh", "setelah", "para", "seperti", "saat",
    "hal", "ketika", "bagi", "sampai", "tentang", "hingga", "sebuah", "yakni",
    "maupun", "selama", "dimana", "tetap", "masih", "lalu", "telah", "tapi",
    "nya", "ya", "sih", "kok", "gak", "ga", "tuh", "si", "deh", "tau", "kan",
    "kalo", "kalau", "dalam", "nya", "yg", "jd", "dgn", "gue", "aja", "dan"
]
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words", stopWords=indonesian_stopwords)
filtered = remover.transform(tokenized)
print("\nHasil penghilangan stopwords (5 baris pertama):")
filtered.select("words", "filtered_words").show(5, truncate=False)


Step 4: Penghilangan Stopwords

Hasil penghilangan stopwords (5 baris pertama):
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|words                                                                                                                                                                                        |filtered_words                                                                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------

**5. convertion to numeriks**

In [35]:
# 5. Konversi Teks ke Numerik
print("\nStep 5: Konversi Teks ke Numerik")
vectorizer = CountVectorizer(inputCol="filtered_words", outputCol="raw_features", minDF=2.0)
idf = IDF(inputCol="raw_features", outputCol="features")

# Konversi label sentiment ke numerik
label_indexer = StringIndexer(inputCol="sentiment", outputCol="label")


Step 5: Konversi Teks ke Numerik


**6. Modeling and Train Model**

In [36]:
# 6. Pemodelan
print("\nStep 6: Pemodelan dengan Logistic Regression")
lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)

# Membuat pipeline
pipeline = Pipeline(stages=[
    tokenizer,
    remover,
    vectorizer,
    idf,
    label_indexer,
    lr
])
# Melatih model
print("\nMelatih model...")
model = pipeline.fit(training_data)


Step 6: Pemodelan dengan Logistic Regression

Melatih model...


**7. Prediction**

In [37]:
# 7. Prediksi
print("\nStep 7: Prediksi")
predictions = model.transform(test_data)
print("\nHasil prediksi (5 baris pertama):")
predictions.select("content", "score", "sentiment", "prediction").show(5, truncate=False)


Step 7: Prediksi

Hasil prediksi (5 baris pertama):
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+---------+----------+
|content                                                                                                                                                                                                                                              |score|sentiment|prediction|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+---------+----------+
|Terbaik dari aplikasi yg lainnya...Alhamdulillah penghasilan bertambah ..yg saya suka dari aplikasi ini ada fitur untung 

**8. Evaluation**

In [38]:
# 8. Evaluasi
print("\nStep 8: Evaluasi")
evaluator_accuracy = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="accuracy"
)
accuracy = evaluator_accuracy.evaluate(predictions)

evaluator_precision = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="weightedPrecision"
)
precision = evaluator_precision.evaluate(predictions)

evaluator_recall = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="weightedRecall"
)
recall = evaluator_recall.evaluate(predictions)

print(f"\nMetrik Evaluasi:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")

# Menampilkan distribusi sentimen
print("\nDistribusi Sentimen:")
df.groupBy("sentiment").count().show()

# Menghentikan Spark session
spark.stop()


Step 8: Evaluasi

Metrik Evaluasi:
Accuracy: 0.7887
Precision: 0.7390
Recall: 0.7887

Distribusi Sentimen:
+---------+-----+
|sentiment|count|
+---------+-----+
| positive| 1738|
|  neutral|  199|
| negative| 1063|
+---------+-----+

