# Lab Assignment: Sentiment Analysis on IMDB Movie Reviews 

## Objective 
- **Load and preprocess** an IMDB movie reviews dataset using PySpark MLlib.  
- **Train a classifier** to predict the sentiment of movie reviews as positive or negative.  
- **Evaluate model performance** using Accuracy, Precision, Recall, and F1-score.  

---

## Instructions 

### Download the IMDB Reviews Dataset:  
This dataset contains 50,000 movie reviews labeled as positive or negative, which will be used to build a sentiment classification model.  

### Your goal:  
1. **Load and preprocess** the dataset, ensuring valid movie reviews and sentiment labels.  
2. **Convert text labels** into binary format (0 = negative, 1 = positive).  
3. **Clean the text data** by removing stopwords, punctuation, and lowercasing.  
4. **Convert text reviews** into numerical features using TF-IDF or Word2Vec.  
5. **Split the dataset** into training (80%) and testing (20%) sets.  
6. **Train a classification model** in PySpark MLlib.  
7. **Evaluate the model** using Accuracy, Precision, Recall, and F1-score.  

---

## Submission 
- **Submission deadline**: 2 weeks from the assignment date.  
- **Submission Format**: Upload the Executed Notebook (or similar) to LMS ([lms.siu.edu.vn](https://lms.siu.edu.vn)).  

---

## Suggested Resources 
- [PySpark Documentation](https://spark.apache.org/docs/latest/api/python/index.html)  
- [PySpark SQL Guide](https://spark.apache.org/sql/)  

---

## Student Information 
- **Name**: Thái Hồ Phú Gia  
- **Class**: 23MMT  
- **Student ID**: 11012302891  

### Bước 0: Tải dataset (chỉ chạy khi chưa có tải dataset)

In [None]:
# %%bash
# mkdir -p data && cd data

# wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

# tar -xzf aclImdb_v1.tar.gz

# rm aclImdb_v1.tar.gz

### Bước 1: Khởi tạo Spark và import thư viện 

In [2]:
%pip install findspark pyspark numpy pandas matplotlib seaborn scikit-learn

import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF, StringIndexer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql.functions import input_file_name, lit, regexp_extract

spark = SparkSession.builder \
    .appName("IMDB_Sentiment") \
    .getOrCreate()

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


25/05/07 08:14:21 WARN Utils: Your hostname, codespaces-41a89b resolves to a loopback address: 127.0.0.1; using 10.0.0.208 instead (on interface eth0)
25/05/07 08:14:21 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/07 08:14:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Bước 2: Load Dataset

Chúng ta sẽ đọc thẳng file `.txt` trong:
- `data/aclImdb/train/pos/` (label = 1)
- `data/aclImdb/train/neg/` (label = 0)
- tương tự cho `data/aclImdb/test/`

In [3]:
train_pos = spark.read.text("data/aclImdb/train/pos/*.txt") \
    .withColumnRenamed("value", "review") \
    .withColumn("label", lit(1))


train_neg = spark.read.text("data/aclImdb/train/neg/*.txt") \
    .withColumnRenamed("value", "review") \
    .withColumn("label", lit(0))


test_pos = spark.read.text("data/aclImdb/test/pos/*.txt") \
    .withColumnRenamed("value", "review") \
    .withColumn("label", lit(1))

test_neg = spark.read.text("data/aclImdb/test/neg/*.txt") \
    .withColumnRenamed("value", "review") \
    .withColumn("label", lit(0))


train_df = train_pos.union(train_neg)
test_df  = test_pos.union(test_neg)
full_df  = train_df.union(test_df)   


print(f"Train size = {train_df.count()}, Test size = {test_df.count()}")
train_df.show(5, truncate=100)


                                                                                

Train size = 25000, Test size = 25000
+----------------------------------------------------------------------------------------------------+-----+
|                                                                                              review|label|
+----------------------------------------------------------------------------------------------------+-----+
|Match 1: Tag Team Table Match Bubba Ray and Spike Dudley vs Eddie Guerrero and Chris Benoit Bubba...|    1|
|**Attention Spoilers**<br /><br />First of all, let me say that Rob Roy is one of the best films ...|    1|
|Titanic directed by James Cameron presents a fictional love story on the historical setting of th...|    1|
|By now you've probably heard a bit about the new Disney dub of Miyazaki's classic film, Laputa: C...|    1|
|*!!- SPOILERS - !!*<br /><br />Before I begin this, let me say that I have had both the advantage...|    1|
+-----------------------------------------------------------------------------------------

### Bước 3: Text Preprocessing

- Tokenization
- Lowercasing & loại bỏ punctuation (Spark Tokenizer đã lowercase tự động)
- Stopwords removal
- Chuyển sang vector numeric (TF–IDF hoặc Word2Vec)

In [4]:
tokenizer = Tokenizer(inputCol="review", outputCol="words")
remover   = StopWordsRemover(inputCol="words", outputCol="filtered")
hashTF    = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=10000)
idf       = IDF(inputCol="rawFeatures", outputCol="features")
lr        = LogisticRegression(labelCol="label", featuresCol="features", maxIter=20)

pipeline  = Pipeline(stages=[tokenizer, remover, hashTF, idf, lr])


### Bước 4: Training & Evaluation

- Fit Pipeline trên `train_df`  
- Dự đoán trên `test_df`  
- Tính Accuracy, Precision, Recall, F1-score

In [5]:
from pyspark.sql import Row

model       = pipeline.fit(train_df)
predictions = model.transform(test_df)

evaluator = MulticlassClassificationEvaluator(labelCol="label")

metrics = []
for metric in ["accuracy", "weightedPrecision", "weightedRecall", "f1"]:
    score = evaluator.setMetricName(metric).evaluate(predictions)
    metrics.append(Row(Metric=metric, Score=score))


metrics_df = spark.createDataFrame(metrics)
metrics_df = metrics_df.withColumn("Score", metrics_df["Score"].cast("double").alias("Score").substr(0, 6))
metrics_df.show(truncate=False)

spark.stop()

25/05/07 08:19:01 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
                                                                                

+-----------------+------+
|Metric           |Score |
+-----------------+------+
|accuracy         |0.7869|
|weightedPrecision|0.7869|
|weightedRecall   |0.7869|
|f1               |0.7869|
+-----------------+------+

