# Abstract for model

## Dataset and Approach
- The **IMDB Movie Reviews Dataset** from **Kaggle** has been used for sentiment analysis.
- The dataset contains **50,000 reviews**, evenly split between **positive** and **negative** sentiments.
- Preprocessing includes:
  - **Text Tokenization**: Breaking reviews into individual words.
  - **Feature Engineering**: Using **HashingTF** to convert text into numerical features.
  - **Classification Model**: **Logistic Regression** is trained to classify reviews as **positive (1)** or **negative (0)**.

---

## Model Performance Summary

- **Accuracy (90.90%)**: The model correctly classifies **90.90%** of the reviews.
- **Precision (83.06%)**: When the model predicts **positive**, it’s correct **83.06%** of the time.  
  - However, **849 negative reviews were misclassified as positive** (**False Positives**).
- **Recall (85.67%)**: The model correctly identifies **85.67%** of all actual **positive reviews**.  
  - **696 positive reviews were misclassified as negative** (**False Negatives**).
- **F1-Score (84.34%)**: The harmonic mean of **precision and recall**, indicating a **balanced overall performance**.

---

## Insights and Improvement Areas

✅ **Strengths:**
- High accuracy (**90.90%**) suggests a well-trained model.
- Balanced **precision-recall tradeoff**, preventing extreme biases.
- Effective in capturing sentiment from text using simple **TF-based features**.

⚠ **Areas for Improvement:**
- **Reduce False Positives (FP)**: Some negative reviews are incorrectly predicted as positive.
- **Reduce False Negatives (FN)**: Some positive reviews are being classified as negative.
- **Potential Enhancements:**
  - Use **TF-IDF** instead of HashingTF to improve feature representation.
  - Try more complex models like **Random Forest** or **Deep Learning (BERT)**.
  - Fine-tune **Logistic Regression** parameters to optimize classification.

---

## Conclusion

The model provides **strong sentiment classification (90.90% accuracy)** but can be improved by enhancing **feature extraction techniques** and **model tuning**. Future enhancements could involve **advanced NLP techniques** to boost precision and recall.


In [None]:
# Step 1: Initialize Spark session
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import Tokenizer, HashingTF
from pyspark.ml import Pipeline
from pyspark.sql.functions import col, expr, when
import pandas as pd

spark = SparkSession.builder.appName("SentimentAnalysis").getOrCreate()

In [19]:
# Step 2: Load the dataset
df = pd.read_csv("/content/drive/MyDrive/IMDB Dataset.csv")
data = spark.createDataFrame(df)

In [20]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [21]:
# Step 3: Preprocessing - Map sentiments to numeric labels
data = data.withColumn("label", (data["sentiment"] == "positive").cast("int"))

In [22]:
# Step 4: Create an ML pipeline
tokenizer = Tokenizer(inputCol="review", outputCol="words")  # Tokenize text
hashingTF = HashingTF(inputCol="words", outputCol="features")  # Convert words to feature vectors
lr = LogisticRegression(featuresCol="features", labelCol="label")  # Logistic regression model
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

In [23]:
# Step 5: Split data into training and testing sets
(trainingData, testData) = data.randomSplit([0.8, 0.2], seed=42)

In [24]:
# Step 6: Train the model
model = pipeline.fit(trainingData)

In [25]:
# Step 7: Make predictions on the test set
predictions = model.transform(testData)

In [26]:
# Step 8: Evaluate the model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol="label")
accuracy = evaluator.evaluate(predictions)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9090484649150764


In [27]:
# Step 9: Confusion Matrix
confusion_matrix = predictions.withColumn(
    "confusion",
    when((col("label") == 1) & (col("prediction") == 1), "TP")
    .when((col("label") == 0) & (col("prediction") == 0), "TN")
    .when((col("label") == 1) & (col("prediction") == 0), "FN")
    .when((col("label") == 0) & (col("prediction") == 1), "FP")
)

confusion_counts = confusion_matrix.groupBy("confusion").count()
confusion_counts.show()

+---------+-----+
|confusion|count|
+---------+-----+
|       TP| 4162|
|       TN| 4156|
|       FN|  696|
|       FP|  849|
+---------+-----+



In [28]:
# Step 10: Calculate Precision, Recall, and F1-Score
TP = confusion_matrix.filter(col("confusion") == "TP").count()
TN = confusion_matrix.filter(col("confusion") == "TN").count()
FP = confusion_matrix.filter(col("confusion") == "FP").count()
FN = confusion_matrix.filter(col("confusion") == "FN").count()

precision = TP / (TP + FP) if (TP + FP) != 0 else 0
recall = TP / (TP + FN) if (TP + FN) != 0 else 0
f1_score = (2 * precision * recall) / (precision + recall) if (precision + recall) != 0 else 0

print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1_score}")

Precision: 0.8305727399720615
Recall: 0.8567311650885138
F1-Score: 0.8434491843145202


In [29]:
# Step 11: Analyze Misclassified Examples
print("Misclassified Examples:")
misclassified = predictions.filter(col("label") != col("prediction"))
misclassified.select(
    col("review").substr(1, 100).alias("truncated_review"),
    "label",
    "prediction"
).show(10, truncate=False)

Misclassified Examples:
+----------------------------------------------------------------------------------------------------+-----+----------+
|truncated_review                                                                                    |label|prediction|
+----------------------------------------------------------------------------------------------------+-----+----------+
|!!!! POSSIBLE MILD SPOILER !!!!!<br /><br />As I watched the first half of GUILTY AS SIN I couldn`t |0    |1.0       |
|"... the beat is too strong ... we're deaf mutants now--like them", Rex Voorhas Ormine<br /><br />I |1    |0.0       |
|"Bell Book and Candle" was shown recently on cable. Not having seen it for a while, we decided to ta|1    |0.0       |
|"Die Sieger" was highly recommended to be one of the few good action movies made in Germany. I watch|0    |1.0       |
|"I thought I'd be locked away in a padded cell and they'd throw away the key" (Thus is a paraphrased|0    |1.0       |
|"Nada" was the 