# Baseline Model (Logistic Regression)

---

**Author:** Diego Antonio Garc√≠a Padilla

**Date:** Nov 3, 2025

## Enviroment setup

In [13]:
#@title Setup & Environment Verification

import warnings
warnings.filterwarnings('ignore')

import os
import sys

print("=== ENVIRONMENT CHECK ===")
print(f"Python: {sys.version.split()[0]}")
print(f"JAVA_HOME: {os.environ.get('JAVA_HOME')}")
print(f"SPARK_HOME: {os.environ.get('SPARK_HOME')}")
print(f"Driver Memory: {os.environ.get('SPARK_DRIVER_MEMORY')}")
print(f"Executor Memory: {os.environ.get('SPARK_EXECUTOR_MEMORY')}")
print("=" * 50)

=== ENVIRONMENT CHECK ===
Python: 3.10.12
JAVA_HOME: /usr/lib/jvm/java-8-openjdk-arm64/jre
SPARK_HOME: /opt/spark
Driver Memory: 12g
Executor Memory: 8g


In [14]:
#@title Import Libraries

# PySpark
from pyspark import SparkContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, HashingTF, IDF, StringIndexer
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import *
from pyspark.sql.window import Window

# SciKit Learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# TensorFlow
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, Dropout, Bidirectional, GlobalMaxPooling1D, Concatenate
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

# Data manipulation
import pandas as pd
import numpy as np

# Financial data
import yfinance as yf

# Hugging Face
from huggingface_hub import hf_hub_download

# Kaggle
import kagglehub

# Utilities
from datetime import datetime, timedelta
import json
import requests
import logging
from tqdm import tqdm
import time
import subprocess
from pathlib import Path

In [15]:
#@title Start Spark session

print("=== PRE-FLIGHT CHECK ===")

# Verify Java is available
try:
    java_version = subprocess.check_output(['java', '-version'], stderr=subprocess.STDOUT)
    print("Java: ‚úÖ Available")
except Exception as e:
    print(f"Java: ‚ùå Not available - {e}")

print("=" * 50)

# üî• STOP any existing Spark sessions first
try:
    SparkContext.getOrCreate().stop()
    print("üßπ Cleaned up existing Spark session")
except:
    print("üÜï No existing session to clean")

print("=" * 50)

# Create fresh Spark session
spark = SparkSession.builder \
    .appName("Yelp_Sentiment_Analysis") \
    .master("local[*]") \
    .config("spark.driver.memory", "12g") \
    .config("spark.executor.memory", "8g") \
    .config("spark.driver.maxResultSize", "4g") \
    .config("spark.memory.fraction", "0.8") \
    .config("spark.memory.storageFraction", "0.3") \
    .config("spark.sql.shuffle.partitions", "200") \
    .config("spark.default.parallelism", "16") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "512m") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

print("‚úÖ Spark session configured with:")
print(f"   - Driver Memory: 12GB")
print(f"   - Executor Memory: 8GB")
print(f"   - Max Result Size: 4GB")
print(f"   - Parallelism: 16 cores")
print(f"   - Shuffle Partitions: 200")

=== PRE-FLIGHT CHECK ===
Java: ‚úÖ Available
üßπ Cleaned up existing Spark session
‚úÖ Spark session configured with:
   - Driver Memory: 12GB
   - Executor Memory: 8GB
   - Max Result Size: 4GB
   - Parallelism: 16 cores
   - Shuffle Partitions: 200


## Logistic Regression with MLlib (baseline)

In [16]:
#@title Load dataset

# Parquet path
parquet_path = "../data/clean/yelp_reviews_tokenized.parquet"

yelp_df = spark.read.parquet(parquet_path)

# Show schema to understand structure
print("üìã Schema of Yelp Reviews:")
yelp_df.printSchema()

# Sample
print("\nüìã Sample:")
yelp_df.show(5, truncate=80)

üìã Schema of Yelp Reviews:
root
 |-- text: string (nullable = true)
 |-- sentiment: string (nullable = true)
 |-- text_length: integer (nullable = true)
 |-- word_count: integer (nullable = true)
 |-- text_clean: string (nullable = true)
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- tokens_filtered: array (nullable = true)
 |    |-- element: string (containsNull = true)


üìã Sample:
+--------------------------------------------------------------------------------+---------+-----------+----------+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                                                                            text|sentiment|text_length|word_count|                                                                      text_clean|           

In [17]:
#@ Split data into train, valdation and test

# Split: 70% train, 15% validation, 15% test
train_df, temp_df = yelp_df.randomSplit([0.7, 0.3], seed=42)
val_df, test_df = temp_df.randomSplit([0.5, 0.5], seed=42)

print("üìä Train: ", train_df.count())
print("üìä Validation: ", val_df.count())
print("üìä Test: ", test_df.count())

                                                                                

üìä Train:  291152


                                                                                

üìä Validation:  62355




üìä Test:  61924


                                                                                

In [18]:
#@title Logistic Regression with TF-IDF (MLlib)

print("=" * 80)
print("üéØ BASELINE MODEL: Logistic Regression with TF-IDF")
print("=" * 80)

# Step 1: Convert sentiment labels to numerical indices
# 0 -> Negative
# 1 -> Neutral
# 2 -> Positive
label_indexer = StringIndexer(inputCol="sentiment", outputCol="label")

# Step 2: TF-IDF feature extraction
# HashingTF: converts tokens to term frequency vectors
hashingTF = HashingTF(inputCol="tokens_filtered", outputCol="raw_features", numFeatures=10000)

# IDF: applies inverse document frequency weighting
idf = IDF(inputCol="raw_features", outputCol="features")

# Step 3: Logistic Regression classifier
lr = LogisticRegression(
    maxIter=20,
    regParam=0.01,  # L2 regularization
    elasticNetParam=0.0  # Pure L2 (ridge)
)

# Create pipeline
baseline_pipeline = Pipeline(stages=[label_indexer, hashingTF, idf, lr])

# Train model
print("\n‚è≥ Training baseline model...")
baseline_model = baseline_pipeline.fit(train_df)

# Make predictions
print("\nüìä Making predictions on validation set...")
val_predictions = baseline_model.transform(val_df)
test_predictions = baseline_model.transform(test_df)

# Evaluate
evaluator_accuracy = MulticlassClassificationEvaluator(
    labelCol="label", 
    predictionCol="prediction", 
    metricName="accuracy"
)

evaluator_f1 = MulticlassClassificationEvaluator(
    labelCol="label", 
    predictionCol="prediction", 
    metricName="f1"
)

val_accuracy = evaluator_accuracy.evaluate(val_predictions)
val_f1 = evaluator_f1.evaluate(val_predictions)

test_accuracy = evaluator_accuracy.evaluate(test_predictions)
test_f1 = evaluator_f1.evaluate(test_predictions)

print("\n" + "=" * 80)
print("üìà BASELINE RESULTS")
print("=" * 80)
print(f"Validation Accuracy: {val_accuracy:.4f}")
print(f"Validation F1-Score: {val_f1:.4f}")
print(f"Test Accuracy:       {test_accuracy:.4f}")
print(f"Test F1-Score:       {test_f1:.4f}")
print("=" * 80)

# Show confusion matrix (validation)
print("\nüîç Sample predictions (validation):")
val_predictions.select('text', 'sentiment', 'prediction').show(5, truncate=80)

# Save baseline model
baseline_model_path = "../models/baseline_lr_tfidf"
baseline_model.write().overwrite().save(baseline_model_path)
print(f"\nüíæ Baseline model saved to: {baseline_model_path}")

üéØ BASELINE MODEL: Logistic Regression with TF-IDF

‚è≥ Training baseline model...


                                                                                


üìä Making predictions on validation set...


                                                                                


üìà BASELINE RESULTS
Validation Accuracy: 0.7533
Validation F1-Score: 0.7520
Test Accuracy:       0.7542
Test F1-Score:       0.7532

üîç Sample predictions (validation):


                                                                                

+--------------------------------------------------------------------------------+---------+----------+
|                                                                            text|sentiment|prediction|
+--------------------------------------------------------------------------------+---------+----------+
|" after a previous that was awful I decided to give the bridal garden one mor...| negative|       0.0|
|" you are killing me Larry". The ice cream is amazing. The staff was amazing,...| negative|       2.0|
|"Drishti is a point of gaze or focus. Drishti mainly means not looking at an ...| negative|       0.0|
|"Flavor of Indian" huh? What flavor would that be? This place is as blah as t...| negative|       0.0|
|"Loss prevention" undercover in that store must have followed me through the ...| negative|       0.0|
+--------------------------------------------------------------------------------+---------+----------+
only showing top 5 rows


üíæ Baseline model saved to: ../model

In [19]:
#@title Export predictions for Tableau 

print("=" * 80)
print("üìä PREPARING DATA FOR TABLEAU")
print("=" * 80)

# 1. Get predictions with all relevant info
tableau_predictions = test_predictions.select(
    F.col('text'),
    F.col('sentiment').alias('true_sentiment'),
    F.col('label').alias('true_label'),
    F.col('prediction').alias('predicted_label'),
    F.col('text_length'),
    F.col('word_count')
).withColumn('predicted_sentiment',
    F.when(F.col('predicted_label') == 0.0, 'negative')
    .when(F.col('predicted_label') == 1.0, 'neutral')
    .when(F.col('predicted_label') == 2.0, 'positive')
    .otherwise('unknown')
).withColumn('is_correct',
    F.when(F.col('true_label') == F.col('predicted_label'), 1).otherwise(0)
)

# Cache it for reuse
tableau_predictions.cache()

print("\nüìã Sample data for Tableau:")
tableau_predictions.show(5, truncate=80)

# VERIFY counts first
total_count = tableau_predictions.count()
correct_count = tableau_predictions.filter(F.col('is_correct') == 1).count()
incorrect_count = tableau_predictions.filter(F.col('is_correct') == 0).count()

print(f"\nüîç VERIFICATION:")
print(f"Total predictions: {total_count:,}")
print(f"Correct predictions: {correct_count:,} ({correct_count/total_count*100:.2f}%)")
print(f"Incorrect predictions: {incorrect_count:,} ({incorrect_count/total_count*100:.2f}%)")
print(f"Test Accuracy from model: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")
print(f"Match? {abs((correct_count/total_count) - test_accuracy) < 0.01}")

# 2. Create BALANCED & CLEAN predictions sample for Tableau
print("\nüéØ Creating balanced sample (equal representation of all sentiments)...")

# Take equal samples from each sentiment class
sample_size_per_class = 3333  # 3333 * 3 = ~10,000 total

negative_sample = tableau_predictions.filter(F.col('true_sentiment') == 'negative').limit(sample_size_per_class)
neutral_sample = tableau_predictions.filter(F.col('true_sentiment') == 'neutral').limit(sample_size_per_class)
positive_sample = tableau_predictions.filter(F.col('true_sentiment') == 'positive').limit(sample_size_per_class)

# Union all samples
balanced_sample = negative_sample.union(neutral_sample).union(positive_sample)

print(f"\nüìä Sample distribution:")
balanced_sample.groupBy('true_sentiment').count().orderBy('true_sentiment').show()

# Clean text: remove newlines, tabs, and extra spaces
print("\nüßπ Cleaning text for CSV export...")

predictions_clean = balanced_sample.withColumn('text_clean',
    F.regexp_replace(
        F.regexp_replace(
            F.regexp_replace(F.col('text'), r'[\n\r\t]+', ' '),  # Replace newlines/tabs with space
        r'\s+', ' '),  # Replace multiple spaces with single space
    r'[\"]+', '')  # Remove double quotes
).withColumn('text_truncated',
    F.substring(F.col('text_clean'), 1, 500)  # Limit to 500 chars for Tableau
).select(
    'text_truncated',
    'true_sentiment',
    'true_label',
    'predicted_sentiment',
    'predicted_label',
    'text_length',
    'word_count',
    'is_correct'
)

print("\n‚úÖ Balanced sample created with all 3 sentiments!")
predictions_clean.show(5, truncate=80)

# 3. Create confusion matrix data
print("\nüìä Creating confusion matrix data...")
confusion_matrix = tableau_predictions.groupBy('true_sentiment', 'predicted_sentiment').count()

print("\nüî¢ Confusion Matrix:")
confusion_matrix.orderBy('true_sentiment', 'predicted_sentiment').show()

# 4. Calculate metrics per class
print("\nüìà Creating per-class metrics...")

metrics_list = []
for class_idx, class_name in enumerate(['negative', 'neutral', 'positive']):
    tp = tableau_predictions.filter(
        (F.col('true_label') == class_idx) & (F.col('predicted_label') == class_idx)
    ).count()
    
    fp = tableau_predictions.filter(
        (F.col('true_label') != class_idx) & (F.col('predicted_label') == class_idx)
    ).count()
    
    fn = tableau_predictions.filter(
        (F.col('true_label') == class_idx) & (F.col('predicted_label') != class_idx)
    ).count()
    
    tn = tableau_predictions.filter(
        (F.col('true_label') != class_idx) & (F.col('predicted_label') != class_idx)
    ).count()
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    metrics_list.append({
        'sentiment': class_name,
        'true_positives': tp,
        'false_positives': fp,
        'false_negatives': fn,
        'true_negatives': tn,
        'precision': round(precision, 4),
        'recall': round(recall, 4),
        'f1_score': round(f1, 4)
    })

metrics_df = spark.createDataFrame(metrics_list)
print("\nüìä Per-class metrics:")
metrics_df.show()

# 5. Text length analysis
print("\nüìè Creating text length analysis...")
length_analysis = tableau_predictions.groupBy('true_sentiment', 'is_correct').agg(
    F.avg('text_length').alias('avg_text_length'),
    F.avg('word_count').alias('avg_word_count'),
    F.count('*').alias('count')
)

print("\nüìä Text length by sentiment and correctness:")
length_analysis.show()

# 6. Export everything using Spark with proper CSV options
print("\nüíæ Exporting to CSV for Tableau using Spark...")

output_dir = "../data/tableau"

# CSV options for clean export
csv_options = {
    'header': 'true',
    'quote': '"',
    'escape': '"',
    'quoteAll': 'true'  # Quote all fields to avoid issues
}

print("\n‚è≥ Saving balanced predictions sample...")
predictions_clean.coalesce(1).write.mode('overwrite').options(**csv_options).csv(f"{output_dir}/predictions_sample")

print("\n‚è≥ Saving confusion matrix...")
confusion_matrix.coalesce(1).write.mode('overwrite').options(**csv_options).csv(f"{output_dir}/confusion_matrix")

print("\n‚è≥ Saving per-class metrics...")
metrics_df.coalesce(1).write.mode('overwrite').options(**csv_options).csv(f"{output_dir}/metrics_per_class")

print("\n‚è≥ Saving text length analysis...")
length_analysis.coalesce(1).write.mode('overwrite').options(**csv_options).csv(f"{output_dir}/text_length_analysis")

# 7. Create and save summary - WITH CORRECT COUNTS
print("\n‚è≥ Saving model summary...")
overall_summary = spark.createDataFrame([{
    'model_name': 'Baseline (LR + TF-IDF)',
    'accuracy': round(test_accuracy, 4),
    'f1_score': round(test_f1, 4),
    'total_predictions': total_count,
    'correct_predictions': correct_count,
    'incorrect_predictions': incorrect_count,
    'correct_percentage': round((correct_count/total_count)*100, 2),
    'incorrect_percentage': round((incorrect_count/total_count)*100, 2)
}])

print("\nüìä Model Summary:")
overall_summary.show(truncate=False)
overall_summary.coalesce(1).write.mode('overwrite').options(**csv_options).csv(f"{output_dir}/model_summary")

print("\n" + "=" * 80)
print("‚úÖ ALL DATA EXPORTED FOR TABLEAU!")
print("=" * 80)
print(f"\nüìÅ Files saved to: {output_dir}/")
print("\nüéØ Predictions sample now contains BALANCED data:")
print("   - ~3,333 negative reviews")
print("   - ~3,333 neutral reviews")
print("   - ~3,333 positive reviews")

üìä PREPARING DATA FOR TABLEAU

üìã Sample data for Tableau:


                                                                                

+--------------------------------------------------------------------------------+--------------+----------+---------------+-----------+----------+-------------------+----------+
|                                                                            text|true_sentiment|true_label|predicted_label|text_length|word_count|predicted_sentiment|is_correct|
+--------------------------------------------------------------------------------+--------------+----------+---------------+-----------+----------+-------------------+----------+
|!! FRAUD ALERT / SCAM / BUYER BEWARE !!\n\nUses cross state Craigslist posts ...|      negative|       0.0|            0.0|       2017|       322|           negative|         1|
|" Venture N is an absolute horrible place to go if you want to have good cust...|      negative|       0.0|            0.0|        345|        67|           negative|         1|
|"Bouncer" at the door was a tool. Wouldn't let us in because we weren't on so...|      negative|       0

                                                                                


‚è≥ Saving text length analysis...

‚è≥ Saving model summary...

üìä Model Summary:
+--------+------------------+-------------------+--------+--------------------+---------------------+----------------------+-----------------+
|accuracy|correct_percentage|correct_predictions|f1_score|incorrect_percentage|incorrect_predictions|model_name            |total_predictions|
+--------+------------------+-------------------+--------+--------------------+---------------------+----------------------+-----------------+
|0.7542  |75.42             |46705              |0.7532  |24.58               |15219                |Baseline (LR + TF-IDF)|61924            |
+--------+------------------+-------------------+--------+--------------------+---------------------+----------------------+-----------------+


‚úÖ ALL DATA EXPORTED FOR TABLEAU!

üìÅ Files saved to: ../data/tableau/

üéØ Predictions sample now contains BALANCED data:
   - ~3,333 negative reviews
   - ~3,333 neutral reviews
   - ~3,333 p

In [20]:
#@title Consolidate CSVs

import shutil
import glob

# Move CSV files out of Spark folders
output_dir = "../data/tableau"
folders = ['predictions_sample', 'confusion_matrix', 'metrics_per_class', 
           'text_length_analysis', 'model_summary']

for folder in folders:
    folder_path = f"{output_dir}/{folder}"
    csv_files = glob.glob(f"{folder_path}/part-*.csv")
    
    if csv_files:
        csv_file = csv_files[0]
        new_path = f"{output_dir}/{folder}.csv"
        shutil.copy(csv_file, new_path)
        print(f"‚úÖ {folder}.csv created")

print("\nüéâ Clean CSVs ready for Tableau!")

‚úÖ predictions_sample.csv created
‚úÖ confusion_matrix.csv created
‚úÖ metrics_per_class.csv created
‚úÖ text_length_analysis.csv created
‚úÖ model_summary.csv created

üéâ Clean CSVs ready for Tableau!
