# Análisis de Sentimientos de Book Reviews con PySpark

El tamaño del DataSet es de 2.86 GB, y se encuentra en la siguiente liga: https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews/data

Link a Dashboard en Tableau: https://public.tableau.com/views/AmazonBookDashboard/Dashboard1?:language=en-US&publish=yes&:display_count=n&:origin=viz_share_link

Presenta: Ricardo Andrés Cáceres Villibord A01706972

In [3]:
#Bibliotecas para poder trabajar con Spark
!sudo apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
!tar xf spark-3.5.0-bin-hadoop3.tgz
#Configuración de Spark con Python
!pip install -q findspark
!pip install pyspark

#Estableciendo variable de entorno
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"

#Buscando e inicializando la instalación de Spark
import findspark
findspark.init()
findspark.find()

[33m0% [Working][0m            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
[33m0% [Waiting for headers] [Waiting for headers] [1 InRelease 0 B/3,626 B 0%] [Co[0m[33m0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpadcont[0m                                                                               Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
[33m0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpadcont[0m                                                                               Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
[33m0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpadcont[0m                                                                               Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:5 http://security.ubuntu.com/ubuntu jammy-security InRele

'/content/spark-3.5.0-bin-hadoop3'

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [5]:
# Crear una sesión de Spark
spark = SparkSession.builder.appName("ReviewSentimentAnalyzer").getOrCreate()

In [6]:
# Cargar el conjunto de datos
data = spark.read.csv('/content/drive/My Drive/Colab Notebooks/Books_rating.csv', header=True, inferSchema=True)
data.show()

+----------+--------------------+-----+--------------+--------------------+------------------+------------+-----------+--------------------+--------------------+
|        Id|               Title|Price|       User_id|         profileName|review/helpfulness|review/score|review/time|      review/summary|         review/text|
+----------+--------------------+-----+--------------+--------------------+------------------+------------+-----------+--------------------+--------------------+
|1882931173|Its Only Art If I...| NULL| AVCGYZL8FQQTD|"Jim of Oz ""jim-...|               7/7|         4.0|  940636800|Nice collection o...|This is only for ...|
|0826414346|Dr. Seuss: Americ...| NULL|A30TK6U7DNS82R|       Kevin Killian|             10/10|         5.0| 1095724800|   Really Enjoyed It|I don't care much...|
|0826414346|Dr. Seuss: Americ...| NULL|A3UH4UZ4RSVO82|        John Granger|             10/11|         5.0| 1078790400|Essential for eve...|"If people become...|
|0826414346|Dr. Seuss: Ameri

In [7]:
# Cuenta cuantos null hay en cada columna
null_counts = [data.where(col(c).isNull()).count() for c in data.columns]

# Mostrar nulls
for column, null_count in zip(data.columns, null_counts):
    print(f"Column '{column}': {null_count} null values")

Column 'Id': 0 null values
Column 'Title': 208 null values
Column 'Price': 2517579 null values
Column 'User_id': 562250 null values
Column 'profileName': 562200 null values
Column 'review/helpfulness': 367 null values
Column 'review/score': 130 null values
Column 'review/time': 27 null values
Column 'review/summary': 65 null values
Column 'review/text': 43 null values


In [8]:
# Se agrega una columna 'sentiment' basada en la 'review/score'
# donde se asigna 1 si 'review/score' es mayor a 3.0, y 0 en caso contrario
data = data.withColumn("sentiment", when(col("review/score") > 3.0, 1).otherwise(0))

In [9]:
# Elimina filas con valores nulos
data = data.select("review/text", "sentiment").na.drop()

## Preprocesamiento de Datos

In [10]:
# Se utiliza Tokenizer para dividir el texto en palabras
tokenizer = Tokenizer(inputCol="review/text", outputCol="words")
# Se utiliza StopWordsRemover para eliminar palabras comunes.
stop_words_remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
# Se utiliza HashingTF para convertir las palabras en características numéricas.
hashing_tf = HashingTF(inputCol="filtered_words", outputCol="raw_features")
# Se utiliza IDF para ajustar los pesos de las características, con una frecuencia mínima de documento de 5.
idf = IDF(inputCol="raw_features", outputCol="features", minDocFreq=5)

## Entrenamiento del Modelo

In [11]:
# Se utiliza LogisticaRegression
# 'featuresCol' se configura con las características preprocesadas y 'labelCol' con la columna 'sentiment'
# Se establece el número máximo de iteraciones en 10 y el parámetro de regularización en 0.01
lr = LogisticRegression(featuresCol="features", labelCol="sentiment", maxIter=10, regParam=0.01)

In [12]:
# Se crea un Pipeline que encadena las etapas de preprocesamiento y entrenamiento del modelo
pipeline = Pipeline(stages=[tokenizer, stop_words_remover, hashing_tf, idf, lr])

In [13]:
# Se dividen los datos en conjuntos de entrenamiento (80%) y prueba (20%)
(training_data, test_data) = data.randomSplit([0.8, 0.2], seed=42)

In [14]:
# Se entrena el modelo utilizando los datos de entrenamiento
model = pipeline.fit(training_data)

## Evaluar el modelo

In [15]:
# Se realizan predicciones utilizando el modelo entrenado en el conjunto de prueba.
predictions = model.transform(test_data)

In [16]:
# Se utiliza BinaryClassificationEvaluator para evaluar la precisión del modelo en el conjunto de prueba
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol="sentiment")
# Se calcula la precisión del modelo.
accuracy = evaluator.evaluate(predictions)
print(f"Model Accuracy: {accuracy}")

Model Accuracy: 0.8925118254093743


## Hacer predicciones con un Input del usuario

In [17]:
# Pedir un review al usuario
user_review = input("Enter your review: ")

# Meter el input del usuario en un spark dataframe
user_data = spark.createDataFrame([(user_review,)], ["review/text"])

# Hacer la prediccion
prediction = model.transform(user_data)

# Extraer la prediccion del sentimiento
predicted_sentiment = prediction.select("prediction").collect()[0][0]

# Mostrar la predicccion de sentimiento
if predicted_sentiment == 1.0:
    print("Sentiment: Positive")
else:
    print("Sentiment: Negative")

Enter your review: I like this book, it is very fun!
Sentiment: Positive


In [18]:
model.save("/content/drive/My Drive/Colab Notebooks/BookReview_Sentiment_Analysis")

## Hacer prediccion con el Modelo Guardado
Ya no hay necesidad de entrenar desde 0

In [19]:
from pyspark.ml import PipelineModel
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF
from pyspark.sql.functions import when, col

In [21]:
saved_model_path = "/content/drive/MyDrive/Colab Notebooks/BookReview_Sentiment_Analysis"
loaded_model = PipelineModel.load(saved_model_path)

user_review = input("Ingrese su revisión: ")

tokenizer = Tokenizer(inputCol="review/text", outputCol="words")
stop_words_remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
hashing_tf = HashingTF(inputCol="filtered_words", outputCol="raw_features")
idf = IDF(inputCol="raw_features", outputCol="features", minDocFreq=5)

user_data = spark.createDataFrame([(user_review,)], ["review/text"])

# Hacer la prediccion
prediction = loaded_model.transform(user_data)

# Extraer la prediccion del sentimiento
predicted_sentiment = prediction.select("prediction").collect()[0][0]

# Mostrar la predicccion de sentimiento
if predicted_sentiment == 1.0:
    print("Sentiment: Positive")
else:
    print("Sentiment: Negative")

Ingrese su revisión: This book sucks, its terrible. I want to give it back.
Sentiment: Negative
