<a href="https://colab.research.google.com/github/Terry-Migwi/Sentiment_Analysis/blob/main/ML_CDs_Vinyl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Defining the data analytic question

The objective of this notebook is to perform classification of sentiment reviews for CDs and Vinyl products using Machine Learning Algorithms. The algorithms used for classification are Logistic Regression, Naive Bayes, Random Forest, and Support Vector Machines.

In [None]:
# Download Java and Spark

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark

In [None]:
# Set up the paths

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"

In [None]:
# Create a Spark session

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled", True) # Property used to format output tables better
spark.conf.set("spark.sql.caseSensitive", True) # Avoid error "Found duplicate column(s) in the data schema"
spark

In [None]:
myReview = spark.read.json("/content/drive/MyDrive/Colab Notebooks/CDs_and_Vinyl_5.json.gz")

# Take a look
myReview.show(5)

+----------+-----+-------+--------------------+-----------+--------------+----------------+-----+--------------------+--------------+--------+----+
|      asin|image|overall|          reviewText| reviewTime|    reviewerID|    reviewerName|style|             summary|unixReviewTime|verified|vote|
+----------+-----+-------+--------------------+-----------+--------------+----------------+-----+--------------------+--------------+--------+----+
|0001393774| null|    5.0|Love it!!  Great ...|04 29, 2016|A1H1DL4K669VQ9| Judith Paladino| null|          Five Stars|    1461888000|    true|null|
|0001393774| null|    5.0|One of my very fa...|02 23, 2016|A3V5XBBT7OZG5G|          gflady| null|One of my very fa...|    1456185600|    true|null|
|0001393774| null|    5.0|THank you Jesus L...|02 11, 2016|A3SNL7UJY7GWBI|Lady Leatherneck| null|          Five Stars|    1455148800|    true|null|
|0001393774| null|    5.0|I recall loving h...|11 28, 2015|A3478QRKQDOPQ2|           jacki| null|forgot but I fi

In [None]:
# manually create positive and negative reviews
from pyspark.sql import functions as f

# Set up sentiment column based on rating
myData = myReview.withColumn(
    # Name a new column
    "sentiment",
    # Use "when" for conditional setup
    f.when((f.col("overall") >=4),"positive")
    # neutral sentiment
    .when((f.col("overall") == 3), "neutral")
    # Negative coded as
    .when((f.col("overall") <=2),"negative")
    )

In [None]:
# Take a look

myData.show()

+----------+-----+-------+--------------------+-----------+--------------+-------------------+------------+--------------------+--------------+--------+----+---------+
|      asin|image|overall|          reviewText| reviewTime|    reviewerID|       reviewerName|       style|             summary|unixReviewTime|verified|vote|sentiment|
+----------+-----+-------+--------------------+-----------+--------------+-------------------+------------+--------------------+--------------+--------+----+---------+
|0001393774| null|    5.0|Love it!!  Great ...|04 29, 2016|A1H1DL4K669VQ9|    Judith Paladino|        null|          Five Stars|    1461888000|    true|null| positive|
|0001393774| null|    5.0|One of my very fa...|02 23, 2016|A3V5XBBT7OZG5G|             gflady|        null|One of my very fa...|    1456185600|    true|null| positive|
|0001393774| null|    5.0|THank you Jesus L...|02 11, 2016|A3SNL7UJY7GWBI|   Lady Leatherneck|        null|          Five Stars|    1455148800|    true|null| po

In [None]:
# explore the sentiment column

# Get an idea of sentiment distribution

myData.groupBy("sentiment").count()

sentiment,count
positive,1243486
neutral,110407
negative,89862


In [None]:
# view review text
myData.select("reviewText").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|reviewText                                                                                                                                                                                                                                                                                                                  

In [None]:
# view summary
myData.select("summary").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------+
|summary                                                                                                                      |
+-----------------------------------------------------------------------------------------------------------------------------+
|Five Stars                                                                                                                   |
|One of my very favourite albums from one of my very favourite singers                                                        |
|Five Stars                                                                                                                   |
|forgot but I figured on some of these artists seems like one good album and all good albums                                  |
|and I have loved every album he did                                                                    

In [None]:
# Tidy up the text data

myData = (myData
          #Remove handles
          .withColumn("reviewText", f.regexp_replace(f.col("reviewText"), "@[\w]*", ""))    #delete all words that start with @
          #Remove special characters
          .withColumn("reviewText", f.regexp_replace(f.col("reviewText"), "[^a-zA-Z']", " "))   #replaced hyperlinks with spaces. Because they are not aplhanumeric
          #Remove leading and trailing whitespaces
          .withColumn("reviewText", f.trim(f.col("reviewText")))
          #Restrict the length of the string
          .filter(f.length("reviewText")>5)
          )

# use withcolumn to refer to the variable we want to use

In [None]:
# Collect a sample for modelling

# Get the positive ones
myDataPos = myData.filter("sentiment = 'positive'")

# Get the negative ones
myDataNeg = myData.filter("sentiment = 'negative'")

# Get a random sample from positive
myDataPosSample = myDataPos.sample(fraction=200/myDataPos.count(), seed=9165)

# Get a random sample from negative
myDataNegSample = myDataNeg.sample(fraction=200/myDataNeg.count(), seed=9165)

# Combine into a single sample
mySample = myDataPosSample.union(myDataNegSample)

In [None]:
# Take a look

mySample.groupBy("sentiment").count()

sentiment,count
positive,212
negative,228


In [None]:
# Make a split 80 /20

(training, test) = mySample.randomSplit([0.8, 0.2],seed = 9165)

In [None]:
training

asin,image,overall,reviewText,reviewTime,reviewerID,reviewerName,style,summary,unixReviewTime,verified,vote,sentiment
6305394776,,5.0,MAGNIFICENT movie...,"11 5, 2016",A20GQVYYTLJTIW,suomi,{ DVD},Brilliantly direc...,1478304000,True,,positive
B000000H85,,5.0,This album is rig...,"03 3, 2003",A3KYERW5V9TQUV,Total Scumbag,{ Audio CD},INSANE!!!!,1046649600,False,,positive
B000000HQR,,4.0,solid album,"10 9, 2015",AKNPR99HNY7KM,beerguy,{ Audio CD},Japan is worth th...,1444348800,True,,positive
B000000OC2,,5.0,The movie Sister ...,"04 13, 2010",A16C9QBZHS9UDO,Curst Saden,{ Audio CD},Really Good,1271116800,True,,positive
B0000012ZQ,,5.0,We already have s...,"12 21, 2004",A1THJ5GJF9NLCS,BLee,{ Audio CD},A Life Long Enjoy...,1103587200,False,30.0,positive
B000001EMW,,5.0,What a great cd ...,"10 2, 2010",A1YMOZFIXPKU73,D. Ryan,{ Audio CD},CD,1285977600,True,,positive
B000001ESB,,5.0,The second Album ...,"08 30, 2015",AU4DJA0QUTAPS,Seanus Groovus,,Perfection..,1440892800,True,,positive
B000001F62,,5.0,This CD is likely...,"05 12, 2003",A2AOZQ3WTNVVOK,Lonnie E. Holder,{ Audio CD},The Most Progress...,1052697600,False,39.0,positive
B000001F68,,5.0,Awesome The new...,"04 15, 2018",A1NLZWLIP2GC6W,R Dee Dee.,{ Audio CD},Awesome! The new ...,1523750400,True,,positive
B000001FKG,,5.0,So what if this a...,"04 4, 2005",A31HTN51QNSQ3F,Ben Kizer,{ Audio CD},His best album,1112572800,False,,positive


In [None]:
# Check the training data

# training.groupBy("sentiment").count()

In [None]:
# Check the test data

# test.groupBy("sentiment").count()

In [None]:
# Use nickname feat for the subpackage
import pyspark.ml.feature as feat

# We need Pipeline to streamline the workflow
from pyspark.ml import Pipeline

# Use logistic regression, naive bayes, and random forests
from pyspark.ml.classification import LogisticRegression, NaiveBayes, RandomForestClassifier, LinearSVC, MultilayerPerceptronClassifier

# Import an evaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Additional functions for tuning parameters
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [None]:
# Build up the pipeline/workflow for TF-IDF approach

# Split the tweets into words
splitter = feat.RegexTokenizer(
    inputCol='reviewText'
    , outputCol='text_split'
    , pattern='\s+'
)

# Remove stop words
sw_remover = feat.StopWordsRemover(
    inputCol=splitter.getOutputCol()
    , outputCol='text_noSW'
)

# Count word frequency
count_vec = feat.CountVectorizer(
    inputCol=sw_remover.getOutputCol()
    , outputCol='vector'
    , vocabSize=5000
)

# Calculate IDF
idf_cal = feat.IDF(
    inputCol=count_vec.getOutputCol()
    , outputCol='features'
    , minDocFreq=5
)

# Prepare the target variable
label_string = feat.StringIndexer(
    inputCol = "sentiment"
    , outputCol = "label"
)

# Logistic regression model
lr = LogisticRegression(
    maxIter=100
)


# Finally set up the pipline
sentiment_pipeline_idf_lr = Pipeline(
    stages=[
            splitter
            , sw_remover
            , count_vec
            , idf_cal
            , label_string
            , lr
            ]
)

In [None]:
# Set up the parameters to tune
parGrid = ParamGridBuilder() \
          .addGrid(count_vec.vocabSize, [3000, 5000]) \
          .addGrid(lr.regParam, [0.1, 5]) \
          .build()

# Set up the cross validation
crossVal = CrossValidator(estimator=sentiment_pipeline_idf_lr,
                          estimatorParamMaps=parGrid,
                          evaluator=MulticlassClassificationEvaluator(metricName="accuracy"),
                          numFolds=10,
                          seed=9165)

In [None]:
# Fit the process to the training data set

cvModel = crossVal.fit(training)

In [None]:
# Summarise nicely the results of different parameter combinations

for i in range(len(cvModel.avgMetrics)):
  myParam = parGrid[i]
  myModel = "Model parameters: "
  for key, value in myParam.items():
    myModel += (key.name + '=' + str(value) + ' ')
  print(myModel+"has average accuracy: "+str(cvModel.avgMetrics[i]))

Model parameters: vocabSize=3000 regParam=0.1 has average accuracy: 0.6968558969090903
Model parameters: vocabSize=3000 regParam=5.0 has average accuracy: 0.631015363747379
Model parameters: vocabSize=5000 regParam=0.1 has average accuracy: 0.6968558969090903
Model parameters: vocabSize=5000 regParam=5.0 has average accuracy: 0.631015363747379


In [None]:
# Apply the best model to the test data set

cv_prediction = cvModel.transform(test)
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
cv_accuracy = evaluator.evaluate(cv_prediction)
print("Accuracy of the best Logistic Regression model with the test data is %g"% (cv_accuracy))

Accuracy of the best Logistic Regression model with the test data is 0.736264


### Naive Bayes

In [None]:
# Build up the pipeline/workflow for Naive Bayes

# Split the tweets into words
splitter = feat.RegexTokenizer(
    inputCol='reviewText'
    , outputCol='text_split'
    , pattern='\s+'
)

# Remove stop words
sw_remover = feat.StopWordsRemover(
    inputCol=splitter.getOutputCol()
    , outputCol='text_noSW'
)

# Count word frequency
count_vec = feat.CountVectorizer(
    inputCol=sw_remover.getOutputCol()
    , outputCol='vector'
)

# Calculate IDF
idf_cal = feat.IDF(
    inputCol=count_vec.getOutputCol()
    , outputCol='features'
    , minDocFreq=5
)

# Prepare the target variable
label_string = feat.StringIndexer(
    inputCol = "sentiment"
    , outputCol = "label"
)

# Naive Bayes model
nb = NaiveBayes(
)


# Finally set up the pipline
sentiment_pipeline_idf_nb = Pipeline(
    stages=[
            splitter
            , sw_remover
            , count_vec
            , idf_cal
            , label_string
            , nb
            ]
)

In [None]:
# Set up the parameters to tune
parGrid = ParamGridBuilder() \
          .addGrid(count_vec.vocabSize, [3000, 5000]) \
          .addGrid(nb.smoothing, [1, 0]) \
          .build()

# Set up the cross validation
crossVal = CrossValidator(estimator=sentiment_pipeline_idf_nb,
                          estimatorParamMaps=parGrid,
                          evaluator=MulticlassClassificationEvaluator(metricName="accuracy"),
                          numFolds=10,
                          seed=9165)

In [None]:
# Fit the process to the training data set

cvModel = crossVal.fit(training)

In [None]:
# Summarise nicely the results of different parameter combinations

for i in range(len(cvModel.avgMetrics)):
  myParam = parGrid[i]
  myModel = "Model parameters: "
  for key, value in myParam.items():
    myModel += (key.name + '=' + str(value) + ' ')
  print(myModel+"has average accuracy: "+str(cvModel.avgMetrics[i]))

Model parameters: vocabSize=3000 smoothing=1.0 has average accuracy: 0.6486509720946829
Model parameters: vocabSize=3000 smoothing=0.0 has average accuracy: 0.4962947666479634
Model parameters: vocabSize=5000 smoothing=1.0 has average accuracy: 0.6408432841099689
Model parameters: vocabSize=5000 smoothing=0.0 has average accuracy: 0.4962947666479634


In [None]:
# Apply the best model to the test data set

cv_prediction = cvModel.transform(test)
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
cv_accuracy = evaluator.evaluate(cv_prediction)
print("Accuracy of the best Naive Bayes model with the test data is %g"% (cv_accuracy))

Accuracy of the best Naive Bayes model with the test data is 0.648352


### Random Forests

In [None]:
# Build up the pipeline/workflow for Random Forest

# Split the tweets into words
splitter = feat.RegexTokenizer(
    inputCol='reviewText'
    , outputCol='text_split'
    , pattern='\s+'
)

# Remove stop words
sw_remover = feat.StopWordsRemover(
    inputCol=splitter.getOutputCol()
    , outputCol='text_noSW'
)

# Count word frequency
count_vec = feat.CountVectorizer(
    inputCol=sw_remover.getOutputCol()
    , outputCol='vector'
)

# Calculate IDF
idf_cal = feat.IDF(
    inputCol=count_vec.getOutputCol()
    , outputCol='features'
    , minDocFreq=5
)

# Prepare the target variable
label_string = feat.StringIndexer(
    inputCol = "sentiment"
    , outputCol = "label"
)

# Random forest model
rf = RandomForestClassifier()


# Finally set up the pipline
sentiment_pipeline_idf_rf = Pipeline(
    stages=[
            splitter
            , sw_remover
            , count_vec
            , idf_cal
            , label_string
            , rf
            ]
)

In [None]:
# Set up the parameters to tune
parGrid = ParamGridBuilder() \
          .addGrid(idf_cal.minDocFreq, [5, 10]) \
          .addGrid(rf.numTrees, [20, 40]) \
          .addGrid(rf.maxDepth, [5, 4]) \
          .build()

# Set up the cross validation
crossVal_rf = CrossValidator(estimator=sentiment_pipeline_idf_rf,
                          estimatorParamMaps=parGrid,
                          evaluator=MulticlassClassificationEvaluator(metricName="accuracy"),
                          numFolds=10,
                          seed=9165)

In [None]:
# Fit the process to the training data set

cvModel_rf = crossVal_rf.fit(training)

In [None]:
# Summarise nicely the results of different parameter combinations

for i in range(len(cvModel_rf.avgMetrics)):
  myParam = parGrid[i]
  myModel = "Model parameters: "
  for key, value in myParam.items():
    myModel += (key.name + '=' + str(value) + ' ')
  print(myModel+"has average accuracy: "+str(cvModel_rf.avgMetrics[i]))

Model parameters: minDocFreq=5 numTrees=20 maxDepth=5 has average accuracy: 0.6176726442161053
Model parameters: minDocFreq=5 numTrees=20 maxDepth=4 has average accuracy: 0.5778559786526811
Model parameters: minDocFreq=5 numTrees=40 maxDepth=5 has average accuracy: 0.5828305863093602
Model parameters: minDocFreq=5 numTrees=40 maxDepth=4 has average accuracy: 0.5586308550759684
Model parameters: minDocFreq=10 numTrees=20 maxDepth=5 has average accuracy: 0.5975431507188426
Model parameters: minDocFreq=10 numTrees=20 maxDepth=4 has average accuracy: 0.6030019847501803
Model parameters: minDocFreq=10 numTrees=40 maxDepth=5 has average accuracy: 0.5949051668218364
Model parameters: minDocFreq=10 numTrees=40 maxDepth=4 has average accuracy: 0.5720242842635344


In [None]:
# Apply the best model to the test data set

cv_prediction_rf = cvModel_rf.transform(test)
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
cv_accuracy = evaluator.evaluate(cv_prediction_rf)
print("Accuracy of the best Random Forest model with the test data is %g"% (cv_accuracy))

Accuracy of the best Random Forest model with the test data is 0.637363


### Support Vector Machine

In [None]:
# Build up the pipeline/workflow for TF-IDF approach

# Split the tweets into words
splitter = feat.RegexTokenizer(
    inputCol='reviewText'
    , outputCol='text_split'
    , pattern='\s+'
)

# Remove stop words
sw_remover = feat.StopWordsRemover(
    inputCol=splitter.getOutputCol()
    , outputCol='text_noSW'
)

# Count word frequency
count_vec = feat.CountVectorizer(
    inputCol=sw_remover.getOutputCol()
    , outputCol='vector'
    , vocabSize=5000
)

# Calculate IDF
idf_cal = feat.IDF(
    inputCol=count_vec.getOutputCol()
    , outputCol='features'
    , minDocFreq=5
)

# Prepare the target variable
label_string = feat.StringIndexer(
    inputCol = "sentiment"
    , outputCol = "label"
)

# svm model
lsvc = LinearSVC(maxIter=100)

# Finally set up the pipline
sentiment_pipeline_idf_lsvc = Pipeline(
    stages=[
            splitter
            , sw_remover
            , count_vec
            , idf_cal
            , label_string
            , lsvc
            ]
)

In [None]:
# Set up the parameters to tune
parGrid = ParamGridBuilder() \
          .addGrid(count_vec.vocabSize, [3000, 5000]) \
          .addGrid(lsvc.regParam, [0.1, 5]) \
          .build()

# Set up the cross validation
crossVal_lsvc = CrossValidator(estimator=sentiment_pipeline_idf_lsvc,
                          estimatorParamMaps=parGrid,
                          evaluator=MulticlassClassificationEvaluator(metricName="accuracy"),
                          numFolds=10,
                          seed=9165)

In [None]:
# Fit the process to the training data set

cvModel_lsvc = crossVal_lsvc.fit(training)

In [None]:
# Summarise nicely the results of different parameter combinations

for i in range(len(cvModel_lsvc.avgMetrics)):
  myParam = parGrid[i]
  myModel = "Model parameters: "
  for key, value in myParam.items():
    myModel += (key.name + '=' + str(value) + ' ')
  print(myModel+"has average accuracy: "+str(cvModel_lsvc.avgMetrics[i]))

Model parameters: vocabSize=3000 regParam=0.1 has average accuracy: 0.6930785095359377
Model parameters: vocabSize=3000 regParam=5.0 has average accuracy: 0.6122528481791985
Model parameters: vocabSize=5000 regParam=0.1 has average accuracy: 0.6930785095359377
Model parameters: vocabSize=5000 regParam=5.0 has average accuracy: 0.6122528481791985


In [None]:
# Apply the best model to the test data set

cv_prediction_lsvc = cvModel_lsvc.transform(test)
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
cv_accuracy = evaluator.evaluate(cv_prediction_lsvc)
print("Accuracy of the best svm model with the test data is %g"% (cv_accuracy))

Accuracy of the best Random Forest model with the test data is 0.703297


Summary of model performance based on accuracy score:
1. Logistic Regression - 0.736264
2. Support Vector Machines - 0.7032
3. Naive Bayes - 0.64835
4. Random Forests - 0.6373

The logistic regression was the best performing model while the random forest was the worst performing model for this dataset.