# Critical Thinking 3 
## Sentiment Analysis

<br>
Course Code: DS520 <br>
Course Name: Big Data processing and Analysis <br>
CRN: 24541 <br>
DR. Rudra S Bandhu

Student ID: G200007615 <br>
Student Name: Abdulaziz Alqumayzi<br>
Date: 03/04/2021









In [2]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q ftp://mirror.klaus-uwe.me/apache/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz
!pip install -q findspark

In [3]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

# Importing files here in colab:

In [4]:
## uploading train,validation and test datasets
from google.colab import files
uploaded = files.upload()

Saving Test.csv to Test (1).csv
Saving Train.csv to Train (1).csv
Saving Valid.csv to Valid.csv


# Importing needed packages:

In [6]:
import pandas as pd
import findspark
findspark.init() #Findspark package to make a Spark Context available in your Jupyter Notebook
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.ml.feature import RegexTokenizer
import requests
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import IDF
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator

# Initializing Spark Session as spark_session:

In [7]:
spark_session = SparkSession.builder.appName("Sentiment-analysis").master("local[4]").getOrCreate()

## loading Data:

In [9]:
df_train = pd.read_csv("Train (1).csv")
df_valid = pd.read_csv("Valid.csv")
df_test  = pd.read_csv("Test (1).csv")

## Drop null values 

In [11]:
df_train = spark_session.createDataFrame(df_train).dropna()
df_test  = spark_session.createDataFrame(df_test).dropna()
df_valid = spark_session.createDataFrame(df_valid).dropna()

## Number of rows in each file:

In [12]:
print("Train dataframe:",df_train.count())
print("Test dataframe:",df_test.count())
print("Validation dataframe:",df_valid.count())

Train dataframe: 40000
Test dataframe: 5000
Validation dataframe: 5000


# Pre-processing:

Extracting text and removing special characters

In [13]:
df_train = df_train.withColumn('text', regexp_replace('text',r"[^0-9a-zA-Z]+", " "))\
                        .withColumn('text', lower(col("text")))\
                        .withColumn('text', regexp_replace('text',r"\b\d+\b", ""))



In [14]:
df_test = df_test.withColumn('text', regexp_replace('text',r"[^0-9a-zA-Z]+", " "))\
                        .withColumn('text', lower(col("text")))\
                        .withColumn('text', regexp_replace('text',r"\b\d+\b", ""))

In [15]:
df_valid = df_valid.withColumn('text', regexp_replace('text',r"[^0-9a-zA-Z]+", " "))\
                        .withColumn('text', lower(col("text")))\
                        .withColumn('text', regexp_replace('text',r"\b\d+\b", "")) 

### Downloading stopwords:

This list of English stop words is taken from the "Glasgow Information
Retrieval Group". The original list can be found at
http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words

---



In [16]:
stop_words = requests.get('http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words').text.split()

### Removing stop words:

StopWordsRemover is a Transformer takes a String array of words and returns a String array after removing all the defined stop words.

In [17]:
remove_stopwords = StopWordsRemover(inputCol="words", outputCol="filtered_words", stopWords=stop_words, caseSensitive=False)

# Building the model

Here we will perform the Sentiment Analysis

In [18]:
# A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text
tokenizer = RegexTokenizer(gaps= False, pattern='\\p{L}+', inputCol="text", outputCol="words", toLowercase=True)
# Convert a collection of text documents to a matrix of token counts
cv = CountVectorizer(minTF=1.0, minDF=1.0, vocabSize=2**17, inputCol="filtered_words", outputCol="tf")
# Takes feature vectors and scales each feature. Intuitively, it down-weights features that appear frequently in a corpus.
idf = IDF( minDocFreq=0, inputCol='tf', outputCol='tfidf') 
lr = LogisticRegression(featuresCol='tfidf',labelCol='label',regParam= 0.01, elasticNetParam=0.5,maxIter=100 )
# The pipeline used as a single instance of a complete model.
pipeline = Pipeline(stages=[tokenizer, remove_stopwords, cv, idf,lr])

## Fitting and transform the trained model

In [20]:
fit_model = pipeline.fit(df_train)
validation = fit_model.transform(df_valid)

# Measure the performance of the model

Showing the table of labels and predictions of the trained model

In [21]:
validation.select("text","label","prediction").show()

+--------------------+-----+----------+
|                text|label|prediction|
+--------------------+-----+----------+
|it s been about  ...|    0|       0.0|
|someone needed to...|    0|       0.0|
|the guidelines st...|    0|       0.0|
|this movie is a m...|    0|       0.0|
|before stan laure...|    0|       0.0|
|this is the best ...|    1|       1.0|
|the morbid cathol...|    1|       1.0|
| semana santa or ...|    0|       0.0|
|somebody mastered...|    1|       1.0|
|why did i waste  ...|    0|       0.0|
|this film takes y...|    1|       1.0|
|the russian space...|    0|       0.0|
|the more i think ...|    0|       0.0|
|this is very date...|    1|       1.0|
|i had seen kalifo...|    1|       1.0|
|a powerful adapta...|    1|       1.0|
|this movie s orig...|    1|       1.0|
|i really enjoyed ...|    0|       0.0|
|hi everyone oh bo...|    0|       0.0|
|it takes a while ...|    1|       1.0|
+--------------------+-----+----------+
only showing top 20 rows



Measuring the RMSE of the trained model

In [22]:
evaluator = RegressionEvaluator(labelCol='label',predictionCol='prediction')
print("RMSE:",evaluator.evaluate(validation))

RMSE: 0.363868107973205


The range of the RMSE is between 0 and 1. When the RMSE is lower (or near to Zero) indicates a good result. 

## Make a prediction on the test set 

In [23]:
predictions = fit_model.transform(df_test)
predictions.select("text","label","prediction").show()

+--------------------+-----+----------+
|                text|label|prediction|
+--------------------+-----+----------+
|i always wrote th...|    0|       1.0|
|1st watched     o...|    0|       0.0|
|this movie was so...|    0|       0.0|
|the most interest...|    1|       0.0|
|when i first read...|    0|       0.0|
|i saw this film o...|    1|       1.0|
|i saw a screening...|    0|       0.0|
|william hurt may ...|    1|       0.0|
|it is a piece of ...|    0|       0.0|
|i m bout it  br b...|    0|       0.0|
|i had a recent sp...|    0|       0.0|
|i really enjoyed ...|    1|       1.0|
|didn t the writer...|    0|       0.0|
|this movie was re...|    0|       0.0|
|i think i watched...|    0|       0.0|
|uwe boll has done...|    0|       0.0|
|i felt asleep wat...|    0|       0.0|
|brass pictures mo...|    0|       1.0|
|my interest was r...|    1|       1.0|
|pity the monkees ...|    1|       1.0|
+--------------------+-----+----------+
only showing top 20 rows



## Evaluating the test set 

In [24]:
evaluator = RegressionEvaluator(labelCol='label',predictionCol='prediction')
print("RMSE:",evaluator.evaluate(predictions))

RMSE: 0.35242020373412186


The test set result similar to the trained model result, which indicates that it is a good prediction result. But, this is not enough metric to decide the model is good or not, we must evaluate other metrics. 

# References
- Kim, R. (2018, March 14). Sentiment analysis with pyspark. Retrieved April 03, 2021, from https://towardsdatascience.com/sentiment-analysis-with-pyspark-bc8e83f80c35

- VM@ai. (2020, May 12). Tokenizer &amp; REGEXTOKENIZER IN PySpark. Retrieved April 03, 2021, from https://medium.com/@harinata.0624/tokenizer-regextokenizer-in-pyspark-51a7c9b33132

- Fan, T. (2019, October 25). MNT Make modules private in feature_extraction_stop_words. Retrieved April 03, 2021, from https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/feature_extraction/_stop_words.py

- Karim, M. R., &amp; Alla, S. (2017). Scala and Spark for big data analytics: Explore the concepts of functional programming, data streaming, and machine learning. Birmingham: Packt Publishing.

- Bengfort, B., Bilbro, R., &amp; Ojeda, T. (n.d.). Applied text analysis with python. Retrieved April 03, 2021, from https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html

- Li, S. (2018, May 07). Building a linear regression With PySpark and mllib. Retrieved April 03, 2021, from https://towardsdatascience.com/building-a-linear-regression-with-pyspark-and-mllib-d065c3ba246a
