# Big Data Analysis - a review classification project

## Introduction

The way businesses connect with their clients has permanently changed as a result of social media. Thousands of tweets or reviews might be trending on social media in a matter of minutes. It would be extremely helpful for a company to be able to examine reviews in real time and discover the sentiment that underpins each one.

Nowadays, a brand's internet reputation is one of its most precious assets. If a negative review or a blunder on social media is not addressed swiftly, it can be costly. Social media and review sentiment analysis helps a firm to keep track of what people are saying about it, its products, and services, as well as discover negative sentiment and the reasons for it.

To deescalate the problem and minimize future unfavorable mentions, it's critical to spot negative patterns or irate consumers quickly.

However, not only can social media/review sentiment analysis help with brand management, but it may also provide insight into customer preferences. Customers' ratings and opinions are incredibly valuable to businesses. 

Companies can use feedback to tailor their product or service to the preferences of their customers. People nowadays find it far more easy to tweet about their satisfaction or dissatisfaction with a service or product rather than leave a review on the company's website.

Because social media posts and reviews are not designed to be well-written with a clear structure and a well summarized thought process, analyzing them can be difficult. Instead, reviews/posts are a relatively casual expression of an individual's thoughts.

Furthermore, posts and reviews frequently contain spelling errors, making the process even more difficult. Finally, sentiment analysis/text classification allows a company to track its clients' emotions and comprehend how they feel. It adds a new dimension to the standard measures for analyzing brand performance and gives businesses new chances.

Businesses might manually classify data by sentiment/ratings, but because the internet moves so quickly and thousands of customers can engage in minutes, this work must be automated. Review classification and sentiment analysis must be both quick and scalable in order to produce consistent, high-quality results.[3]

Also businesses and especially Amazon are having trouble with fake reviews. On Amazon's marketplace, where products are rated on a five-star scale and a large number of positive reviews can help a brand stand out from a crowd of competitors, not everything is as it seems. Amazon has admitted to having a fake review problem as it tries to reign in coordinated activities on other websites to flood product listings with positive ratings in exchange for money, which is against the company's terms of service.

The overflow of stars and comments — real and fraudulent — might be daunting for customers evaluating 15 variants of a product of their liking. 

When Amazon discovers that a seller has broken the rules, it bans them from the marketplace. It took down listings for Aukey and Mpow electronics in May after reports that the companies had engaged in paid review schemes.

Amazon also claims to devote efforts to deleting fraudulent reviews and the accounts who post them, claiming that 200 million suspected phony reviews were removed before they were published in 2020.

According to an Amazon spokesman, 99 percent of the company's actions on incentivized reviews are taken proactively, before concerns are identified. According to a spokeswoman, Amazon wants their costumers to shop with confidence, knowing that revies are real and genuine.

However, because many businesses are anxious to outperform their competitors, buyers are unable to distinguish if a product's number of five-star reviews is genuine or falsely exaggerated. When faced with the prospect of dozens of knockoff items on an Amazon marketplace with over 2 million sellers worldwide, shoppers are unclear what to believe. Amazon also has a hard time distinguishing false reviews from real customers who have purchased and utilized a product. 
Their actions appear to be legitimate, and the same client may write both paid and unpaid evaluations.

Fake reviews are frequently planned on social media sites that Amazon does not control, which is another huge difficulty for the corporation. A UK regulator warned in May that it will continue to investigate these Facebook and Instagram groups, noting that 16,000 social media groups that organized refunds for phony Amazon reviews had been removed.

The business that owns those social media networks, Meta, prohibits the trading of reviews and has automatic systems in place to detect such schemes. People can report this type of group, and the firm will remove groups and content if they are deemed to be in violation of the regulations, according to the corporation. Amazon also keeps an eye on social media for groups that are coordinating reviews, and 6,000 of them were reported to social media companies last year.

The issue is circular in nature. The more positive reviews a product receives, the more attention it gains as an Amazon "best seller," and the faster it earns the trust of customers who have never purchased from the firm before. As the company's client base grows, so does the number of people from whom it may solicit paid evaluations, accelerating its ratings success even more.
[1]-[2]


        
## Data

Initially I wanted to do a Twitter sentiment analysis project with the 'Twitter US Airline Sentiment' Dataset found on Kaggle(https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment).

However since this dataset is only 3 Megabytes big this was not suitable for a Big Data project. Instead I chose to use the 'Amazon Review Dataset(2018)'(https://nijianmo.github.io/amazon/index.html). 

Tis Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
And there I downloaded the subsection 'Video Games' which has 2,565,349 reviews about video games.
The data was downloaded and converted into a csv via a online converter. The csv has 1.34 gigabytes and was uploaded to the Hadoop File system under the name 'amazon.csv'. 

The video games subcategory is only a small part of many, so this project coulb be scaled up, by using more data from this dataset. I found the sample size if over 2 million reviews and 1.34gb sufficient for this project.

## Hypotheses

My hypothese is that it is possible to gather sentiment and review rating from text and summary of reviews. Of course, when someone leaves a review on amazon, there is no need to detect the rating, as the user would rate the product from 1 to 5 stars anway, so the need to gather sentiment and rewiew rating would be redundant.

However, there are of course smaller ecommerce businesses who might not have a rating system in place. Also, there is no 1 to 5 star rating system on social media. If a user on twitter complains or praises a product on social media, there won't be a rating system in place.

And nowadays it is way more likely for a user to publish his/her sentiment of a product on twitter or instagram instead of leaving a proper feedback on amazon or another website. So there is the need to gather sentiment from prodcut review texts and summarys.

Furthermore and this  is by far the most challening task of this project, there is the need to detect fake reviews. My hypothese is that it is possible, to detect fake reviews to some extent.

## Planned analysis

I will try to create a machine learning model, which should predict the correct rating of a given review. Two iterations of this model will be tested: 1. predicting the correct rating from the review Text and 2. predicting the correct rating from the review summary, which is often much more concise. 

The last step is to try and identif fake reviews. The dataset has a column 'verified', which states wether the user who left the review was verified or not. Since amazon deletes fake reviews as soon as they detect it, this dataset might not contain a lot of fake reviews. Also there are no other datasets containing fake reviews. 

So by following the logic that a user who is not verified, might be more likely to leave a fake review, the machine learning model will try to predict wether a user was veriefied and therefore if the review was  fake or not.

# Implementation

In [1]:
# importing libraries
import pyspark
import random
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import isnan, when, count, col
from pyspark.sql.types import IntegerType,BooleanType,DateType
from pyspark.sql import functions as F
from pyspark.sql import types as T
import re
import string
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.ml.feature import Tokenizer,StopWordsRemover,CountVectorizer,IDF
from pyspark.ml.feature import StringIndexer
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.functions import col,when
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [2]:
# build spark session
spark = SparkSession.builder.appName("amazonjm2").getOrCreate()

### reading file from HDFS path

path = hdfs:///user/jmohr001/amazon.csv

In [3]:
# read the amazon.csv from HDFS
data = spark.read.format("csv").option("header", "true").load("hdfs:///user/jmohr001/amazon.csv")

In [4]:
# insepct data
data.show(5)

+-------+--------+-----------+--------------+----------+----------------+--------------------+--------------------+--------------+----+-----+-----+
|overall|verified| reviewTime|    reviewerID|      asin|    reviewerName|          reviewText|             summary|unixReviewTime|vote|style|image|
+-------+--------+-----------+--------------+----------+----------------+--------------------+--------------------+--------------+----+-----+-----+
|    1.0|    True| 06 9, 2014|A21ROB4YDOZA5P|0439381673|   Mary M. Clark|I used to play th...|   Did not like this|    1402272000|null| null| null|
|    3.0|    True|05 10, 2014|A3TNZ2Q5E7HTHD|0439381673|       Sarabatya|The game itself w...|      Almost Perfect|    1399680000|null| null| null|
|    4.0|    True| 02 7, 2014|A1OKRM3QFEATQO|0439381673| Amazon Customer|I had to learn th...|DOES NOT WORK WIT...|    1391731200|  15| null| null|
|    1.0|    True| 02 7, 2014|A2XO1JFCNEYV3T|0439381673|ColoradoPartyof5|The product descr...|does not work on .

In [5]:
# inspect columns
data.columns

['overall',
 'verified',
 'reviewTime',
 'reviewerID',
 'asin',
 'reviewerName',
 'reviewText',
 'summary',
 'unixReviewTime',
 'vote',
 'style',
 'image']

### ML-model with summary column

In [6]:
# get only relevant columns
df = data.select('overall','summary')
df.show(25)

+--------------------+--------------------+
|             overall|             summary|
+--------------------+--------------------+
|                 1.0|   Did not like this|
|                 3.0|      Almost Perfect|
|                 4.0|DOES NOT WORK WIT...|
|                 1.0|does not work on ...|
|                 4.0|                null|
|I really like pla...|                null|
|                 5.0|Love this game!  ...|
|                 3.0|Would like it mor...|
|                 5.0|                null|
|               [...]|                null|
|Just the patch al...|                null|
|Classic game that...|                null|
|                 5.0|The Oregon Trail-...|
|                 5.0|                null|
|I got this game w...|                null|
|Graphics: The gra...|                null|
|Sound: The sound ...|                null|
|Pros: Help ppl le...|                null|
|Is really fun to ...|                null|
|Has really cool s...|          

In [7]:
# chaning column names
newcolnames = ['rating','summary']

for c,n in zip(df.columns,newcolnames):
    df=df.withColumnRenamed(c,n)
    
df.show(25)

+--------------------+--------------------+
|              rating|             summary|
+--------------------+--------------------+
|                 1.0|   Did not like this|
|                 3.0|      Almost Perfect|
|                 4.0|DOES NOT WORK WIT...|
|                 1.0|does not work on ...|
|                 4.0|                null|
|I really like pla...|                null|
|                 5.0|Love this game!  ...|
|                 3.0|Would like it mor...|
|                 5.0|                null|
|               [...]|                null|
|Just the patch al...|                null|
|Classic game that...|                null|
|                 5.0|The Oregon Trail-...|
|                 5.0|                null|
|I got this game w...|                null|
|Graphics: The gra...|                null|
|Sound: The sound ...|                null|
|Pros: Help ppl le...|                null|
|Is really fun to ...|                null|
|Has really cool s...|          

In [8]:
# change dtype to int for rating column
df = df.withColumn("rating",df.rating.cast('int'))

In [9]:
# drop row  if  null value  is in row
df = df.na.drop()
df.show(25)

+------+--------------------+
|rating|             summary|
+------+--------------------+
|     1|   Did not like this|
|     3|      Almost Perfect|
|     4|DOES NOT WORK WIT...|
|     1|does not work on ...|
|     5|Love this game!  ...|
|     3|Would like it mor...|
|     5|The Oregon Trail-...|
|     5|           Still fun|
|     5|         Fun to Play|
|     1|           Cant play|
|     1| NOT FOR A newer MAC|
|     5| A must have game!!!|
|     5|          Five Stars|
|     5|          Five Stars|
|     2|   Very disappointed|
|     5|          Five Stars|
|     5|Believe it is bei...|
|     4|          Four Stars|
|     3|            So. Old.|
|     5|          Still Fun!|
|     1|       what a waste.|
|     1|        Disappointed|
|     1|I have been unabl...|
|     4| Blast from the Past|
|     1|... work in my co...|
+------+--------------------+
only showing top 25 rows



In [10]:
# groupby rating
df.groupBy("rating") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+------+-------+
|rating|  count|
+------+-------+
|     5|1284099|
|     4| 309921|
|     1| 254602|
|     3| 157666|
|     2| 103753|
|  1985|      4|
|   360|      2|
|    10|      2|
|     6|      1|
|    -1|      1|
|   552|      1|
|     7|      1|
|   343|      1|
|   100|      1|
|    20|      1|
|   250|      1|
+------+-------+



In [11]:
# filter only the 1-5 star ratings
df = df.filter((df.rating == '5.0') | (df.rating == '4.0') | (df.rating == '3.0') | (df.rating == '2.0') | (df.rating == '1.0'))
df.show(25)

+------+--------------------+
|rating|             summary|
+------+--------------------+
|     1|   Did not like this|
|     3|      Almost Perfect|
|     4|DOES NOT WORK WIT...|
|     1|does not work on ...|
|     5|Love this game!  ...|
|     3|Would like it mor...|
|     5|The Oregon Trail-...|
|     5|           Still fun|
|     5|         Fun to Play|
|     1|           Cant play|
|     1| NOT FOR A newer MAC|
|     5| A must have game!!!|
|     5|          Five Stars|
|     5|          Five Stars|
|     2|   Very disappointed|
|     5|          Five Stars|
|     5|Believe it is bei...|
|     4|          Four Stars|
|     3|            So. Old.|
|     5|          Still Fun!|
|     1|       what a waste.|
|     1|        Disappointed|
|     1|I have been unabl...|
|     4| Blast from the Past|
|     1|... work in my co...|
+------+--------------------+
only showing top 25 rows



In [12]:
# groupby rating and see if only ratrings from 1-5 are there
df.groupBy("rating") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+------+-------+
|rating|  count|
+------+-------+
|     5|1284099|
|     4| 309921|
|     1| 254602|
|     3| 157666|
|     2| 103753|
+------+-------+



In [13]:
#looking for missing values in the data
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()

+------+-------+
|rating|summary|
+------+-------+
|     0|      0|
+------+-------+



In [14]:
# function for lowercasing text
def lower_case(text):
    text = text.lower()
    return text

In [15]:
# function for seperating phrases like can't into can not
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [16]:
# function for removing punctuation
def process_text_basic(text):
    exclude = set(string.punctuation)
    text = ''.join(ch for ch in text if ch not in exclude)
    return text

In [17]:
# function for filtering empty string
def empty_string(text):
    filter(None, text)
    return text

In [18]:
# making udf functions out of normal python functions

# convert text to lowercase
lower_case_udf=F.udf(f=lambda row: lower_case(row), returnType=T.StringType())

# decontract phrases like can't -> can not
decontracted_udf=F.udf(f=lambda row: decontracted(row), returnType=T.StringType())

# process text function, basically only removing punctuation
process_text_basic_udf=F.udf(f=lambda row: process_text_basic(row), returnType=T.StringType())

# empty string function
empty_string_udf=F.udf(f=lambda row: empty_string(row), returnType=T.StringType())

In [19]:
# applying udf functions to df

df = df.withColumn("summary",lower_case_udf(F.col("summary")))

df = df.withColumn("summary",decontracted_udf(F.col("summary")))

df = df.withColumn("summary",process_text_basic_udf(F.col("summary")))

df = df.withColumn("summary",empty_string_udf(F.col("summary")))

In [20]:
# where empty string replace with none
df=df.select([when(col(c)=="",None).otherwise(col(c)).alias(c) for c in df.columns])

# drop none values
df = df.na.drop()

In [21]:
# split dataset into training and testing datasets
train_df,test_df = df.randomSplit([0.75,0.25])

In [22]:
# show 25 entries of train_df
train_df.show(25)

+------+--------------------+
|rating|             summary|
+------+--------------------+
|     1|  i used to enjoy...|
|     1|  it is terrible ...|
|     1|  keep on dreamin...|
|     1|        we need wood|
|     1|  you hit a butto...|
|     1|          128meg ram|
|     1|   18 wheeler racing|
|     1| 2001 we should s...|
|     1| 2gb ram w 256mb ...|
|     1| 5 headshots with...|
|     1| 9800 radian vide...|
|     1| a day which will...|
|     1| a game using the...|
|     1| a helpless consumer|
|     1| a lot more fight...|
|     1| a message pops u...|
|     1| a new patch came...|
|     1| a physical pain ...|
|     1| a term i use to ...|
|     1| a timer starts t...|
|     1| accounts for app...|
|     1|           after all|
|     1| after reaching d...|
|     1| after that come ...|
|     1| alexi123 and man...|
+------+--------------------+
only showing top 25 rows



In [23]:
# check if everything went right, when splitting into training data
train_df.groupBy("rating") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+------+------+
|rating| count|
+------+------+
|     5|962076|
|     4|232554|
|     1|190590|
|     3|118564|
|     2| 77596|
+------+------+



### Using SparkMLlib to develop machine learing pipeline

In [24]:
#  stages for the pipeline 
tokenizer = Tokenizer(inputCol='summary',outputCol='tokens')
stopwords_remover= StopWordsRemover(inputCol='tokens',outputCol='filtered_tokens')
vectorizer=CountVectorizer(inputCol='filtered_tokens',outputCol='features')
idf = IDF(inputCol='features',outputCol='vectorized_features')

In [25]:
# logisit regression estimator
lr = LogisticRegression(featuresCol='vectorized_features',labelCol='rating')

In [26]:
# build pipeline
pipeline = Pipeline(stages=[tokenizer,stopwords_remover,vectorizer,idf,lr])

In [27]:
# building model
lr_model = pipeline.fit(train_df)

In [28]:
# predictions on test dataset
predictions = lr_model.transform(test_df)

In [29]:
# select columns
predictions.columns

['rating',
 'summary',
 'tokens',
 'filtered_tokens',
 'features',
 'vectorized_features',
 'rawPrediction',
 'probability',
 'prediction']

In [30]:
# show summary rating and prediction
predictions.select('summary','rating','prediction').show(25)

+--------------------+------+----------+
|             summary|rating|prediction|
+--------------------+------+----------+
|  but that was li...|     1|       5.0|
|  chessmaster 800...|     1|       1.0|
| 15 bucks but the...|     1|       1.0|
| 2 minutes rounds...|     1|       3.0|
| 98 which is neve...|     1|       1.0|
| a baldur is gate...|     1|       5.0|
| a nfl blitz styl...|     1|       5.0|
| a scam it does n...|     1|       1.0|
| a serious error ...|     1|       1.0|
| acof ran like a ...|     1|       1.0|
| after spending a...|     1|       2.0|
|               again|     1|       5.0|
| all look like th...|     1|       5.0|
| and 4 wide but i...|     1|       4.0|
|      and a snickers|     1|       5.0|
| and boy does it ...|     1|       3.0|
| and cause gamepl...|     1|       2.0|
| and doing ridicu...|     1|       1.0|
| and i have heard...|     1|       5.0|
| and i liked the ...|     1|       5.0|
| and i was very s...|     1|       5.0|
| and in large g

In [31]:
#  model evaluation
evaluator=MulticlassClassificationEvaluator(labelCol='rating',predictionCol='prediction',metricName='accuracy')

In [32]:
# model accuracy on summary column
accuracy_summarycol = evaluator.evaluate(predictions)
print(accuracy_summarycol)

0.7443476478814709


Ok! An accuracy of almost 75% has been achieved. Considering that there are 5 labels the model has to correctly classify the given text to, this is a very good result. The review summary gives a good indication of wether a review was good or bad, as it is supposed to be concise. Let's see if the model can achieve the same accuarcy with the text column as input.

### ML-model with text column

In [33]:
# creating new df from original data
df2 = data.select('overall','reviewText')
df2.show(10)

+--------------------+--------------------+
|             overall|          reviewText|
+--------------------+--------------------+
|                 1.0|I used to play th...|
|                 3.0|The game itself w...|
|                 4.0|I had to learn th...|
|                 1.0|The product descr...|
|                 4.0|I would recommend...|
|I really like pla...|                null|
|                 5.0|Choose your caree...|
|                 3.0|Would like it mor...|
|                 5.0|It took a few hou...|
|               [...]|                null|
+--------------------+--------------------+
only showing top 10 rows



In [35]:
# chaning column names
newcolnames = ['rating','text']

for c,n in zip(df2.columns,newcolnames):
    df2=df2.withColumnRenamed(c,n)

# filtering only rated reviews
df2 = df2.filter((df2.rating == '5.0') | (df2.rating == '4.0') | (df2.rating == '3.0') | (df2.rating == '2.0') | (df2.rating == '1.0'))

df2 = df2.withColumn("rating",df2.rating.cast('int'))

# drop row  if  null value  is in row
df2 = df2.na.drop()

In [36]:
# groupby rating and see if only ratrings from 1-5 are there
df2.groupBy("rating") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+------+-------+
|rating|  count|
+------+-------+
|     5|1485936|
|     4| 412278|
|     1| 311808|
|     3| 212302|
|     2| 141306|
+------+-------+



In [37]:
# applying udf functions to df

df2 = df2.withColumn("text",lower_case_udf(F.col("text")))

df2 = df2.withColumn("text",decontracted_udf(F.col("text")))

df2 = df2.withColumn("text",process_text_basic_udf(F.col("text")))

df2 = df2.withColumn("text",empty_string_udf(F.col("text")))

In [38]:
# where empty string replace with none
df2=df2.select([when(col(c)=="",None).otherwise(col(c)).alias(c) for c in df2.columns])

# drop none values
df2 = df2.na.drop()

In [39]:
# split dataset into training and testing datasets
train_df2,test_df2 = df2.randomSplit([0.75,0.25])

In [41]:
#  stages for the pipeline 
tokenizer = Tokenizer(inputCol='text',outputCol='tokens')
stopwords_remover= StopWordsRemover(inputCol='tokens',outputCol='filtered_tokens')
vectorizer=CountVectorizer(inputCol='filtered_tokens',outputCol='features')
idf = IDF(inputCol='features',outputCol='vectorized_features')

In [42]:
pipeline = Pipeline(stages=[tokenizer,stopwords_remover,vectorizer,idf,lr])

In [None]:
# logisit regression estimator
lr = LogisticRegression(featuresCol='vectorized_features',labelCol='rating')

In [43]:
# building model
lr_model = pipeline.fit(train_df2)

# predictions on test dataset
predictions = lr_model.transform(test_df2)

In [44]:
predictions.select('text','rating','prediction').show(25)

+--------------------+------+----------+
|                text|rating|prediction|
+--------------------+------+----------+
|  it is too small...|     1|       1.0|
| i am quite sorry...|     1|       5.0|
| noty the tasmani...|     1|       1.0|
| the cool box art...|     1|       5.0|
| the graphics are...|     1|       5.0|
|1 ai not n way no...|     1|       5.0|
|18900 are you kid...|     1|       5.0|
|1who does this qu...|     1|       2.0|
|2 days after arri...|     1|       1.0|
|2 player does not...|     1|       1.0|
|29940 for only th...|     1|       5.0|
|4 short games 50 ...|     1|       1.0|
|a 60fps original ...|     1|       3.0|
|a bit of a snoore...|     1|       5.0|
|a datahookproduct...|     1|       5.0|
|a datahookproduct...|     1|       3.0|
|a few words befor...|     1|       5.0|
|a friend lent me ...|     1|       5.0|
|a fun game to pla...|     1|       1.0|
|a great looking g...|     1|       1.0|
|a longtime fan of...|     1|       1.0|
|a lot of good r

In [45]:
#  model evaluation
evaluator=MulticlassClassificationEvaluator(labelCol='rating',predictionCol='prediction',metricName='accuracy')

In [46]:
# model accuracy on summary column
accuracy_textcol = evaluator.evaluate(predictions)
print(accuracy_textcol)

0.6550704475768342


As expected, the accuarcy decreased. The review texts are longer and more complicated than the summary. 65% is still decent though since this is a mutliclass classification task.

At last, fake reviews will be predicted.

### ML-model verified column

In [113]:
# creating new df from original data
df3 = data.select('overall','summary','verified')

# chaning column names
newcolnames = ['rating','summary','verified']

for c,n in zip(df3.columns,newcolnames):
    df3=df3.withColumnRenamed(c,n)

# filtering only rated reviews
df3 = df3.filter((df3.rating == '5.0') | (df3.rating == '4.0') | (df3.rating == '3.0') | (df3.rating == '2.0') | (df3.rating == '1.0'))

In [114]:
# groupby verified and see 
df3.groupBy("verified") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+--------+-------+
|verified|  count|
+--------+-------+
|    True|1948309|
|   False| 617040|
+--------+-------+



In [115]:
# encoding labels:  If review is true = 0, if false=1
df3=df3.select([when(col(c)=="True",0).otherwise(col(c)).alias(c) for c in df3.columns])
df3=df3.select([when(col(c)=="False",1).otherwise(col(c)).alias(c) for c in df3.columns])

In [116]:
# where empty string replace with none
df3=df3.select([when(col(c)=="",None).otherwise(col(c)).alias(c) for c in df3.columns])

# drop none values
df3 = df3.na.drop()

In [117]:
# applying udf functions to df

df3 = df3.withColumn("summary",lower_case_udf(F.col("summary")))

df3 = df3.withColumn("summary",decontracted_udf(F.col("summary")))

df3 = df3.withColumn("summary",process_text_basic_udf(F.col("summary")))

df3 = df3.withColumn("summary",empty_string_udf(F.col("summary")))

df3 = df3.withColumn("verified",empty_string_udf(F.col("verified")))

In [118]:
# converting verified labels to integer dtype
df3 = df3.withColumn("verified",df3.rating.cast('int'))

In [119]:
# split dataset into training and testing datasets
train_df3,test_df3 = df3.randomSplit([0.75,0.25])

In [120]:
# stages for the pipeline 
tokenizer = Tokenizer(inputCol='summary',outputCol='tokens')
stopwords_remover= StopWordsRemover(inputCol='tokens',outputCol='filtered_tokens')
vectorizer=CountVectorizer(inputCol='filtered_tokens',outputCol='features')
idf = IDF(inputCol='features',outputCol='vectorized_features')

In [121]:
pipeline = Pipeline(stages=[tokenizer,stopwords_remover,vectorizer,idf,lr])

In [122]:
# logisit regression estimator
lr = LogisticRegression(featuresCol='vectorized_features',labelCol='verified')

In [123]:
# building model
lr_model = pipeline.fit(train_df3)

# predictions on test dataset
predictions = lr_model.transform(test_df3)

In [124]:
predictions.select('summary','rating','verified','prediction').show(25)

+--------------------+------+--------+----------+
|             summary|rating|verified|prediction|
+--------------------+------+--------+----------+
|                    |   1.0|       1|       5.0|
|                    |   1.0|       1|       5.0|
|                    |   1.0|       1|       5.0|
|                    |   1.0|       1|       5.0|
|                    |   1.0|       1|       5.0|
|                    |   1.0|       1|       5.0|
|                    |   1.0|       1|       5.0|
|                    |   1.0|       1|       5.0|
|                    |   1.0|       1|       5.0|
|                    |   1.0|       1|       5.0|
|                    |   1.0|       1|       5.0|
|                    |   1.0|       1|       5.0|
|                    |   1.0|       1|       5.0|
|        we need wood|   1.0|       1|       5.0|
|          128meg ram|   1.0|       1|       5.0|
| a day which will...|   1.0|       1|       4.0|
| a new patch came...|   1.0|       1|       1.0|


In [125]:
#  model evaluation
evaluator=MulticlassClassificationEvaluator(labelCol='verified',predictionCol='prediction',metricName='accuracy')

In [126]:
# model accuracy on verified column
accuracy_fakereview = evaluator.evaluate(predictions)
print(accuracy_fakereview)

0.7440916238798052


Again almost 75%! This result is hard to interpret, because there are probably no fake reviews in the dataset. So what makes a review from a unverified user different, than from a verified one? Nothing really. Nonetheless, efforts have been made to classify fake reviews.


# Summary and Conclusions

At first I had to load the data from HDFS. Afterwards there was the need to inspect the data and explore the columns. 
For a task like this, it is necessary to know, which columns need to be used and which are irrelevant to the project.

By looking at the data via df.show() I was able to see that there was a lot of noise in the data. With datasets as big as this one, it is very common to have noise in data which needs to be cleaned up.

In the process of cleaning, I changed the column names of the updated dataframe to more relevant names. Furthermore the datatypes of the columns were adjusted so numbers were actuatlly integers instead of strings.

Then I filtered only the relevant ratings with the scores 1 to 5, because there was other rubbish in the rating column not 1 to 5. 

With the groupby function I always confirmed manually if everything was filtered the right way. Afterwards NaN or missing values were removed from the dataframe.

Now with a text classifcation task like this, it is from utmost importance that the text itself is cleaned properly.
That is why i implemented 4 typical python functions to preprocess the text:

    1. The decontracted function which pulls phrases like can't into can not
    2. lowercase function for making the whole input text lowercase
    3. process_text_basic function for removing punctuation
    4. and the empty_string function, which removes text only including empty strings
    
Those functions were implemented via the user defined function(udf) from pyspark.

Preprocessing text like this is important because in the ML-pipeline, the text will be tokenized and then tf-idf will be applied which means the tokens will be counted by term frequency and inverse document frequency. So it is necessary to process similar phrases into identical tokens. For example: "I CAN'T" and "i can not" would be 6 different tokens but they say the exact same thing, so the functions mentioned above take care of this problem, so that both phrases have the same exact three tokens like 'i' 'can' 'not.

After the the text/summary preprocessing, the dataframe has been split into training and testing parts. This is needed to train the model on training data and then test it on unseen data. Then, the ML-Pipline consisting of tokenizer, stopwords_remover, vectorizer and idf was deployed. The stopwords remover, is an important part of this pipeline as it filters frequent words without any meaning towards sentiment. The most common stop words are pronouns, articles, prepositions, and conjunctions. This includes words like a, an, the, and, it, for, or, but, in, my, your, our, and their. In this pipeline the tokenized and filtered text is vectorized as the machine learning model can't work with words but with arrays of numbers. 

I used logistig regression for this task. The model was trained and then tested on the test set. The predictions were shown and the accuarcy of corrected predicted labels to total labels was calcuclated. 

This whole process was repeated three times for each set of objective in this project:
    
    1. For the summary column and rating classification
    2. For the text column and rating classification
    3. For the verified column to predict wehter a review was verified
    
The results were the following:

    1.~75% accuarcy of correctly predicted ratings for a given summary review text
    2.~65% accuarcy of correctly predicted ratings for a given complete review text
    3.~75% accuarcy of correctly predicted wether the summary text belonged to a verified user or not 
    
Since this dataset contained well above 250.000 reviews after cleaning, and the tasks for the first 2 objective were multilabel classification the results were impressive! It is not that easy to differentiate between a 2 and 3 star rating review for example. And thats exactly the type of work the model had to carry out.

So my hypothesis that a machine learning model can accuartly predict ratings and therefore sentiment of a given text is true.

As the commonsense baseline for this task would be 20% accaurcy(5 labels, 20% chance to guess it right randomly), the model beat that baseline handily each time. 

The fake review/verified user task was a little bit unclear, because as I said earlier, there are proably not a lot of fake reives in this dataset, if so, my theory that they all belong to unverifed users might not be true. However, the accaurcy of detecting wether a review text came from a verified source or not, was actually impressive aswell.


## References

[1] https://www.cnet.com/tech/services-and-software/features/amazons-never-ending-fake-reviews-problem-explained/

[2] https://www.aboutamazon.com/news/how-amazon-works/creating-a-trustworthy-reviews-experience?tag=cnet-buy-button-20&ascsubtag=4066e7bd-1a2e-46f1-969f-f45464fec7dc%7C%7Cdtp

[3] https://monkeylearn.com/sentiment-analysis/
