**Data collection**:  
The scraping was done using [Google-Play-Scraper](https://github.com/JoMingyu/google-play-scraper) to collect users' reviews and ratings on Shopee's app. A total of 4,461 unique reviews were collected.

**Data cleaning**:
- Remove duplicated reviews
- Remove reviews that do not have any meaningful words
- Remove reviews that are non-English or gibberish

**Pre-processing**:
- Remove HTML tags
- Use regular expression to remove special characters and numbers
- Lowercase words
- Use NLTK to remove stopwords
- Remove frequently occurring words that appear in both positive and negative sentiments, like 'app', 'shopee', 'item', 'seller', 'bad'.
- Use NLTK to stem words to their root form

**Time frame of the reviews written**  
The number of reviews for Shopee on Google Play has increased across all ratings (1-5 stars), between Jan and April 2020. This may likely be a result of the recent rise in e-commerce purchases. As the COVID-19 outbreak resulted in the forced closure of many brick-and-mortar stores during Singapore's circuit breaker period, many consumers started to turn to online shopping instead.

**Number of thumbs up received**  
Negative reviews with 1 or 2-star ratings receive more thumbs up on average, than positive reviews. This may suggest that several others face the same issues as those who have written these negative reviews.

**Number of meaningful words**  
The average number of meaningful words in a negative review (15 words) is higher than that in a positive review (7 words). There is also a noticeable higher variance in the number of meaningful words among negative reviews than positive reviews, suggesting that dissatisfied customers are more likely to write longer reviews. 

**Barplots: Top uni-grams and bi-grams**  
'Use', 'time' and 'order' are the top 3 most frequently occurring uni-grams in negative reviews. 'Customer service' is the top bi-gram seen among negative reviews. We can thus infer that users are somewhat dissatisfied with Shopee's customer service.   

'Good', 'shop' and 'easi' are the top 3 most frequently seen uni-grams in positive reviews. The bi-grams give us some context to the word 'easi', where it probably refers to an 'easy to use online shopping platform'.  

**VADER sentiment analysis**  
Given a compound score threshold of 0.175, VADER is able to correctly classify 80% of sentiments. As Shopee's product managers would also like to prioritise the identification of negative reviews so that they can fix immediate problems if necessary, achieving a decent recall rate is important. VADER is able to correctly classify 74% of actual negative reviews.

## Contents:
- [Import Libraries](#Import-Libraries)
- [Data Collection](#Data-Collection)
- [Data Cleaning & Pre-processing](#Data-Cleaning-&-Pre-processing)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)

## Import Libaries

In [11]:
# import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import seaborn as sns
import re

# Library to scrape Google Play
# from google_play_scraper import Sort, reviews

from bs4 import BeautifulSoup
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
# from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

%matplotlib inline

## Data Collection

Users can rate [Shopee's app](https://play.google.com/store/apps/details?id=com.shopee.sg&hl=en_SG) on Google Play with a star rating and review. The ratings are on a 5-point scale, with 1 being the lowest score and 5 being the highest score one could possibly give. Since the goal of our project is to predict if a review has a positive or negative sentiment based on textual data, we will scrape real user reviews on Google Play.

[Google-Play-Scraper](https://github.com/JoMingyu/google-play-scraper) provides an API to crawl through Google Play. We used `pip install google-play-scraper` to install the package and scraped users' reviews and rating scores on Shopee's app.  

The reviews were collected in batches, according to their scores (1-5). This was done in an attempt to achieve a balanced dataset with roughly the same number of reviews for each score. Also, in order to gather reviews that had more text and were written recently, we set up the google play scraper to scrape from both review types, 'Most relevant' and 'Newest'.


In [12]:
## Commenting this out so that we don't re-run the cells and accidentally re-collect the data
## Create an empty list to store the reviews that we are about to collect
# app_reviews = []

In [13]:
# # Function to scrape reviews on google play store
# # app: the url of the app we want to scrape
# # score: number of stars rated by users
# # n_loops: the number of loops to collect reviews in batches of 200

# def reviews_scraper(app, score, n_loops):
#     for sort_order in [Sort.MOST_RELEVANT, Sort.NEWEST]: # Collect both reviews types - 'most relevant' and 'newest'
#         for i in range(n_loops):
#             rvs, continuation_token = reviews(app,
#                                               lang='en',
#                                               country='sg',
#                                               sort=sort_order,
#                                               count=200, # 200 is the maximum number of reviews per page supported by Google Play
#                                               filter_score_with=score,
#                                               continuation_token=None if i==0 else continuation_token) # To begin crawling from where it last left off
#             for r in rvs:
#                 r['sort_order'] = 'most_relevant' if sort_order == Sort.MOST_RELEVANT else 'newest'
#                 r['app_id'] = app
#             app_reviews.extend(rvs)
#             print('No. of reviews collected: ' + str(len(rvs)))

After setting up our scraping function, we will now collect the reviews in batches, based on their scores. As not many users left 2-3 star reviews on Shopee, hitting the site with more than 2 loops will return an error. Hence, `n_loops` for scores 2 and 3 star reviews were set to 2, rather than 5.

In [14]:
## Collect reviews that were rated 1 star
# reviews_scraper(app='com.shopee.sg', score=1, n_loops=5)

In [15]:
## Collect reviews that were rated 2 stars
# reviews_scraper(app='com.shopee.sg', score=2, n_loops=2)

In [16]:
## Collect reviews that were rated 3 stars
# reviews_scraper(app='com.shopee.sg', score=3, n_loops=2)

In [17]:
## Collect reviews that were rated 4 stars
# reviews_scraper(app='com.shopee.sg', score=4, n_loops=5)

In [18]:
## Collect reviews that were rated 5 stars
# reviews_scraper(app='com.shopee.sg', score=5, n_loops=5)

In [19]:
# # Save reviews to csv file
# pd.DataFrame(app_reviews).to_csv('../data/shopee_reviews.csv', index=False)

In [20]:
# Read in shopee csv file 
# Datetime parsing for 'at' and 'repliedAt' columns
reviews = pd.read_csv('..\SentimentModel\shopee_reviews.csv', parse_dates=['at','repliedAt'])

In [21]:
# We have collected 7404 reviews 
reviews.shape

(7600, 12)

In [22]:
# View first 5 rows
reviews.head()

Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,sort_order,app_id
0,28103537-a7f2-4748-ab09-00eae61061ed,Rei Asuka,https://play-lh.googleusercontent.com/a/AGNmyx...,The app is workng fine. The issue i had was th...,1,320,2.91.31,2022-08-14 06:11:42,Thank you for bringing this issue to our atten...,2022-08-14 09:21:57,most_relevant,com.shopee.sg
1,89137ef2-561d-4f36-83b6-3b48cfaa8ac8,Alexander 1995,https://play-lh.googleusercontent.com/a/AGNmyx...,The issue of this rated 🌟 with genuine conside...,1,4,2.54.04,2020-08-13 20:37:40,Thank you for bringing this issue to our atten...,2020-08-13 22:16:39,most_relevant,com.shopee.sg
2,c9b45b86-2e35-4ab2-9b27-152881c4e4ff,Jaymes Chiang,https://play-lh.googleusercontent.com/a/AGNmyx...,Latest version of the app keeps crashing whene...,1,22,,2022-03-02 00:10:18,,NaT,most_relevant,com.shopee.sg
3,31d98c71-a1d2-4f7a-b4c8-f3cc1d7922f3,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,Payment page is such a disaster. Full of verif...,1,22,,2019-10-29 23:45:11,"Hey,\r\n\r\nThank you for the review.\r\n\r\nW...",2019-10-30 09:45:11,most_relevant,com.shopee.sg
4,28430f95-0409-429c-ad10-e12a08be6c24,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,Be very careful when you use this platform. Ma...,1,9,2.23.31,2018-12-06 14:43:45,Hi Royis! We're so sorry for the unpleasant ex...,2018-12-06 22:16:34,most_relevant,com.shopee.sg


In [23]:
# Check that the datatypes are correct eg. 'at' and 'repliedAt' are datetime
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7600 entries, 0 to 7599
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   reviewId              7600 non-null   object        
 1   userName              7600 non-null   object        
 2   userImage             7600 non-null   object        
 3   content               7600 non-null   object        
 4   score                 7600 non-null   int64         
 5   thumbsUpCount         7600 non-null   int64         
 6   reviewCreatedVersion  6948 non-null   object        
 7   at                    7600 non-null   datetime64[ns]
 8   replyContent          6468 non-null   object        
 9   repliedAt             6468 non-null   datetime64[ns]
 10  sort_order            7600 non-null   object        
 11  app_id                7600 non-null   object        
dtypes: datetime64[ns](2), int64(2), object(8)
memory usage: 712.6+ KB


In [24]:
# Some null values in app version & developer replies
reviews.isnull().sum()

reviewId                   0
userName                   0
userImage                  0
content                    0
score                      0
thumbsUpCount              0
reviewCreatedVersion     652
at                         0
replyContent            1132
repliedAt               1132
sort_order                 0
app_id                     0
dtype: int64

In [25]:
# Summary statistics for numerical variables
reviews.describe()

Unnamed: 0,score,thumbsUpCount,at,repliedAt
count,7600.0,7600.0,7600,6468
mean,3.157895,5.046579,2021-08-06 04:36:43.404210432,2021-09-12 08:57:26.810141952
min,1.0,0.0,2018-09-13 15:34:41,2018-09-13 16:35:05
25%,1.0,0.0,2020-08-05 21:24:52,2020-09-05 21:19:48
50%,4.0,1.0,2021-08-12 02:46:06,2021-10-17 10:05:44
75%,5.0,3.0,2022-10-09 06:00:56.249999872,2022-10-30 14:33:54
max,5.0,405.0,2023-04-08 19:32:15,2023-04-09 09:59:44
std,1.564952,18.131879,,


In [26]:
# Check how many reviews were retrieved from each score
reviews['score'].value_counts().sort_index()

score
1    2000
2     800
3     800
4    2000
5    2000
Name: count, dtype: int64

## Data Dictionary

The data dictionary below provides an overview of the features in our dataset.

| Feature              | Type     | Description                                                                                        |
|:----------------------|:----------|:----------------------------------------------------------------------------------------------------|
| reviewId             | obj      | Unique review Id                                                                                   |
| userName             | obj      | Username of the reviewer                                                                           |
| userImage            | obj      | Url link to the user's profile photo                                                                         |
| content              | obj      | Textual data of the review                                                                         |
| score                | int      | No. of star ratings the user gave (1-5)                                                                  |
| thumbsUpCount        | int      | No. of thumbs up the review received from other users                                              |
| reviewCreatedVersion | obj      | App version                                                                                        |
| at                   | datetime | Date and time of which the review was written                                                      |
| replyContent         | obj      | Shopee's reply to the review                                                                       |
| repliedAt            | datetime | Date and time of Shopee's reply                                                                    |
| sort_order           | obj      | Indicates whether the data was scraped from the 'Most relevant' or 'Newest' section in Google Play |
| app_id               | obj      | The url which the reviews were collected from                                                      |

## Data Cleaning & Pre-processing

### Remove duplicated reviews

As we have scraped reviews that were sorted by 'Most relevant' and 'Newest', there will definitely be duplicates in our dataset. We will be dropping these duplicates to ensure that we train and test our model on unique reviews.

In [27]:
# 7404 reviews in our dataframe
reviews.shape

(7600, 12)

In [28]:
# There are 2943 duplicates
reviews[reviews.duplicated(['userName', 'content', 'at'])].shape

(3800, 12)

In [29]:
reviews[reviews.duplicated(['userName', 'content', 'at'])]['content']

1000    The app is workng fine. The issue i had was th...
1001    The issue of this rated 🌟 with genuine conside...
1002    Latest version of the app keeps crashing whene...
1003    Payment page is such a disaster. Full of verif...
1004    Be very careful when you use this platform. Ma...
                              ...                        
7595    Sellers are all very friendly, app will always...
7596    Can get very cheap price compared to buy from ...
7597    Fast service, low shipping fees, highly recomm...
7598    Quite a good application. Have lots of items f...
7599    It is recommended app / platform to use Shopee...
Name: content, Length: 3800, dtype: object

In [30]:
# Drop duplicates as we only want unique reviews
reviews.drop_duplicates(['userName', 'content', 'at'], inplace=True)

In [31]:
# Reindex the dataframe
reviews.reset_index(drop=True, inplace=True)

In [32]:
# Check that we have dropped these duplicates
reviews.shape

(3800, 12)

In [33]:
# Check how many reviews we have for each score after dropping the duplicates
reviews['score'].value_counts().sort_index()

score
1    1000
2     400
3     400
4    1000
5    1000
Name: count, dtype: int64

### Rename columns

It is good practice to use snake case when naming our columns.

In [34]:
# Rename the columns to lowercase and use underscores
reviews.rename(columns={'reviewId': 'review_id', 
                        'userName': 'username', 
                        'userImage': 'user_image', 
                        'thumbsUpCount': 'thumbs_up_count', 
                        'reviewCreatedVersion': 'review_created_version', 
                        'replyContent': 'reply_content',
                        'repliedAt': 'replied_at'},
                inplace=True) 

In [35]:
# Check that columns have been correctly renamed
reviews.columns

Index(['review_id', 'username', 'user_image', 'content', 'score',
       'thumbs_up_count', 'review_created_version', 'at', 'reply_content',
       'replied_at', 'sort_order', 'app_id'],
      dtype='object')

### Create a target variable

The goal is to classify positive and negative app reviews. Negative reviews can reveal critical features that are missing from Shopee's app or even bring to light the presence of bugs on the app. This will require immediate action from Shopee. As such, priority will be placed on the prediction of negative sentiment. We will assign the negative reviews (scores 1-3) to class 1, and the positive reviews (scores 4-5) to class 0.

In [36]:
# Defining the target variable using scores
reviews['target'] = reviews['score'].map(lambda x: 1 if x < 4 else 0)

In [37]:
# Check the count of our target variable
reviews['target'].value_counts()

target
0    2000
1    1800
Name: count, dtype: int64

In [38]:
# Check that the target variable has been added to our dataframe
reviews.head()

Unnamed: 0,review_id,username,user_image,content,score,thumbs_up_count,review_created_version,at,reply_content,replied_at,sort_order,app_id,target
0,28103537-a7f2-4748-ab09-00eae61061ed,Rei Asuka,https://play-lh.googleusercontent.com/a/AGNmyx...,The app is workng fine. The issue i had was th...,1,320,2.91.31,2022-08-14 06:11:42,Thank you for bringing this issue to our atten...,2022-08-14 09:21:57,most_relevant,com.shopee.sg,1
1,89137ef2-561d-4f36-83b6-3b48cfaa8ac8,Alexander 1995,https://play-lh.googleusercontent.com/a/AGNmyx...,The issue of this rated 🌟 with genuine conside...,1,4,2.54.04,2020-08-13 20:37:40,Thank you for bringing this issue to our atten...,2020-08-13 22:16:39,most_relevant,com.shopee.sg,1
2,c9b45b86-2e35-4ab2-9b27-152881c4e4ff,Jaymes Chiang,https://play-lh.googleusercontent.com/a/AGNmyx...,Latest version of the app keeps crashing whene...,1,22,,2022-03-02 00:10:18,,NaT,most_relevant,com.shopee.sg,1
3,31d98c71-a1d2-4f7a-b4c8-f3cc1d7922f3,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,Payment page is such a disaster. Full of verif...,1,22,,2019-10-29 23:45:11,"Hey,\r\n\r\nThank you for the review.\r\n\r\nW...",2019-10-30 09:45:11,most_relevant,com.shopee.sg,1
4,28430f95-0409-429c-ad10-e12a08be6c24,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,Be very careful when you use this platform. Ma...,1,9,2.23.31,2018-12-06 14:43:45,Hi Royis! We're so sorry for the unpleasant ex...,2018-12-06 22:16:34,most_relevant,com.shopee.sg,1


### Pre-processing

Next, we will perform pre-processing to transform our text into a more digestible form for our classifier. The steps are as follows:
- Remove HTML tags
- Use regular expression to remove special characters and numbers
- Lowercase words
- Use NLTK to remove stopwords
- Remove frequently occurring words that appear in both positive and negative sentiments, like 'app', 'shopee', 'item', 'seller', 'bad'. Removing these words led to a 1 and 2 percentage point improvement in our model's accuracy and recall rate, respectively.
- Use NLTK to stem words to their root form. Note that the model returned better accuracy when we used stemming, rather than lemmatizing.

It is also good to note that we have tried using SpaCy to remove stopwords and for lemmatizing. However, our model's performance was much better when the text was pre-processed using NLTK.

In [39]:
# Write a function to convert raw text to a string of meaningful words
def stem_text(raw_text):
    
    # Remove HTML tags
    review_text = BeautifulSoup(raw_text).get_text()
    
    # Remove non-letters
    letters_only = re.sub("[^a-zA-Z]", " ", review_text)
    
    # Convert words to lower case and split each word up
    words = letters_only.lower().split()
    
    # Searching through a set is faster than searching through a list 
    # Hence, we will convert stopwords to a set
    stops = set(stopwords.words('english'))
    
    # Adding on stopwords that were appearing frequently in both positive and negative reviews 
    stops.update(['app','shopee','shoppee','item','items','seller','sellers','bad']) 
    
    # Remove stopwords
    meaningful_words = [w for w in words if w not in stops]
        
    # Instantiate PorterStemmer
    p_stemmer = PorterStemmer()
    
    # Stem words
    meaningful_words = [p_stemmer.stem(w) for w in meaningful_words]        
    
    # Join words back into one string, with a space in between each word
    return(" ".join(meaningful_words))

In [40]:
# Pre-process the raw text
reviews['content_stem'] = reviews['content'].map(stem_text)

  review_text = BeautifulSoup(raw_text).get_text()


Let us compare our original text with the pre-processed version.

In [41]:
# This is the original text of the first review in our dataset
reviews.loc[0]['content']

"The app is workng fine. The issue i had was the shopee pet game. 1) Half the time, it will get stuck at 99% loading. 2) 20% of the time, it will not be able to connect to server after you completeed the music minigame and have to replay the minigame again. 3) Worst of all is the soccer minigame. Even if the goalie is at one end of the goal, it can still block the ball at the other end, making it totally based on luck. Sometimes, it doesn't even need to touch the ball to block it."

In [42]:
# This is how the text looks like after stemming
reviews.loc[0]['content_stem']

'workng fine issu pet game half time get stuck load time abl connect server complete music minigam replay minigam worst soccer minigam even goali one end goal still block ball end make total base luck sometim even need touch ball block'

### Remove reviews that do not have any meaningful words

After pre-processing, we notice that there are some reviews that do not have any meaningful words left. As these reviews largely consists of only emojis or Chinese characters, these reviews returned blank fields after stemming. In other words, the length of these reviews after pre-processing, was 0. Given that these reviews will not add value to our model's training, we will remove them from our dataset. 

In [43]:
# Find the number of meaningful words in each review
reviews['content_clean_len'] = reviews['content_stem'].str.split().map(len)

In [44]:
# There are 48 reviews that do not have any meaningful words
reviews[reviews['content_clean_len']==0].shape

(0, 15)

In [45]:
# View reviews that do not have any meaningful words
reviews[reviews['content_clean_len']==0]['content']

Series([], Name: content, dtype: object)

In [46]:
# Drop these reviews that do not have any meaningful words
reviews = reviews.drop(reviews[reviews['content_clean_len']==0].index)

In [47]:
# Reindex the dataframe
reviews.reset_index(drop=True, inplace=True)

### Create a train and test set

20% of the original dataset will be set aside and used as a test set. This will be useful in evaluating our model's performance on unseen data.  

We will use stratify to preserve the class representation in our train and test set.

In [48]:
# As we would like to stratify our target variable, we will need to first assign X and y
X = reviews[[cols for cols in reviews.columns if cols != 'target']]
y = reviews['target']

In [49]:
# Perform a train_test_split to create a train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [50]:
# Merge X_train and y_train back together using index
train = pd.merge(X_train, y_train, left_index=True, right_index=True)

# Merge X_test and y_test back together using index
test = pd.merge(X_test, y_test, left_index=True, right_index=True)

In [51]:
# Reindex the train and test set
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)

In [52]:
# 3478 documents in our training set
train.shape

(3040, 15)

In [53]:
# 870 documents in our test set
test.shape

(760, 15)

In [54]:
# Check split in class labels for training set
train['target'].value_counts(normalize=True)

target
0    0.526316
1    0.473684
Name: proportion, dtype: float64

In [55]:
# Check split in class labels for test set
test['target'].value_counts(normalize=True)

target
0    0.526316
1    0.473684
Name: proportion, dtype: float64

Finally, after data cleaning, we now have 3478 reviews for training and 870 reviews in our test set. The class representation is consistent across the train and test set, with 61% of the data belonging to class 0 (positive sentiment) and 39% belonging to class 1 (negative sentiment).

## Save clean datasets for modeling

In [56]:
# Keep only the columns that we need for modeling and interpretation
train = train[['content','content_stem','score','target']]
test = test[['content','content_stem','score','target']]

In [57]:
# Save clean training set
train.to_csv('../SentimentModel/clean_train.csv', index=False)

In [58]:
# Save clean test set
test.to_csv('../SentimentModel/clean_test.csv', index=False)