# Shopping Apps, Rating for Google Play Store and Apple AppStore Users

<img src="https://image.freepik.com/free-vector/cartoon-delivery-man-brings-goods-customer-from-laptop-vector-illustration-concept-with-online-shopping-services_46527-344.jpg" />

## Introduction

Users download apps for various purposes. Given that there is a rise in the usage of online shopping due to the Covid-19 pandemic, improvement of shopping experience has become more important then before. With that in mind, what are the important features we have to look out for to improve a shopping app?

## Problem Statement

- How do the app ratings differ across different shopping apps?
- Is there any specific group of users we can look out for to improve the app?
- Are there any specific improvement we can work on to further improve user experience of the app?

To explore and answer the above questions, we will scrap reviews from Google Play Store and Apple AppStore and conduct analysis and modelling.

## Executive Summary

The data is webscrapped from the Shopping category in Google Play Store and Apple Appstore, 8 apps reviews were chosen for this project (Amazon, Wish, ASOS, Lazada, Ebay, Shoppee, AliExpress, Carousell). The data used was exclusive dated in 2020 only as majority of the data scrapped are from in 2020. Data cleaning was done by removing stopwords, lemmatized and Vectorized to the raw data to create bag-of-words. 

There will be 3 steps to our modelling process, with the first step classifying whether the text is a good or bad review, followed by classifying the reviews into categories created through topic modelling to group them into different subgroups.

A few classification model were used, namely LogisticRegression, MultinomialNB, SGDClassifier, RandomForest, ADABoost. LogisticRegression give us the best results in classifying our data and thus used as the final model. 

As the data set is quite big, RandomizedSearch was used instead of Gridsearch to find the best hyperparameter.



### Content Summary
- Webscrapped reviews of 8 apps from Google Play Store & Apple App Store
- Data Cleaning 
  - Removing data not in year 2020
  - Removing emoji and punctuations
  - Removing non english words
  - Lemmatization
  - Compound score calculation using VaderSentiment
- EDA
  - Plotting distribution of features
  - Topic modelling of good reviews
  - Topic modelling of bad reviews
- Machine Learning Model 
  - LogisticRegression
  - MultinomialNB
  - SGDClassifier
  - RandomForest
  - ADABoost
- Deep Learning Model
  - Convolutional Neural Network
  
### Key Findings
- Most complains are on Bad User Experience, while most good reviews are on the good overall service of the App, which is rather vague
- There is more negative reviews in 9am - 3pm period, and on Tuesdays
- There is quite a number of reviews being 1 word, or otherwise rated wrongly by the user, (e.g. review: Excellent, Rating: 1)
- The multiclass model seems to be predicting better compared to what the topic modeling originally set the topic was, which is quite interesting as the model is able to differentiate the categories clearly base on the keywords.

### Metrics
Using the following metrics to evaluate the models:
- ROC AUC curve(for Binary Classification)
  -  The ROC AUC cruve is able to tell how much model is capable of distinguishing between classes.ranging from 0 to 1, with 1 being perfectly classified.
- MCC Score
  - The Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset.
- Kappa Score(For multiclassification)
  - Cohen’s Kappa is a quantitative measure of reliability for two raters that are rating the same thing, corrected for how often that the raters may agree by chance. ranging from 0 to 1, with 1 being good.
  
- f1 score weighted
  - The F1 Scores are calculated for each label and then their average is weighted by support - which is the number of true instances for each label. It can result in an F-score that is not between precision and recall

### Final Results
**Classification (Good & Bad Reviews)**
- LogisticRegression
  - Train data AUC: 0.947
  - Test data AUC: 0.945
  - MCC Score: 0.729
  
  
**Multi Classification (Bad Review categories)**
- LogisticRegression
    - Train Data f1 weighted score: 0.867	
    - Test Data f1 weighted score: 0.862	
    - MCC Score: 0.798336	
    - Kappa Score: 0.796888


**Multi Classification (Good Review categories)**
- LogisticRegression
    - Train Data f1 weighted score: 0.937	
    - Test Data f1 weighted score: 0.944	
    - MCC Score: 0.907	
    - Kappa Score: 0.906

### Limitations
- The data set is mostly collected in the month of August and September, which means the model is able to predict this period better, but not in predicting past data. 
- More data could be collected, as there is a major lack of Apple Appstore reviews compared to Google Play Store

### Further research
- Try to use Compound score gathered from VaderSentiment to do the classification instead, as we know there is some misclassified post by users. which hopefully give us a better accuracy.

### Content
1. Webscrap data
2. Data Cleaning
3. EDA
4. Model Part 1, Classification (Good & Bad Reviews)
5. Model Part 2, Multi Classification (Bad Review categories) 
6. Model Part 3, Multi Classification (Good Review categories)
7. Deep Learning Model

# Import Library

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from time import time

from PIL import Image
from wordcloud import WordCloud

from bs4 import BeautifulSoup
import re
import spacy
from spacymoji import Emoji
from nltk.corpus import stopwords, words
from nltk import wordpunct_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Import data

In [2]:
df_info = pd.read_csv('../data/shoppingapps_info.csv')
df_google = pd.read_csv('../data/google_apps_reviews.csv')
df_apple = pd.read_csv('../data/apple_apps_reviews.csv')

## Checking info 

In [3]:
df_info.head()

Unnamed: 0,title,description,descriptionHTML,summary,summaryHTML,installs,minInstalls,score,ratings,reviews,...,contentRatingDescription,adSupported,containsAds,released,updated,version,recentChanges,recentChangesHTML,appId,url
0,Shopee: #1 Online Platform,*For usage in Singapore only\r\n\r\nShopee is ...,*For usage in Singapore only<br><br>Shopee is ...,#1 Online Shopping Platform\r\n15% Cashback | ...,#1 Online Shopping Platform<br>15% Cashback | ...,"1,000,000+",1000000,4.776189,75281,20125,...,,,False,"May 24, 2015",1600059737,2.60.11,Thanks for using Shopee! We’ve fixed some bugs...,Thanks for using Shopee! We’ve fixed some bugs...,com.shopee.sg,https://play.google.com/store/apps/details?id=...
1,"Carousell: Snap-Sell, Chat-Buy",Carousell is a community marketplace that lets...,Carousell is a community marketplace that lets...,"Carousell: Snap to Sell, Chat to Buy. And it's...","Carousell: Snap to Sell, Chat to Buy. And it&#...","10,000,000+",10000000,4.542936,186438,70213,...,Parental Guidance Recommended,True,True,"Jan 15, 2013",1600053888,2.179.753.778,"Every week, we polish the app to help you buy,...","Every week, we polish the app to help you buy,...",com.thecarousell.Carousell,https://play.google.com/store/apps/details?id=...
2,Lazada & RedMart - Online Shopping & Groceries,Welcome to the brand new Lazada™ mobile app! J...,Welcome to the brand new Lazada™ mobile app! J...,Shop Countless Deals Online at Lazada Singapore!,Shop Countless Deals Online at Lazada Singapore!,"100,000,000+",100000000,4.441479,7051435,2646815,...,,,False,"Jun 8, 2013",1600176475,6.52.0,Thanks for using LAZADA! We've enhanced the pe...,Thanks for using LAZADA! We&#39;ve enhanced th...,com.lazada.android,https://play.google.com/store/apps/details?id=...
3,"AliExpress - Smarter Shopping, Better Living","Ever wanted to shop everything in one place, a...","Ever wanted to shop everything in one place, a...","AliExpress - Smarter Shopping, Better Living","AliExpress - Smarter Shopping, Better Living","100,000,000+",100000000,4.509242,10217530,3144111,...,Parental Guidance Recommended,,False,"Sep 27, 2012",1600157563,8.16.0,We're always looking for ways to further optim...,We&#39;re always looking for ways to further o...,com.alibaba.aliexpresshd,https://play.google.com/store/apps/details?id=...
4,eBay: Discover great deals on the brands you love,"Buy, sell and save with the eBay app! Shop dea...","Buy, sell and save with the eBay app! Shop dea...",Buy and sell on the world’s largest marketplac...,Buy and sell on the world’s largest marketplac...,"100,000,000+",100000000,4.716624,3436915,1263982,...,Parental Guidance Recommended,True,True,"Feb 17, 2010",1599781634,Varies with device,Tap “lockable filters” in the “customize” sect...,Tap “lockable filters” in the “customize” sect...,com.ebay.mobile,https://play.google.com/store/apps/details?id=...


In [4]:
df_info.title

0                           Shopee: #1 Online Platform
1                       Carousell: Snap-Sell, Chat-Buy
2       Lazada & RedMart - Online Shopping & Groceries
3         AliExpress - Smarter Shopping, Better Living
4    eBay: Discover great deals on the brands you love
5       Amazon Shopping - Search, Find, Ship, and Save
6                                                 ASOS
7                             Wish - Shopping Made Fun
Name: title, dtype: object

## Checking Google Playstore Review

In [5]:
df_google.head()

Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,sortOrder,appId
0,gp:AOqpTOFMK_YUOfYqWluUehP3lajbdBztb0kaA_oinNC...,songsin12,https://lh3.googleusercontent.com/-_Z0Ydwm7Xcc...,Orders mostly came early and products are good.,5,0,2.60.11,2020-09-16 20:26:28,"""Thank you for giving Shopee a 5-star review! ...",2020-07-09 10:04:23,newest,com.shopee.sg
1,gp:AOqpTOECqGI7ocjdKrG5PykxhWmBG2wp1HzwO5zxrvj...,Ho Soh Fong,https://lh3.googleusercontent.com/a-/AOh14GhHM...,Good and convenient,4,0,2.60.08,2020-09-16 20:13:46,Thank you for your review. We're excited to be...,2020-09-16 20:17:31,newest,com.shopee.sg
2,gp:AOqpTOFK6om-GRJCgm-WXGyf_nurLs1YXL3FSoLRr5b...,Yasohthah Devadas,https://lh3.googleusercontent.com/-sji2OhurxhM...,Gd...........,5,0,2.60.11,2020-09-16 20:13:05,Thank you for giving Shopee a 5-star review! W...,2020-09-16 20:18:24,newest,com.shopee.sg
3,gp:AOqpTOHfstsd3G5jEoFQn62yZ9gpYcVj2oRlZZkKWBn...,May Han,https://play-lh.googleusercontent.com/-RhVs3Za...,My first purchase experience...Happy with purc...,4,0,2.60.11,2020-09-16 20:11:18,Thank you for your review. We're excited to be...,2020-09-16 20:28:10,newest,com.shopee.sg
4,gp:AOqpTOGtT_ODk0PZdHD0m_phZw4fFng1RvxZsM9Gk_v...,fauziah ata,https://lh3.googleusercontent.com/-NhTM2s673Pw...,A lot of items at a very good deal.,5,0,,2020-09-16 20:08:54,Thank you for giving Shopee a 5-star review! W...,2020-09-16 20:29:35,newest,com.shopee.sg


In [6]:
labels = {'appId' : {'com.amazon.mShop.android.shopping' : 'amazon',
                    'com.ebay.mobile' : 'ebay',
                     'com.shopee.sg': 'shoppee',
                     'com.alibaba.aliexpresshd': 'aliexpress',
                     'com.thecarousell.Carousell': 'carousell',
                     'com.asos.app': 'asos',
                     'com.lazada.android': 'lazada',
                     'com.contextlogic.wish': 'wish',
                    }}
df_google.replace(labels, inplace=True)
df_google['appId'].value_counts()

carousell     10000
amazon        10000
wish          10000
aliexpress    10000
shoppee       10000
asos          10000
lazada        10000
ebay          10000
Name: appId, dtype: int64

## Checking Apple Appstore Review

In [7]:
df_apple.head()

Unnamed: 0,date,title,userName,rating,review,developerResponse,isEdited,appid
0,2020-08-05 05:32:41,Good in price and customers’ interest platform,Little tortoise,5,Shopee is a platform that protecting customers...,"{'id': 17097958, 'body': ""Thank you for giving...",False,shoppee
1,2019-12-17 15:59:30,Bad experience,Tujimu,1,I have been used this app for years. However t...,"{'id': 12456571, 'body': 'Hey,\n\nThank you fo...",False,shoppee
2,2020-08-03 09:08:40,Waste of time,Xed82,1,I regretted choosing shopee. Reason being: Pur...,"{'id': 17059815, 'body': 'Thank you for bringi...",False,shoppee
3,2019-12-09 08:02:35,Shoppee is not one of the best shopping app...!,Cartoonfreak1980,5,Who says shopee is one of the best shopping ap...,"{'id': 11615436, 'body': 'Hey, :D\n\nThank you...",True,shoppee
4,2020-04-10 04:39:06,Disappointed,Raymond Koh,1,"Initially when Shopee started, transactions ar...","{'id': 14589371, 'body': 'Thank you for bringi...",False,shoppee


## Adding new column to df

In [8]:
df_google['store'] = 'google'
df_apple['store'] = 'apple'

## Renaming columns

In [9]:
df_google.columns

Index(['reviewId', 'userName', 'userImage', 'content', 'score',
       'thumbsUpCount', 'reviewCreatedVersion', 'at', 'replyContent',
       'repliedAt', 'sortOrder', 'appId', 'store'],
      dtype='object')

In [10]:
df_google.rename(columns = {'score' : 'rating', 'at': 'date', 'appId': 'app', 'content': 'review'}, inplace = True)
df_google.head()

Unnamed: 0,reviewId,userName,userImage,review,rating,thumbsUpCount,reviewCreatedVersion,date,replyContent,repliedAt,sortOrder,app,store
0,gp:AOqpTOFMK_YUOfYqWluUehP3lajbdBztb0kaA_oinNC...,songsin12,https://lh3.googleusercontent.com/-_Z0Ydwm7Xcc...,Orders mostly came early and products are good.,5,0,2.60.11,2020-09-16 20:26:28,"""Thank you for giving Shopee a 5-star review! ...",2020-07-09 10:04:23,newest,shoppee,google
1,gp:AOqpTOECqGI7ocjdKrG5PykxhWmBG2wp1HzwO5zxrvj...,Ho Soh Fong,https://lh3.googleusercontent.com/a-/AOh14GhHM...,Good and convenient,4,0,2.60.08,2020-09-16 20:13:46,Thank you for your review. We're excited to be...,2020-09-16 20:17:31,newest,shoppee,google
2,gp:AOqpTOFK6om-GRJCgm-WXGyf_nurLs1YXL3FSoLRr5b...,Yasohthah Devadas,https://lh3.googleusercontent.com/-sji2OhurxhM...,Gd...........,5,0,2.60.11,2020-09-16 20:13:05,Thank you for giving Shopee a 5-star review! W...,2020-09-16 20:18:24,newest,shoppee,google
3,gp:AOqpTOHfstsd3G5jEoFQn62yZ9gpYcVj2oRlZZkKWBn...,May Han,https://play-lh.googleusercontent.com/-RhVs3Za...,My first purchase experience...Happy with purc...,4,0,2.60.11,2020-09-16 20:11:18,Thank you for your review. We're excited to be...,2020-09-16 20:28:10,newest,shoppee,google
4,gp:AOqpTOGtT_ODk0PZdHD0m_phZw4fFng1RvxZsM9Gk_v...,fauziah ata,https://lh3.googleusercontent.com/-NhTM2s673Pw...,A lot of items at a very good deal.,5,0,,2020-09-16 20:08:54,Thank you for giving Shopee a 5-star review! W...,2020-09-16 20:29:35,newest,shoppee,google


In [11]:
df_apple.columns

Index(['date', 'title', 'userName', 'rating', 'review', 'developerResponse',
       'isEdited', 'appid', 'store'],
      dtype='object')

In [12]:
df_apple.rename(columns = {'appid': 'app'}, inplace = True)
df_apple.head()
df_app['app'].value_counts()

NameError: name 'df_app' is not defined

## Combining dataframe

In [None]:
df = df_google[['rating', 
                'date', 
                'app', 
                'store', 
                'review', ]].append(df_apple[['rating', 
                                               'date', 
                                               'app', 
                                               'store', 
                                               'review', ]])
df.reset_index(inplace = True, drop = True)

In [None]:
df.head()

In [None]:
df.shape

### Comments
- There is more data collected from Google Play Store compared to Apple Appstore, which is not a problem as the project is just looking at the apps reviews directly, irregardless of where which platform they are from.

# Checking datatime Data

In [None]:
df['date'] = df['date'].astype('datetime64')
df['date'].dt.year.value_counts()

In [None]:
df.info()

In [None]:
plt.figure(figsize = (10,10))
sns.boxplot(x = df['date'].dt.year, y = 'app', data = df);

In [None]:
plt.figure(figsize = (10,10))
sns.boxenplot(x = df['date'].dt.month, y = 'app', data = df[df['date'].dt.year == 2020]);

### Comments
There is more data collected in 2020 compared to the other years, could be due to the platform removing reviews when they reach a certain amount. Will be using 2020 reviews for the modeling due to the lack of most of the past data.

# Using only 2020 data

In [None]:
df = df[df['date'].dt.year == 2020]

In [None]:
#check app number of review
df['app'].value_counts()

In [None]:
df.columns

In [None]:
df['rating'].value_counts(normalize=True)

In [None]:
df['review'][:5]

# Cleaning of Text data for Modeling

In [None]:
#adding number to add into cleaning data process
text1 = df['review'][20].lower() + ' 123' + ' 7eleven'
text1

In [None]:
sp = spacy.load('en_core_web_sm')
emoji = Emoji(sp, merge_spans = False)
sp.add_pipe(emoji, first = True)

### Comments
Using SpaCy and Spacy Emoji to take our all the features that can possibly help our models predictibility

In [None]:
sen = sp(text1)
for token in sen:
    print(token.text)

In [None]:
for word in sen:
    print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

In [None]:
adj_list = ''
noun_list = ''
verb_list = ''
emoji_list = ''

for word in sen:
    if word._.is_emoji:
        emoji_list += str(word)
    else:
        if word.pos_ == 'ADJ':
            adj_list += str(word) + ' '
        
        elif word.pos_ == 'NOUN':
            noun_list += str(word) + ' '
        
        elif word.pos_ == 'VERB':
            verb_list += str(word) + ' '
        

In [None]:
print(adj_list)
print(noun_list)
print(verb_list)
print(emoji_list)

# removing emojis and punctuations

In [None]:
string = []
for word in sen:
    if word.is_stop == False:
        if word._.is_emoji == False:
            if word.pos_ != 'PUNCT':
                if word.pos_ != 'NUM':
                    string.append(str(word.lemma_).lower())
string

## Comments
Removing emoji and punctuations from the raw data

# removing non english words

In [None]:
test_words = set(words.words())
en_words = [w for w in string if w.lower() in test_words or not w.isalpha()]
en_words

### Coimments
As there is quite a few reviews not in English, they will be removed from the test data

# Lemmatize words

In [None]:
lemma = WordNetLemmatizer()
words_lemma = [lemma.lemmatize(i) for i in en_words]
words_lemma

# get Sentiments scores

In [None]:
analyser = SentimentIntensityAnalyzer()
print(df['review'][1])
analyser.polarity_scores(df['review'][1])

### Comments
Using VaderSentiment to get the sentiment score of the reviews, this is more used in the EDA.

In [None]:
sp = spacy.load('en_core_web_sm')
emoji = Emoji(sp, merge_spans = False)
sp.add_pipe(emoji, first = True)
analyser = SentimentIntensityAnalyzer()
tokenizer = RegexpTokenizer(r'\w+')
test_words = set(words.words())
lemma = WordNetLemmatizer()

def words_cleaning(text):
    
    #get score from raw text
    scores = analyser.polarity_scores(str(text))
    
    # use spacy to categorize words
    sen = sp(str(text))
    
    #creating list to store words
    adj_list = ''
    noun_list = ''
    verb_list = ''
    emoji_list = ''
    
    for word in sen:
        if word._.is_emoji:
            emoji_list += str(word)
        else:
            if word.pos_ == 'ADJ':
                adj_list += (str(word.lemma_).lower()) + ' ' 
        
            elif word.pos_ == 'NOUN':
                noun_list += (str(word.lemma_).lower()) + ' '
        
            elif word.pos_ == 'VERB':
                verb_list += (str(word.lemma_).lower()) + ' '
                
    string = []
    for word in sen:
        if word.is_stop == False:
            if word._.is_emoji == False:
                if word.pos_ != 'PUNCT':
                    if word.pos_ != 'NUM':
                        string.append(str(word.lemma_).lower())     
    
    #removing non english words
    en_words = [w for w in string if w.lower() in test_words or not w.isalpha()]

    #steming tokens
    words_lemma = [lemma.lemmatize(i) for i in en_words]
    
    #join words
    join_words = (" ".join(words_lemma))

  
    return(join_words, adj_list, noun_list, verb_list, emoji_list, scores)

In [None]:
words_cleaning(df['review'][300])

In [None]:
words_cleaning(df['review'][300])[5]['neg']

# Function to clean text Data

In [None]:
total_text = len(df['review'])
print(f'There is a total of {total_text} selftext.')

#instantiate empty list to hold cleaned data
clean_text = []
adj_text = []
noun_text = []
verb_list = []
emoji_icons = []
neg_scores = []
neu_scores = []
pos_scores = []
compound_scores = []

t0 = time()
print("Cleaning and parsing the training set text...")

# Instantiate counter.
j = 0

# For every review in our training set...
for text in df['review']:
    
    join, adj, noun, verb, emoji, scores = words_cleaning(text)
    
    # Convert text to words, then append to clean_train_text.
    clean_text.append(str(join).lower())
    adj_text.append(adj)
    noun_text.append(noun)
    verb_list.append(verb)
    emoji_icons.append(emoji)
    neg_scores.append(scores['neg'])
    neu_scores.append(scores['neu'])
    pos_scores.append(scores['pos'])
    compound_scores.append(scores['compound'])
    
    # If the index is divisible by 100, print a message.
    if (j + 1) % 10000 == 0:
        print(f'selftext {j + 1} of {total_text}.')
        bag_test_time = time() - t0
        print('Fit time:  %0.3fs' % bag_test_time);
    j += 1

    
print(f'selftext {total_text} of {total_text}.')    
print('Cleaning complete')

df['clean_content'] = clean_text
df['adj'] = adj_text
df['noun'] = noun_text
df['verb'] = verb_list
df['emoji'] = emoji_icons
df['neg_score'] = neg_scores
df['neu_score'] = neu_scores
df['pos_score'] = pos_scores
df['compound_score'] = compound_scores

In [None]:
df.head()

In [None]:
df.to_csv('../data/cleaned_reviews.csv', index=None, header=True)