3.1 Getting review data via web scraping Google play app page
======

**In this section we demonstrate how to scrape Google play app page, especially the one with large amount of review data such as Whatsapp. A customised reviews_all function is used below to save the retrieved data into sepereated CSV files with 100000 reviews per file. Be mindful that the process may take a long time and occupy much space depends on the number of reviews for the requested Google play app. Eg, Whatsapp return around 12 millions reviews and 3GB data.**

_The following Python modules need to be installed in order to run this notebook (run without ! under command line or with ! in the notebook, for SWAN, see https://support.aarnet.edu.au/hc/en-us/articles/360000668076-How-do-I-add-code-libraries-to-my-Notebook-):_

In [None]:
# comment out to run install
#!pip install google_play_scraper

**Import required libraries.**

In [1]:
from google_play_scraper import app
from google_play_scraper import Sort, reviews, reviews_all
import json
import pandas as pd
from time import sleep
import os
import random
import math


**Get app information from app id.**

In [2]:
result = app(
    'com.yummly.android',
    lang='en', # defaults to 'en'
    country='us' # defaults to 'us'
)
#print(result)

**Test getting first 3 most relevant reviews. Read more at https://github.com/JoMingyu/google-play-scraper.**

In [3]:


result, continuation_token = reviews(
    'com.yummly.android',
    lang='en', # defaults to 'en'
    country='us', # defaults to 'us'
    sort=Sort.MOST_RELEVANT, # defaults to Sort.MOST_RELEVANT
    count=3, # defaults to 100
    #filter_score_with=5 # defaults to None(means all score)
)

# If you pass `continuation_token` as an argument to the reviews function at this point,
# it will crawl the items after 3 review items.

result, _ = reviews(
    'com.whatsapp',
    continuation_token=continuation_token # defaults to None(load from the beginning)
)

**Print and check result.**

In [4]:
print(len(result))
print(result)


3
[{'reviewId': 'gp:AOqpTOH3aQnZIPouGz_RGKngcb5mvpXYD9dVMiYjvfepoAYw_cVlKClpF1nOFAP3bwsu1LJK4B-xAFKYQCqN9DQ', 'userName': 'Joseph Antony', 'userImage': 'https://play-lh.googleusercontent.com/a-/AOh14GipzrgTfmJFesmFKrPF5St1f_DSaEMUHwP-apCE', 'content': "Watts app is waste app it doesn't support to send large video files. only 2 minutes 50 seconds can be send at a time. It is not worth for video call and voice call. Often the video is blurry and voice quality is too bad. It need more improvement. It doesn't allow to forward chats in large. When i hear voice message it is too much annoying me. Watts app is utilizing more battery and running in background. I don't want it. I don't want to give any stars.", 'score': 1, 'thumbsUpCount': 452, 'reviewCreatedVersion': '2.20.206.24', 'at': datetime.datetime(2021, 1, 9, 6, 58, 34), 'replyContent': None, 'repliedAt': None}, {'reviewId': 'gp:AOqpTOHhVYr5eMSffUnlrkZYjRA8IblE9FYFP_s9YmggNTJtTRGg1_05tS_8Nq_-t3bj7gvTxxWf0KoqAZNO04c', 'userName': 'ΜŘ ŘΔ

**Custmoise reviews_all function for saving reviews as csv files. Read more at https://github.com/JoMingyu/google-play-scraper.**

In [5]:
# define the custmosed reviews_all function
# define the custmosed reviews_all function
def reviews_all_customise(app_id, sleep_milliseconds=0, max_csvrows=10000, folder_name="com.yummly.android",**kwargs):
    kwargs.pop("count", None)
    kwargs.pop("continuation_token", None)

    _csvcount = 0
    _count = 199 # Google Play Store limit (up to 200 reviews can be fetched at a time)
    _continuation_token = None # the start of next review position to scrape after scaping _count of reviews
    result = []
    totalcount = 0
    _lastvalid_continuation_token = None # if token and last result_ are not none save the last valid token to resume scraping after sleep
    
    while True:
        try:
            result_, _continuation_token = reviews(
                app_id, count=_count, continuation_token=_continuation_token, **kwargs
            )
            if _continuation_token is not None and _continuation_token.token is not None and str(result_)!='[]':
                _lastvalid_continuation_token = _continuation_token
            #print total count and token
            totalcount+=len(result_)
            print("totalcount so far:"+str(totalcount))
            print("token:"+str(_continuation_token.token))
            result += result_
            # check if the return results reach the max rows per csv file, save the current results to csv file if it does and reset the result array 
            if(len(result)>=max_csvrows):
                app_reviews_df = pd.DataFrame(result)
                app_reviews_df.to_csv(folder_name+'/'+folder_name+'_reviews_'+str(_csvcount)+'.csv', index=None, header=True)
                #print last csv file number
                print("last csv file number:"+str(_csvcount))
                _csvcount = _csvcount+1
                result = []
                #sleep 0-1 random mins every csv file
                sleep(random.randint(0,60))
            # check if there is more continuation token to run next if none then save the current results to csv file, if last result_ is empty, the scraping may have been cut off,sleep 30-45 random mins before resuming scrape, otherwise break the scaping process
            if _continuation_token.token is None:
                app_reviews_df = pd.DataFrame(result)
                app_reviews_df.to_csv(folder_name+'/'+folder_name+'_reviews_'+str(_csvcount)+'.csv', index=None, header=True)
                #print last csv file number
                print("last csv file number:"+str(_csvcount))
                #print warning re cut off
                print("returned continuation token is none, if the last returned result number and result_ below are also none the scraping may have been cut off, unless totalcount so far is 0, which means App has 0 review .")
                print("last result_ number:"+str(len(result_)))
                print("last result_ :"+str(result_))
                # if last result_is empty sleep 30-45 random mins before resuming scrape
                if(str(result_)=='[]' and totalcount!=0):
                    _continuation_token = _lastvalid_continuation_token
                    print("sleep 30-45 random mins before resuming scraping using the last valid token"+str(_lastvalid_continuation_token.token) )
                    sleep(random.randint(30,45)*60)
                    continue
                else:
                    print("finished.")
                    break
            #sleep 0-0.3 random second if random
            if sleep_milliseconds=="random":
                sleep(round(random.uniform(0,0.3), 3))
            else:
                sleep(sleep_milliseconds / 1000)
            
            
            
         # If some error occurs
        except Exception as e:
            # Print the error
            print(e)
            _continuation_token = _lastvalid_continuation_token
            print("sleep 5-15 mins before resuming scraping using the last valid token"+str(_lastvalid_continuation_token.token) )
            sleep(random.randint(5,15)*60)
            continue
            

    return result

You may see larger number of stars and reviews on the app page than the returned data, see the reason below:
"reviews and reviews_all function scrape reviews with content. People can post review with star rating only(without review contents). These reviews aggregated for total reviews in Google Play, but don't displayed to review list." https://github.com/JoMingyu/google-play-scraper/issues/65 

**Get all reivews and save in csv files. Comment out to get all the review data. Be mindful that it may take a long time and occupy much space depends on the number of reviews for the requested Google play app. Eg, Whatsapp return around 12 millions reviews and 3GB data.**

In [7]:
'''app_name = "com.yummly.android"
#create app folder
path = app_name
try:
    os.mkdir(path)
except OSError:
    print ("Creation of the app folder %s failed" % path)
    raise
else:
    print ("Successfully created the app folder %s " % path)


result = reviews_all_customise(
    app_name,
    sleep_milliseconds=100, # defaults to 0, 
    max_csvrows=100000, # maximum rows in each csv file
    folder_name=app_name,
    lang='en', # defaults to 'en'
    country='us', # defaults to 'us'
    sort=Sort.NEWEST, # defaults to Sort.MOST_RELEVANT
    #filter_score_with=5 # defaults to None(means all score)
)
'''

'app_name = "com.yummly.android"\n#create app folder\npath = app_name\ntry:\n    os.mkdir(path)\nexcept OSError:\n    print ("Creation of the app folder %s failed" % path)\n    raise\nelse:\n    print ("Successfully created the app folder %s " % path)\n\n\nresult = reviews_all_customise(\n    app_name,\n    sleep_milliseconds=100, # defaults to 0, \n    max_csvrows=100000, # maximum rows in each csv file\n    folder_name=app_name,\n    lang=\'en\', # defaults to \'en\'\n    country=\'us\', # defaults to \'us\'\n    sort=Sort.NEWEST, # defaults to Sort.MOST_RELEVANT\n    #filter_score_with=5 # defaults to None(means all score)\n)\n'