1.2 Getting data via Web scraping
------

### 1.2.2 Scraping Google play app page

**In this section we demonstrate how to scrape Google play app page, especially the one with large amount of review data such as Whatsapp. A customised reviews_all function is used below to save the retrieved data into sepereated CSV files with 100000 reviews per file. Be mindful that the process may take a long time and occupy much space depends on the number of reviews for the requested Google play app. Eg, Whatsapp return around 7 millions reviews and 2GB data.**

**Import required libraries.**

In [3]:
from google_play_scraper import app
from google_play_scraper import Sort, reviews, reviews_all
import json
import pandas as pd
from time import sleep


**Get app information from app id.**

In [4]:
result = app(
    'com.whatsapp',
    lang='en', # defaults to 'en'
    country='us' # defaults to 'us'
)
#print(result)

**Test getting first 3 most relevant reviews. Read more at https://github.com/JoMingyu/google-play-scraper.**

In [5]:


result, continuation_token = reviews(
    'com.whatsapp',
    lang='en', # defaults to 'en'
    country='us', # defaults to 'us'
    sort=Sort.MOST_RELEVANT, # defaults to Sort.MOST_RELEVANT
    count=3, # defaults to 100
    #filter_score_with=5 # defaults to None(means all score)
)

# If you pass `continuation_token` as an argument to the reviews function at this point,
# it will crawl the items after 3 review items.

result, _ = reviews(
    'com.whatsapp',
    continuation_token=continuation_token # defaults to None(load from the beginning)
)

**Print and check result.**

In [8]:
print(len(result))
print(result)


[{'reviewId': 'gp:AOqpTOHHgOCMQc40y-SvzC0ywAIHRlAZs5vi2UhzMD8fU8vXp8RPxUtz_x3WWrnWA7iZXYrUeXiohG-sd0RD-CU', 'userName': 'mathewsyriac syriac', 'userImage': 'https://play-lh.googleusercontent.com/-xGfXIuvKHxQ/AAAAAAAAAAI/AAAAAAAAAAA/AMZuuclG5F_WVk2713JFh9MadPON5fEvxA/photo.jpg', 'content': 'Whatsapp has been my all time favorite. Adfree and convenient chatting and calling is something most of the apps lack. I really appreciate the team for their hard work. But the only problem I face while using Whatsapp is that all the backup files, voice notes, images and documents are all saved in the internal storage which decreases all the space left. I hope the Whatsapp team would bring an feature to store all multimedia files in an external storage such as an SD card.', 'score': 4, 'thumbsUpCount': 2, 'reviewCreatedVersion': '2.20.202.21', 'at': datetime.datetime(2020, 11, 9, 1, 3, 7), 'replyContent': None, 'repliedAt': None}, {'reviewId': 'gp:AOqpTOH0xoNRDw2EMfBWXAXI5ACmGc_hy22a6ULXnfZQ_3dZRkK3C

**Custmoise reviews_all function for saving reviews as csv files. Read more at https://github.com/JoMingyu/google-play-scraper.**

In [32]:
# define the custmosed reviews_all function
def reviews_all_customise(app_id, sleep_milliseconds=0, max_csvrows=100000, **kwargs):
    kwargs.pop("count", None)
    kwargs.pop("continuation_token", None)

    _csvcount = 0
    _count = 199 # Google Play Store limit (up to 200 reviews can be fetched at a time)
    _continuation_token = None # the start of next review position to scrape after scaping _count of reviews
    result = []

    while True:
        result_, _continuation_token = reviews(
            app_id, count=_count, continuation_token=_continuation_token, **kwargs
        )

        result += result_
        # check if the return results reach the max rows per csv file, save the current results to csv file if it does and reset the result array 
        if(len(result)>=max_csvrows):
            app_reviews_df = pd.DataFrame(result)
            app_reviews_df.to_csv('reviews_'+str(_csvcount)+'.csv', index=None, header=True)
            _csvcount = _csvcount+1
            result = []
        # check if there is more continuation token to run next if none then save the current results to csv file and break the scaping process
        if _continuation_token.token is None:
            app_reviews_df = pd.DataFrame(result)
            app_reviews_df.to_csv('reviews_'+str(_csvcount)+'.csv', index=None, header=True)
            break

        if sleep_milliseconds:
            sleep(sleep_milliseconds / 1000)

    return result

You may see larger number of stars and reviews on the app page than the returned data, see the reason below:
"reviews and reviews_all function scrape reviews with content. People can post review with star rating only(without review contents). These reviews aggregated for total reviews in Google Play, but don't displayed to review list." https://github.com/JoMingyu/google-play-scraper/issues/65 

**Get all reivews and save in csv files. Comment out to get all the review data. Be mindful that it may take a long time and occupy much space depends on the number of reviews for the requested Google play app. Eg, Whatsapp return around 7 millions reviews and 2GB data.**

In [6]:
'''result = reviews_all_customise(
    'com.whatsapp',
    sleep_milliseconds=100, # defaults to 0
    max_csvrows=100000, # maximum rows in each csv file
    lang='en', # defaults to 'en'
    country='us', # defaults to 'us'
    sort=Sort.MOST_RELEVANT, # defaults to Sort.MOST_RELEVANT
    #filter_score_with=5 # defaults to None(means all score)
)'''


"result = reviews_all_customise(\n    'com.whatsapp',\n    sleep_milliseconds=100, # defaults to 0\n    max_csvrows=100000, # maximum rows in each csv file\n    lang='en', # defaults to 'en'\n    country='us', # defaults to 'us'\n    sort=Sort.MOST_RELEVANT, # defaults to Sort.MOST_RELEVANT\n    #filter_score_with=5 # defaults to None(means all score)\n)"