# Web scraping with `google_play_scraper`

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

**Web scraping**, **web harvesting**, or **web data extraction** is [data scraping](https://en.wikipedia.org/wiki/Data_scraping "Data scraping") used for [extracting data](https://en.wikipedia.org/wiki/Data_extraction "Data extraction") from [websites](https://en.wikipedia.org/wiki/Website "Website"). Web scraping software may directly access the [World Wide Web](https://en.wikipedia.org/wiki/World_Wide_Web "World Wide Web") using the [Hypertext Transfer Protocol](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol "Hypertext Transfer Protocol") or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a [bot](https://en.wikipedia.org/wiki/Internet_bot "Internet bot") or [web crawler](https://en.wikipedia.org/wiki/Web_crawler "Web crawler"). It is a form of copying in which specific data is gathered and copied from the web, typically into a central local [database](https://en.wikipedia.org/wiki/Database "Database") or spreadsheet, for later [retrieval](https://en.wikipedia.org/wiki/Data_retrieval "Data retrieval") or [analysis](https://en.wikipedia.org/wiki/Data_analysis "Data analysis") (and in this case, _training a ML model_).

- In this notebook, we are using the `google_play_scraper` to create and automate our _crawler_.

![scrappe](https://contraponto.digital/wp-content/uploads/2022/02/web-.jpg)


In [None]:
from cleantext import clean
from langdetect import detect
import pandas as pd
from google_play_scraper import Sort, reviews, app
from tqdm import tqdm
import string
import unidecode

# ----------------------------------------------------------------------------------------#
#
# Scrape apps
#
# ----------------------------------------------------------------------------------------#

apps_ids = [
    'br.com.brainweb.ifood',
    'com.cerveceriamodelo.modelonow',
    'com.mcdo.mcdonalds',
    'habibs.alphacode.com.br',
    'com.xiaojukeji.didi.brazil.customer',
    'com.ubercab.eats',
    'com.grability.rappi',
    'burgerking.com.br.appandroid',
    'com.instagram.android',
    'com.tinder',
    'com.facebook.katana',
    'com.google.android.youtube',
    'com.zhiliaoapp.musically',
    'com.ubercab',
    'com.twitter.android',
    'org.telegram.messenger',
]

# ----------------------------------------------------------------------------------------#
#
# Raw Data
#
# ----------------------------------------------------------------------------------------#

app_infos = []

for ap in tqdm(apps_ids):
    info = app(ap, lang='pt', country='br')
    del info['comments']
    app_infos.append(info)

app_reviews = []

for ap in tqdm(apps_ids):
    for score in list(range(1, 6)):
        for sort_order in [Sort.MOST_RELEVANT]:
            rvs, _ = reviews(
                ap,
                lang='pt',
                country='br',
                sort=sort_order,
                count=0 if score == 3 else 1000,
                filter_score_with=score
            )
            for r in rvs:
                r['sortOrder'] = 'most_relevant'
                r['appId'] = ap
            app_reviews.extend(rvs)

data = pd.DataFrame(app_reviews)

# ----------------------------------------------------------------------------------------#
#
# Clean Data
# Here are a couple of simple preprocessing rules to work with scrapped text.
#
# ----------------------------------------------------------------------------------------#


data['lang'] = 0
for i in range(0, len(data)):
    data['content'][i] = clean(data['content'][i], no_emoji=True)
    if len(data['content'][i]) <= 10:
        data['lang'][i] = 'NaN'
    else:
        x = detect(data['content'][i])
        data['lang'][i] = x

# ----------------------------------------------------------------------------------------#
#
# Select only reviews in portuguese (some other languages may have found their way in...)
#
# ----------------------------------------------------------------------------------------#

data = data[data['lang'] == 'pt']


data = data.drop(['reviewId', 'userName', 'userImage',
                  'thumbsUpCount', 'reviewCreatedVersion', 'at',
                  'replyContent', 'repliedAt', 'sortOrder'], axis=1)


def to_target(rating):
    rating = int(rating)
    if rating <= 2:
        return 0
    else:
        return 1


data = data.rename(columns={'score': 'review'})
data['score'] = data.review.apply(to_target)


l = list(data['content'])
new_l = []
for review in l:
    new_review = review.translate(str.maketrans('', '', string.punctuation))
    new_review = new_review.lower()
    new_review = unidecode.unidecode(new_review)
    new_l.append(new_review)
data['content'] = new_l

data.to_excel('google_play_apps_review(pt).xlsx', index=None, header=True)


---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).
