# Web scraping with `google_play_scraper`

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

**`Web scraping` is the process of extracting data from websites using automated software tools. It is a method of collecting information from the internet that would otherwise require manual copying and pasting. `Web scraping` is used by businesses, researchers, and individuals to gather data for a variety of purposes, including market research, price monitoring, content aggregation, and social media analysis.

**With the right tools and techniques, `web scraping` can be an efficient and effective way to collect large amounts of data from the web. However, it is important to be aware of the legal and ethical considerations involved in web scraping and to ensure that data is collected responsibly and transparently. Always respect the rules established by the `robots.txt` file of the website you wish to crawl.**

![scrappe](https://contraponto.digital/wp-content/uploads/2022/02/web-.jpg)

**`Python` has several libraries and frameworks, such as `Beautiful Soup` and `Scrapy`, that make web scraping easier and more efficient, and allow developers to access, extract and manipulate web data in a structured and efficient manner. In [this notebook](xxx), we use `Beautiful Soup` to create a dataset to train a model for language generation.**

**However, in this notebook, we are using the `google_play_scraper` to create and automate our crawler. First, let us create a list of apps for us to scrape from `Google Play`**

In [1]:
apps_ids = [
    'com.mcdo.mcdonalds',
    'com.ubercab.eats',
    'com.instagram.android',
    'com.tinder',
    'com.facebook.katana',
    'com.google.android.youtube',
    'com.ubercab',
    'com.twitter.android',
]

**Second, we collect some information/metadata about the apps we want to scrape.**

**If you want to scrape reviews in Portuguese, you can try the following apps:**

```python
apps_ids = [
    'br.com.brainweb.ifood',
    'com.cerveceriamodelo.modelonow',
    'com.mcdo.mcdonalds',
    'habibs.alphacode.com.br',
    'com.xiaojukeji.didi.brazil.customer',
    'com.ubercab.eats',
    'com.grability.rappi',
    'burgerking.com.br.appandroid',
    'com.instagram.android',
    'com.tinder',
    'com.facebook.katana',
    'com.google.android.youtube',
    'com.zhiliaoapp.musically',
    'com.ubercab',
    'com.twitter.android',
    'org.telegram.messenger',
]
```

In [7]:
from google_play_scraper import Sort, reviews, app

app_infos = []

for ap in apps_ids:
    info = app(ap, lang='en', country='us')
    del info['comments']
    app_infos.append(info)

for i in range(len(app_infos)):
    print(app_infos[i]['title'])

McDonald's Offers and Delivery
Uber Eats: Food Delivery
Instagram
Tinder: Dating app. Meet. Chat
Facebook
YouTube
Uber - Request a ride
Twitter


**Now, we are determining some targets for our crawler. We determine that we only want English reviews (`lang='en'`), from the US (`country='us'`), prioritizing the most relevant reviews (`Sort.MOST_RELEVANT`), from 1 to 5 stars (`for score in list(range(1, 6)):`), except the reviews with 3 stars, otherwise, we collect one thousand reviews (`ount=0 if score == 3 else 1000`).**

In [9]:
app_reviews = []
for app in apps_ids:
    print(f'Scraping {app}...')
    for score in list(range(1, 6)):
        for sort_order in [Sort.MOST_RELEVANT]:
            rvs, _ = reviews(
                app,
                lang='en',
                country='us',
                sort=sort_order,
                count=0 if score == 3 else 1000,
                filter_score_with=score
            )
            for r in rvs:
                r['sortOrder'] = 'most_relevant'
                r['appId'] = ap
            app_reviews.extend(rvs)
print('Got your data!')

Scraping com.mcdo.mcdonalds...
Scraping com.ubercab.eats...
Scraping com.instagram.android...
Scraping com.tinder...
Scraping com.facebook.katana...
Scraping com.google.android.youtube...
Scraping com.ubercab...
Scraping com.twitter.android...
Got your data!


**Let us now turn our data into a `DataFrame`.**

In [20]:
import pandas as pd

data = pd.DataFrame(app_reviews)

display(data.content)

0        Useless, when I choose my country it takes me ...
1        This app is horrible it constantly freezes up ...
2        After the last update, the name of the app eve...
3        Really slow, crashed my brand new phone. Got s...
4        Worst service ever. They have so many issues w...
                               ...                        
30413    Suddenly is just so much better. Great app to ...
30414    The best app to get all latest trending topics...
30415    Love it , especially since musk took over. You...
30416    It's a nice app and improved since the takeove...
30417    My new favourite social. Feed much more taylor...
Name: content, Length: 30418, dtype: object

**Let us clean our text with a `custom_standardization` function (remove HTML tags, lowercase everything, remove symbols, remove accents, remove double spaces, etc.)**

In [21]:
import re
from unidecode import unidecode

def custom_standardization(input_data):
    clean_text = input_data.lower().replace("<br />", " ")
    clean_text = re.sub(r"[-()\"#/@;:<>{}=~|.?,]", ' ', clean_text)
    clean_text = re.sub(' +', ' ', clean_text)
    return unidecode(clean_text)

data.content = data.content.apply(custom_standardization)
display(data.content)

0        useless when i choose my country it takes me t...
1        this app is horrible it constantly freezes up ...
2        after the last update the name of the app even...
3        really slow crashed my brand new phone got stu...
4        worst service ever they have so many issues wi...
                               ...                        
30413    suddenly is just so much better great app to c...
30414    the best app to get all latest trending topics...
30415    love it especially since musk took over you ca...
30416    it's a nice app and improved since the takeove...
30417    my new favourite social feed much more taylore...
Name: content, Length: 30418, dtype: object

**Now, let us save our data into a CSV file for later use. This dataset could be used to, for example, train an ML model in sentiment analysis.**

In [22]:
data = data.drop(['reviewId', 'userName', 'userImage', 'appId',
                  'thumbsUpCount', 'reviewCreatedVersion', 'at',
                  'replyContent', 'repliedAt', 'sortOrder'], axis=1)

def to_sentiment(rating):
    rating = int(rating)
    if rating <= 2:
        return 0
    else:
        return 1

data.score = data.score.apply(to_sentiment)
data.columns = ['review', 'sentiment']

data.to_csv('data/google_play_apps_review_en.csv', index=None)
display(data)

Unnamed: 0,review,sentiment
0,useless when i choose my country it takes me t...,0
1,this app is horrible it constantly freezes up ...,0
2,after the last update the name of the app even...,0
3,really slow crashed my brand new phone got stu...,0
4,worst service ever they have so many issues wi...,0
...,...,...
30413,suddenly is just so much better great app to c...,1
30414,the best app to get all latest trending topics...,1
30415,love it especially since musk took over you ca...,1
30416,it's a nice app and improved since the takeove...,1


**Congratulations, you scrapped your own sentiment dataset. 🙃**

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).
