<a href="https://colab.research.google.com/github/Daisyzhao21/Data-Ingestion-System/blob/main/GooglePlayStoreReviews_ChatGPT_Data_Collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Ingestion System


## 1.Data Collection:
Use the google_play_scraper package to collect reviews for the ChatGPT app.

#### 2.EDA:
Perform exploratory and statistical analysis on the sample dataset.

#### 3.Evaluation:
Comment quality and richness (length, depth, relevance)
Language consistency (ideally mostly English)
Supporting metadata availability (timestamps, ratings, IDs, etc.)
Rating skewness (balanced vs. heavily biased distributions)
Review volume and update speed (is it refreshed frequently enough?)

 Install the Required Package

In [1]:
pip install google-play-scraper

Collecting google-play-scraper
  Downloading google_play_scraper-1.2.7-py3-none-any.whl.metadata (50 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/50.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.2/50.2 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading google_play_scraper-1.2.7-py3-none-any.whl (28 kB)
Installing collected packages: google-play-scraper
Successfully installed google-play-scraper-1.2.7


In [2]:
pip install matplotlib seaborn



In [3]:
pip install tqdm # 用于显示进度条



In [4]:
from google_play_scraper import app, reviews, Sort # app: 用于获取应用的基本信息, reviews: 用于获取应用的评价, Sort: 枚举类型，用于指定评价的排序方式
import pandas as pd
import time
from tqdm import tqdm # 显示进度条

In [5]:
# ChatGPT app package name
app_package = "com.openai.chatgpt" #ChatGPT 官方应用的包名

# Get app information first
app_info = app(
    app_package,
    lang='en',  # language
    country='us'  # country
)

print(f"App Name: {app_info['title']}")
print(f"Installs: {app_info['installs']}")
print(f"Score: {app_info['score']}")
print(f"Number of Reviews: {app_info['reviews']}\n")



App Name: ChatGPT
Installs: 500,000,000+
Score: 4.7592316
Number of Reviews: 110340



In [6]:
# 设置抓取参数
desired_count = 20000  # 希望获取的评论总数
batch_size = 100  # 每个批次请求的评论数，Google Play 每次最多返回 100 条
sleep_interval = 2  # 每个请求之间的休眠时间（秒），避免请求过于频繁

reviews_data = []  # 存储所有评论的列表
continuation_token = None  # 用于获取下一页评论的令牌
error_count = 0
max_errors = 5  # 最大允许错误次数，超过则停止


In [7]:
# 使用 tqdm 创建进度条（可选）
with tqdm(total=desired_count, desc="scraping抓取评论") as pbar:
    while len(reviews_data) < desired_count and error_count < max_errors:
        try:
            # 获取一批评论
            batch_result, continuation_token = reviews(
                app_package,
                lang='en',
                country='us',
                sort=Sort.NEWEST,  # 按最新排序，确保获取到最新的评论
                count=batch_size,
                continuation_token=continuation_token  # 用于分页
            )

            # 将这批评论添加到总列表中
            reviews_data.extend(batch_result)

            # 更新进度条
            pbar.update(len(batch_result))

            # 如果没有 continuation_token，或者已获取到足够数据，退出循环
            if continuation_token is None or not batch_result:
                print("No more results.没有更多评论了。")
                break

            # 在每个请求之间休眠，减轻服务器压力
            time.sleep(sleep_interval)

        except Exception as e:
            error_count += 1
            print(f"获取评论时出错 (错误 #{error_count}): {e}")
            time.sleep(5)  # 出错时等待更长时间

# 检查是否获取到数据
if not reviews_data:
    print("未获取到任何评论数据。请检查应用包名、网络连接或稍后再试。")
else:
    print(f"成功获取了 {len(reviews_data)} 条评论。")

    # 转换为 Pandas DataFrame
    reviews_df = pd.DataFrame(reviews_data)


scraping抓取评论: 100%|██████████| 20000/20000 [08:07<00:00, 41.02it/s]

成功获取了 20000 条评论。





In [8]:
# Display the first few reviews
print("\nSample reviews:")
for i, (_, review) in enumerate(reviews_df.head(10).iterrows()):
    print(f"{i+1}. Rating: {review['score']} - {review['content']}")



Sample reviews:
1. Rating: 4 - good
2. Rating: 1 - Just pathetic. had to ask for one thing 10 times and still didn't get the results! Deleted immediately after downloading it.
3. Rating: 5 - extremely helpful
4. Rating: 4 - thiss good
5. Rating: 5 - It's Amazing 🤩🤩
6. Rating: 4 - Best app
7. Rating: 5 - love
8. Rating: 5 - exceptionally good
9. Rating: 5 - best happy
10. Rating: 5 - great and helpful


In [9]:
# Display basic information
print(f"Successfully collected {len(reviews_df)} reviews")
print("\nRating distribution:")
print(reviews_df['score'].value_counts().sort_index())

Successfully collected 20000 reviews

Rating distribution:
score
1     1575
2      419
3      894
4     1928
5    15184
Name: count, dtype: int64


In [10]:
# Save to CSV
reviews_df.to_csv('chatgpt_reviews.csv', index=False)
print(f"\nReviews saved to 'chatgpt_reviews.csv'")


Reviews saved to 'chatgpt_reviews.csv'
