# Mobile App Data Science and Analysis Project
(Rough Draft Phase)
By: Angela Cao

## Overview
This comprehensive synthetic dataset contains 2,514 authentic mobile app reviews spanning 40+ popular applications across 24 different languages, making it ideal for multilingual NLP, sentiment analysis, and cross-cultural user behavior research.

### Language Distribution
The dataset includes reviews in 24 languages:
- **European**: English (en), Spanish (es), French (fr), German (de), Italian (it), Russian (ru), Polish (pl), Dutch (nl), Swedish (sv), Danish (da), Norwegian (no), Finnish (fi)
- **Asian**: Chinese (zh), Hindi (hi), Japanese (ja), Korean (ko), Thai (th), Vietnamese (vi), Indonesian (id), Malay (ms)
- **Other**: Arabic (ar), Turkish (tr), Filipino (tl)

### Application Categories
Reviews cover 18 distinct categories:
- Social Networking
- Entertainment
- Productivity
- Travel & Local
- Music & Audio
- Video Players & Editors
- Shopping
- Navigation
- Finance
- Communication
- Education
- Photography
- Dating
- Business
- Utilities
- Health & Fitness
- Games
- News & Magazines

### Popular Apps Included
40+ applications including:
- **Social**: WhatsApp, Instagram, Facebook, Snapchat, TikTok, LinkedIn, Twitter, Reddit, Pinterest
- **Entertainment**: YouTube, Netflix, Spotify
- **Productivity**: Microsoft Office, Google Drive, Dropbox, OneDrive, Zoom, Discord
- **Travel**: Uber, Lyft, Airbnb, Booking.com, Google Maps, Waze
- **Finance**: PayPal, Venmo
- **Education**: Duolingo, Khan Academy, Coursera, Udemy
- **Tools**: Grammarly, Canva, Adobe Photoshop, VLC, MX Player

### Geographic Distribution
Reviews from 24 countries across all continents:
- **Asia**: China, India, Japan, South Korea, Thailand, Vietnam, Indonesia, Malaysia, Philippines, Pakistan, Bangladesh
- **Europe**: Germany, United Kingdom, France, Italy, Spain, Russia, Turkey, Poland
- **Americas**: United States, Canada, Brazil, Mexico
- **Oceania**: Australia
- **Africa**: Nigeria

## Project Questions
- What apps/app categories have the best reviews/ratings? What apps/app categories have the worst reviews/ratings?
- Which apps/app categories have had better ratings/reviews over time? Which apps/app categories have had worse ratings/reviews over time? 
- Do the connotation of the reviews match with the ratings given?
- Do certain demographics influence the ratings/reviews? 

## Objectives
- Clean, wrangle, and engineer the dataset to be a suitable dataset for analysis and ML training/testing
- Determine factors that would influence the sentiment of the review or given rating
- Determine key words or phrases that would influence the sentiment of the review or given rating
- Develop ML model that would predict the rating (either categorical or numerical) of a review based on certain keywords
- Develop dashboard that demonstrates all necessary statistics and visualizations

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/multilingual-mobile-app-reviews-dataset-2025/multilingual_mobile_app_reviews_2025.csv


In [2]:
df = pd.read_csv('/kaggle/input/multilingual-mobile-app-reviews-dataset-2025/multilingual_mobile_app_reviews_2025.csv')
df

Unnamed: 0,review_id,user_id,app_name,app_category,review_text,review_language,rating,review_date,verified_purchase,device_type,num_helpful_votes,user_age,user_country,user_gender,app_version
0,1,1967825,MX Player,Travel & Local,Qui doloribus consequuntur. Perspiciatis tempo...,no,1.3,2024-10-09 19:26:40,True,Android Tablet,65,14.0,China,Female,1.4
1,2,9242600,Tinder,Navigation,"Great app but too many ads, consider premium v...",ru,1.6,2024-06-21 17:29:40,True,iPad,209,18.0,Germany,Male,8.9
2,3,7636477,Netflix,Dating,The interface could be better but overall good...,es,3.6,2024-10-31 13:47:12,True,iPad,163,67.0,Nigeria,Male,2.8.37.5926
3,4,209031,Venmo,Productivity,"Latest update broke some features, please fix ...",vi,3.8,2025-03-12 06:16:22,True,iOS,664,66.0,India,Female,10.2
4,5,7190293,Google Drive,Education,"Perfect for daily use, highly recommend to eve...",tl,3.2,2024-04-21 03:48:27,True,iPad,1197,40.0,South Korea,Prefer not to say,4.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2509,2510,2322118,OneDrive,Business,Счастье низкий пастух. Нож неожиданно поезд тр...,nl,3.0,2023-11-23 01:07:30,False,iOS,635,21.0,Malaysia,Non-binary,1.1.2-beta
2510,2511,2167693,Signal,Finance,This app is amazing! Really love the new featu...,ms,1.9,2025-06-05 16:42:20,True,Windows Phone,1127,38.0,Bangladesh,,v12.0.80
2511,2512,5554467,OneDrive,Social Networking,This app is amazing! Really love the new featu...,zh,3.4,2024-06-15 05:02:18,True,Android Tablet,677,27.0,Pakistan,,9.1.32.4821
2512,2513,8805125,Coursera,Social Networking,Invitare convincere pericoloso corsa fortuna. ...,da,2.7,2023-12-02 01:41:31,True,Android,155,35.0,India,,v8.9.13


In [3]:
df.columns

Index(['review_id', 'user_id', 'app_name', 'app_category', 'review_text',
       'review_language', 'rating', 'review_date', 'verified_purchase',
       'device_type', 'num_helpful_votes', 'user_age', 'user_country',
       'user_gender', 'app_version'],
      dtype='object')

In [4]:
df.isna().any()

review_id            False
user_id              False
app_name             False
app_category         False
review_text           True
review_language      False
rating                True
review_date          False
verified_purchase    False
device_type          False
num_helpful_votes    False
user_age             False
user_country          True
user_gender           True
app_version           True
dtype: bool

In [5]:
df.nunique()

review_id            2514
user_id              2514
app_name               41
app_category           18
review_text           739
review_language        24
rating                 41
review_date          2514
verified_purchase       2
device_type             5
num_helpful_votes    1078
user_age               63
user_country           24
user_gender             4
app_version          2081
dtype: int64

In [6]:
df.dtypes

review_id              int64
user_id                int64
app_name              object
app_category          object
review_text           object
review_language       object
rating               float64
review_date           object
verified_purchase       bool
device_type           object
num_helpful_votes      int64
user_age             float64
user_country          object
user_gender           object
app_version           object
dtype: object

## Data Engineering