In [1]:
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
df = pd.read_csv('cleaned_googleplaystore_data(1).csv')
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Last Updated Date,Last Updated Month,Last Updated Year
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,19.0,10000.0,Free,0.0,Everyone,2018-01-07,January,2018.0
1,Coloring book moana,ART_AND_DESIGN,3.9,967.0,14.0,500000.0,Free,0.0,Everyone,2018-01-15,January,2018.0
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510.0,8.7,5000000.0,Free,0.0,Everyone,2018-08-01,August,2018.0
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644.0,25.0,50000000.0,Free,0.0,Teen,2018-06-08,June,2018.0
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967.0,2.8,100000.0,Free,0.0,Everyone,2018-06-20,June,2018.0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10839 entries, 0 to 10838
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   App                 10839 non-null  object 
 1   Category            10839 non-null  object 
 2   Rating              10839 non-null  float64
 3   Reviews             10839 non-null  float64
 4   Size                10839 non-null  float64
 5   Installs            10839 non-null  float64
 6   Type                10839 non-null  object 
 7   Price               10839 non-null  float64
 8   Content Rating      10839 non-null  object 
 9   Last Updated Date   10839 non-null  object 
 10  Last Updated Month  10839 non-null  object 
 11  Last Updated Year   10839 non-null  float64
dtypes: float64(6), object(6)
memory usage: 1016.3+ KB


In [4]:
df.isnull().sum()

App                   0
Category              0
Rating                0
Reviews               0
Size                  0
Installs              0
Type                  0
Price                 0
Content Rating        0
Last Updated Date     0
Last Updated Month    0
Last Updated Year     0
dtype: int64

In [5]:
df.shape

(10839, 12)

##### Converting type of some columns

In [6]:
df['Last Updated Date'] = pd.to_datetime(df['Last Updated Date'])
df['Category'] = df['Category'].astype('category')
df['Installs'] = df['Installs'].astype(int)
df.dtypes

App                           object
Category                    category
Rating                       float64
Reviews                      float64
Size                         float64
Installs                       int64
Type                          object
Price                        float64
Content Rating                object
Last Updated Date     datetime64[ns]
Last Updated Month            object
Last Updated Year            float64
dtype: object

##### Dropping Duplicates

In [7]:
df = df.drop_duplicates(subset='App')
df.shape

(9658, 12)

# **Finding Key Insights from Google Play Store Dataset**

## **App Popularity & User Engagement**

##### Which app category has the most apps listed?

In [8]:
df['Category'].value_counts().head()

Category
FAMILY      1831
GAME         959
TOOLS        827
BUSINESS     420
MEDICAL      395
Name: count, dtype: int64

##### Which app has the highest number of installs overall?

In [9]:
top_installed_apps = df.sort_values(by='Installs', ascending=False).head(1)
print(top_installed_apps[['App', 'Installs']])

                   App    Installs
865  Google Play Games  1000000000


##### Which category has the most apps with 1M+ installs?

In [10]:
filtered_df = df[df['Installs'] >= 1_000_000]
category_counts = filtered_df['Category'].value_counts()
most_common_category = category_counts.idxmax()
most_common_count = category_counts.max()

print(f"The category with the most apps having 1M+ installs is '{most_common_category}' with {most_common_count} apps.")

The category with the most apps having 1M+ installs is 'GAME' with 553 apps.


##### Which app has received the most reviews?

In [11]:
top_review_apps = df.sort_values(by='Reviews', ascending=False).head(1)
print(top_review_apps[['App', 'Reviews']])

           App     Reviews
2544  Facebook  78158306.0


##### Which category has the highest average number of installs?

In [12]:
df.groupby('Category')['Installs'].mean().sort_values(ascending=False).astype(int).head()

Category
COMMUNICATION    35042146
VIDEO_PLAYERS    24091427
SOCIAL           22961790
ENTERTAINMENT    20722156
PHOTOGRAPHY      16545009
Name: Installs, dtype: int64

##### Is there any correlation between high ratings and high installs?

In [13]:
df[['Rating', 'Installs']].corr()

Unnamed: 0,Rating,Installs
Rating,1.0,0.03431
Installs,0.03431,1.0


*There is no significant correlation between app ratings and number of installs.*

*Just because an app has a high rating doesn't mean it has more installs and just because an app has a lot of installs doesn’t mean it has a higher rating.*

##### Is there a relationship between size of the app and number of installs?

In [28]:
df[['Size', 'Installs']].corr()

Unnamed: 0,Size,Installs
Size,1.0,0.033832
Installs,0.033832,1.0


*There is almost no linear relationship between the size of an app and how many times it has been installed.*

*Size doesn't play a major role in how popular the app is.*

## **Ratings Analysis**

##### What is the average rating of apps in each category?

In [14]:
avg_rating_by_category = df.groupby('Category')['Rating'].mean().sort_values(ascending=False).round(2)
print(avg_rating_by_category)

Category
EVENTS                 4.40
EDUCATION              4.36
ART_AND_DESIGN         4.35
BOOKS_AND_REFERENCE    4.33
PERSONALIZATION        4.33
PARENTING              4.30
BEAUTY                 4.28
SOCIAL                 4.26
HEALTH_AND_FITNESS     4.25
GAME                   4.25
WEATHER                4.25
SHOPPING               4.24
SPORTS                 4.23
LIBRARIES_AND_DEMO     4.21
PRODUCTIVITY           4.21
AUTO_AND_VEHICLES      4.21
MEDICAL                4.20
FAMILY                 4.19
FOOD_AND_DRINK         4.19
COMICS                 4.19
HOUSE_AND_HOME         4.17
BUSINESS               4.17
PHOTOGRAPHY            4.17
NEWS_AND_MAGAZINES     4.16
COMMUNICATION          4.15
FINANCE                4.14
ENTERTAINMENT          4.14
LIFESTYLE              4.13
TRAVEL_AND_LOCAL       4.10
TOOLS                  4.07
VIDEO_PLAYERS          4.07
MAPS_AND_NAVIGATION    4.06
DATING                 4.04
Name: Rating, dtype: float64


##### Which app categories have the most apps rated above 4.5?

In [15]:
df[df['Rating'] > 4.5]['Category'].value_counts()

Category
FAMILY                 342
GAME                   159
TOOLS                  111
HEALTH_AND_FITNESS      89
LIFESTYLE               81
MEDICAL                 81
PERSONALIZATION         78
FINANCE                 70
BOOKS_AND_REFERENCE     64
PRODUCTIVITY            63
BUSINESS                61
SPORTS                  55
SOCIAL                  47
NEWS_AND_MAGAZINES      38
PHOTOGRAPHY             37
EDUCATION               33
SHOPPING                32
COMMUNICATION           24
FOOD_AND_DRINK          23
ART_AND_DESIGN          22
TRAVEL_AND_LOCAL        22
AUTO_AND_VEHICLES       21
VIDEO_PLAYERS           21
DATING                  20
EVENTS                  19
PARENTING               19
COMICS                  13
MAPS_AND_NAVIGATION     12
BEAUTY                  11
LIBRARIES_AND_DEMO      10
WEATHER                  9
ENTERTAINMENT            8
HOUSE_AND_HOME           7
Name: count, dtype: int64

##### Are paid apps rated higher than free apps on average?

In [16]:
df['Type'].unique()

array(['Free', 'Paid'], dtype=object)

In [17]:
df.groupby('Type')['Rating'].mean().round(2)

Type
Free    4.19
Paid    4.27
Name: Rating, dtype: float64

*This would mean Paid apps are rated slightly higher on average than Free apps.*

## **Free v/s Paid Apps**

##### What percentage of apps are Free vs Paid?

In [18]:
df['Type'].value_counts(normalize=True).round(2) * 100

Type
Free    92.0
Paid     8.0
Name: proportion, dtype: float64

*This would mean 92.17% of apps are Free, and 7.83% are Paid.*

##### What is the average price of paid apps by category?

In [19]:
paid_apps = df[df['Type'] == 'Paid']

average_price_by_category = paid_apps.groupby('Category')['Price'].mean().sort_values(ascending=False).round(2)
print(average_price_by_category)

Category
FINANCE                170.64
LIFESTYLE              124.26
EVENTS                 109.99
BUSINESS                14.61
FAMILY                  13.11
MEDICAL                 12.00
PRODUCTIVITY             8.96
PHOTOGRAPHY              6.23
MAPS_AND_NAVIGATION      5.39
SOCIAL                   5.32
PARENTING                4.79
DATING                   4.57
EDUCATION                4.49
AUTO_AND_VEHICLES        4.49
HEALTH_AND_FITNESS       4.29
BOOKS_AND_REFERENCE      4.28
FOOD_AND_DRINK           4.24
SPORTS                   4.17
TRAVEL_AND_LOCAL         4.16
WEATHER                  4.05
ENTERTAINMENT            3.99
GAME                     3.47
TOOLS                    3.43
COMMUNICATION            3.08
SHOPPING                 2.74
VIDEO_PLAYERS            2.62
NEWS_AND_MAGAZINES       1.99
ART_AND_DESIGN           1.99
PERSONALIZATION          1.86
LIBRARIES_AND_DEMO       0.99
BEAUTY                    NaN
COMICS                    NaN
HOUSE_AND_HOME            NaN
N

##### Which are the most expensive apps, and do they have high ratings or installs?

In [20]:
most_expensive = df.sort_values(by='Price', ascending=False)

most_expensive[['App', 'Price', 'Rating', 'Installs']].head(10)

Unnamed: 0,App,Price,Rating,Installs
4367,I'm Rich - Trump Edition,400.0,3.6,10000
9933,I'm Rich/Eu sou Rico/أنا غني/我很有錢,399.99,4.3,0
5354,I am Rich Plus,399.99,4.0,10000
5351,I am rich,399.99,3.8,100000
5369,I am Rich,399.99,4.3,5000
5373,I AM RICH PRO PLUS,399.99,4.0,1000
4197,most expensive app (H),399.99,4.3,100
5359,I am rich(premium),399.99,3.5,5000
5362,I Am Rich Pro,399.99,4.4,5000
5356,I Am Rich Premium,399.99,4.1,50000


*The most expensive apps (around $400) are novelty apps like “I Am Rich” variants. While they have moderately good ratings (mostly between 3.5 to 4.4), they have very few installs — mostly under 10,000, some even 0 installs.*

##### Do paid apps receive more reviews than free apps on average?

In [21]:
df.groupby('Type')['Reviews'].mean().round(0)

Type
Free    234270.0
Paid      8725.0
Name: Reviews, dtype: float64

*No, free apps have more reviews than paid apps on average.*

## **Category-Wise Analysis**

##### Which category has the highest number of apps?

In [22]:
df['Category'].value_counts().head()

Category
FAMILY      1831
GAME         959
TOOLS        827
BUSINESS     420
MEDICAL      395
Name: count, dtype: int64

## **Update Frequency Trends**

##### Which year had the most app updates?

In [24]:
df['Last Updated Year'].value_counts().sort_values(ascending=False).head()

Last Updated Year
2018.0    6283
2017.0    1794
2016.0     779
2015.0     449
2014.0     203
Name: count, dtype: int64

##### Which month and year had the highest number of app updates?

In [25]:
df[['Last Updated Month', 'Last Updated Year']].value_counts().sort_values(ascending=False).head()

Last Updated Month  Last Updated Year
July                2018.0               2320
August              2018.0                977
June                2018.0                912
May                 2018.0                691
March               2018.0                407
Name: count, dtype: int64

## **Data Quality Checks**

##### Do any apps have missing or anomalous data like extremely high price or negative values?

In [26]:
anomalous_prices = df[df['Price'] > 300][['App', 'Price', 'Rating', 'Installs']]
print(anomalous_prices)

                                    App   Price  Rating  Installs
4197             most expensive app (H)  399.99     4.3       100
4362                         💎 I'm rich  399.99     3.8     10000
4367           I'm Rich - Trump Edition  400.00     3.6     10000
5351                          I am rich  399.99     3.8    100000
5354                     I am Rich Plus  399.99     4.0     10000
5356                  I Am Rich Premium  399.99     4.1     50000
5357                I am extremely Rich  379.99     2.9      1000
5358                         I am Rich!  399.99     3.8      1000
5359                 I am rich(premium)  399.99     3.5      5000
5362                      I Am Rich Pro  399.99     4.4      5000
5364     I am rich (Most expensive app)  399.99     4.1      1000
5366                          I Am Rich  389.99     3.6     10000
5369                          I am Rich  399.99     4.3      5000
5373                 I AM RICH PRO PLUS  399.99     4.0      1000
9916      

In [27]:
negative_prices = df[df['Price'] < 0]
print(negative_prices)

Empty DataFrame
Columns: [App, Category, Rating, Reviews, Size, Installs, Type, Price, Content Rating, Last Updated Date, Last Updated Month, Last Updated Year]
Index: []



# **📊 Google Play Store Data Analysis Summary**


---

## 🧩 1. App Popularity & User Engagement

### Q1) Which app category has the most apps listed?
Top categories with the highest number of apps:
- **FAMILY** — 1831 apps
- **GAME** — 959 apps
- **TOOLS** — 827 apps  
*These categories dominate the Play Store by volume.*

---

### Q2) Which app has the highest number of installs overall?
- **Google Play Games** with **1,000,000,000 installs**

---

### Q3) Which category has the most apps with 1M+ installs?
- **GAME** — 553 apps  
*A reflection of how popular mobile gaming is.*

---

### Q4) Which app has received the most reviews?
- **Facebook** — **78,158,306 reviews**

---

### Q5) Which category has the highest average number of installs?
Top 5 categories by average installs:
- **COMMUNICATION** — 35M
- **VIDEO_PLAYERS** — 24M
- **SOCIAL** — 23M  
*Communication and social apps dominate user engagement.*

---

### Q6) Is there any correlation between high ratings and high installs?
- Correlation coefficient: **~0.03**
- *There is **no strong correlation** between app ratings and install counts.*

---

### Q7) Is there a relationship between app size and installs?
- Correlation coefficient: **~0.03**
- *App size has **almost no effect** on how many times it gets installed.*

---

## ⭐ 2. Ratings Analysis

### Q1) What is the average rating of apps in each category?
Top-rated categories:
- **EVENTS** — 4.40
- **EDUCATION** — 4.36
- **ART_AND_DESIGN** — 4.35

*These app types are well-liked among users.*

---

### Q2) Which categories have the most apps rated above 4.5?
- **FAMILY** — 342 apps
- **GAME** — 159 apps
- **TOOLS** — 111 apps

*High ratings are also seen in health, personalization, and finance categories.*

---

### Q3) Are paid apps rated higher than free apps on average?
- **Paid apps** — 4.27 average rating
- **Free apps** — 4.19 average rating  
*Paid apps are slightly better rated on average.*

---

## 💸 3. Free vs Paid Apps

### Q1) What percentage of apps are Free vs Paid?
- **Free**: 92.17%
- **Paid**: 7.83%

---

### Q2) Average price of paid apps by category
Highest average prices:
- **FINANCE** — $170.64
- **LIFESTYLE** — $124.26
- **EVENTS** — $109.99

---

### Q3) Most expensive apps and their popularity:
All priced at ~$400:
- **I am Rich**, **I'm Rich - Trump Edition**, etc.
- Ratings: 3.5–4.4
- Installs: Most under 10,000

*These are novelty apps, not widely used despite high cost.*

---

### Q4) Do paid apps get more reviews?
- **Free apps** — Avg. 234,270 reviews
- **Paid apps** — Avg. 8,725 reviews

*No — free apps receive far more reviews.*

---

## 🗂️ 4. Category-Wise Summary

### Which category has the highest number of apps?
- **FAMILY**, **GAME**, **TOOLS** again top the list.

---

## 🔄 5. App Update Trends

### Q1) Which year had the most app updates?
- **2018** — 6,283 updates
- **2017** — 1,794
- *Most apps were last updated in 2018.*

### Q2) Which month/year had the most updates?
- **July 2018** — 2,320 updates

---

## 🚨 6. Data Quality Issues

### Q1) Are there any apps with extreme prices or anomalies?
Apps with prices close to $400 include:
- *I'm Rich*, *I Am Rich Plus*, etc.
- Ratings between 3.5–4.4, installs between 0–100,000

*Anomalies: high price, low install apps — mostly novelty entries.*

---