# Google Play Store Dataset – Data Overview & Quality Analysis

## Final Insights & Notebook Cleanup

#### Top Key Insights (Business + Analytics)


* Free apps receive significantly more installs than paid apps, as shown by boxplots and median comparisons, validating the freemium business model.

* Paid apps tend to have slightly higher average ratings, indicating perceived value or quality bias.

* Ratings are clustered between 4.0–4.5, suggesting positive review bias and limited differentiation.

* Category has a statistically significant impact on ratings (confirmed via ANOVA).

* Game and Family categories dominate in average installs per category.

* App size does not strongly influence ratings, as seen from size vs rating scatterplots.

* Reviews are strongly correlated with installs, making them a key growth indicator for marketing teams.

* App update frequency increased sharply after 2016, reflecting market expansion and shorter app lifecycles.

* App installs are highly right-skewed, where a small number of apps capture the majority of downloads.

#### Segment-Level Findings (Supported by Plots)

##### Free vs Paid Apps

* Plots used: Boxplot, Bar chart

* Free apps dominate installs.

* Paid apps show marginally higher ratings but limited reach.

##### Category Segment

* Plots used: Bar chart, Boxplot

* High-volume categories face intense competition.

##### Size Segment

* Plots used: Scatterplot, Boxplot by Size_Category

* No strong linear relationship between size and success.

##### Time Segment

* Plots used: Line plot (updates over years)

* Rapid increase in app updates post-2016.

In [6]:
import numpy as np 
import pandas as pd

In [7]:
data_quality_log = pd.DataFrame({
    'Column': [
        'Rating', 'Size', 'Installs', 'Reviews',
        'Last Updated', 'App','Price'
    ],
    'Issue Type': [
        'Missing values', 'Inconsistent units',
        'Non-numeric characters', 'Incorrect datatype',
        'Invalid date format', 'Duplicate records','Incorrect datatype'
    ],
    'Description': [
        'Some apps do not have user ratings',
        'Size stored in M, K, and text',
        'Installs contain + and commas',
        'Reviews stored as object instead of numeric',
        'Some dates could not be parsed',
        'Same app appears multiple times',
        'Price is stores as object instead of float'
    ],
    'Action Taken': [
        'Retained as NaN',
        'Converted to Size_MB',
        'Cleaned and converted to numeric',
        'Converted to integer',
        'Parsed with errors=coerce',
        'Removed duplicates',
        'Convert to folat'
    ]
})

data_quality_log


Unnamed: 0,Column,Issue Type,Description,Action Taken
0,Rating,Missing values,Some apps do not have user ratings,Retained as NaN
1,Size,Inconsistent units,"Size stored in M, K, and text",Converted to Size_MB
2,Installs,Non-numeric characters,Installs contain + and commas,Cleaned and converted to numeric
3,Reviews,Incorrect datatype,Reviews stored as object instead of numeric,Converted to integer
4,Last Updated,Invalid date format,Some dates could not be parsed,Parsed with errors=coerce
5,App,Duplicate records,Same app appears multiple times,Removed duplicates
6,Price,Incorrect datatype,Price is stores as object instead of float,Convert to folat
