## Google Play Store Dataset – Data Overview & Quality Analysis

## Dataset Description 

This dataset contains metadata of mobile applications available on the Google Play Store. Each row represents one app and includes information such as app category, rating, number of reviews, installs, size, pricing, and Android version compatibility.

* Source: Google Play Store (via GitHub)
* Rows: 10,000+
* Columns: 13

## Business Understanding

From a business and analytics perspective, this dataset can help:

* Product managers understand which app categories perform better

* Marketing teams identify factors driving higher installs

* Developers analyze how pricing, size, and ratings affect app success

* Business analysts assess data quality issues common in real-world datasets

Typical business questions:

* Do free apps get more installs than paid apps?

* Which categories have the highest-rated apps?

* Does app size impact ratings or installs?

#### Importing Essential Libraries

In [30]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

#### Loading the dataset

In [31]:
df = pd.read_csv("C:\\Users\\anura\\Downloads\\googleplaystore_data.csv") 
df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


#### Exploring the data

In [32]:
df.shape

(10841, 13)

In [33]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [34]:
df.tail()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device
10840,iHoroscope - 2018 Daily Horoscope & Astrology,LIFESTYLE,4.5,398307,19M,"10,000,000+",Free,0,Everyone,Lifestyle,"July 25, 2018",Varies with device,Varies with device


In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [36]:
# Summary of the dataset

df.describe()

Unnamed: 0,Rating
count,9367.0
mean,4.193338
std,0.537431
min,1.0
25%,4.0
50%,4.3
75%,4.5
max,19.0


In [37]:
df.nunique()

App               9660
Category            34
Rating              40
Reviews           6002
Size               462
Installs            22
Type                 3
Price               93
Content Rating       6
Genres             120
Last Updated      1378
Current Ver       2832
Android Ver         33
dtype: int64

In [38]:
df.dtypes

App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [39]:
# Missing values
df.isnull().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

In [40]:
#Percentage of missing values
(df.isnull().sum() / len(df))*100

App                0.000000
Category           0.000000
Rating            13.596532
Reviews            0.000000
Size               0.000000
Installs           0.000000
Type               0.009224
Price              0.000000
Content Rating     0.009224
Genres             0.000000
Last Updated       0.000000
Current Ver        0.073794
Android Ver        0.027673
dtype: float64

In [41]:
duplicate_count = df.apply(lambda x: x.duplicated().sum())
duplicate_count

App                1181
Category          10807
Rating            10800
Reviews            4839
Size              10379
Installs          10819
Type              10837
Price             10748
Content Rating    10834
Genres            10721
Last Updated       9463
Current Ver        8008
Android Ver       10807
dtype: int64

## Data Quality Issue Log

##### This point-wise Data Quality Issue Log highlights the dataset issues and problems and provides clear, actionable fixes. Addressing these issues is critical before proceeding to cleaning, feature engineering, and EDA

* 'Rating' - This column have missing values. A significant number of apps do not have user ratings mainly because of new apps or apps with insufficent user feedback. It affects the average rating calculations and the category wise perfomance analysis. It can be solved by impute missing values using median rating.
* 'Reviews' - this column have data type mismatch because review counts are stored as text(object) instead of numeric values. It mainly prevents numerical aggregation and correlation analysis. It can be resolve by converting column to integer after validating numeric values.
* 'Size' - Column contains inconsistent and mixed data because sizes are stored in 'M','K' and 'Varies with device'. So, it cannot compare app sizes accurately and also limits analysis involving storage requirements. Can be solved by convert all sizes to MB and replace 'Varies with device' with NaN.
* 'App' - Column contains duplicate records same apps appears twice with slight variations. It biases the aggregate statistics and over represents popular apps. It can be resolved by removing duplicates or keep the latest version on reviews or last update date
* 'Installs' - It contains invalid characters like '+' and ',' .It cannot be used for numeric analysis and it can resolve by removing the special character and convert to integer.
* 'Price' Column contains formatting issues and it include '$' symbol and are stored as text. So it prevent the price based statistical analysis. Can be resolved by removing the currency symbol and convert it into float.
* 'Last Updated' - The data values are stored as string instead of datetime objects. It prevents time-based analysis by converting it to datetime format ca solve that issue.
