# 📊 Exploratory Data Analysis on Google Play Store Dataset

- This notebook performs exploratory data analysis (EDA) on a dataset of over 10,000 Android apps listed on the Google Play Store.
- The goal is to understand app trends, clean the data, and uncover patterns in ratings, installs, and other features.

In [None]:
import pandas as pd
import numpy as np

In [None]:
# load the dataset
data = pd.read_csv('google_playstore_dataset_raw.csv') 

In [None]:
data.head(10)

In [None]:
data['App']

In [None]:
data.isnull()

In [None]:
data.isnull().sum()

In [None]:
df = data.dropna()

In [None]:
df.isnull().sum()

# Removing Null Values – Categorical

## Step 1: Handling Missing Values in Numeric Columns
Identifying and removing null values in key numeric fields to ensure accurate statistical analysis.

In [None]:
data = pd.read_csv('google_playstore_dataset_raw.csv').dropna()
data

## Step 2: Calculating the Average App Rating
Computed the overall average user rating across all apps in the dataset.

In [None]:
data['Rating']

In [None]:
round(int(sum(data['Rating']))/9360,2)

In [None]:
s = 0
for i in data['Rating']:
    s += i
s = int(s)
print(s)

In [None]:
int(sum(data['Rating']))

In [None]:
len(data['Rating'])

In [None]:
round(int(sum(data['Rating']))/len(data['Rating']),2)

In [None]:
print("Avg rating of these apps : ", round(int(sum(data['Rating']))/len(data['Rating']),2))

## Step 3: Counting Apps with Perfect Ratings
Determined the total number of apps that received a perfect rating of 5.0 from users.

In [None]:
data['Rating']

In [None]:
c = 0
for i in data['Rating']:
    if(i == 5.0):
        c += 1
print("there are",c,"many applications with rating 5")

## Step 4: Analyzing Rating Distributions Between 4.0–4.5 and 4.0–5.0
Filtered and counted apps with ratings within specified ranges to assess popularity and trustworthiness.

In [None]:
c = 0
for i in data['Rating']:
    if(i>=4.0 and i<=5.0):
        c +=1
print(c)


In [None]:
c = 0
for i in data['Rating']:
    if(i>=4.0 and i<=4.5):
        c +=1
print(c)


## Step 5: Calculating the Average Number of User Reviews
Analyzed the distribution and average count of user-submitted reviews per app.

In [None]:
s = 0
for i in data['Reviews']:
    s += int(i)
print(int(s/len(data['Rating'])))

# Removing Null Values - CATEGORICAL

In [None]:
df = pd.read_csv('google_playstore_dataset_raw.csv').dropna()

In [None]:
df.head(5)

# Q1: How many unique app categories are present in the dataset

In [None]:
df['Category'].unique()

In [None]:
# for i in df['Category'].unique():
#     print(i)

In [None]:
len(df['Category'].unique())

# Q2: How many applications belong to the "ART_AND_DESIGN" category?

In [None]:
c = 0
for i in df['Category']:
    if(i == 'ART_AND_DESIGN'):
        c +=1
print(c)

# Q3: What types of apps are available on the Play Store?

In [None]:
data = df

In [None]:
data['Type'].unique()

#  Q4: What is the distribution of Free and Paid applications?

In [None]:
f = 0
for i in data['Type']:
    if(i == 'Free'):
        f +=1
print("there are",f,"free and",end=' ')

p = 0
for i in data['Type']:
    if(i == 'Paid'):
        p +=1
print("and",p,"paid application")

# Q5: What percentage of apps in the dataset are free?

In [None]:
print(int(f/(f + p)*100),"% applictaions are free")

# Q6: What are the different content rating classifications in the dataset?

In [None]:
data['Content Rating'].unique()

In [None]:
for i in data['Content Rating'].unique():
    print(i)

# Exploring Categories Automatically

Instead of manually checking the number of apps in each category by filtering them one at a time (e.g., "ART_AND_DESIGN", "GAME", etc.), i used a more efficient and scalable approach to summarize all categories at once i.e by applying the value_counts() method on the Category column.

In [None]:
df

 # Q1: What is the total number of apps in each category?

In [None]:
for name in df['Category'].unique():
    ct = 0
    for i in df['Category']:
        if(i == name):
            ct +=1
    print(name, ':' ,ct)

In [None]:
# In Dictionary
categories = {}

for name in df['Category'].unique():
    ct = 0
    for i in df['Category']:
        if(i == name):
            ct +=1
    categories[name] = ct

In [None]:
categories

# Q2: How many applications belong to the "ART_AND_DESIGN" category?

In [None]:
categories['ART_AND_DESIGN']

In [None]:
for i in df['Category'].unique():
    print(i,categories[i])

# Q3: What is the total number of apps by type (Free vs Paid)?

In [None]:
types = {}
for name in df['Type'].unique():
    ct = 0
    for i in df['Type']:
        if(i == name):
            ct +=1
    print(name, ':' ,ct)

In [None]:
# In Dictionary 
types = {}
for name in df['Type'].unique():
    ct = 0
    for i in df['Type']:
        if(i == name):
            ct +=1
    types[name] = ct
print(types)

# Q4: What is the total number of apps for each content rating classification

In [None]:
content_rating = {}
for name in df['Content Rating'].unique():
    ct = 0
    for i in df['Content Rating']:
        if(i == name):
            ct +=1
    print(name, ':' ,ct)

In [None]:
# In Dictionary 
content_rating = {}
for name in df['Content Rating'].unique():
    ct = 0
    for i in df['Content Rating']:
        if(i == name):
            ct +=1
    content_rating[name] = ct
print(content_rating)

In [None]:
# Rating Distribution / Summary Statistics for App Ratings
df['Rating'].describe()

In [None]:
# Summary Statistics for App Type
df['Type'].describe()

In [None]:
# Summary Statistics for Content Rating
df['Content Rating'].describe()

In [None]:
# Summary Statistics for App Categories
df['Category'].describe()

# Handling Missing (Null) Values

In [None]:
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer

In [None]:
df = pd.read_csv('google_playstore_dataset_raw.csv')
df

In [None]:
# missing values
df.isnull().sum()

In [None]:
df.iloc[ : , 2:3].values

In [None]:
#  create imputer to replace NAN with mean of a column 
impute = SimpleImputer(missing_values = np.nan, strategy = 'mean')
impute.fit(df.iloc[ : , 2:3].values) # calculates mean

In [None]:
# Replace missing values in the Rating column with the mean value computed above
df.iloc[ : , 2:3] = impute.transform(df.iloc[ : , 2:3].values)

In [None]:
df

In [None]:
# impute = SimpleImputer(missing_values = np.nan, strategy = 'mean')
# impute.fit(df.iloc[ : , 2:3].values)
# df.iloc[ : , 2:3] = impute.transform(df.iloc[ : , 2:3].values)
# df.head()

In [None]:
# Remove any rows that still contain NaN values (from other columns)
df = df.dropna()

In [None]:
# final check of missing values in each column.
df.isnull().sum()

# Exporting the Cleaned Dataset to CSV

In [None]:
data.to_csv('google_play_store_data_cleaned.csv', index=False)