# Google Play Store EDA

#### Importing the libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

#### Loading the dataset

In [3]:
df = pd.read_csv('AppleStore.csv')

#### Displaying the first 10 rows of the dataset to get a quick look at the data

In [4]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,1,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1
1,2,281796108,Evernote - stay organized,158578688,USD,0.0,161065,26,4.0,3.5,8.2.2,4+,Productivity,37,5,23,1
2,3,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,USD,0.0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1
3,4,282614216,"eBay: Best App to Buy, Sell, Save! Online Shop...",128512000,USD,0.0,262241,649,4.0,4.5,5.10.0,12+,Shopping,37,5,9,1
4,5,282935706,Bible,92774400,USD,0.0,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1
5,6,283619399,Shanghai Mahjong,10485713,USD,0.99,8253,5516,4.0,4.0,1.8,4+,Games,47,5,1,1
6,7,283646709,PayPal - Send and request money safely,227795968,USD,0.0,119487,879,4.0,4.5,6.12.0,4+,Finance,37,0,19,1
7,8,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1
8,9,284666222,PCalc - The Best Calculator,49250304,USD,9.99,1117,4,4.5,5.0,3.6.6,4+,Utilities,37,5,1,1
9,10,284736660,Ms. PAC-MAN,70023168,USD,3.99,7885,40,4.0,4.0,4.0.4,4+,Games,38,0,10,1


#### Displaying the total number of rows and columns in the dataset

In [5]:
rows, columns = df.shape
print(f'There are {rows} rows and {columns} columns in this Dataset.')

There are 7197 rows and 17 columns in this Dataset.


#### Displaying column names

In [6]:
df.columns

Index(['Unnamed: 0', 'id', 'track_name', 'size_bytes', 'currency', 'price',
       'rating_count_tot', 'rating_count_ver', 'user_rating',
       'user_rating_ver', 'ver', 'cont_rating', 'prime_genre',
       'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'],
      dtype='str')

#### Based on the publicly available iOS App Store dataset, the meaning of each column is as follows:

- id: Unique identifier of the app in the App Store.
- track_name: Name of the application as listed on the App Store.
- size_bytes: Size of the app in bytes.
- currency: Currency type used for the app price (e.g., USD).
- price: Price of the app (0 indicates a free app).
- rating_count_tot: Total number of user ratings across all versions of the app.
- rating_count_ver: Number of user ratings for the current version of the app.
- user_rating: Average user rating for all versions of the app.
- user_rating_ver: Average user rating for the current version of the app.
- ver: Latest version code of the app.
- cont_rating: Content rating indicating target age group suitability (e.g., 4+, 12+, 17+).
- prime_genre: Primary genre/category of the app.
- sup_devices.num: Number of devices supported by the app.
- ipadSc_urls.num: Number of screenshots available for display on iPad.
- lang.num: Number of supported languages in the app.
- vpp_lic: Indicates if VPP (Volume Purchase Program) device-based licensing is enabled.

In [8]:
df.describe()

Unnamed: 0.1,Unnamed: 0,id,size_bytes,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
count,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0
mean,4759.069612,863131000.0,199134500.0,1.726218,12892.91,460.373906,3.526956,3.253578,37.361817,3.7071,5.434903,0.993053
std,3093.625213,271236800.0,359206900.0,5.833006,75739.41,3920.455183,1.517948,1.809363,3.737715,1.986005,7.919593,0.083066
min,1.0,281656500.0,589824.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0
25%,2090.0,600093700.0,46922750.0,0.0,28.0,1.0,3.5,2.5,37.0,3.0,1.0,1.0
50%,4380.0,978148200.0,97153020.0,0.0,300.0,23.0,4.0,4.0,37.0,5.0,1.0,1.0
75%,7223.0,1082310000.0,181924900.0,1.99,2793.0,140.0,4.5,4.5,38.0,5.0,8.0,1.0
max,11097.0,1188376000.0,4025970000.0,299.99,2974676.0,177050.0,5.0,5.0,47.0,5.0,75.0,1.0


#### Displaying columns along with their data types

In [9]:
df.dtypes

Unnamed: 0            int64
id                    int64
track_name              str
size_bytes            int64
currency                str
price               float64
rating_count_tot      int64
rating_count_ver      int64
user_rating         float64
user_rating_ver     float64
ver                     str
cont_rating             str
prime_genre             str
sup_devices.num       int64
ipadSc_urls.num       int64
lang.num              int64
vpp_lic               int64
dtype: object

#### Obersvations on Data Types

Upon inspecting the dataset, most columns have appropriate data types for analysis. However, a few columns may need attention or conversion:

- track_name, ver, cont_rating, prime_genre, currency – Correctly stored as str (string). No conversion needed.

- size_bytes, price, rating_count_tot, rating_count_ver, user_rating, user_rating_ver, sup_devices.num, ipadSc_urls.num, lang.num, vpp_lic – Stored as numeric types (int64 or float64) and ready for analysis.

- id, Unnamed: 0 – Stored as integers; Unnamed: 0 is likely just an index column and can be dropped.

#### Checking all unique values of each column

In [10]:
for col in df.columns:
    unique_vals = df[col].unique()
    print(f'Name of the Column : {col}')
    print(f'Number of unique values : {len(unique_vals)}')
    print(f'Unique Values : {unique_vals}')
    print('-'*50)
    print('\n')

Name of the Column : Unnamed: 0
Number of unique values : 7197
Unique Values : [    1     2     3 ... 11087 11089 11097]
--------------------------------------------------


Name of the Column : id
Number of unique values : 7197
Unique Values : [ 281656475  281796108  281940292 ... 1187779532 1187838770 1188375727]
--------------------------------------------------


Name of the Column : track_name
Number of unique values : 7195
Unique Values : <StringArray>
[                                   'PAC-MAN Premium',
                          'Evernote - stay organized',
    'WeatherBug - Local Weather, Radar, Maps, Alerts',
 'eBay: Best App to Buy, Sell, Save! Online Shopping',
                                              'Bible',
                                   'Shanghai Mahjong',
             'PayPal - Send and request money safely',
                            'Pandora - Music & Radio',
                        'PCalc - The Best Calculator',
                                        'M

#### Summary of Data Observations

Columns with correct types
- track_name, cont_rating, ver, prime_genre, currency, price, size_bytes, rating_count_tot, rating_count_ver, user_rating, user_rating_ver, sup_devices.num, ipadSc_urls.num, lang.num, vpp_lic – All stored in appropriate types (str, int64, float64) for analysis.
- No major type conversion required.

Columns needing cleaning or further preprocessing
- track_name – Mostly fine, but check for duplicates (e.g., 7197 IDs vs 7195 unique names).
- ver – Version codes are strings; may need parsing if numerical comparison or semantic versioning is required.
- size_bytes – Already numeric, but consider converting to MB/GB for readability.
- price – Correct numeric type, but 0 indicates free apps; may need to filter for paid apps in some analyses.
- user_rating, user_rating_ver – Correct numeric type; some values are 0 (likely missing or unrated).

Columns with anomalies or special observations
- track_name – 2 duplicate names (7197 IDs vs 7195 unique app names).
- user_rating & user_rating_ver – Some apps have a rating of 0, which may indicate unrated apps.
- ipadSc_urls.num – Includes 0 screenshots for some apps.
- lang.num – Wide variation (0–75 languages); 0 may indicate missing info.
- vpp_lic – Binary (0/1) indicates VPP license availability; no anomalies.

Columns likely not needed for analysis
- Unnamed: 0 – Index column; can be dropped.
- id – Unique identifier; useful for merging or reference but not for feature analysis.

#### Checking all unique values of each column

In [11]:
# Count of NaN / None values per column
nan_count = df.isnull().sum()

# Count of empty strings per column
empty_count = (df == '').sum()

# Total missing values (NaN + empty strings)
total_missing = nan_count + empty_count

# Percentage of missing values
missing_percent = (total_missing / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Values': total_missing,
    'Percentage (%)': missing_percent
}).sort_values(by='Percentage (%)', ascending=False)

print(missing_df)

                  Missing Values  Percentage (%)
Unnamed: 0                     0             0.0
id                             0             0.0
track_name                     0             0.0
size_bytes                     0             0.0
currency                       0             0.0
price                          0             0.0
rating_count_tot               0             0.0
rating_count_ver               0             0.0
user_rating                    0             0.0
user_rating_ver                0             0.0
ver                            0             0.0
cont_rating                    0             0.0
prime_genre                    0             0.0
sup_devices.num                0             0.0
ipadSc_urls.num                0             0.0
lang.num                       0             0.0
vpp_lic                        0             0.0
