# EDA on Google's play store App

**Author Name:** Syed Ghazi Ali Zaidi\
**Email:** sghazializaidi@gmail.com

The data was collected from [link](https://www.kaggle.com/datasets/lava18/google-play-store-apps)

##  *The dataset collected from the source has the following description:*

`About Dataset`

Context
While many public datasets (on Kaggle and the like) provide Apple App Store data, there are not many counterpart datasets available for Google Play Store apps anywhere on the web. On digging deeper, I found out that iTunes App Store page deploys a nicely indexed appendix-like structure to allow for simple and easy web scraping. On the other hand, Google Play Store uses sophisticated modern-day techniques (like dynamic page load) using JQuery making scraping more challenging.

Content
Each app (row) has values for catergory, rating, size, and more.

Acknowledgements
This information is scraped from the Google Play Store. This app information would not be available without it.

Inspiration
The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market!

In [1]:
# import libaraies
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import ydata_profiling as yd
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
df = pd.read_csv('./Datasets/googleplaystore.csv')

In [3]:
# profile = yd.ProfileReport(df)
# profile.to_file(output_file="./outputs/GoogleApp_profiling.html")

In [4]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10840 entries, 0 to 10839
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10840 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9366 non-null   float64
 3   Reviews         10840 non-null  int64  
 4   Size            10840 non-null  object 
 5   Installs        10840 non-null  object 
 6   Type            10839 non-null  object 
 7   Price           10840 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10840 non-null  object 
 11  Current Ver     10832 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 1.1+ MB


In [6]:
df.sample(10)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
3678,XX HD Video downloader-Free Video Downloader,VIDEO_PLAYERS,4.1,7624,6.0M,"1,000,000+",Free,0,Everyone,Video Players & Editors,"April 20, 2018",1.9,4.0.3 and up
7965,CV Builder,BUSINESS,3.8,203,5.9M,"10,000+",Free,0,Everyone,Business,"September 30, 2016",2.2.1.0,2.3 and up
2621,LinkedIn,SOCIAL,4.2,1225367,Varies with device,"100,000,000+",Free,0,Everyone,Social,"August 2, 2018",4.1.202,5.0 and up
8819,Synology Drive,PRODUCTIVITY,3.2,368,Varies with device,"100,000+",Free,0,Everyone,Productivity,"August 2, 2018",Varies with device,5.0 and up
9006,Hausa Radio,NEWS_AND_MAGAZINES,4.4,3375,4.2M,"100,000+",Free,0,Everyone,News & Magazines,"July 31, 2018",4.1,4.1 and up
10425,FH CODE,TOOLS,,13,3.6M,100+,Free,0,Everyone,Tools,"February 26, 2018",FH CODE 1.0,2.1 and up
1266,Sportractive GPS Running Cycling Distance Tracker,HEALTH_AND_FITNESS,4.8,48276,7.5M,"1,000,000+",Free,0,Everyone,Health & Fitness,"August 4, 2018",3.0.6,4.4 and up
306,The Vietnam Story - Fun Stories,COMICS,4.5,438,3.4M,"10,000+",Free,0,Everyone,Comics,"December 27, 2017",1.0,2.3 and up
9002,DW Security,BUSINESS,5.0,6,15M,100+,Free,0,Everyone,Business,"July 25, 2018",69.1,4.1 and up
4976,"WeatherClear - Ad-free Weather, Minute forecast",WEATHER,4.5,3252,3.8M,"50,000+",Free,0,Everyone,Weather,"June 25, 2017",1.2.6,4.1 and up


In [7]:
df.describe()

Unnamed: 0,Rating,Reviews
count,9366.0,10840.0
mean,4.191757,444152.9
std,0.515219,2927761.0
min,1.0,0.0
25%,4.0,38.0
50%,4.3,2094.0
75%,4.5,54775.5
max,5.0,78158310.0


# We need to make the below columns numeric

In [8]:
df[['Size','Installs','Price']].head()

Unnamed: 0,Size,Installs,Price
0,19M,"10,000+",0
1,14M,"500,000+",0
2,8.7M,"5,000,000+",0
3,25M,"50,000,000+",0
4,2.8M,"100,000+",0


### Starting with size, first look at the unique values

In [9]:
df['Size'].unique()

array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
       '31M', '4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M',
       '5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M',
       '1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k',
       '3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.5M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '3.7M', '22M', '7.4M', '6.4M', '3.2M', '8.2M', '9.9M',
       '4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M',
       '4.0M', '2.3M', '7.2M', '2.1M', '42M', '7.3M', '9.1M', '55M',
       '23k', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M',
       '8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M',
       '5.1M', '61M', '66M', '79k', '8.4M', '118k', '44M', '695k', '1.6M',
     

### Group by them to see the value count

In [10]:
df['Size'].value_counts()

Size
Varies with device    1695
11M                    198
12M                    196
14M                    194
13M                    191
                      ... 
430k                     1
429k                     1
200k                     1
460k                     1
619k                     1
Name: count, Length: 461, dtype: int64

# Check that how many MBs are there in one KB? and how to calculate?
Steps:
1. Convert kb into MB
2. Then remove Mb from all numbers
3. Handle `varies with device`

#### Let's check for missing values

In [11]:
df['Size'].isnull().sum()

0

# Now checking for Installs
1. We need to remove the `+` sign

In [12]:
df['Installs'].value_counts()

Installs
1,000,000+        1579
10,000,000+       1252
100,000+          1169
10,000+           1054
1,000+             907
5,000,000+         752
100+               719
500,000+           539
50,000+            479
5,000+             477
100,000,000+       409
10+                386
500+               330
50,000,000+        289
50+                205
5+                  82
500,000,000+        72
1+                  67
1,000,000,000+      58
0+                  14
0                    1
Name: count, dtype: int64

# Checking for Price column (for making it numeric)
1. Removing `$`sign may help to make it numeric
2. After that we can do binning

In [13]:
df['Price'].value_counts()

Price
0         10040
$0.99       148
$2.99       129
$1.99        73
$4.99        72
          ...  
$19.90        1
$1.75         1
$14.00        1
$4.85         1
$1.04         1
Name: count, Length: 92, dtype: int64

### For Size column to make it numeric

In [14]:
df['Size'] = df['Size'].replace('Varies with device', np.nan)

In [15]:
# df['Size_Mb'] = pd.to_numeric(df['Size']).str.replace('M','')
df['Size'] = df['Size'].str.replace('M', '')


In [16]:
for index, row in df.iterrows():
    if 'k' in str(row['Size']):
        # Replace 'k' with empty string before dividing
        df.at[index, 'Size'] = str(row['Size']).replace('k', '')
        df.at[index, 'Size'] = pd.to_numeric(df.at[index, 'Size']) / 1024

In [17]:
df['Size'].unique()

array(['19', '14', '8.7', '25', '2.8', '5.6', '29', '33', '3.1', '28',
       '12', '20', '21', '37', '2.7', '5.5', '17', '39', '31', '4.2',
       '7.0', '23', '6.0', '6.1', '4.6', '9.2', '5.2', '11', '24', nan,
       '9.4', '15', '10', '1.2', '26', '8.0', '7.9', '56', '57', '35',
       '54', 0.1962890625, '3.6', '5.7', '8.6', '2.4', '27', '2.5', '16',
       '3.4', '8.9', '3.9', '2.9', '38', '32', '5.4', '18', '1.1', '2.2',
       '4.5', '9.8', '52', '9.0', '6.7', '30', '2.6', '7.1', '3.7', '22',
       '7.4', '6.4', '3.2', '8.2', '9.9', '4.9', '9.5', '5.0', '5.9',
       '13', '73', '6.8', '3.5', '4.0', '2.3', '7.2', '2.1', '42', '7.3',
       '9.1', '55', 0.0224609375, '6.5', '1.5', '7.5', '51', '41', '48',
       '8.5', '46', '8.3', '4.3', '4.7', '3.3', '40', '7.8', '8.8', '6.6',
       '5.1', '61', '66', 0.0771484375, '8.4', 0.115234375, '44',
       0.6787109375, '1.6', '6.2', 0.017578125, '53', '1.4', '3.0', '5.8',
       '3.8', '9.6', '45', '63', '49', '77', '4.4', '4.8', '7

In [18]:
df['Size'].value_counts()

Size
11              198
12              196
14              194
13              191
15              184
               ... 
0.419921875       1
0.4189453125      1
0.1953125         1
0.44921875        1
0.6044921875      1
Name: count, Length: 460, dtype: int64

## Converting the column to float datatype

In [19]:
# df['Size'] = pd.to_numeric(df['Size'],errors='coerce',downcast='float')

In [20]:
df['Size'] = df['Size'].astype(float, errors='ignore')

In [21]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25.0,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [22]:
df['Size'].isna().sum()

1695

## Successfully converted to float64

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10840 entries, 0 to 10839
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10840 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9366 non-null   float64
 3   Reviews         10840 non-null  int64  
 4   Size            9145 non-null   float64
 5   Installs        10840 non-null  object 
 6   Type            10839 non-null  object 
 7   Price           10840 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10840 non-null  object 
 11  Current Ver     10832 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(2), int64(1), object(10)
memory usage: 1.1+ MB


#### Double checking as the size of it was 892kb now it is in mb

In [24]:
df.loc[df['App'] == 'EP RSS Reader']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
9696,EP RSS Reader,COMMUNICATION,3.8,4,0.871094,100+,Free,0,Everyone,Communication,"July 16, 2018",0.99,4.0.3 and up


# Now we will look into Installs

In [25]:
df['Installs'].value_counts()

Installs
1,000,000+        1579
10,000,000+       1252
100,000+          1169
10,000+           1054
1,000+             907
5,000,000+         752
100+               719
500,000+           539
50,000+            479
5,000+             477
100,000,000+       409
10+                386
500+               330
50,000,000+        289
50+                205
5+                  82
500,000,000+        72
1+                  67
1,000,000,000+      58
0+                  14
0                    1
Name: count, dtype: int64

we need to remove the `+`

In [26]:
df['Installs'] = df['Installs'].str.replace('+','')

In [27]:
df['Installs'].value_counts()

Installs
1,000,000        1579
10,000,000       1252
100,000          1169
10,000           1054
1,000             907
5,000,000         752
100               719
500,000           539
50,000            479
5,000             477
100,000,000       409
10                386
500               330
50,000,000        289
50                205
5                  82
500,000,000        72
1                  67
1,000,000,000      58
0                  15
Name: count, dtype: int64

We first remove `+` then remove `,`

In [28]:
# df['Installs'] = df['Installs'].astype('Int64')  # Because it has ',' so it can't run

df['Installs'] = df['Installs'].str.replace(',','')
df['Installs'] = df['Installs'].astype('int64')

# Lastly we will make Price into numeric

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10840 entries, 0 to 10839
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10840 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9366 non-null   float64
 3   Reviews         10840 non-null  int64  
 4   Size            9145 non-null   float64
 5   Installs        10840 non-null  int64  
 6   Type            10839 non-null  object 
 7   Price           10840 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10840 non-null  object 
 11  Current Ver     10832 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(2), int64(2), object(9)
memory usage: 1.1+ MB


In [30]:
df['Price'].value_counts()

Price
0         10040
$0.99       148
$2.99       129
$1.99        73
$4.99        72
          ...  
$19.90        1
$1.75         1
$14.00        1
$4.85         1
$1.04         1
Name: count, Length: 92, dtype: int64

In [31]:
df['Price'] = df['Price'].str.replace('$','')

In [32]:
df['Price'] = df['Price'].astype('float64')

## Successfully made numeric 
1. Size
2. Installs
3. Price

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10840 entries, 0 to 10839
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10840 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9366 non-null   float64
 3   Reviews         10840 non-null  int64  
 4   Size            9145 non-null   float64
 5   Installs        10840 non-null  int64  
 6   Type            10839 non-null  object 
 7   Price           10840 non-null  float64
 8   Content Rating  10840 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10840 non-null  object 
 11  Current Ver     10832 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(3), int64(2), object(8)
memory usage: 1.1+ MB
