### **Author**: Shehzad Ahmad
### **LinkedIn**: [ds-Shehzad]('https://www.linkedin.com/in/ds-shehzad/')
### **Date**: 2020-07-04

### My Work
I will perform data analysis on this data using pandas, and for visualization I would use Matplotlib, seaborn. I am making this notebook just for learning of EDA from Codanics, as my mentor is [Dr. Aammar Tufail](www.codanics.com)

### **About Dataset**
**Source**\
[Lavanya](https://www.kaggle.com/datasets/lava18/google-play-store-apps)\
**Context**\
While many public datasets (on Kaggle and the like) provide Apple App Store data, there are not many counterpart datasets available for Google Play Store apps anywhere on the web. On digging deeper, I found out that iTunes App Store page deploys a nicely indexed appendix-like structure to allow for simple and easy web scraping. On the other hand, Google Play Store uses sophisticated modern-day techniques (like dynamic page load) using JQuery making scraping more challenging.

**Content**\
Each app (row) has values for catergory, rating, size, and more.

**Acknowledgements**\
This information is scraped from the Google Play Store. This app information would not be available without it.

**Inspiration**\
The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market!

**Importing Libraries**

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Check the data
df = pd.read_csv('./Datasets/googleplaystore.csv')


In [4]:
# lets know about the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  int64  
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10841 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10839 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 1.1+ MB


In [5]:
# lets have a look at the data
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,7-Jan-18,1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,15-Jan-18,2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,1-Aug-18,1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,8-Jun-18,Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,20-Jun-18,1.1,4.4 and up


In [6]:
df.sample(10)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1994,Fruit Block - Puzzle Legend,GAME,4.6,125647,33M,"10,000,000+",Free,0,Everyone,Casual,23-Jul-18,60,4.1 and up
4537,Learn R Programming Full,BOOKS_AND_REFERENCE,,11,9.2M,"5,000+",Free,0,Everyone,Books & Reference,30-Jan-18,1.5,5.0 and up
3427,Butterfly Live Wallpaper,PERSONALIZATION,4.1,35771,3.5M,"5,000,000+",Free,0,Everyone,Personalization,4-May-18,4.1,4.1 and up
4583,SH Script Runner,PRODUCTIVITY,3.9,334,3.7M,"50,000+",Free,0,Everyone,Productivity,2-Jan-18,1.07,4.1 and up
10195,Flash Light on Call & SMS,TOOLS,4.1,25438,3.2M,"1,000,000+",Free,0,Everyone,Tools,24-Apr-18,1.2.3,4.0.3 and up
781,SAT Flashcards: Prep & Vocabulary,EDUCATION,4.2,2277,Varies with device,"100,000+",Free,0,Everyone,Education,7-Nov-17,Varies with device,Varies with device
3594,BookBaby - Baby Development,PARENTING,4.6,1002,5.0M,"50,000+",Free,0,Everyone,Parenting,18-Aug-16,3,4.0.3 and up
349,Call Free – Free Call,COMMUNICATION,4.3,30209,15M,"5,000,000+",Free,0,Everyone,Communication,28-Jul-18,1.3.4,4.1 and up
8462,"Quik – Free Video Editor for photos, clips, music",VIDEO_PLAYERS,4.7,696665,91M,"10,000,000+",Free,0,Everyone,Video Players & Editors,5-Jul-18,4.7.4.3869-7b2372c3e,5.0 and up
4336,Lollipop Keyboard L Pro,PERSONALIZATION,3.6,5525,2.6M,"500,000+",Free,0,Everyone,Personalization,6-Jun-16,1.6,2.3 and up


In [7]:
# describe only works on numerical variables
df.describe()

Unnamed: 0,Rating,Reviews
count,9367.0,10841.0
mean,4.191513,444111.9
std,0.515735,2927629.0
min,1.0,0.0
25%,4.0,38.0
50%,4.3,2094.0
75%,4.5,54768.0
max,5.0,78158310.0


In [8]:
df.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

#### Convert Size, installs, price to numeric data types

In [9]:
df['Price'] = df['Price'].str.replace('$','').astype('float32')

In [10]:
df.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size               object
Installs           object
Type               object
Price             float32
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [11]:
df.Installs.unique()

array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+', '50+', '100+', '500+',
       '10+', '1+', '5+', '0+', '0'], dtype=object)

In [12]:
df['Installs']=df.Installs.str.replace('[+,]','',regex=True).astype('int64')

In [13]:
bins = [0,1000,100000,10000000,1000000000]
labels = ['0-1000','101-100000','100001-10000000','10000001-1000000000']
df['cuts']=pd.cut(df['Installs'],bins=bins,labels=labels)

In [14]:
df.head(2)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,cuts
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,10000,Free,0.0,Everyone,Art & Design,7-Jan-18,1.0.0,4.0.3 and up,101-100000
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,500000,Free,0.0,Everyone,Art & Design;Pretend Play,15-Jan-18,2.0.0,4.0.3 and up,100001-10000000


In [15]:
df.cuts.value_counts()

cuts
100001-10000000        4122
101-100000             3179
0-1000                 2697
10000001-1000000000     828
Name: count, dtype: int64

In [16]:
df.Size.value_counts()

Size
Varies with device    1695
11M                    198
12M                    196
14M                    194
13M                    191
                      ... 
430k                     1
429k                     1
200k                     1
460k                     1
619k                     1
Name: count, Length: 461, dtype: int64

In [17]:
def convert_size(size):
    if 'M' in size:
        return float(size.replace('M', '')) * 1000000
    elif 'K' in size:
        return float(size.replace('K', '')) * 1000
    elif 'Varies with device' in size:
        return 0
    else:
        return np.nan

# Apply the conversion function to 'Size' column
df['Size'] = df['Size'].apply(convert_size)

In [18]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,cuts
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19000000.0,10000,Free,0.0,Everyone,Art & Design,7-Jan-18,1.0.0,4.0.3 and up,101-100000
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14000000.0,500000,Free,0.0,Everyone,Art & Design;Pretend Play,15-Jan-18,2.0.0,4.0.3 and up,100001-10000000
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8700000.0,5000000,Free,0.0,Everyone,Art & Design,1-Aug-18,1.2.4,4.0.3 and up,100001-10000000
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25000000.0,50000000,Free,0.0,Teen,Art & Design,8-Jun-18,Varies with device,4.2 and up,10000001-1000000000
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2800000.0,100000,Free,0.0,Everyone,Art & Design;Creativity,20-Jun-18,1.1,4.4 and up,101-100000


In [19]:
df.Size.value_counts()

Size
0.0           1695
11000000.0     198
12000000.0     196
14000000.0     194
13000000.0     191
              ... 
84000000.0       9
89000000.0       9
86000000.0       8
1000000.0        7
90000000.0       6
Name: count, Length: 182, dtype: int64

In [20]:
df.sample(10)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,cuts
6883,OLX Uganda Sell Buy Cellphones,SHOPPING,4.2,4977,9800000.0,100000,Free,0.0,Everyone,Shopping,25-Jul-18,11.15.03,4.1 and up,101-100000
4270,Guess the song of J Balvin,GAME,,28,8900000.0,1000,Free,0.0,Everyone,Trivia,24-Dec-17,1.1,4.1 and up,0-1000
10436,Forgotten Hill: Surgery,GAME,4.5,2431,24000000.0,100000,Free,0.0,Teen,Adventure,3-Nov-17,1.4,3.0 and up,101-100000
957,Telemundo Now,ENTERTAINMENT,3.9,8674,19000000.0,1000000,Free,0.0,Teen,Entertainment,25-May-18,3.3.1.42,4.4 and up,100001-10000000
8468,Resources For Virtual DJ,FAMILY,3.6,21095,7900000.0,1000000,Free,0.0,Everyone,Entertainment,16-Dec-16,4,4.0.3 and up,100001-10000000
4848,Z PIVOT,COMMUNICATION,,0,5700000.0,10,Paid,19.99,Everyone,Communication,11-Jul-18,1.3,4.2 and up,0-1000
6885,Smashy Road: Arena,GAME,4.3,45558,53000000.0,5000000,Free,0.0,Everyone 10+,Action,24-May-18,1.1.9,4.1 and up,100001-10000000
2174,Steam: Rails to Riches,FAMILY,4.2,214,67000000.0,5000,Paid,4.99,Everyone,Board;Brain Games,13-Dec-17,3.3.1,4.1 and up,101-100000
4861,Angry Birds POP Bubble Shooter,FAMILY,4.5,596628,99000000.0,10000000,Free,0.0,Everyone,Casual,2-Aug-18,3.39.0,4.1 and up,100001-10000000
989,Kidjo TV Kids Have Fun & Learn,ENTERTAINMENT,4.6,732,0.0,100000,Free,0.0,Everyone,Entertainment;Music & Video,1-Aug-18,Varies with device,Varies with device,101-100000


In [21]:
df.describe()

Unnamed: 0,Rating,Reviews,Size,Installs,Price
count,9367.0,10841.0,10525.0,10841.0,10841.0
mean,4.191513,444111.9,18682150.0,15462910.0,1.027273
std,0.515735,2927629.0,22284000.0,85025570.0,15.94897
min,1.0,0.0,0.0,0.0,0.0
25%,4.0,38.0,3000000.0,1000.0,0.0
50%,4.3,2094.0,9800000.0,100000.0,0.0
75%,4.5,54768.0,26000000.0,5000000.0,0.0
max,5.0,78158310.0,100000000.0,1000000000.0,400.0


In [22]:
df['Content Rating'].value_counts()

Content Rating
Everyone           8715
Teen               1208
Mature 17+          499
Everyone 10+        414
Adults only 18+       3
Unrated               2
Name: count, dtype: int64