# EDA on Google's play store Apps

**Author Name:** Syed Ghazi Ali Zaidi \
**Email:** sghazializaidi@gmail.com

The data was downloaded from [link](https://www.kaggle.com/datasets/lava18/google-play-store-apps)

## *The data collected from the source has the following description:*

## About the Dataset

### Context
While numerous public datasets, such as those on Kaggle, provide Apple App Store data, equivalent datasets for Google Play Store apps are scarce across the web. Upon closer examination, it becomes evident that the iTunes App Store employs a well-organized, index-like structure, facilitating straightforward web scraping. In contrast, the Google Play Store utilizes sophisticated, modern techniques, including dynamic page loading with JQuery, presenting a more challenging environment for scraping.

### Content
Each entry in the dataset represents an app, featuring information such as category, rating, size, and more.

### Acknowledgements
The data presented here has been scraped from the Google Play Store. This valuable app information would not be accessible without the efforts involved in scraping.

### Inspiration
The dataset containing information on Play Store apps holds immense potential for guiding app development businesses toward success. Developers can extract actionable insights to target specific areas and capitalize on opportunities within the Android market!

# 1. Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# 2. Data Loading and Exploration | Cleaning

- let's load the csv

In [2]:
df = pd.read_csv('./Datasets/googleplaystore.csv')

- Important code to run

In [3]:
# Set options to be maximum for rows and columns
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)

# hide all warnings
import warnings 
warnings.filterwarnings('ignore')

- Looking at first 5 rows

In [4]:
df.head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [5]:
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns in the dataset.")

There are 10840 rows and 13 columns in the dataset.


In [6]:
df.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10840 entries, 0 to 10839
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10840 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9366 non-null   float64
 3   Reviews         10840 non-null  int64  
 4   Size            10840 non-null  object 
 5   Installs        10840 non-null  object 
 6   Type            10839 non-null  object 
 7   Price           10840 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10840 non-null  object 
 11  Current Ver     10832 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 1.1+ MB


- To check which columns are numeric

In [8]:
df.describe()

Unnamed: 0,Rating,Reviews
count,9366.0,10840.0
mean,4.191757,444152.9
std,0.515219,2927761.0
min,1.0,0.0
25%,4.0,38.0
50%,4.3,2094.0
75%,4.5,54775.5
max,5.0,78158310.0


#### We analyzed that there must be 5 columns that are numeric, but we have 2

In [9]:
df[['Size','Installs','Price']].sample(5)

Unnamed: 0,Size,Installs,Price
911,13M,"1,000,000+",0
5052,53M,100+,0
7267,29M,"10,000,000+",0
6763,373k,"5,000,000+",0
5198,26M,"1,000+",0


# Making columns numeric

# 1. Size

In [10]:
df['Size'].value_counts()

Size
Varies with device    1695
11M                    198
12M                    196
14M                    194
13M                    191
15M                    184
17M                    160
19M                    154
26M                    149
16M                    149
25M                    143
20M                    139
21M                    138
10M                    136
24M                    136
18M                    133
23M                    117
22M                    114
29M                    103
27M                     97
28M                     95
30M                     84
33M                     79
3.3M                    77
37M                     76
35M                     72
31M                     70
2.9M                    69
2.5M                    68
2.3M                    68
2.8M                    65
3.4M                    65
32M                     63
34M                     63
3.7M                    63
40M                     62
3.9M                   

Observations:
1. Varies with device
2. M 
3. K

- We will remove `Varies with device` with `NAN`
- Remove M
- Remove k and dividing it by 1024

In [11]:
df['Size'].isna().sum()

0

In [12]:
# Sum of Varies with device available in Size column
df['Size'].loc[df['Size'].str.contains('Varies with device')].value_counts().sum()

1695

In [13]:
# Sum of M available in Size column
df['Size'].loc[df['Size'].str.contains('M')].value_counts().sum()

8829

In [14]:
# Sum of k available in Size column
df['Size'].loc[df['Size'].str.contains('k')].value_counts().sum()

316

In [15]:
df['Size'].value_counts().sum()

10840

#### Removing `Varies with device` with numpy NAN

In [16]:
df['Size'] = df['Size'].replace('Varies with device',np.nan)

In [17]:
df['Size'].isnull().sum()

1695

#### Replacing `M` with empty space

In [18]:
df['Size'] = df['Size'].str.replace('M','')

### Making function to handle `K` and divide it by 1024

In [19]:
def covert_k(size):
    if isinstance(size,str):
        if 'k' in size:
            return float(size.replace('k',"")) / 1024
    return size

- Calling the function

In [20]:
df['Size'] = df['Size'].apply(covert_k)
df['Size'].unique()

array(['19', '14', '8.7', '25', '2.8', '5.6', '29', '33', '3.1', '28',
       '12', '20', '21', '37', '2.7', '5.5', '17', '39', '31', '4.2',
       '7.0', '23', '6.0', '6.1', '4.6', '9.2', '5.2', '11', '24', nan,
       '9.4', '15', '10', '1.2', '26', '8.0', '7.9', '56', '57', '35',
       '54', 0.1962890625, '3.6', '5.7', '8.6', '2.4', '27', '2.5', '16',
       '3.4', '8.9', '3.9', '2.9', '38', '32', '5.4', '18', '1.1', '2.2',
       '4.5', '9.8', '52', '9.0', '6.7', '30', '2.6', '7.1', '3.7', '22',
       '7.4', '6.4', '3.2', '8.2', '9.9', '4.9', '9.5', '5.0', '5.9',
       '13', '73', '6.8', '3.5', '4.0', '2.3', '7.2', '2.1', '42', '7.3',
       '9.1', '55', 0.0224609375, '6.5', '1.5', '7.5', '51', '41', '48',
       '8.5', '46', '8.3', '4.3', '4.7', '3.3', '40', '7.8', '8.8', '6.6',
       '5.1', '61', '66', 0.0771484375, '8.4', 0.115234375, '44',
       0.6787109375, '1.6', '6.2', 0.017578125, '53', '1.4', '3.0', '5.8',
       '3.8', '9.6', '45', '63', '49', '77', '4.4', '4.8', '7

In [21]:
df['Size'] = df['Size'].astype('float64')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10840 entries, 0 to 10839
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10840 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9366 non-null   float64
 3   Reviews         10840 non-null  int64  
 4   Size            9145 non-null   float64
 5   Installs        10840 non-null  object 
 6   Type            10839 non-null  object 
 7   Price           10840 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10840 non-null  object 
 11  Current Ver     10832 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(2), int64(1), object(10)
memory usage: 1.1+ MB


# 2. Installs

In [22]:
df['Installs'].sample(5)

2323        500,000+
6976      1,000,000+
7679          1,000+
1986    100,000,000+
4090     10,000,000+
Name: Installs, dtype: object

**Observations:**
1. Remove `+`
2. Remove `,`
3. Convert astype to int64

In [23]:
df['Installs'] = df['Installs'].str.replace('+','')
df['Installs'] = df['Installs'].str.replace(',','')
df['Installs'] = df['Installs'].astype('int64')
df['Installs'].sample(10)

10554           5
9442         1000
8553        10000
6433          100
5845           10
9648           10
6264       100000
1323     10000000
8885          100
2165      5000000
Name: Installs, dtype: int64

# 3. Price

In [24]:
df['Price'].sample(5)

9696     0
2852     0
10503    0
4683     0
7814     0
Name: Price, dtype: object

## **Observations:**
#### 1. `$` sign is what causing the issue, we need to remove it

In [25]:
df['Price'] = df['Price'].str.replace('$','')
df['Price'] = df['Price'].astype('float64')