# **Google playstore App Data**

#### **Complete Exploratory Data AnalysisðŸ”Ž**

---
**Author:** `Syed Ghazi Ali Zaidi`

* Contact: _sghazializaidi@gmail.com_
* Explore my code: _https://github.com/Ghazi-work_
* Connect with me: _https://www.linkedin.com/in/syed-ghazi-ali-zaidi-405931217_


---

# About the data 

This dataset delves into the world of Google Play Store apps, offering valuable insights for developers and anyone interested in the Android app landscape. 

## Data Overview

* **Attributes:** The dataset includes information like category, rating, size, installs, price, and more for each app.
* **Record Structure:** Each row represents a unique app with its corresponding attributes.

| Attribute | Data Type | Description |
|---|---|---|
| App | Object | Name of the app |
| Category | Object | Category of the app |
| Rating | Numeric | Average user rating of the app |
| Reviews | Numeric | Number of user reviews for the app |
| Size | Object | Size of the app in megabytes |
| Installs | Object | Number of times the app has been installed |
| Type | Object | Type of app (Free or Paid) |
| Price | Object | Price of the app (if it's a paid app) |
| Content Rating | Object | Age-based content rating of the app |
| Genres | object | Genre of the app |
| Last Updated | object | The date on which app was last updated |
| Current Ver | object | Current app version |
| Android Ver | object | Current android version |


## Motivation

While Apple App Store data is readily available, Google Play Store data often proves elusive. This dataset bridges that gap, allowing us to analyze app trends, user preferences, and potential market opportunities. Scraping Google Play Store data presented its own set of challenges due to its dynamic nature, making this resource even more valuable.

## Inspiration

The potential of this data is vast. Developers can leverage it to:

* Understand user behavior and preferences
* Optimize app strategies for better visibility and engagement
* Identify market gaps and opportunities
* Gain a competitive edge in the Android market

Researchers and enthusiasts can explore:

* App usage patterns across different categories and demographics
* Factors influencing app success and failure
* The evolution of the app ecosystem over time

## Acknowledgements

A huge thank you to the original data providers on [Kaggle](https://www.kaggle.com/datasets/lava18/google-play-store-apps/) and the Google Play Store itself. Without their efforts, this valuable resource wouldn't be available.


---

## **1. Importing Libraries**

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
%matplotlib inline

## **2. Loading Dataset and Exploring**

â†ª Load the csv file from dataset folder in pandas dataframe.\
â†ª Exploring data and generating observations for data wrangling.

In [3]:
df = pd.read_csv("Datasets\googleplaystore.csv")

* Viewing the first 5 rows of the data.

In [5]:
df.head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite â€“ FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


* Looking at the total rows and columns of the dataset.

In [8]:
print(f"There are a total of {df.shape[0]} rows and {df.shape[1]} columns in the dataset.")

There are a total of 10840 rows and 13 columns in the dataset.


* Looking at the datatypes of the features.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10840 entries, 0 to 10839
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10840 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9366 non-null   float64
 3   Reviews         10840 non-null  int64  
 4   Size            10840 non-null  object 
 5   Installs        10840 non-null  object 
 6   Type            10839 non-null  object 
 7   Price           10840 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10840 non-null  object 
 11  Current Ver     10832 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 1.1+ MB


* Basic descriptive statistics on the data.

In [10]:
df.describe()

Unnamed: 0,Rating,Reviews
count,9366.0,10840.0
mean,4.191757,444152.9
std,0.515219,2927761.0
min,1.0,0.0
25%,4.0,38.0
50%,4.3,2094.0
75%,4.5,54775.5
max,5.0,78158310.0


### **Observations:** 
----
1. There are 10840 rows and 13 columns.
2. The columns in the dataset are:
    - `'App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'`
3. There are only **2 columns** that are _numeric_ but when we analyze there are **5 columns** that can be _numeric_.  
4. The `'Size', 'Installs' and 'Price'` columns are needed to be transform into numeric.
----

## **3. Data Pre-processing**

â†ª We'll transform these columns in numeric by first checking any missing values in these columns:\
a. **Size**\
b. **Installs**\
c. **Price**

### a. Size

* Checking any null values.

In [24]:
print(f"There are {df['Size'].isnull().sum()} missing values in `Size` column.")

There are 0 missing values in `Size` column.


* Checking the unique values in the `Size` column.

In [18]:
df['Size'].unique()

array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
       '31M', '4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M',
       '5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M',
       '1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k',
       '3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.5M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '3.7M', '22M', '7.4M', '6.4M', '3.2M', '8.2M', '9.9M',
       '4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M',
       '4.0M', '2.3M', '7.2M', '2.1M', '42M', '7.3M', '9.1M', '55M',
       '23k', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M',
       '8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M',
       '5.1M', '61M', '66M', '79k', '8.4M', '118k', '44M', '695k', '1.6M',
     

* Checking the category wise value count.

In [19]:
df['Size'].value_counts()

Size
Varies with device    1695
11M                    198
12M                    196
14M                    194
13M                    191
                      ... 
430k                     1
429k                     1
200k                     1
460k                     1
619k                     1
Name: count, Length: 461, dtype: int64

### **Observations:** 
----
* The Size column has **0 missing values**.
* It can be observed that `Size` column has these characters that has to be removed:
    1. `Varies with device` has to be replaced with numpy `NA`.
    2. `M` which has to be replaced with empty space.
    3. `k` which has to be replaced with empty space and divided by **`1024`** in order to  convert it to MB.
* Lastly the column can be converted to float64.
----

* Total rows by default.

In [23]:
df['Size'].value_counts().sum()

10840

* `Varies with device` sum which will be numpy nan later.

In [26]:
df['Size'].loc[df['Size'] == 'Varies with device'].value_counts().sum()

1695

* Replacing `Varies with device` to numpy nan.

In [33]:
# Replacing `Varies with device` to numpy NA
df['Size'] = df['Size'].replace('Varies with device',np.nan)

* The null values shows that it's now nan

In [34]:
df['Size'].isnull().sum()

1695

* Handling `M` & `k` for Size column and converting it to float64

In [38]:
# Replaced with empty string
df['Size'] = df['Size'].str.replace('M', '')

# Getting those which contains k and then replacing k with empty string, later dividing it by 1024
for index, row in df.iterrows():
    if 'k' in str(row['Size']):
        # Replace 'k' with empty string before dividing
        df.at[index, 'Size'] = str(row['Size']).replace('k', '')
        df.at[index, 'Size'] = pd.to_numeric(df.at[index, 'Size']) / 1024

# Coverting Size to float64
df['Size'].astype('float64')



0        19.0
1        14.0
2         8.7
3        25.0
4         2.8
         ... 
10835    53.0
10836     3.6
10837     9.5
10838     NaN
10839    19.0
Name: Size, Length: 10840, dtype: float64