# **Google playstore Data**
**Complete Exploratory Data Analysis**

## About Dataset

>- **`Description`**\
> The Data Set was downloaded from Kaggle, from the following [link](https://www.kaggle.com/datasets/lava18/google-play-store-apps/)

- `Context`
While many public datasets (on Kaggle and the like) provide Apple App Store data, there are not many counterpart datasets available for Google Play Store apps anywhere on the web. On digging deeper, I found out that iTunes App Store page deploys a nicely indexed appendix-like structure to allow for simple and easy web scraping. On the other hand, Google Play Store uses sophisticated modern-day techniques (like dynamic page load) using JQuery making scraping more challenging.

- `Content`
Each app (row) has values for catergory, rating, size, and more.

- `Acknowledgements`
This information is scraped from the Google Play Store. This app information would not be available without it.

- `Inspiration`
The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market!

## 1. Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

## 2. **Data Loading and exploration and cleaning**
 ↪ Load the csv file with the pandas
 
 ↪ creating the dataframe and understanding the data present in the dataset using pandas
 
 ↪ Dealing with the missing data, outliers and the incorrect records

In [None]:
data = pd.read_csv('googleplaystore.csv')

In [None]:
data.head()

> **Note**: Some the output of notebook does not present the complete output, therefore we can increase the limit of columns view and row view by using these commands: 


In [None]:
pd.set_option('display.max_columns', None) # this is to display all the columns in the dataframe
pd.set_option('display.max_rows', None) # this is to display all the rows in the dataframe
# hide all warnings runtime
import warnings
warnings.filterwarnings('ignore')

- let's see the exact column names which can be easily copied later on from Google Playstore Dataset

In [None]:
data.columns

- let's see the exact column names which can be easily copied later on from Google Playstore Dataset

In [None]:
data.shape

Not enough, let's have a look on the columns and their data types using detailed info function

In [None]:
data.info()

# **Observations**
---
1. There are 10841 rows and 13 columns in the dataset
2. The columns are of different data types
3. The columns in the datasets are:
   - `'App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'`
4. There are some missing values in the dataset which we will read in details and deal later on in the notebook.
5. There are some columns which are of object data type but they should be of numeric data type, we will convert them later on in the notebook.
   - `'Size', 'Installs', 'Price'` 

In [None]:
data.describe()

## Observations:
---
- We have only 2 columns as numeric data type, rest all are object data type (according to python), but we can see that `'Size', 'Installs', 'Price'` are also numeric, we must convert them to numeric data type in data wrangling process.
---

- Let's clean the `Size` column first

In [None]:
# First check null values
data['Size'].isnull().sum()

In [None]:
# Check Unique Values 
data['Size'].unique()

-  There are several uniques values in the `Size` column, we have to first make the unit into one common unit from M and K to bytes, and then remove the `M` and `K` from the values and convert them into numeric data type.

In [None]:
# find the values in size column which has 'M' in it
data['Size'].loc[data['Size'].str.contains('M')].value_counts().sum()

In [None]:
# find the values in size column which has 'k' in it
data['Size'].loc[data['Size'].str.contains('k')].value_counts().sum()

In [None]:
# find the values in size column which has 'Varies with device' in it
data['Size'].loc[data['Size'].str.contains('Varies with device')].value_counts().sum()

In [None]:
# Total Values in Size column
data['Size'].value_counts().sum()

- We have `8830` values in `M` units
- We have `316` values in `k` units
- We have `1695` value in `Varies with device` 

> Let's convert the `M` and `K` units into bytes and then remove the `M` and `K` from the values and convert them into numeric data type.

In [None]:
# this function will convert the size column to numeric
def convert_size(size):
    if isinstance(size, str):
        if 'k' in size:
            return float(size.replace('k', '')) * 1024
        elif 'M' in size:
            return float(size.replace('M', '')) * 1024 * 1024
        elif 'Varies with device' in size:
            return np.nan
    return size

data['Size'] = data['Size'].apply(convert_size)

In [None]:
# rename the column name 'Size' to 'Size_in_bytes'
data.rename(columns={'Size': 'Size_in_bytes'}, inplace=True)

- Now we have converted every value into bytes and removed the `M` and `K` from the values and converted them into numeric data type.
- 'Varies with device' was a string value, therefore we intentionally converted them into null values, which we can fill later on according to our needs.

---
- Let's have a look on the `Installs` column

In [None]:
# check the unique values in size column
data['Installs'].unique()

In [None]:
# let's have a values counts
data['Installs'].value_counts()

In [None]:
# find how many values has '+' in it
data['Installs'].loc[data['Installs'].str.contains('\+')].value_counts().sum()

In [None]:
# Total values in Installs column
data['Installs'].value_counts().sum()

- The total values in the `Installs` column are `10841` and there are no null values in the column.
- However, one value 0 has no plus sign

- Let's remove the plus sign `+` and `,` from the values and convert them into numeric data type

In [None]:
# remove the plus sign from install column and convert it to numeric
data['Installs'] = data['Installs'].apply(lambda x: x.replace('+', '') if '+' in str(x) else x)
# also remove the commas from the install column
data['Installs'] = data['Installs'].apply(lambda x: x.replace(',', '') if ',' in str(x) else x)
# convert the install column to numeric (integers because this is the number of installs/count)
data['Installs'] = data['Installs'].apply(lambda x: int(x))

- Let's verify if the dtypes has been changes and the `+` and `,` sign has been removed


In [None]:
data['Installs'].dtype

In [None]:
data['Installs'].max()

- We can generate a new columns based on the installation values, which will be helpful in our analysis

In [None]:
# making a new column called 'Installs_category' which will have the category of the installs
bins = [-1, 0, 10, 1000, 10000, 100000, 1000000, 10000000, 10000000000]
labels=['no', 'Very low', 'Low', 'Moderate', 'More than moderate', 'High', 'Very High', 'Top Notch']
data['Installs_category'] = pd.cut(data['Installs'], bins=bins, labels=labels)

In [None]:
data['Installs_category'].value_counts() # check the value counts of the new column

- Let's have a look on the `Price` column

In [None]:
# check the unique values in the 'Price' column
data['Price'].unique()

In [None]:
data['Price'].isnull().sum()

- No Null Values

In [None]:
data['Price'].value_counts()

- We need to confirm if the values in the `Price` column are only with $ sign or not

In [None]:
# count the values having $ in the 'Price' column
data['Price'].loc[data['Price'].str.contains('\$')].value_counts().sum()

- Now we can confirm that the only currency used is `$` in the `Price` column or 0 value, as `800+10041=10841 Total values`
- The only problem is $ sign let's remove it and convert the column into numeric data type.

In [None]:
# remove the dollar sign from the price column and convert it to numeric
data['Price'] = data['Price'].apply(lambda x: x.replace('$', '') if '$' in str(x) else x)
# convert the price column to numeric (float because this is the price)
data['Price'] = data['Price'].apply(lambda x: float(x))

In [None]:
# using f string to print the min, max and average price of the apps
print(f"Min price is: {data['Price'].min()} $")
print(f"Max price is: {data['Price'].max()} $")
print(f"Average price is: {data['Price'].mean()} $")

### **2.1. Descriptive Statistics**

In [None]:
data.describe()

## Observations:
---
- Now, we have only 6 columns as numeric data type.
- We can observe their descriptive statistics. and make tons of observations as per our hypotheses.
- We can see that the `Rating` column has a minimum value of `1` and a maximum value of `5`, which is the range of rating, and the mean is `4.19` which is a good rating. On an average people give this rating.
- We can see that the `Reviews` column has a minimum value of `0` and a maximum value of `78,158,306` 78+ Millions, which is the range of reviews, and the mean is `444,111.93` which is a good number of reviews. On an average people give this number of reviews to the apps. But it does not make sense to us, as we have different categories of apps.
- Similarly, we can observe the other columns as well.

Therefore, the most important thing is to classify as app based on the correlation matrix and then observe the descriptive statistics of the app category and number of installs, reviews, ratings, etc.

But even before that we have to think about the missing values in the dataset.
---

## **2.2. Dealing with the missing values**
Dealing with the missing values is one of the most important part of the data wrangling process, we must deal with the missing values in order to get the correct insights from the data.

## Where to Learn more about Missing Values?
In the following blog [Missing Values k Rolay](https://codanics.com/missing-values-k-rolay/) you will understand how missing values can change your output if you ignore them and how to deal with them.

- Lets looks have a missing values in the datasets

In [None]:
data.isnull().sum().sort_values(ascending=True)

In [None]:
data.isnull().sum().sum()

In [None]:
(data.isnull().sum() / len(data) * 100).sort_values(ascending=True)

- Lets plots a missing values in the graph

In [None]:
plt.figure(figsize=(16,6))
sns.heatmap(data.isnull(),yticklabels=False,cbar=False,cmap="viridis")

-  Lets Plot missing value according to percentage

In [None]:
# make figure size
plt.figure(figsize=(16, 6))
# plot the null values by their percentage in each column
missing_percentage = data.isnull().sum()/len(data)*100
missing_percentage.plot(kind='bar')
# add the labels
plt.xlabel('Columns')
plt.ylabel('Percentage')
plt.title('Percentage of Missing Values in each Column')

## Observations:
---
- We have 1695 missing values in the `'Size_in_bytes'` and `'Size_in_Mb'` columns, which is 15.6% of the total values in the column.
- We have 1474 missing values in the `'Rating'` column, which is 13.6% of the total values in the column.
- We have 8 missing value in the `'Current Ver'` column, which is 0.07% of the total values in the column.
- We have 2 missing values in the `'Android Ver'` column, which is 0.01% of the total values in the column.
- We have only 1 missing value in `Category`, `Type` and `Genres` columns, which is 0.009% of the total values in the column.

### **2.3. Dealing with the missing values**
- We can not impute the `Rating` column as is is directly linked with the installation column. To test this Hypothesis we need to plot the `Rating` column with the `Installs` and `Size` columns and statistically test it using `pearson correlation test`.
---

- Lets Start the process of corelation

In [None]:
# Make a correlation matrix of numeric columns
plt.figure(figsize=(16, 10)) # make figure size  
numeric_cols = ['Rating', 'Reviews', 'Size_in_bytes', 'Installs', 'Price'] # make a list of numeric columns
sns.heatmap(data[numeric_cols].corr(), annot=True) # plot the correlation matrix

In [None]:
data[numeric_cols].corr() # this will show the correlation matrix

In [None]:
from scipy import stats

data_clean = data.dropna()

# calculate Pearson's R between Rating and Installs
pearson_r, _ = stats.pearsonr(data_clean['Reviews'], data_clean['Installs'])
print(f"Pearson's R between Reviews and Installs: {pearson_r:.4f}")

- Before going ahead, let's remove the rows with missing values in the `Current Ver`, `Android Ver`, `Category`, `Type` and `Genres` columns, as they are very less in number and will not affect our analysis.

In [None]:
# length before removing null values
print(f"Length of the dataframe after removing null values: {len(data)}")

In [None]:
data.dropna(subset=['Current Ver', 'Android Ver', 'Category', 'Type', 'Genres'], inplace=True)

In [None]:
# length after removing null values
print(f"Length of the dataframe after removing null values: {len(data)}")

- We have removed `12` rows having null values in the `Current Ver`, `Android Ver`, `Category`, `Type` and `Genres` columns.

In [None]:
# let's check the null values again
data.isnull().sum().sort_values(ascending=False)

---
## **Observations**
- Only `Rating` and `Size_in_bytes` or `Size_in_Mb` columns are left with missing values.
  - We know that we have to be carefull while deadling with `Rating` column, as it is directly linked with the `Installs` column.
  - In Size columns we already know about `Varies with device` values, which we have converted into null values, we do not need to impute at the moment, as every app has different size and nobody can predict that as nearly as possible.
---

In [None]:
data.columns

In [None]:
# use groupby function to find the trend of Rating in each Installs_category
data.groupby('Installs_category')['Rating'].describe()

In [None]:
data['Rating'].isnull().sum()

In [None]:
data['Installs_category'].loc[data['Rating'].isnull()].value_counts()

- Lets plot this and have a Look

In [None]:
# plot the boxplot of Rating in each Installs_category
plt.figure(figsize=(16, 6)) # make figure size
sns.boxplot(x='Installs_category', y='Rating', hue='Installs_category', data=data) # plot the boxplot
# add the text of number of null values in each category
plt.text(0, 3.5, 'Null values: 14')
plt.text(1, 3.5, 'Null values: 874')
plt.text(2, 3.5, 'Null values: 86')
plt.text(3, 3.5, 'Null values: 31')
plt.text(4, 3.5, 'Null values: 3')
plt.text(5, 3.5, 'Null values: 0')
plt.text(6, 3.5, 'Null values: 0')
plt.text(7, 3.5, 'Null values: 0')

In [None]:
data['Installs_category'].loc[data['Reviews'].isnull()].value_counts()

- There are no null value in Reiview

In [None]:
# let's plot the same plots for Reviews column as well
plt.figure(figsize=(16, 6)) # make figure size
sns.boxplot(x='Installs_category', y= 'Reviews', data=data) # plot the boxplot

- We also draw the scatter plot of the `Rating` and `Review` columns with the `Installs` column

In [None]:
# Draw a scatter plot between Rating, Reviews and Installs
plt.figure(figsize=(16, 6)) # make figure size
sns.scatterplot(x='Rating', y='Reviews', hue='Installs_category', data=data) # plot the scatter plot

In [None]:
# plot reviews and installs in a scatter plot
plt.figure(figsize=(16, 6)) # make figure size
sns.scatterplot(x='Reviews', y='Installs', data=data) # plot the scatter plot

---
## **Observation**
-We can see that most of the null values from `Rating` column are no - Moderate Installation apps, which make sense that if the app has less installations, it has less Rating and review.

---

## 2.3. **Duplicates**

* Removing duplicates is one of the most important part of the data wrangling process, we must remove the duplicates in order to get the correct insights from the data.
* If you do not remove duplicates from a dataset, it can lead to incorrect insights and analysis. 
* Duplicates can skew statistical measures such as mean, median, and standard deviation, and can also lead to over-representation of certain data points. 
* It is important to remove duplicates to ensure the accuracy and reliability of your data analysis.


In [None]:
data.duplicated().sum()

* let's check for number of duplicates in each column using a for loop and printing the output

In [None]:
for col in data.columns:
    print(f"Number of Dulicates in {col} columns are {data[col].duplicated().sum()}")

- This means that the only better way to find duplicates is to check for whole data

In [None]:
print(f"Number of Duplicated in whole data is {data.duplicated().sum()}")

- Find and watch all duplicates if they are real!

In [None]:
# find exact duplicates and print them
data[data['App'].duplicated(keep=False)].sort_values(by='App')

* Removes all Duplicated

In [None]:
data.drop_duplicates(inplace=True)

In [None]:
# print the number of rows and columns after removing duplicates
print(f"Number of rows after removing duplicates: {data.shape[0]}")

- Now we have removed 483 duplicates from the dataset. and have 10346 rows left.

---