# Google Playstore Data Cleaning
* In this notebook i perform data changes and preprocesiing on Google Playstore data.
* The cleaned data will be used to build a power BI dashboard for visual insights.

# Importing Required Libararies and Loading the Dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style = "whitegrid")
playstore_datas = pd.read_csv("google_play_store_dataset.csv")
playstore_datas

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


# Data Cleaning Steps 
* Handling missing values
  

In [33]:
playstore_datas.isnull().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

* Drop duplicates

In [34]:
playstore_datas.drop_duplicates(inplace = True)

* Format Installs,price columns
* convert Installs,price to numeric

In [35]:
playstore_datas['Installs'] = playstore_datas['Installs'].str.replace('[+]','',regex=True)
playstore_datas['Installs'] = pd.to_numeric(playstore_datas['Installs'],errors = 'coerce')


In [36]:
playstore_datas['Price'] = playstore_datas['Price'].str.replace('$','',regex=True)
playstore_datas['Price'] = pd.to_numeric(playstore_datas['Price'],errors = 'coerce')

## Standardizing size column (MB conversion)
### The size column has values like 23M,1.5.So we define a custome function to,
* Remove Text
* Convert M to MB directly
* Replace varies with device with NAN

In [37]:
def convert_size(size):
    if 'M' in size:
        return float(size.replace('M',''))
    elif 'k' in size:
        return float(size.replace('k','')) / 1024
    else:
        return np.nan

playstore_datas['Size'] = playstore_datas['Size'].replace('Varies with device', np.nan)
playstore_datas['Size'] = playstore_datas['Size'].dropna().apply(convert_size)

In [4]:
playstore_datas.dropna(subset=['Rating','Installs','Price'],inplace=True)

In [40]:
playstore_datas.describe()

Unnamed: 0,Rating,Size,Installs,Price
count,503.0,494.0,503.0,503.0
mean,4.37674,12.679041,214.425447,0.0
std,0.881187,14.816607,200.115778,0.0
min,1.0,0.053711,1.0,0.0
25%,4.0,3.4,100.0,0.0
50%,4.7,7.3,100.0,0.0
75%,5.0,16.0,500.0,0.0
max,5.0,94.0,500.0,0.0


### Fetching some random columns to watch 
#### Performing sorting also 

In [None]:
playstore_datas[['App','Installs','Price','Size','Rating']].head(10)

In [43]:
top_installed_apps = playstore_datas.sort_values(by = 'Installs',ascending = False).head(10)
top_installed_apps[['App','Installs']]

Unnamed: 0,App,Installs
605,Cardi B Live Stream Video Chat - Prank,500.0
8859,DT Baby Cam,500.0
6474,Mobilight-BM,500.0
9059,MBU DX Cluster,500.0
9055,Human Dx,500.0
9042,Tricky Bike Stunt Rider DX,500.0
6495,Word Hunt,500.0
8998,DW Witness,500.0
6557,Casual Sic Bo (骰寶),500.0
6602,Wireless BP,500.0


In [44]:
playstore_datas['Category'].value_counts().head(10)

Category
FAMILY                108
BUSINESS               44
MEDICAL                41
TOOLS                  41
LIFESTYLE              32
GAME                   22
PRODUCTIVITY           22
HEALTH_AND_FITNESS     20
COMMUNICATION          17
FINANCE                16
Name: count, dtype: int64

In [45]:
avg_rating = playstore_datas.groupby('Category')['Rating'].mean().sort_values(ascending = False)
avg_rating.head(10)

Category
ART_AND_DESIGN         5.000000
PARENTING              5.000000
BOOKS_AND_REFERENCE    4.833333
FOOD_AND_DRINK         4.775000
HEALTH_AND_FITNESS     4.675000
EVENTS                 4.672727
TRAVEL_AND_LOCAL       4.642857
SOCIAL                 4.625000
NEWS_AND_MAGAZINES     4.571429
SHOPPING               4.566667
Name: Rating, dtype: float64

In [46]:
playstore_datas['Content Rating'].value_counts()

Content Rating
Everyone        436
Teen             40
Mature 17+       23
Everyone 10+      4
Name: count, dtype: int64

In [47]:
top_installs = playstore_datas.drop_duplicates(subset='App')
top_installs = top_installs.sort_values(by='Installs', ascending=False).head(10)
top_installs[['App','Installs']]

Unnamed: 0,App,Installs
605,Cardi B Live Stream Video Chat - Prank,500.0
8859,DT Baby Cam,500.0
6474,Mobilight-BM,500.0
9059,MBU DX Cluster,500.0
9055,Human Dx,500.0
9042,Tricky Bike Stunt Rider DX,500.0
6495,Word Hunt,500.0
8998,DW Witness,500.0
6557,Casual Sic Bo (骰寶),500.0
6602,Wireless BP,500.0


#### We review the cleaned dataset with describe() to verify proper formatting and no missing values

In [48]:
playstore_datas['Reviews'].describe()

count     503
unique     44
top         2
freq       62
Name: Reviews, dtype: object

Export Clean data to csv

In [7]:
playstore_datas.to_csv('Cleaned_playstore_data.csv',index = False) # I can used this code to create the cleaned playstore data file