# Google Play Dataset

As part of this notebook, I will clean data and conduct some exploratory analysis. The first step is to import necessary packages.

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sb
from scipy import stats
from scipy.stats import ttest_ind

The second step will be to import our data.

In [2]:
GP = pd.read_csv("C:/Users/brend/Documents/GitHub/Play-Store-Exploratory-Analysis/Google_data_cleaned.csv")
GP = GP.iloc[:,1:] # Gets rid of "Unnamed column".

In [3]:
GP.head() # Takes a look at the first 5 rows of data.

Unnamed: 0,app,category,rating,reviews,installs,type,price,content_rating,genres,current_ver,android_ver,size(kb),update_month,update_year
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,10000,0,0.0,Everyone,Art & Design,1.0.0,4.0.3,19000.0,1,2018
1,Coloring book moana,ART_AND_DESIGN,3.9,967,500000,0,0.0,Everyone,Art & Design;Pretend Play,2.0.0,4.0.3,14000.0,1,2018
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,5000000,0,0.0,Everyone,Art & Design,1.2.4,4.0.3,8.7,8,2018
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,50000000,0,0.0,Teen,Art & Design,,4.2,25000.0,6,2018
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,100000,0,0.0,Everyone,Art & Design;Creativity,1.1,4.4,2.8,6,2018


# Data Wrangling

In order to conduct exploratory analysis, I will conduct some data cleaning. 

In [5]:
GP.columns # Takes a look at all the columns in our dataset.

Index(['app', 'category', 'rating', 'reviews', 'installs', 'type', 'price',
       'content_rating', 'genres', 'current_ver', 'android_ver', 'size(kb)',
       'update_month', 'update_year'],
      dtype='object')

In [6]:
GP1 = GP.drop(['current_ver', 'genres','app','type','android_ver','size(kb)','update_month'], axis=1) # 'genre column is repetitive from 'category' column, and 'current_ver','app','type' seem irrelevant to our analysis. Will remove both columns

In [7]:
GP1.head()

Unnamed: 0,category,rating,reviews,installs,price,content_rating,update_year
0,ART_AND_DESIGN,4.1,159,10000,0.0,Everyone,2018
1,ART_AND_DESIGN,3.9,967,500000,0.0,Everyone,2018
2,ART_AND_DESIGN,4.7,87510,5000000,0.0,Everyone,2018
3,ART_AND_DESIGN,4.5,215644,50000000,0.0,Teen,2018
4,ART_AND_DESIGN,4.3,967,100000,0.0,Everyone,2018


In [8]:
null_values = GP1.isna().sum().sort_values(ascending=False) # Takes a look at null values
null_values.head(10)

rating            1462
category             0
reviews              0
installs             0
price                0
content_rating       0
update_year          0
dtype: int64

In [9]:
print(f'Rows: {GP1.shape[0]}') # Takes a look at how many rows and column my data has.
print(f'Columns: {GP1.shape[1]}')

Rows: 9658
Columns: 7


In [10]:
# replace null values with mean of each category. 

In [11]:
GP1['ratingR'] = GP1.groupby('category')['rating'].apply(lambda x: x.fillna(x.mean()))

In [12]:
# dropping rating column

In [13]:
GP2 = GP1.drop(['rating'], axis=1)

In [15]:
# check null values and new column

In [16]:
GP2.isna().sum()

category          0
reviews           0
installs          0
price             0
content_rating    0
update_year       0
ratingR           0
dtype: int64

In [17]:
GP2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9658 entries, 0 to 9657
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   category        9658 non-null   object 
 1   reviews         9658 non-null   int64  
 2   installs        9658 non-null   int64  
 3   price           9658 non-null   float64
 4   content_rating  9658 non-null   object 
 5   update_year     9658 non-null   int64  
 6   ratingR         9658 non-null   float64
dtypes: float64(2), int64(3), object(2)
memory usage: 528.3+ KB


In [18]:
# Export clean dataset.

In [19]:
GP2.to_csv("C:/Users/brend/OneDrive/Desktop/Data Sciences SCI/Google Play/CleanData.csv")