# Exploratory Analysis

After obtaining a clean dataset, I will explore it the dataset. My exploratory analysis aims to gain a better understanding of my data.

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sb
from scipy import stats
from scipy.stats import ttest_ind

In [2]:
GooglePlay = pd.read_csv("C:/Users/brend/Documents/GitHub/Play-Store-Exploratory-Analysis/Datasets/CleanData.csv")
GooglePlay = GooglePlay.iloc[:,1:]

In [3]:
GooglePlay.head()

Unnamed: 0,category,reviews,installs,price,content_rating,update_year,ratingR
0,ART_AND_DESIGN,159,10000,0.0,Everyone,2018,4.1
1,ART_AND_DESIGN,967,500000,0.0,Everyone,2018,3.9
2,ART_AND_DESIGN,87510,5000000,0.0,Everyone,2018,4.7
3,ART_AND_DESIGN,215644,50000000,0.0,Teen,2018,4.5
4,ART_AND_DESIGN,967,100000,0.0,Everyone,2018,4.3


In [17]:
# Used to look at the data type of each column.

In [18]:
GooglePlay.info()    

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9658 entries, 0 to 9657
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   category        9658 non-null   object 
 1   reviews         9658 non-null   int64  
 2   installs        9658 non-null   int64  
 3   price           9658 non-null   float64
 4   content_rating  9658 non-null   object 
 5   update_year     9658 non-null   int64  
 6   ratingR         9658 non-null   float64
dtypes: float64(2), int64(3), object(2)
memory usage: 528.3+ KB


It might be necessary to change the data type for some analyses, but I will move forward with my exploratory analysis.

In [19]:
# Shows how many rows and columns the dataset has.

In [20]:
print(f'Rows: {GooglePlay.shape[0]}')
print(f'Columns: {GooglePlay.shape[1]}')  

Rows: 9658
Columns: 7


In [21]:
# Shows the range of years the apps were last updated.

In [6]:
GooglePlay.update_year.unique() 

array([2018, 2017, 2014, 2016, 2015, 2013, 2012, 2011, 2010], dtype=int64)

In [7]:
GooglePlay.update_year.value_counts()

2018    6283
2017    1794
2016     779
2015     449
2014     203
2013     108
2012      26
2011      15
2010       1
Name: update_year, dtype: int64

Using this dataset, most apps were created between 2010 and 2018, with most of them updated later in the decade. 

In [22]:
# Looks at the different content ratings

In [8]:
GooglePlay.content_rating.unique() 

array(['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+',
       'Adults only 18+', 'Unrated'], dtype=object)

In [23]:
 # Each category is shown along with how many values it has.

In [9]:
GooglePlay.content_rating.value_counts()

Everyone           7903
Teen               1036
Mature 17+          393
Everyone 10+        321
Adults only 18+       3
Unrated               2
Name: content_rating, dtype: int64

In this dataset, most apps are rated 'Everyone' and only one is unrated. If analysis calls for ratings, 'Adults only 18+' and 'Unrated' can be grouped with other ratings.

In [24]:
 # Each category grouped along with how many values it has.

In [10]:
GooglePlay.category.value_counts()

FAMILY                 1831
GAME                    959
TOOLS                   827
BUSINESS                420
MEDICAL                 395
PERSONALIZATION         376
PRODUCTIVITY            374
LIFESTYLE               369
FINANCE                 345
SPORTS                  325
COMMUNICATION           315
HEALTH_AND_FITNESS      288
PHOTOGRAPHY             281
NEWS_AND_MAGAZINES      254
SOCIAL                  239
BOOKS_AND_REFERENCE     222
TRAVEL_AND_LOCAL        219
SHOPPING                202
DATING                  171
VIDEO_PLAYERS           163
MAPS_AND_NAVIGATION     131
EDUCATION               119
FOOD_AND_DRINK          112
ENTERTAINMENT           102
AUTO_AND_VEHICLES        85
LIBRARIES_AND_DEMO       84
WEATHER                  79
HOUSE_AND_HOME           74
EVENTS                   64
ART_AND_DESIGN           64
PARENTING                60
COMICS                   56
BEAUTY                   53
Name: category, dtype: int64

'Family','Game', and 'Tools' categories have the most values in this dataset.

In [25]:
# Each shows the ratings and how many apps with those ratings. 

In [11]:
GooglePlay.ratingR.value_counts() 

4.300000    907
4.400000    895
4.500000    848
4.200000    810
4.600000    683
           ... 
1.400000      3
1.500000      3
4.181481      2
4.364407      1
1.200000      1
Name: ratingR, Length: 70, dtype: int64

In [12]:
GooglePlay.price.value_counts()

0.00      8902
0.99       145
2.99       124
1.99        73
4.99        70
          ... 
18.99        1
389.99       1
19.90        1
1.75         1
1.04         1
Name: price, Length: 92, dtype: int64

In [26]:
 # Used to look at the mean, std, min, max, etc. of continous values

In [13]:
GooglePlay.describe()         

Unnamed: 0,reviews,installs,price,update_year,ratingR
count,9658.0,9658.0,9658.0,9658.0,9658.0
mean,216615.0,7778312.0,1.099413,2017.34562,4.172229
std,1831413.0,53761000.0,16.853021,1.13764,0.495618
min,0.0,0.0,0.0,2010.0,1.0
25%,25.0,1000.0,0.0,2017.0,4.0
50%,967.0,100000.0,0.0,2018.0,4.2
75%,29408.0,1000000.0,0.0,2018.0,4.5
max,78158310.0,1000000000.0,400.0,2018.0,5.0


In [27]:
# Shows each categories with their average ratings.

In [14]:
GooglePlay.groupby('category')['ratingR'].mean().sort_values(ascending=False) 

category
EVENTS                 4.435556
EDUCATION              4.364407
ART_AND_DESIGN         4.357377
BOOKS_AND_REFERENCE    4.344970
PERSONALIZATION        4.332215
PARENTING              4.300000
BEAUTY                 4.278571
GAME                   4.247368
SOCIAL                 4.247291
WEATHER                4.243056
HEALTH_AND_FITNESS     4.243033
SHOPPING               4.230000
SPORTS                 4.216154
AUTO_AND_VEHICLES      4.190411
PRODUCTIVITY           4.183389
COMICS                 4.181481
FAMILY                 4.179664
LIBRARIES_AND_DEMO     4.178125
FOOD_AND_DRINK         4.172340
MEDICAL                4.166552
PHOTOGRAPHY            4.157414
HOUSE_AND_HOME         4.150000
ENTERTAINMENT          4.135294
NEWS_AND_MAGAZINES     4.121569
COMMUNICATION          4.121484
FINANCE                4.115563
BUSINESS               4.098479
LIFESTYLE              4.093355
TRAVEL_AND_LOCAL       4.069519
VIDEO_PLAYERS          4.044595
TOOLS                  4.039554

The categories with the highest ratings are 'Events','Education', and 'Art_and_Design'.

In [28]:
# Shows each category with total installs.

In [15]:
GooglePlay.groupby('category')['installs'].sum().sort_values(ascending=False) 

category
GAME                   13878924415
COMMUNICATION          11038276251
TOOLS                   8001771915
PRODUCTIVITY            5793091369
SOCIAL                  5487867902
PHOTOGRAPHY             4649147655
FAMILY                  4427941505
VIDEO_PLAYERS           3926902720
TRAVEL_AND_LOCAL        2894887146
NEWS_AND_MAGAZINES      2369217760
ENTERTAINMENT           2113660000
BOOKS_AND_REFERENCE     1665969576
PERSONALIZATION         1532494782
SHOPPING                1400348785
HEALTH_AND_FITNESS      1144022512
SPORTS                  1096474498
BUSINESS                 697164865
LIFESTYLE                503823539
MAPS_AND_NAVIGATION      503281890
FINANCE                  455348734
WEATHER                  361100520
EDUCATION                352952000
FOOD_AND_DRINK           211798751
DATING                   140926107
ART_AND_DESIGN           114338100
HOUSE_AND_HOME            97212461
AUTO_AND_VEHICLES         53130211
LIBRARIES_AND_DEMO        52995910
COMICS     

The categories with the most installs are 'Games','Family', and 'Tools'.

In [29]:
# Shows total reviews for each category. 

In [16]:
GooglePlay.groupby('category')['reviews'].sum().sort_values(ascending=False) 

category
GAME                   622298709
COMMUNICATION          285811368
TOOLS                  229356578
SOCIAL                 227927801
FAMILY                 143825488
PHOTOGRAPHY            105351270
VIDEO_PLAYERS           67484568
PRODUCTIVITY            55590649
PERSONALIZATION         53543080
SHOPPING                44551730
SPORTS                  35348813
ENTERTAINMENT           34762650
TRAVEL_AND_LOCAL        26819741
NEWS_AND_MAGAZINES      23130228
HEALTH_AND_FITNESS      21361355
MAPS_AND_NAVIGATION     17729148
BOOKS_AND_REFERENCE     16721314
EDUCATION               13364148
FINANCE                 12662106
WEATHER                 12295164
LIFESTYLE               11832671
BUSINESS                 9890245
FOOD_AND_DRINK           6325028
DATING                   3623544
COMICS                   2342071
HOUSE_AND_HOME           1929847
ART_AND_DESIGN           1419203
MEDICAL                  1182971
AUTO_AND_VEHICLES        1163666
PARENTING                 958331
L

The categories with the most reviews are 'Games','Family', and 'Tools'.