<a href="https://colab.research.google.com/github/SairajNeelam/EDA---Google-Play-Store/blob/main/EDA_Google_Play_Store_App.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Problem Statement :**

### <b> The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. </b>

### <b> Each app (row) has values for catergory, rating, size, and more. Another dataset contains customer reviews of the android apps.</b>

### <b> Explore and analyze the data to discover key factors responsible for app engagement and success. </b>

## **About Google Play :**

> ### ▶ **Google Play is an online store where people go to find and enjoy their favorite apps, games, movies, TV shows, books, and more on their Android devices.**


> ### ▶ **Google Play offers millions of apps, games, and other content for people to choose from. The store provides a search function, editorial content, user reviews, and more to help people find the best content for their individual needs.**

> ### ▶ **To be successful, Google Play must work to meet the needs of both users and developers. To help users find high quality, engaging apps Google works hard to ensure the Play Store is safe, secure, and convenient.**












## **Why Exploratory Data Analysis is important ?**

> **It is a way of visualizing, summarizing and interpreting the information that is hidden in rows and column format. EDA is one of the crucial step in data science that allows us to achieve certain insights and statistical measure that is essential for the business continuity, stockholders and data scientists. It performs to define and refine our important features variable selection, that will be used in our model.**




## **Content Table :**

### **1. Exploring the Data**
### **2. Data Wrangling (Check sanity of data and clean the data)**
### **3. Feature Engineering**
### **4. Data Vizualization**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# importing libraries
import pandas as pd               # for data manipulation
import numpy as np                # for mathemathical operations and linear algebra
import matplotlib.pyplot as plt   # for data visualization
# %matplotlib inline                # It is a magic function that renders the figure in a notebook
import seaborn as sns             # for data visualization

In [3]:
GPStore = pd.read_csv('/content/drive/MyDrive/AlmaBetter/Capstone Project/Datasets/Google Play Store/Play Store Data.csv')
user_review = pd.read_csv('/content/drive/MyDrive/AlmaBetter/Capstone Project/Datasets/Google Play Store/User Reviews.csv')

In [4]:
GPStore.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [5]:
user_review.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


In [6]:
print(GPStore.shape)
print(user_review.shape)

(10841, 13)
(64295, 5)


## **GPStore Data**

In [7]:
# prints a summary of the dataframe rows and columns, including information on the datatypes and non-null values
GPStore.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [8]:
GPStore.describe()

Unnamed: 0,Rating
count,9367.0
mean,4.193338
std,0.537431
min,1.0
25%,4.0
50%,4.3
75%,4.5
max,19.0


## ***Data Exploration and Data Wrangling***


> Explore all columns one by one and check for invalid data and clean data accordingly.



In [9]:
list(GPStore.columns)

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

### Column ▶ App



In [10]:
# get the count/frequency of all the unique values of the specified column
GPStore['App'].value_counts()

ROBLOX                                               9
CBS Sports App - Scores, News, Stats & Watch Live    8
ESPN                                                 7
Candy Crush Saga                                     7
8 Ball Pool                                          7
                                                    ..
Narrator's Voice                                     1
Random Love (BF)                                     1
17th Edition Cable Sizer                             1
L.O.L. Surprise Ball Pop                             1
Despegar.com Hotels and Flights                      1
Name: App, Length: 9660, dtype: int64

**As we see here that there are duplicates of the same app multiple times, therefore we need to remove the dulplicate data**

In [11]:
# display the shape of the dataframe, i.e., the no. of rows and columns 
print(GPStore.shape)

# remove the duplicate values from the dataframe, specifying the column name in the subset parameter 
GPStore = GPStore.drop_duplicates(subset=['App'], keep = 'first')

# display the shape of the dataframe, i.e., the no. of rows and columns 
print(GPStore.shape)

(10841, 13)
(9660, 13)


### Column ▶ Category

In [12]:
# get all the unique values present in the specified column
GPStore.Category.unique()

array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE',
       'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL',
       'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL',
       'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER',
       'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION',
       '1.9'], dtype=object)

**In the 'Category' column we have one value as '1.9' which seems to be invalid. Let's have a look at that data entry.**

In [13]:
# dataframe filtering based on a condition
GPStore[GPStore.Category == '1.9']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


**What we observe here is that, the column values are shifted to left by one position, i.e. 1.9 should be rating, 19.0 should be reviews, 3.0M should be size,... and so on.**

**There is no category for this App, so we can drop this row**

In [14]:
# remove the row with the specified index; axis 0 implies along the rows; axis 1 along the columns
GPStore=GPStore.drop([10472],axis=0)

In [15]:
# display the shape of the dataframe, i.e., the no. of rows and columns 
GPStore.shape

(9659, 13)

### Column ▶ Rating

**Rating column is the only feature having float datatype, so let us check statistical summary for it**

In [16]:
# statistical summary of the specified numerical variable
GPStore['Rating'].describe()

count    8196.000000
mean        4.173243
std         0.536625
min         1.000000
25%         4.000000
50%         4.300000
75%         4.500000
max         5.000000
Name: Rating, dtype: float64

In [17]:
GPStore['Rating'].unique()

array([4.1, 3.9, 4.7, 4.5, 4.3, 4.4, 3.8, 4.2, 4.6, 3.2, 4. , nan, 4.8,
       4.9, 3.6, 3.7, 3.3, 3.4, 3.5, 3.1, 5. , 2.6, 3. , 1.9, 2.5, 2.8,
       2.7, 1. , 2.9, 2.3, 2.2, 1.7, 2. , 1.8, 2.4, 1.6, 2.1, 1.4, 1.5,
       1.2])

**All the rating values are within the range so no invalid data is present in 'Rating' Column. But the count of rating values is 8196 where as we have 9659 entries in our dataset. It shows that there are missing values in 'Rating Column. Lets check for the Missing Values.**

In [18]:
# find the total no. of missing values present in the specified column
GPStore.Rating.isnull().sum()

1463

In [19]:
# lets see what percentage of data is missing from this feature
print(f'{round((GPStore.Rating.isnull().sum()*100)/GPStore.shape[0],2)}% of data is missing from Rating Column')

15.15% of data is missing from Rating Column


**We will impute the missing values later**

### Column ▶ Reviews

In [20]:
# displays frequency measures for a non-numerical column
GPStore.Reviews.describe()    # The datatype for the reviews column is string 

count     9659
unique    5330
top          0
freq       593
Name: Reviews, dtype: object

In [21]:
# check for any non numeric value 
GPStore.Reviews.str.isnumeric().sum()   

9659

In [22]:
# convert the 'Review' column to numeric
GPStore.Reviews=pd.to_numeric(GPStore.Reviews)

In [23]:
# statistical summary of the specified numerical variable
GPStore.Reviews.describe()

count    9.659000e+03
mean     2.165926e+05
std      1.831320e+06
min      0.000000e+00
25%      2.500000e+01
50%      9.670000e+02
75%      2.940100e+04
max      7.815831e+07
Name: Reviews, dtype: float64

### Column ▶ Size

In [24]:
# get the count/frequency of all the unique values of the specified column
GPStore.Size.value_counts()

Varies with device    1227
11M                    182
12M                    181
13M                    177
14M                    177
                      ... 
467k                     1
411k                     1
209k                     1
55k                      1
916k                     1
Name: Size, Length: 461, dtype: int64

**In the 'Size' column we have the values as '20M' and '10K' which represents the size of app in MB and KB respectively. So replace 'M' and 'K' with their equivalent numeric values in bytes.**

In [25]:
# replace all the 'Varies with device' with 0
GPStore.Size = GPStore.Size.apply(lambda x: x.replace('Varies with device','0') if 'Varies with device' in x else x)

# replace all the 'M' representing Million with ''
GPStore.Size = GPStore.Size.apply(lambda x: x.replace('M','') if 'M' in x else x)

# replace all the 'k' representing thousand (convert it in MB's)
GPStore.Size = GPStore.Size.apply(lambda x: round(float(x.replace('k',''))/1024,1) if 'k' in x else x)

In [26]:
# convert to float datatype
GPStore.Size = GPStore.Size.apply(lambda x: float(x))

In [27]:
GPStore.Size

0        19.0
1        14.0
2         8.7
3        25.0
4         2.8
         ... 
10836    53.0
10837     3.6
10838     9.5
10839     0.0
10840    19.0
Name: Size, Length: 9659, dtype: float64

In [28]:
# statistical summary of the specified numerical variable
GPStore.Size.describe()

count    9659.000000
mean       17.804058
std        21.495550
min         0.000000
25%         2.900000
50%         9.100000
75%        25.000000
max       100.000000
Name: Size, dtype: float64

In [29]:
GPStore=GPStore.rename(columns={'Size':'Size_in_MB'})   # rename the Size column to Size_in_MB 

### Column ▶ Installs

In [30]:
GPStore.Installs

0            10,000+
1           500,000+
2         5,000,000+
3        50,000,000+
4           100,000+
            ...     
10836         5,000+
10837           100+
10838         1,000+
10839         1,000+
10840    10,000,000+
Name: Installs, Length: 9659, dtype: object

**The Installs column shows the number of installations for an app. The values consists of '+' and ',' characters. So remove '+' and ',' present in Installs column and convert it to numeric.**

In [31]:
# values are given as, for example, '1,000+'. Removes the '+' sign from the end of the string
GPStore.Installs=GPStore.Installs.apply(lambda x: x.strip('+'))

# numbers have commas in them, for eg., 100,000. Removes all the commas from the strings.
GPStore.Installs=GPStore.Installs.apply(lambda x: x.replace(',',''))

# get the count/frequency of all the unique values of the specified column
GPStore.Installs.value_counts()

1000000       1417
100000        1112
10000         1031
10000000       937
1000           888
100            710
5000000        607
500000         505
50000          469
5000           468
10             385
500            328
50             204
50000000       202
100000000      188
5               82
1               67
500000000       24
1000000000      20
0               15
Name: Installs, dtype: int64

In [32]:
# convert to numeric datatype
GPStore.Installs=pd.to_numeric(GPStore.Installs)

In [33]:
GPStore['Installs'].describe()

count    9.659000e+03
mean     7.777507e+06
std      5.375828e+07
min      0.000000e+00
25%      1.000000e+03
50%      1.000000e+05
75%      1.000000e+06
max      1.000000e+09
Name: Installs, dtype: float64

### Column ▶ Type

In [34]:
# get the count/frequency of all the unique values of the specified column
GPStore.Type.value_counts()

Free    8902
Paid     756
Name: Type, dtype: int64

**All Good in this column. The type of the app is categorized as "Free" or "Paid" and we have these values only. So no cleaning is required for this column.**

### Column ▶ Price

In [36]:
GPStore.Price.value_counts()

0         8903
$0.99      145
$2.99      124
$1.99       73
$4.99       70
          ... 
$1.76        1
$1.04        1
$2.60        1
$19.90       1
$1.20        1
Name: Price, Length: 92, dtype: int64

**The data for the paid apps is prefix with "\$" character. Remove "$" before the price and convert it to numeric**

In [37]:
# removing the dollar sign from the string
GPStore.Price=GPStore.Price.apply(lambda x: x.strip('$'))

In [38]:
# converting to numeric datatype
GPStore.Price=pd.to_numeric(GPStore.Price)

In [39]:
# statistical summary of the specified numerical variable
GPStore.Price.describe()

count    9659.000000
mean        1.099299
std        16.852152
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max       400.000000
Name: Price, dtype: float64

### Column ▶ Content Rating

In [40]:
# get all the unique values present in the specified column
GPStore['Content Rating'].unique()

array(['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+',
       'Adults only 18+', 'Unrated'], dtype=object)

In [41]:
GPStore['Content Rating'].value_counts()

Everyone           7903
Teen               1036
Mature 17+          393
Everyone 10+        322
Adults only 18+       3
Unrated               2
Name: Content Rating, dtype: int64

### Column ▶ Genres

In [42]:
# get all the unique values present in the specified column
GPStore.Genres.unique()

array(['Art & Design', 'Art & Design;Pretend Play',
       'Art & Design;Creativity', 'Art & Design;Action & Adventure',
       'Auto & Vehicles', 'Beauty', 'Books & Reference', 'Business',
       'Comics', 'Comics;Creativity', 'Communication', 'Dating',
       'Education;Education', 'Education', 'Education;Creativity',
       'Education;Music & Video', 'Education;Action & Adventure',
       'Education;Pretend Play', 'Education;Brain Games', 'Entertainment',
       'Entertainment;Music & Video', 'Entertainment;Brain Games',
       'Entertainment;Creativity', 'Events', 'Finance', 'Food & Drink',
       'Health & Fitness', 'House & Home', 'Libraries & Demo',
       'Lifestyle', 'Lifestyle;Pretend Play',
       'Adventure;Action & Adventure', 'Arcade', 'Casual', 'Card',
       'Casual;Pretend Play', 'Action', 'Strategy', 'Puzzle', 'Sports',
       'Music', 'Word', 'Racing', 'Casual;Creativity',
       'Casual;Action & Adventure', 'Simulation', 'Adventure', 'Board',
       'Trivia', 'Role 

**Does not seem to be anything unusual here, so no cleaning required for this column**

**Although we can create the sub-genres for this**

In [51]:
# in case an app belongs to 2 genres, the values are separated by ';'. Split the values with ; as a separator
Genre_split = GPStore.Genres.str.split(';',expand=True)

In [52]:
# in case an app belongs to 2 genres, the values are separated by ';'. Split the values with ; as a separator
Genre_split = GPStore.Genres.str.split(';',expand=True)

# add column names
Genre_split.columns = ['Genres', 'Sub-Genres']

# display the head or top 5 columns of the dataframe
Genre_split.head()

Unnamed: 0,Genres,Sub-Genres
0,Art & Design,
1,Art & Design,Pretend Play
2,Art & Design,
3,Art & Design,
4,Art & Design,Creativity


In [53]:
# remove the 'Genres' column from the dataframe
GPStore.drop('Genres', axis=1,inplace=True)

In [54]:
# merge the two dataframes
GPStore=GPStore.merge(Genre_split, left_index=True, right_index=True)

### Column ▶ Last Update

**Check for the data and convert it into date format**

In [43]:
# display the head or top 5 columns of the dataframe
GPStore['Last Updated'].head()

0     January 7, 2018
1    January 15, 2018
2      August 1, 2018
3        June 8, 2018
4       June 20, 2018
Name: Last Updated, dtype: object

In [44]:
# convert to datetime datatype
GPStore['Last Updated']=pd.to_datetime(GPStore['Last Updated'])

In [45]:
# display the head or top 5 columns of the dataframe
GPStore['Last Updated'].head()

0   2018-01-07
1   2018-01-15
2   2018-08-01
3   2018-06-08
4   2018-06-20
Name: Last Updated, dtype: datetime64[ns]

### Column ▶ Current Version

In [46]:
# count the no. of null values in the dataframe
GPStore['Current Ver'].isnull().sum()

8

In [47]:
# get the count/frequency of all the unique values of the specified column
GPStore['Current Ver'].value_counts()

Varies with device    1055
1.0                    799
1.1                    260
1.2                    176
2.0                    149
                      ... 
4.1.21                   1
8.5                      1
1.25.4                   1
3.1.16                   1
5.16.0                   1
Name: Current Ver, Length: 2817, dtype: int64

### Column ▶ Android version

In [50]:
# get all the unique values present in the specified column
GPStore['Android Ver'].unique()

array(['4.0.3 and up', '4.2 and up', '4.4 and up', '2.3 and up',
       '3.0 and up', '4.1 and up', '4.0 and up', '2.3.3 and up',
       'Varies with device', '2.2 and up', '5.0 and up', '6.0 and up',
       '1.6 and up', '1.5 and up', '2.1 and up', '7.0 and up',
       '5.1 and up', '4.3 and up', '4.0.3 - 7.1.1', '2.0 and up',
       '3.2 and up', '4.4W and up', '7.1 and up', '7.0 - 7.1.1',
       '8.0 and up', '5.0 - 8.0', '3.1 and up', '2.0.1 and up',
       '4.1 - 7.1.1', nan, '5.0 - 6.0', '1.0 and up', '2.2 - 7.1.1',
       '5.0 - 7.1.1'], dtype=object)

### ***Saving the cleaned dataset in csv format***

In [55]:
# save the dataframe
GPStore.to_csv('Play Store Data Clean.csv',index=False)