# Exploratory Data Analysis on Google Play Store Apps 

![Play Store](https://i.imgur.com/5TFmHcV.png)

### INTRODUCTION

In this project, we will analyze the Google Playstore Dataset from Kaggle. This dataset has over 4.5 millions rows and 29 columns. We'll use the useful 17 columns for our Analysis. The data set can be viewed using this link: 'https://www.kaggle.com/datasets/geothomas/playstore-dataset?resource=download&select=Playstore_final.csv'

The main objective of this project is to get a better understanding about the Google Play Store Apps by applying the data analysis & visualization skills to the real-world dataset.

**Here is an outline of the steps we'll follow:**

1.  Downloading a dataset from an online source.
2.  Data preparation and cleaning
3.  Exploratory Analysis and Visualization.
4.  Asking and Answering interesting questions.
5.  Summary
6.  Inferences and conclusion.
7.  Reference

**Exploratory Data Analysis (EDA)** is the process of exploring, investigating and gathering insights from data using statistical measures and visualizations. The objective of EDA is to develop and understanding of data, by uncovering trends, relationships and patterns.

EDA is both a science and an art. On the one hand it requires the knowledge of statistics, visualization techniques and data analysis tools like Numpy, Pandas, Seaborn etc. On the other hand, it requires asking interesting questions to guide the investigation and interpreting numbers & figures to generate useful insights.


![EDA](https://i.imgur.com/FT73bKp.png)

### __1. DOWNLOADING DATASET FROM AN ONLINE SOURCE__

#### Installing and importing all required Libraries 

In [None]:
!pip install opendatasets --upgrade --quiet
!pip install matplotlib seaborn --upgrade --quiet
!pip install plotly --upgrade --quiet
!pip install -U matplotlib --upgrade --quiet
!pip install folium --upgrade --quiet
!pip install numpy --upgrade --quiet 
import os
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import opendatasets as od
import folium
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

**Downloading the Dataset**

- **We will download the Google Play store Apps Data set from Kaggle using "Opendatasets" library**

In [None]:
download_url = 'https://www.kaggle.com/datasets/geothomas/playstore-dataset?resource=download&select=Playstore_final.csv'
od.download(download_url)

Downloading playstore-dataset.zip to ./playstore-dataset


100%|██████████| 41.9M/41.9M [00:02<00:00, 15.9MB/s]





In [None]:
os.listdir('/content/playstore-dataset')

['Playstore_final.csv']

#### __Converting the dataset in csv format to Pandas Dataframe using Pandas Library__

In [None]:
playstore_apps_df= pd.read_csv('playstore-dataset/Playstore_final.csv', error_bad_lines=False)



  exec(code_obj, self.user_global_ns, self.user_ns)
b'Skipping line 155239: expected 29 fields, saw 57\n'
b'Skipping line 293106: expected 29 fields, saw 57\n'
  exec(code_obj, self.user_global_ns, self.user_ns)


**Displaying the dataframe using the Pandas Library**

In [None]:
playstore_apps_df[:5]

Unnamed: 0,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Free,Price,Currency,...,Ad Supported,In app purchases,Editor Choice,Summary,Reviews,Android version Text,Developer,Developer Address,Developer Internal ID,Version
0,Logistics Management,com.eniseistudio.logistics_management,Education,4.090909,66.0,"10,000+",10000.0,True,0.0,USD,...,True,False,False,Leading Online Learning and Training System in...,28.0,4.0 and up,eniseistudio,"7115 N Muscatel Ave San Gabriel, CA 91775 Unit...",4.656447e+18,1.1.5
1,Estados Unidos Noticias,com.eniseistudio.news.estados_unidos,News & Magazines,4.0,8.0,"1,000+",1000.0,True,0.0,USD,...,True,False,False,Top Stories\r\nWorld\r\nEntertainment\r\nSport...,3.0,4.0 and up,eniseistudio,"7115 N Muscatel Ave San Gabriel, CA 91775 Unit...",4.656447e+18,1.2.3
2,Dental Assistant,com.eniseistudio.dental_assistant,Education,3.866667,15.0,"10,000+",10000.0,True,0.0,USD,...,True,False,False,"Dental Assistant: Study Dental Assistant, Dent...",3.0,4.0 and up,eniseistudio,"7115 N Muscatel Ave San Gabriel, CA 91775 Unit...",4.656447e+18,1.1.5
3,Medical Assistant,com.eniseistudio.course.medical_assistant,Education,4.0,18.0,"5,000+",5000.0,True,0.0,USD,...,True,False,False,Medical Assistant Degree Medical Assistant Job...,7.0,4.0 and up,eniseistudio,"7115 N Muscatel Ave San Gabriel, CA 91775 Unit...",4.656447e+18,1.1.4
4,Business Administration,com.eniseistudio.majors.course.business_admini...,Education,4.023256,86.0,"50,000+",50000.0,True,0.0,USD,...,True,False,False,"Business Administration Learning, Business Adm...",29.0,4.0 and up,eniseistudio,"7115 N Muscatel Ave San Gabriel, CA 91775 Unit...",4.656447e+18,1.1.6


#### Checking the number of rows and columns in the dataset downloaded

In [None]:
playstore_apps_df.shape

(450793, 29)

**Checking the info of all columns to know thw datatypes and Non-Null counts**

In [None]:
playstore_apps_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450793 entries, 0 to 450792
Data columns (total 29 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   App Name               450780 non-null  object 
 1   App Id                 450793 non-null  object 
 2   Category               450780 non-null  object 
 3   Rating                 447981 non-null  float64
 4   Rating Count           332025 non-null  float64
 5   Installs               450702 non-null  object 
 6   Minimum Installs       450780 non-null  float64
 7   Free                   450701 non-null  object 
 8   Price                  450701 non-null  float64
 9   Currency               450701 non-null  object 
 10  Size                   450765 non-null  object 
 11  Minimum Android        449711 non-null  object 
 12  Developer Id           450777 non-null  object 
 13  Developer Website      341067 non-null  object 
 14  Developer Email        450767 non-nu

**Type is found by using the `type()` function**

In [None]:
type(playstore_apps_df)

pandas.core.frame.DataFrame

**Data Type of each columns is found using the `dtype()` function**

In [None]:
playstore_apps_df.dtypes

App Name                  object
App Id                    object
Category                  object
Rating                   float64
Rating Count             float64
Installs                  object
Minimum Installs         float64
Free                      object
Price                    float64
Currency                  object
Size                      object
Minimum Android           object
Developer Id              object
Developer Website         object
Developer Email           object
Released                  object
Last update               object
Privacy Policy            object
Content Rating            object
Ad Supported              object
In app purchases            bool
Editor Choice               bool
Summary                   object
Reviews                  float64
Android version Text      object
Developer                 object
Developer Address         object
Developer Internal ID    float64
Version                   object
dtype: object

**Displaying all column names alone in the dataset**

In [None]:
playstore_apps_df.columns

Index(['App Name', 'App Id', 'Category', 'Rating', 'Rating Count', 'Installs',
       'Minimum Installs', 'Free', 'Price', 'Currency', 'Size',
       'Minimum Android', 'Developer Id', 'Developer Website',
       'Developer Email', 'Released', 'Last update', 'Privacy Policy',
       'Content Rating', 'Ad Supported', 'In app purchases', 'Editor Choice',
       'Summary', 'Reviews', 'Android version Text', 'Developer',
       'Developer Address', 'Developer Internal ID', 'Version'],
      dtype='object')

### __2. DATA PREPERATION AND CLEANING__

**Now we will explore the data and displaying only the required columns which will be useful for our Analysis. It is done by the `selected_cols`[]**

In [None]:
selected_dtypes ={
    'App Name': 'str',
    'Category': 'object',
    'Rating':'float64',
    'Installs': 'object',
    'Minimum Installs':'float64',
    'Free': 'object',
    'Price': 'float64',
    'Currency': 'object',
    'Size': 'object',
    'Released': 'object',
    'Last update': 'object',
    'Content Rating': 'object',
    'In app purchases':'bool',
    'Editor Choice':'bool',
    'Reviews':'float64',
    'Developer': 'object',
}
selected_cols=['App Name', 'Category','Rating','Installs',
       'Minimum Installs', 'Free', 'Price', 'Currency', 'Size',
       'Released', 'Last update', 'Content Rating',
       'In app purchases', 'Editor Choice','Reviews','Developer']

pd.read_csv('playstore-dataset/Playstore_final.csv', error_bad_lines=False)
Apps_df = pd.read_csv('playstore-dataset/Playstore_final.csv',error_bad_lines=False, 
                       usecols=selected_cols,dtype=selected_dtypes)



  exec(code_obj, self.user_global_ns, self.user_ns)
b'Skipping line 155239: expected 29 fields, saw 57\n'
b'Skipping line 293106: expected 29 fields, saw 57\n'
  exec(code_obj, self.user_global_ns, self.user_ns)


# **Descriptive Statistics**

In [None]:
Apps_df.describe()

Unnamed: 0,Rating,Minimum Installs,Price,Reviews
count,447983.0,450782.0,450703.0,447983.0
mean,3.018803,883648.2,0.315508,5118.334
std,1.860017,36076260.0,4.110261,199311.3
min,0.0,0.0,0.0,0.0
25%,0.0,500.0,0.0,0.0
50%,3.933333,5000.0,0.0,17.0
75%,4.39,50000.0,0.0,178.0
max,5.0,10000000000.0,400.0,52377200.0


 __Let us convert the columns Released and Last Update to datetime format__

In [None]:
Apps_df [:5] # Displaying first 5 rows 

Unnamed: 0,App Name,Category,Rating,Installs,Minimum Installs,Free,Price,Currency,Size,Released,Last update,Content Rating,In app purchases,Editor Choice,Reviews,Developer
0,Logistics Management,Education,4.090909,"10,000+",10000.0,True,0.0,USD,5.8M,"Jul 19, 2017","July 19, 2017",Everyone,False,False,28.0,eniseistudio
1,Estados Unidos Noticias,News & Magazines,4.0,"1,000+",1000.0,True,0.0,USD,5.3M,"May 5, 2017","May 5, 2017",Everyone,False,False,3.0,eniseistudio
2,Dental Assistant,Education,3.866667,"10,000+",10000.0,True,0.0,USD,5.7M,"Jul 18, 2017","July 18, 2017",Everyone,False,False,3.0,eniseistudio
3,Medical Assistant,Education,4.0,"5,000+",5000.0,True,0.0,USD,5.8M,"Jun 24, 2017","June 24, 2017",Everyone,False,False,7.0,eniseistudio
4,Business Administration,Education,4.023256,"50,000+",50000.0,True,0.0,USD,5.7M,"Jun 13, 2017","October 6, 2017",Everyone,False,False,29.0,eniseistudio


In [None]:
Apps_df.shape # To find the number of rows and columns of the dataframe

(450795, 16)

- **To find the missing values in the rows we use `isna().sum()` funcion and sorted it in Descending order by using `sort_values(ascending=False)**

In [None]:
Apps_df.isna().sum().sort_values(ascending=False)

Released            3447
Rating              2812
Reviews             2812
Free                  92
Price                 92
Currency              92
Installs              91
Size                  28
Developer             16
App Name              13
Category              13
Minimum Installs      13
Content Rating        13
Last update            0
In app purchases       0
Editor Choice          0
dtype: int64

**Dropping missing value rows which are important for the Data Analysis using the `notnull()` function to particlar rows**

In [None]:
# Drop rows with None/NaN values
Apps_df2= Apps_df[Apps_df['App Name'].notnull()]
Apps_df2= Apps_df[Apps_df['Developer'].notnull()]

In [None]:
Apps_df2= Apps_df[Apps_df['Size'].notnull()]

In [None]:
Apps_df2.isna().sum().sort_values(ascending=False)

Released            3419
Rating              2784
Reviews             2784
Free                  78
Price                 78
Currency              78
Installs              77
Developer              3
App Name               0
Category               0
Minimum Installs       0
Size                   0
Last update            0
Content Rating         0
In app purchases       0
Editor Choice          0
dtype: int64

In [None]:
Apps_df2= Apps_df[Apps_df['Released'].notnull()]
Apps_df2= Apps_df[Apps_df['Last update'].notnull()]

In [None]:
Apps_df2[:5]

Unnamed: 0,App Name,Category,Rating,Installs,Minimum Installs,Free,Price,Currency,Size,Released,Last update,Content Rating,In app purchases,Editor Choice,Reviews,Developer
0,Logistics Management,Education,4.090909,"10,000+",10000.0,True,0.0,USD,5.8M,"Jul 19, 2017","July 19, 2017",Everyone,False,False,28.0,eniseistudio
1,Estados Unidos Noticias,News & Magazines,4.0,"1,000+",1000.0,True,0.0,USD,5.3M,"May 5, 2017","May 5, 2017",Everyone,False,False,3.0,eniseistudio
2,Dental Assistant,Education,3.866667,"10,000+",10000.0,True,0.0,USD,5.7M,"Jul 18, 2017","July 18, 2017",Everyone,False,False,3.0,eniseistudio
3,Medical Assistant,Education,4.0,"5,000+",5000.0,True,0.0,USD,5.8M,"Jun 24, 2017","June 24, 2017",Everyone,False,False,7.0,eniseistudio
4,Business Administration,Education,4.023256,"50,000+",50000.0,True,0.0,USD,5.7M,"Jun 13, 2017","October 6, 2017",Everyone,False,False,29.0,eniseistudio


## **Changing App size to MB**

In [None]:
Apps_df2['Size'] = Apps_df2['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x)
Apps_df2['Size'] = Apps_df2['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else x)
Apps_df2['Size'] = Apps_df2['Size'].apply(lambda x: str(x).replace(',', '') if 'M' in str(x) else x)
Apps_df2['Size'] = Apps_df2['Size'].apply(lambda x: float(str(x).replace('K', '')) / 1000 if 'K' in str(x) else x)

## **Changing the size format to dd-mm-yy using pd.to_datedtime**

In [None]:
Apps_df2['Last update']=pd.to_datetime(Apps_df2["Last update"],errors= 'coerce')

In [None]:
Apps_df2['Released']=pd.to_datetime(Apps_df2["Released"],errors= 'coerce')

# **Removing '+' sign from the Column 'Installs'**

In [None]:
Apps_df2['Installs'] = Apps_df2['Installs'].replace(r'\D', '')

In [None]:
missing_value=Apps_df2.isna().sum().sort_values(ascending=False) # To find the missing values sum in each rows in Descending order 
missing_value

Released            15331
Last update         12085
Rating               2812
Reviews              2812
Free                   92
Price                  92
Currency               92
Installs               91
Size                   28
Developer              16
App Name               13
Category               13
Minimum Installs       13
Content Rating         13
In app purchases        0
Editor Choice           0
dtype: int64

# **Dropping all Nan/ None Values for accurate Data visualizastion**

**Since the missing rows are less in numbers compared to the dataset, this will not result in a significant impact on our data, hence are deleting these rows**

In [None]:
# Drop All rows with None/NaN values
Apps_df2= Apps_df2[Apps_df2['Free'].notnull()]
Apps_df2= Apps_df2[Apps_df2['Price'].notnull()]
Apps_df2= Apps_df2[Apps_df2['Rating'].notnull()]
Apps_df2= Apps_df2[Apps_df2['Reviews'].notnull()]
Apps_df2= Apps_df2[Apps_df2['Last update'].notnull()]
Apps_df2= Apps_df2[Apps_df2['Released'].notnull()]
Apps_df2= Apps_df2[Apps_df2['Developer'].notnull()]
Apps_df2.isna().sum().sort_values(ascending=False)

App Name            0
Category            0
Rating              0
Installs            0
Minimum Installs    0
Free                0
Price               0
Currency            0
Size                0
Released            0
Last update         0
Content Rating      0
In app purchases    0
Editor Choice       0
Reviews             0
Developer           0
dtype: int64

In [None]:
len(Apps_df2['App Name'].unique()) # To find the total number of Apps which are unique

391598

In [None]:
boolean = Apps_df2['App Name'].duplicated().any()
boolean

True

In [None]:
Apps_df2['App Name'].value_counts()

Solitaire                                     78
Flashlight                                    68
Gallery                                       64
Music Player                                  56
Sudoku                                        55
                                              ..
BrainwaveX Third Eye Chakra                    1
BrainwaveX Focus Pro                           1
BrainwaveX Study                               1
BrainwaveX Focus                               1
Wholesale Clothing Online wholesale7 Store     1
Name: App Name, Length: 391598, dtype: int64

In [None]:
Apps_df2[Apps_df2['App Name']=='Solitaire']

Unnamed: 0,App Name,Category,Rating,Installs,Minimum Installs,Free,Price,Currency,Size,Released,Last update,Content Rating,In app purchases,Editor Choice,Reviews,Developer
8138,Solitaire,Card,4.308056,"50,000,000+",50000000.0,TRUE,0.0,USD,46,2013-12-26,2021-05-11,Everyone,False,False,51546.0,Zynga
8714,Solitaire,Card,4.702971,"100,000+",100000.0,TRUE,0.0,USD,35,2020-10-15,2021-04-14,Everyone,True,False,198.0,Gamma Play
16408,Solitaire,Card,4.356895,"1,000,000+",1000000.0,TRUE,0.0,USD,,2015-04-23,2019-09-10,Everyone,True,False,2366.0,BlackLight Studio Games
16617,Solitaire,Card,4.170125,"5,000,000+",5000000.0,TRUE,0.0,USD,26,2017-05-03,2020-10-15,Everyone,False,False,6968.0,Big Cat Studio - we make brain games
17979,Solitaire,Card,4.551285,"1,000,000+",1000000.0,TRUE,0.0,USD,78,2014-04-22,2021-03-25,Everyone,True,False,10503.0,Green Panda Games
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
353767,Solitaire,Card,4.683168,"10,000+",10000.0,TRUE,0.0,USD,14,2020-11-20,2021-04-27,Everyone,False,False,46.0,foo Game Group
380544,Solitaire,Card,4.170000,"10,000+",10000.0,TRUE,0.0,USD,,2015-12-12,2017-01-23,Everyone,False,False,49.0,TLCM free apps & games
384270,Solitaire,Strategy,4.393940,"10,000+",10000.0,TRUE,0.0,USD,16,2019-12-11,2020-01-22,Everyone,False,False,12.0,Apps Specials
432908,Solitaire,Card,4.310000,"100,000+",100000.0,TRUE,0.0,USD,9.2,2016-04-14,2021-04-07,Everyone,False,False,132.0,AppAgency Labs


**Removing '+' Symbol from the Column 'Installs'**

In [None]:
Apps_df2['Installs'] = Apps_df2['Installs'].map(lambda x: x.rstrip('+'))
Apps_df2['Installs'] = pd.to_numeric(Apps_df2['Installs'].str.replace(',',''))
Apps_df2[:100]


Unnamed: 0,App Name,Category,Rating,Installs,Minimum Installs,Free,Price,Currency,Size,Released,Last update,Content Rating,In app purchases,Editor Choice,Reviews,Developer
0,Logistics Management,Education,4.090909,10000,10000.0,TRUE,0.0,USD,5.8,2017-07-19,2017-07-19,Everyone,False,False,28.0,eniseistudio
1,Estados Unidos Noticias,News & Magazines,4.000000,1000,1000.0,TRUE,0.0,USD,5.3,2017-05-05,2017-05-05,Everyone,False,False,3.0,eniseistudio
2,Dental Assistant,Education,3.866667,10000,10000.0,TRUE,0.0,USD,5.7,2017-07-18,2017-07-18,Everyone,False,False,3.0,eniseistudio
3,Medical Assistant,Education,4.000000,5000,5000.0,TRUE,0.0,USD,5.8,2017-06-24,2017-06-24,Everyone,False,False,7.0,eniseistudio
4,Business Administration,Education,4.023256,50000,50000.0,TRUE,0.0,USD,5.7,2017-06-13,2017-10-06,Everyone,False,False,29.0,eniseistudio
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Teach Your Kids Animal Sounds,Educational,4.250000,10000,10000.0,TRUE,0.0,USD,,2015-05-23,2020-06-18,Everyone,True,False,23.0,Blion Games
96,Vegan Defense,Strategy,4.634783,10000,10000.0,TRUE,0.0,USD,11,2015-03-20,2020-11-03,Everyone,True,False,222.0,Blion Games
97,Carnevale Maschere e Ricette,Entertainment,4.140000,10000,10000.0,TRUE,0.0,USD,,2014-01-26,2020-06-19,Everyone,False,False,14.0,Blion Games
98,La Befana Storie e Leggende,Entertainment,4.080000,10000,10000.0,TRUE,0.0,USD,,2013-12-29,2020-06-20,Everyone,False,False,28.0,Blion Games


### __3. EXPLORATORY DATA ANALYSIS AND VISUALIZATION__

# **Descriptive Statistics**

In [None]:
Apps_df2.describe()

Unnamed: 0,Rating,Installs,Minimum Installs,Price,Reviews
count,435312.0,435312.0,435312.0,435312.0,435312.0
mean,3.000458,817747.9,817747.9,0.313957,4942.956
std,1.867864,29729760.0,29729760.0,4.122183,193547.9
min,0.0,0.0,0.0,0.0,0.0
25%,0.0,500.0,500.0,0.0,0.0
50%,3.92385,5000.0,5000.0,0.0,16.0
75%,4.386789,50000.0,50000.0,0.0,171.0
max,5.0,5000000000.0,5000000000.0,400.0,52377200.0


In [None]:
Apps_Category= Apps_df2.Category.value_counts().sort_values(ascending=False).reset_index().rename(columns={'Category':'Count','index':'Category'})
Apps_Category[:25]

Unnamed: 0,Category,Count
0,Education,45528
1,Tools,32797
2,Personalization,29227
3,Books & Reference,28914
4,Music & Audio,27798
5,Entertainment,21579
6,Lifestyle,18254
7,Business,15807
8,Productivity,14136
9,Health & Fitness,13691


## 1. **What are the Top 25 App Categories in the Google Play Store according to the App count**

**Ploting a Histogram to find the Top 25 App Category with more App counts** 

**Setting up the parameters for the plots**

In [None]:
fig=px.bar(Apps_Category.head(25),
          y='Category',
          x='Count',
          width=900, height=750,
           color='Count')

fig.update_layout(title="Top 25 App Categories",
                 xaxis_title="App Count",
                 yaxis_title="App Category Name",
                 yaxis=dict(tickfont=dict(size=10)))
fig.show()

**From the graph it is clear that the `Education` and `Tools` are the leading App categories with the most apps in the google play store**

##2. **What are the Top 25 Apps with the most intalls by users?**

In [None]:
Apps_Installs= Apps_df2.Installs.value_counts().sort_values(ascending=True)
Apps_Installs[:25]

5000000000       12
500000000        62
1000000000       62
100000000       491
50000000        784
0              1585
1              5045
5000000        5082
5              5120
10000000       5210
500000        15288
50            15511
1000000       22066
10            22839
500           26785
50000         28324
5000          33171
100000        50312
100           50602
1000          71873
10000         75088
Name: Installs, dtype: int64

Category.value_counts().sort_values(ascending=False).reset_index().rename(columns={'Category':'Count','index':'Category'})
Apps_Category[:25]

5 **Which appication has the 5 star rating (top 25 rated apps)**

###  __3. ASKING AND ANSWERING INTERESTING QUESTIONS__

QUESTIONS 

1. Which category has more apps in google play store? (TOP 25 CATEGORIES WITH APP COUNTS). show the same in word cloud
2. Which category has more insalls ? (Top 25 intalls category)
3. In which your more applications released (No: of apps released yearwise)
4. Which appication has the 5 star rating (top 25 rated apps)
5. Does the no: of installs increases when the app is free? 
6. Costlier Apps in the play store? (Top 25 costlier Apps)
7. Applications with more reviews. (Top 25 reviews apps)
8. Which developer has developed most number of apps
9. Correlation between review and rating
10.Latest trend in App Category





### __4. SUMMARY__

### __5. INFERENCES AND CONCLUSION__

### __6. REFERENCES__