# Top GooglePlay Store Apps

# Introduction

In this analysis we are going to review the users downloads behavior. The data obtained includes the following description:

* App: The application name.
* Category: The category the app belongs to.
* Rating: Overall user rating of the app.
* Reviews: Number of user reviews for the app.
* Size: The size of the app.
* Installs: Number of user installs for the app.
* Type: Either "Paid" or "Free".
* Price: The price of the app.
* Content Rating: The age group the app is targeted at - "Children" / "Mature 21+" / "Adult".
* Genres: Possibly multiple genres the app belongs to.
* Last Updated: The date the app was last updated.
* Current Ver: The current version of the app.
* Android Ver: The Android version is needed for this app.

To accomplish this objective we are going to follow this steps:

1. Data preprocessing: 

    * Loading the dataset 
    * Data description
    * Review duplicates, column names and null information. 

2. Data analysis

    * Top Categories depending of number of apps. 
    * Classify install segments

3. Conclusions




In [1]:
# Loading libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

In [2]:
# Additional libraries
import personal_library as per_lib

# Data Preprocessing

In [3]:
# Loading the data to the project
df = pd.read_csv('./datasets/googleplaystore.csv')

In [4]:
# Describing the data
df.describe

<bound method NDFrame.describe of        Unnamed: 0                                                App  \
0               0     Photo Editor & Candy Camera & Grid & ScrapBook   
1               1                                Coloring book moana   
2               2  U Launcher Lite – FREE Live Cool Themes, Hide ...   
3               3                              Sketch - Draw & Paint   
4               4              Pixel Draw - Number Art Coloring Book   
...           ...                                                ...   
10836       10836                                   Sya9a Maroc - FR   
10837       10837                   Fr. Mike Schmitz Audio Teachings   
10838       10838                             Parkinson Exercices FR   
10839       10839                      The SCP Foundation DB fr nn5n   
10840       10840      iHoroscope - 2018 Daily Horoscope & Astrology   

                  Category  Rating Reviews                Size     Installs  \
0           ART_AND_DE

In [5]:
print(df.info())
df.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      10841 non-null  int64  
 1   App             10841 non-null  object 
 2   Category        10841 non-null  object 
 3   Rating          9367 non-null   float64
 4   Reviews         10841 non-null  object 
 5   Size            10841 non-null  object 
 6   Installs        10841 non-null  object 
 7   Type            10840 non-null  object 
 8   Price           10841 non-null  object 
 9   Content Rating  10840 non-null  object 
 10  Genres          10841 non-null  object 
 11  Last Updated    10841 non-null  object 
 12  Current Ver     10833 non-null  object 
 13  Android Ver     10838 non-null  object 
dtypes: float64(1), int64(1), object(12)
memory usage: 1.2+ MB
None


Unnamed: 0.1,Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,7-Jan-18,1.0.0,4.0.3 and up
1,1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,15-Jan-18,2.0.0,4.0.3 and up
2,2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,1-Aug-18,1.2.4,4.0.3 and up
3,3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,8-Jun-18,Varies with device,4.2 and up
4,4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,20-Jun-18,1.1,4.4 and up


As we can see there are missing data in the columns `Rating`, `type`, `Content Rating`, `Current Ver` and `Android Ver`. After to review this missing data, we are going to rename the columns following the snake_case methology and making all the column names in lower cases. 

In [6]:
# Renaming the name of the columns 
df.columns = per_lib.rename_columns(df)

df.columns

Index(['unnamed:_0', 'app', 'category', 'rating', 'reviews', 'size',
       'installs', 'type', 'price', 'content_rating', 'genres', 'last_updated',
       'current_ver', 'android_ver'],
      dtype='object')

Now we are going to delete the column `unnamed:_0` it's a column that we are not going to use in this analysis. 

In [7]:
# Deleting the column "unnamed:_0"
df = df.drop('unnamed:_0', axis = 1)

df.head(5)

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,7-Jan-18,1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,15-Jan-18,2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,1-Aug-18,1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,8-Jun-18,Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,20-Jun-18,1.1,4.4 and up


Now let's continue with reviewing if there is duplicated data. We are going to corroborate the duplicated information in the columns `category`, `installs` and `app`.

In [8]:
# Corroborating duplicated information
df[df[['category', 'installs', 'app']].duplicated()]

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
229,Quick PDF Scanner + OCR FREE,BUSINESS,4.2,80805,Varies with device,"5,000,000+",Free,0,Everyone,Business,26-Feb-18,Varies with device,4.0.3 and up
236,Box,BUSINESS,4.2,159872,Varies with device,"10,000,000+",Free,0,Everyone,Business,31-Jul-18,Varies with device,Varies with device
239,Google My Business,BUSINESS,4.4,70991,Varies with device,"5,000,000+",Free,0,Everyone,Business,24-Jul-18,2.19.0.204537701,4.4 and up
256,ZOOM Cloud Meetings,BUSINESS,4.4,31614,37M,"10,000,000+",Free,0,Everyone,Business,20-Jul-18,4.1.28165.0716,4.0 and up
261,join.me - Simple Meetings,BUSINESS,4.0,6989,Varies with device,"1,000,000+",Free,0,Everyone,Business,16-Jul-18,4.3.0.508,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10715,FarmersOnly Dating,DATING,3.0,1145,1.4M,"100,000+",Free,0,Mature 17+,Dating,25-Feb-16,2.2,4.0 and up
10720,Firefox Focus: The privacy browser,COMMUNICATION,4.4,36981,4.0M,"1,000,000+",Free,0,Everyone,Communication,6-Jul-18,5.2,5.0 and up
10730,FP Notebook,MEDICAL,4.5,410,60M,"50,000+",Free,0,Everyone,Medical,24-Mar-18,2.1.0.372,4.4 and up
10753,Slickdeals: Coupons & Shopping,SHOPPING,4.5,33599,12M,"1,000,000+",Free,0,Everyone,Shopping,30-Jul-18,3.9,4.4 and up


As we can see, there are 1088 duplicated rows, lets get rid of those duplicated information. 

In [9]:
# Deleting duplicated information
df = df.drop_duplicates(subset = ['category', 'installs', 'app'])

df[df[['category', 'installs', 'app']].duplicated()]

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver


Now that there is no duplicated information, lets review once more the null rows. 

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9753 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   app             9753 non-null   object 
 1   category        9753 non-null   object 
 2   rating          8288 non-null   float64
 3   reviews         9753 non-null   object 
 4   size            9753 non-null   object 
 5   installs        9753 non-null   object 
 6   type            9752 non-null   object 
 7   price           9753 non-null   object 
 8   content_rating  9752 non-null   object 
 9   genres          9753 non-null   object 
 10  last_updated    9753 non-null   object 
 11  current_ver     9745 non-null   object 
 12  android_ver     9750 non-null   object 
dtypes: float64(1), object(12)
memory usage: 1.0+ MB


The columns `rating`, `type`, `content_rating`, `current_ver` and `android_ver` still have null information, lets review it. 

In [11]:
# Reviewing the missing data in the column "rating"
df[df['rating'].isna()]

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
23,Mcqueen Coloring pages,ART_AND_DESIGN,,61,7.0M,"100,000+",Free,0,Everyone,Art & Design;Action & Adventure,7-Mar-18,1.0.0,4.1 and up
113,Wrinkles and rejuvenation,BEAUTY,,182,5.7M,"100,000+",Free,0,Everyone 10+,Beauty,20-Sep-17,8,3.0 and up
123,Manicure - nail design,BEAUTY,,119,3.7M,"50,000+",Free,0,Everyone,Beauty,23-Jul-18,1.3,4.1 and up
126,Skin Care and Natural Beauty,BEAUTY,,654,7.4M,"100,000+",Free,0,Teen,Beauty,17-Jul-18,1.15,4.1 and up
129,"Secrets of beauty, youth and health",BEAUTY,,77,2.9M,"10,000+",Free,0,Mature 17+,Beauty,8-Aug-17,2,2.3 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10824,Cardio-FR,MEDICAL,,67,82M,"10,000+",Free,0,Everyone,Medical,31-Jul-18,2.2.2,4.4 and up
10825,Naruto & Boruto FR,SOCIAL,,7,7.7M,100+,Free,0,Teen,Social,2-Feb-18,1,4.0 and up
10831,payermonstationnement.fr,MAPS_AND_NAVIGATION,,38,9.8M,"5,000+",Free,0,Everyone,Maps & Navigation,13-Jun-18,2.0.148.0,4.0 and up
10835,FR Forms,BUSINESS,,0,9.6M,10+,Free,0,Everyone,Business,29-Sep-16,1.1.5,4.0 and up


In [12]:
# Reviewing the impact of the information according to the installs information 
df[df['rating'].isna()].groupby(by='installs', as_index=False).agg({'app': 'count'}).sort_values(by='app', ascending=False)

Unnamed: 0,installs,app
7,100+,407
5,10+,316
3,"1,000+",192
11,50+,148
13,500+,129
9,5+,73
2,1+,64
6,"10,000+",44
10,"5,000+",43
8,"100,000+",19


As we can see, the quantity of apps that doesn't have rating information are apps that only have a few downloads, so we can drop those apps without affecting the analysis results. Also we are going to delete the information of the columns `type`, `content_rating`, `current_ver` and `android_ver`.

In [13]:
# Drop the null information of the column "rating"
df = df.dropna(subset=['rating', 'type', 'content_rating', 'current_ver', 'android_ver'])

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8281 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   app             8281 non-null   object 
 1   category        8281 non-null   object 
 2   rating          8281 non-null   float64
 3   reviews         8281 non-null   object 
 4   size            8281 non-null   object 
 5   installs        8281 non-null   object 
 6   type            8281 non-null   object 
 7   price           8281 non-null   object 
 8   content_rating  8281 non-null   object 
 9   genres          8281 non-null   object 
 10  last_updated    8281 non-null   object 
 11  current_ver     8281 non-null   object 
 12  android_ver     8281 non-null   object 
dtypes: float64(1), object(12)
memory usage: 905.7+ KB


Now we can continue with the analysis.

# Data Analysis

## Top categories depending of number of apps

First of all let's review the categories that we have. 

In [14]:
# Reviewing the values of the category column
print(df.category.nunique())
df.category.unique()

33


array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE',
       'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL',
       'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL',
       'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER',
       'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION'],
      dtype=object)

As we can see we have 33 categories. Now lets group this categories and count the number of apps that each one have. 

In [15]:
# Grouping by category
categories = df.groupby(by='category', as_index=False).agg({'app': 'count'}).sort_values(by='app', ascending=False)
categories.columns = ['Category', 'App_num']
categories.head(10)

Unnamed: 0,Category,App_num
11,FAMILY,1683
14,GAME,915
29,TOOLS,719
25,PRODUCTIVITY,302
12,FINANCE,302
18,LIFESTYLE,301
23,PERSONALIZATION,296
20,MEDICAL,293
24,PHOTOGRAPHY,264
4,BUSINESS,263


In [16]:
# Plotting the results
fig = px.bar(categories.head(10),
                x= 'Category',
                y= 'App_num',
                title='Top 10 Categories')

fig.show()

The top 10 categories are: 
1. Family
2. Game
3. Tools
4. Productivity
5. Finance
6. Lifestyle
7. Personalization
8. Medical
9. Photography
10. Business

## Classify install segments

Lets review the different categories of the column `install`.

In [17]:
# Reviewing the install segments
print(df.installs.nunique())
df.installs.unique()

19


array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+', '100+', '500+', '10+',
       '5+', '50+', '1+'], dtype=object)

There are 19 segments, lets view how many applications are in each segment.

In [18]:
# Grouping by install segment
downloads = df.groupby(by='installs', as_index=False).agg({'app': 'count'})


install_order = ['1+','5+','10+', '50+','100+', '500+', '1,000+', '5,000+', '10,000+', '50,000+', '100,000+', '500,000+', '1,000,000+', '5,000,000+', '10,000,000+', '50,000,000+', '100,000,000+', '500,000,000+', '1,000,000,000+']
install_order = install_order[::-1]

downloads['installs'] = pd.Categorical(downloads['installs'], categories=install_order, ordered=True)

downloads = downloads.sort_values('installs')

downloads.columns = ['Installs', 'App_num']

downloads


Unnamed: 0,Installs,App_num
3,"1,000,000,000+",21
18,"500,000,000+",27
9,"100,000,000+",204
15,"50,000,000+",211
6,"10,000,000+",962
12,"5,000,000+",621
2,"1,000,000+",1427
17,"500,000+",508
8,"100,000+",1097
14,"50,000+",456


In [19]:
# Plotting the results
fig = px.bar(downloads,
                x= 'Installs',
                y= 'App_num',
                title='Installs Segments')

fig.show()

With these results we can se that the majority of the applications in this dataset were installed more that 1,000,000 times. Also there are only 21 applications that were installed more than 1,000,000,000.

It's more usual that the applications will be downloaded for 1,000,000 users.

## Top categories according to the install segments

In [20]:
# Grouping by categories and install segments
categories_installs = df.groupby(['installs', 'category'], as_index=False).agg({'app':'count'})

categories_installs['installs'] = pd.Categorical(categories_installs['installs'], categories=install_order, ordered=True)
categories_installs = categories_installs.sort_values('installs')

categories_installs.head(20)

Unnamed: 0,installs,category,app
67,"1,000,000,000+",BOOKS_AND_REFERENCE,1
69,"1,000,000,000+",ENTERTAINMENT,1
70,"1,000,000,000+",FAMILY,1
71,"1,000,000,000+",GAME,1
72,"1,000,000,000+",NEWS_AND_MAGAZINES,1
73,"1,000,000,000+",PHOTOGRAPHY,1
74,"1,000,000,000+",PRODUCTIVITY,1
75,"1,000,000,000+",SOCIAL,3
76,"1,000,000,000+",TOOLS,1
77,"1,000,000,000+",TRAVEL_AND_LOCAL,2


The top categories for the apps that are the most installed are: 
1. Books and Reference
2. Entertainment
3. Family
4. Game
5. News and Magazines
6. Photography 
7. Productivity
8. Social
9. Tools
10. Travel and Local
11. Video Players
12. Communication

Now lets see the apps in this segment of downloads (1,000,000,000+).

In [21]:
# All applications that are in the "1,000,000,000+" segment
df[df['installs'] == '1,000,000,000+'][['app', 'category','rating','installs','type']]

Unnamed: 0,app,category,rating,installs,type
152,Google Play Books,BOOKS_AND_REFERENCE,3.9,"1,000,000,000+",Free
335,Messenger – Text and Video Chat for Free,COMMUNICATION,4.0,"1,000,000,000+",Free
336,WhatsApp Messenger,COMMUNICATION,4.4,"1,000,000,000+",Free
338,Google Chrome: Fast & Secure,COMMUNICATION,4.3,"1,000,000,000+",Free
340,Gmail,COMMUNICATION,4.3,"1,000,000,000+",Free
341,Hangouts,COMMUNICATION,4.0,"1,000,000,000+",Free
391,Skype - free IM & video calls,COMMUNICATION,4.1,"1,000,000,000+",Free
865,Google Play Games,ENTERTAINMENT,4.3,"1,000,000,000+",Free
1654,Subway Surfers,GAME,4.5,"1,000,000,000+",Free
2544,Facebook,SOCIAL,4.1,"1,000,000,000+",Free


As we can see the, the majority of the top applications are all of google and are applications that are preinstalled in the devices that have android. If we want to review the apps that are more popular to the users, we should chose another segment. To review this information we are going to create a Dashboard to visualice this data. 

# Conclusions 

From the analysis we can conclude the following:

* The top categories are:
    1. Family
    2. Game
    3. Tools
    4. Productivity
    5. Finance
    6. Lifestyle
    7. Personalization
    8. Medical
    9. Photography
    10. Business

* It's more usual that the applications will be downloaded by 1,000,000 users. 

* The majority of the top applications are all of google and are applications that are preinstalled in the devices that have android.

### [Link to the Dashboard](https://public.tableau.com/app/profile/jos.jim.nez6777/viz/Tableau-Dashboard_17229825302340/UsersInstallsBehabiour?publish=yes)

In [22]:
# Code used to save the new dataset without errors
#df.to_csv('./datasets/dashboard_data.csv', index = False)