# Profitable App Profiles Google Play Store

Our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.
This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

In [86]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [87]:
Android_app_raw = pd.read_csv('./data/googleplaystore.csv')

In [88]:
# Google Play store
Android_app_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


The Anroid_app_raw Dataframe has **10841** apps and 13 columns.

The columns that might be useful for the purpose of our analysis are `App`, `Category`, `Reviews`, `Installs`, `Type`, `Price`, `Genres`

# Google Play Store

In [89]:
# load data
Android_app_raw.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [178]:
# Remove irrelevant columns
Android_app = Android_app_raw[['App', 'Category', 'Reviews', 'Rating', 'Installs', 'Type', 'Price', 'Genres']]
Android_app.head()

Unnamed: 0,App,Category,Reviews,Rating,Installs,Type,Price,Genres
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,159,4.1,"10,000+",Free,0,Art & Design
1,Coloring book moana,ART_AND_DESIGN,967,3.9,"500,000+",Free,0,Art & Design;Pretend Play
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,87510,4.7,"5,000,000+",Free,0,Art & Design
3,Sketch - Draw & Paint,ART_AND_DESIGN,215644,4.5,"50,000,000+",Free,0,Art & Design
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,967,4.3,"100,000+",Free,0,Art & Design;Creativity


In [91]:
# Replace the '3.0M' in reviews col
Android_app[Android_app['Reviews'].str.contains('M') == True]
Android_app.drop(10472, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [92]:
# Reivews Column
Android_app['Reviews'] = Android_app['Reviews'].astype('int')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Android_app['Reviews'] = Android_app['Reviews'].astype('int')


In [93]:
Android_app.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10840 entries, 0 to 10840
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   App       10840 non-null  object 
 1   Category  10840 non-null  object 
 2   Reviews   10840 non-null  int32  
 3   Rating    9366 non-null   float64
 4   Installs  10840 non-null  object 
 5   Type      10839 non-null  object 
 6   Price     10840 non-null  object 
 7   Genres    10840 non-null  object 
dtypes: float64(1), int32(1), object(6)
memory usage: 719.8+ KB


In [94]:
# $
Android_app['Price'].str.contains('$').count()

10840

In [95]:
#########################

#Android_app['Price'] = Android_app['Price'].str.replace('$', "")
#Android_app['Price'].astype('float').dtype

In [96]:
# missing data
Android_app.isnull().sum()

App            0
Category       0
Reviews        0
Rating      1474
Installs       0
Type           1
Price          0
Genres         0
dtype: int64

No NaN value to remove

## Removing duplicated Apps

Have a quick scan throughout the labels, there are some apps that has been duplicated.

In [97]:
Android_app['App'].duplicated().sum()

1181

There are 1181 cases where apps occurs more than once. Examples of duplicate apps:

In [98]:
Android_app['Duplicated'] = Android_app['App'].duplicated()
# Android_app[Android_app['Duplicated'] == True]['App']
duplicate = Android_app.loc[Android_app['Duplicated'] == True]
duplicate.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Android_app['Duplicated'] = Android_app['App'].duplicated()


Unnamed: 0,App,Category,Reviews,Rating,Installs,Type,Price,Genres,Duplicated
229,Quick PDF Scanner + OCR FREE,BUSINESS,80805,4.2,"5,000,000+",Free,0,Business,True
236,Box,BUSINESS,159872,4.2,"10,000,000+",Free,0,Business,True
239,Google My Business,BUSINESS,70991,4.4,"5,000,000+",Free,0,Business,True
256,ZOOM Cloud Meetings,BUSINESS,31614,4.4,"10,000,000+",Free,0,Business,True
261,join.me - Simple Meetings,BUSINESS,6989,4.0,"1,000,000+",Free,0,Business,True


In [99]:
Instagram = Android_app[Android_app['App'] == 'Instagram']

Remove the duplicated elements but Keep ones with the highest Reviews

In [100]:
Instagram['Reviews'].sort_values()

3909    66509917
2545    66577313
2611    66577313
2604    66577446
Name: Reviews, dtype: int32

In [101]:
duplicate.drop('Duplicated', axis =1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [102]:
duplicate.sort_values('App').head()

Unnamed: 0,App,Category,Reviews,Rating,Installs,Type,Price,Genres
1407,10 Best Foods for You,HEALTH_AND_FITNESS,2490,4.0,"500,000+",Free,0,Health & Fitness
2543,1800 Contacts - Lens Store,MEDICAL,23160,4.7,"1,000,000+",Free,0,Medical
2385,2017 EMRA Antibiotic Guide,MEDICAL,12,4.4,"1,000+",Paid,$16.99,Medical
1434,21-Day Meditation Experience,HEALTH_AND_FITNESS,11506,4.4,"100,000+",Free,0,Health & Fitness
5415,365Scores - Live Scores,SPORTS,666246,4.6,"10,000,000+",Free,0,Sports


In [103]:
duplicate['App'].value_counts()+1

ROBLOX                                               9
CBS Sports App - Scores, News, Stats & Watch Live    8
Candy Crush Saga                                     7
8 Ball Pool                                          7
Duolingo: Learn Languages Free                       7
                                                    ..
Learn English with Wlingua                           2
Coursera: Online courses                             2
Alto's Adventure                                     2
United Airlines                                      2
10 Best Foods for You                                2
Name: App, Length: 798, dtype: int64

### Apply groupby.rank() method to retrieve the max reviews of each unique app 

In [104]:
Android_app['Rank'] = Android_app.groupby("App")['Reviews'].rank(method = 'first', ascending = False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Android_app['Rank'] = Android_app.groupby("App")['Reviews'].rank(method = 'first', ascending = False)


In [105]:
Android_app[Android_app['App'].str.contains('Plants vs. Zombies FREE') == True]

Unnamed: 0,App,Category,Reviews,Rating,Installs,Type,Price,Genres,Duplicated,Rank
1666,Plants vs. Zombies FREE,GAME,4066989,4.4,"100,000,000+",Free,0,Strategy,False,3.0
1706,Plants vs. Zombies FREE,GAME,4066980,4.4,"100,000,000+",Free,0,Strategy,True,4.0
1881,Plants vs. Zombies FREE,GAME,4067651,4.4,"100,000,000+",Free,0,Strategy,True,1.0
1967,Plants vs. Zombies FREE,GAME,4067651,4.4,"100,000,000+",Free,0,Strategy,True,2.0
4055,Plants vs. Zombies FREE,FAMILY,4064868,4.4,"100,000,000+",Free,0,Strategy,True,5.0


In [106]:
Android_app = Android_app[Android_app['Rank'] == 1]

In [107]:
Android_app.shape

(9659, 10)

After removing duplicates and used the max reviews of each apps, there are 9649 rows remaining

In [108]:
Android_app.drop(['Rank','Duplicated'], axis = 1, inplace= True)

In [109]:
Android_app.shape

(9659, 8)

## Removing Non-English Apps

In [110]:
Android_app.head(3)

Unnamed: 0,App,Category,Reviews,Rating,Installs,Type,Price,Genres
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,159,4.1,"10,000+",Free,0,Art & Design
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,87510,4.7,"5,000,000+",Free,0,Art & Design
3,Sketch - Draw & Paint,ART_AND_DESIGN,215644,4.5,"50,000,000+",Free,0,Art & Design


In [111]:
Android_app.iloc[4412][0]
Android_app.iloc[7940][0]

'لعبة تقدر تربح DZ'

Remove non-english classes

Explaination: All these characters that are specific to English texts are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it. ---> if there is a character in a string exceeds 127 so it returns not english

In [112]:
# Function that dectect non-english:

def is_english(string):
    
    for charater in string:
        if ord(charater) > 127:
            return False
        else:
            return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(ord('™'))
print(ord('😜'))

True
False
True
True
8482
128540


In [113]:
english_app = []

for app in list(Android_app['App']):
    if is_english(app):
        english_app.append(app)
        
print(f'# rows of english only app = {np.array(english_app).shape[0]}')
print(f'# rows of original data = {Android_app.shape[0]}')

# rows of english only app = 9623
# rows of original data = 9659


In [114]:
Android_app['App'] = Android_app[Android_app['App'].isin(english_app)]

# Free Apps

In [115]:
Android_app = Android_app[Android_app['Type'] == "Free"]

In [116]:
print(f'Free Android App = {Android_app.shape[0]}')

Free Android App = 8904


# EDA 

In [117]:
Android_app.head()

Unnamed: 0,App,Category,Reviews,Rating,Installs,Type,Price,Genres
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,159,4.1,"10,000+",Free,0,Art & Design
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,87510,4.7,"5,000,000+",Free,0,Art & Design
3,Sketch - Draw & Paint,ART_AND_DESIGN,215644,4.5,"50,000,000+",Free,0,Art & Design
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,967,4.3,"100,000+",Free,0,Art & Design;Creativity
5,Paper flowers instructions,ART_AND_DESIGN,167,4.4,"50,000+",Free,0,Art & Design


### By Category and Genre in GOOGLE PLAY STORE

In [118]:
# Category
Android_app['Category'].value_counts(normalize= True) * 100

FAMILY                 18.969003
GAME                    9.703504
TOOLS                   8.434412
BUSINESS                4.582210
LIFESTYLE               3.930818
PRODUCTIVITY            3.885894
FINANCE                 3.683738
MEDICAL                 3.515274
SPORTS                  3.380503
PERSONALIZATION         3.313118
COMMUNICATION           3.234501
HEALTH_AND_FITNESS      3.066038
PHOTOGRAPHY             2.942498
NEWS_AND_MAGAZINES      2.830189
SOCIAL                  2.650494
TRAVEL_AND_LOCAL        2.324798
SHOPPING                2.246181
BOOKS_AND_REFERENCE     2.178796
DATING                  1.853100
VIDEO_PLAYERS           1.796945
MAPS_AND_NAVIGATION     1.415094
FOOD_AND_DRINK          1.235400
EDUCATION               1.168014
ENTERTAINMENT           0.954627
LIBRARIES_AND_DEMO      0.932165
AUTO_AND_VEHICLES       0.920934
HOUSE_AND_HOME          0.819856
WEATHER                 0.797394
EVENTS                  0.707547
PARENTING               0.651393
ART_AND_DE

In Google Play Store, only 9.7% of total apps for gaming entertainment while Family category apps doubled at 18.97%. In top 10 most avaialbe app by category, there mostly apps for practical purposes (family, tools, business, lifestyle ... )  


In [119]:
# Genre
Android_app['Genres'].value_counts(normalize= True)[:20] * 100

Tools                8.423181
Entertainment        6.087152
Education            5.390836
Business             4.582210
Lifestyle            3.919587
Productivity         3.885894
Finance              3.683738
Medical              3.515274
Sports               3.447889
Personalization      3.313118
Communication        3.234501
Action               3.088500
Health & Fitness     3.066038
Photography          2.942498
News & Magazines     2.830189
Social               2.650494
Travel & Local       2.313567
Shopping             2.246181
Books & Reference    2.178796
Simulation           2.066487
Name: Genres, dtype: float64

The difference between Genres and Category columns is not crystal clear, but Genres column is more granular. % for practical purpose apps has been dominated for entertainment only.

Take a deeper look in Family Category

In [120]:
Family = Android_app[Android_app['Category'] == 'FAMILY']

In [121]:
Family.groupby(by= 'Genres').count().sort_values('App', ascending = False)['App'][:20]

Genres
Entertainment                  459
Education                      384
Simulation                     176
Casual                         134
Puzzle                          78
Role Playing                    72
Strategy                        65
Educational;Education           35
Educational                     33
Education;Education             25
Casual;Pretend Play             21
Racing;Action & Adventure       15
Puzzle;Brain Games              15
Casual;Action & Adventure       12
Entertainment;Music & Video     12
Arcade;Action & Adventure       11
Casual;Brain Games              11
Educational;Pretend Play         8
Action;Action & Adventure        8
Board;Brain Games                7
Name: App, dtype: int64

In [122]:
Family['Genres'].value_counts(normalize= True)[:10] * 100

Entertainment            27.353464
Education                22.912966
Simulation               10.479574
Casual                    7.933689
Puzzle                    4.618117
Role Playing              4.262877
Strategy                  3.907638
Educational;Education     2.072232
Educational               1.953819
Education;Education       1.480166
Name: Genres, dtype: float64

In FAMILY Category, 27% of the apps belongs to Entertaiment genre; Then, Educational purpose experiecne 22% of the Family apps ---> The Family Category was divided into 2 main genre: Entertainment with games, Casual, Puzzle or Role Playing; and Education 

In [123]:
Android_app.head()

Unnamed: 0,App,Category,Reviews,Rating,Installs,Type,Price,Genres
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,159,4.1,"10,000+",Free,0,Art & Design
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,87510,4.7,"5,000,000+",Free,0,Art & Design
3,Sketch - Draw & Paint,ART_AND_DESIGN,215644,4.5,"50,000,000+",Free,0,Art & Design
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,967,4.3,"100,000+",Free,0,Art & Design;Creativity
5,Paper flowers instructions,ART_AND_DESIGN,167,4.4,"50,000+",Free,0,Art & Design


Which Category has the highest rating on average on gg play store?

In [124]:
Android_app.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8904 entries, 0 to 10840
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   App       8872 non-null   object 
 1   Category  8904 non-null   object 
 2   Reviews   8904 non-null   int32  
 3   Rating    7594 non-null   float64
 4   Installs  8904 non-null   object 
 5   Type      8904 non-null   object 
 6   Price     8904 non-null   object 
 7   Genres    8904 non-null   object 
dtypes: float64(1), int32(1), object(6)
memory usage: 591.3+ KB


In [125]:
x = Android_app[Android_app['Category']=='ART_AND_DESIGN']
x[['Rating','Reviews']] = x[['Rating', 'Reviews']].dropna()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [126]:
x['W_Review'] = x['Reviews'] / x['Reviews'].sum() 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x['W_Review'] = x['Reviews'] / x['Reviews'].sum()


In [127]:
(x['W_Review']*x['Rating']).sum()

4.473816601355378

In [128]:
cat_list = Android_app['Category'].unique()
rating_avg = []
for cat in cat_list:
    
    x = Android_app[Android_app['Category']==cat]
    x[['Rating','Reviews']] = x[['Rating', 'Reviews']].dropna()
    
    x['W_Review'] = x['Reviews'] / x['Reviews'].sum() 
    avg=(x['W_Review']*x['Rating']).sum()
    rating_avg.append(round(avg,2))
print(rating_avg)
print('\n')
print(cat_list)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x['W_Review'] = x['Reviews'] / x['Reviews'].sum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = valu

[4.47, 4.49, 4.34, 4.5, 4.32, 4.39, 4.28, 4.17, 4.56, 4.28, 4.19, 4.38, 4.38, 4.57, 4.42, 4.09, 4.32, 4.45, 4.4, 4.6, 4.27, 4.45, 4.43, 4.34, 4.36, 4.54, 4.51, 4.43, 4.62, 4.41, 4.4, 4.3, 4.4]


['ART_AND_DESIGN' 'AUTO_AND_VEHICLES' 'BEAUTY' 'BOOKS_AND_REFERENCE'
 'BUSINESS' 'COMICS' 'COMMUNICATION' 'DATING' 'EDUCATION' 'ENTERTAINMENT'
 'EVENTS' 'FINANCE' 'FOOD_AND_DRINK' 'HEALTH_AND_FITNESS' 'HOUSE_AND_HOME'
 'LIBRARIES_AND_DEMO' 'LIFESTYLE' 'GAME' 'FAMILY' 'MEDICAL' 'SOCIAL'
 'SHOPPING' 'PHOTOGRAPHY' 'SPORTS' 'TRAVEL_AND_LOCAL' 'TOOLS'
 'PERSONALIZATION' 'PRODUCTIVITY' 'PARENTING' 'WEATHER' 'VIDEO_PLAYERS'
 'NEWS_AND_MAGAZINES' 'MAPS_AND_NAVIGATION']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x['W_Review'] = x['Reviews'] / x['Reviews'].sum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x['W_Review'] = x['Reviews'] / x['Reviews'].sum()


In [129]:
dict_zip = zip(cat_list, rating_avg)
dict_cat_rating = dict(dict_zip)
df_cat_rating = pd.DataFrame(list(dict_cat_rating.items()), columns= ['Category', 'Average Ratings'])
df_cat_rating.sort_values(by= 'Average Ratings', ascending = False)

Unnamed: 0,Category,Average Ratings
28,PARENTING,4.62
19,MEDICAL,4.6
13,HEALTH_AND_FITNESS,4.57
8,EDUCATION,4.56
25,TOOLS,4.54
26,PERSONALIZATION,4.51
3,BOOKS_AND_REFERENCE,4.5
1,AUTO_AND_VEHICLES,4.49
0,ART_AND_DESIGN,4.47
17,GAME,4.45


Pretty high ratings based on reviews --> Parenting was the greatest, then medical, health and fitness. While Libraries and demo are the worst; 

### Installations

Which Category was most installed?

In [130]:
Android_app['Installs'].value_counts()

1,000,000+        1397
100,000+          1031
10,000,000+        935
10,000+            913
1,000+             750
100+               616
5,000,000+         607
500,000+           493
50,000+            429
5,000+             403
10+                315
500+               288
50,000,000+        204
100,000,000+       189
50+                170
5+                  70
1+                  46
500,000,000+        24
1,000,000,000+      20
0+                   4
Name: Installs, dtype: int64

In [131]:
Android_app['Installs'].nunique()

20

In [132]:
# Top installed apps
Android_app[Android_app['Installs'].isin(['1,000,000,000+', '500,000,000+'])].sort_values(by= 'Installs')

Unnamed: 0,App,Category,Reviews,Rating,Installs,Type,Price,Genres
152,Google Play Books,BOOKS_AND_REFERENCE,1433233,3.9,"1,000,000,000+",Free,0,Books & Reference
5856,Google Play Games,FAMILY,7168735,4.3,"1,000,000,000+",Free,0,Entertainment
3687,Google Play Movies & TV,VIDEO_PLAYERS,906384,3.7,"1,000,000,000+",Free,0,Video Players & Editors
3665,YouTube,VIDEO_PLAYERS,25655305,4.3,"1,000,000,000+",Free,0,Video Players & Editors
3523,Google Drive,PRODUCTIVITY,2731211,4.4,"1,000,000,000+",Free,0,Productivity
3234,Google,TOOLS,8033493,4.4,"1,000,000,000+",Free,0,Tools
3232,Google Street View,TRAVEL_AND_LOCAL,2129707,4.2,"1,000,000,000+",Free,0,Travel & Local
3223,Maps - Navigate & Explore,TRAVEL_AND_LOCAL,9235373,4.3,"1,000,000,000+",Free,0,Travel & Local
2604,Instagram,SOCIAL,66577446,4.5,"1,000,000,000+",Free,0,Social
2554,Google+,SOCIAL,4831125,4.2,"1,000,000,000+",Free,0,Social


In [133]:
Android_app['Category'].value_counts()
Android_app['Category'].nunique()

33

In [134]:
pivot_cat_installs = pd.pivot_table(Android_app, index='Installs', columns='Category', aggfunc='count')
pivot_cat_installs

Unnamed: 0_level_0,App,App,App,App,App,App,App,App,App,App,...,Type,Type,Type,Type,Type,Type,Type,Type,Type,Type
Category,ART_AND_DESIGN,AUTO_AND_VEHICLES,BEAUTY,BOOKS_AND_REFERENCE,BUSINESS,COMICS,COMMUNICATION,DATING,EDUCATION,ENTERTAINMENT,...,PERSONALIZATION,PHOTOGRAPHY,PRODUCTIVITY,SHOPPING,SOCIAL,SPORTS,TOOLS,TRAVEL_AND_LOCAL,VIDEO_PLAYERS,WEATHER
Installs,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0+,1.0,,,,1.0,,,,,,...,,,,,1.0,,,,,
1+,,1.0,,,5.0,,1.0,2.0,,,...,3.0,,4.0,,2.0,3.0,4.0,1.0,,
"1,000+",1.0,8.0,1.0,32.0,47.0,1.0,19.0,12.0,,,...,28.0,15.0,29.0,8.0,22.0,14.0,69.0,20.0,8.0,3.0
"1,000,000+",4.0,13.0,8.0,20.0,34.0,10.0,40.0,19.0,42.0,32.0,...,42.0,48.0,56.0,34.0,33.0,43.0,100.0,38.0,33.0,21.0
"1,000,000,000+",,,,1.0,,,6.0,,,,...,,1.0,1.0,,3.0,,1.0,2.0,2.0,
10+,,5.0,,4.0,40.0,,14.0,10.0,,,...,13.0,1.0,20.0,8.0,9.0,19.0,20.0,2.0,2.0,2.0
"10,000+",10.0,6.0,13.0,23.0,45.0,10.0,20.0,20.0,5.0,1.0,...,34.0,13.0,18.0,10.0,18.0,33.0,84.0,29.0,15.0,3.0
"10,000,000+",4.0,2.0,1.0,19.0,13.0,1.0,43.0,8.0,8.0,20.0,...,32.0,57.0,37.0,41.0,30.0,33.0,83.0,24.0,26.0,10.0
100+,1.0,5.0,5.0,6.0,74.0,1.0,28.0,14.0,,,...,32.0,13.0,31.0,11.0,16.0,21.0,42.0,15.0,7.0,
"100,000+",18.0,23.0,8.0,20.0,28.0,10.0,16.0,24.0,20.0,9.0,...,22.0,18.0,31.0,20.0,21.0,37.0,95.0,16.0,13.0,12.0


Because Installs column include string data with "+" and "," character, which means the column did not represent accurately the installation for each app. So let's consider the number of installation for apps equal to the number on the web, for instance, 100+ means the app was installed 100 times

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.


In [135]:
# convert from string to int values
Installs = []
for x in list(Android_app['Installs']):
    y = x.replace('+', '').replace(',','')
    Installs.append(int(y))

# print(Installs)

In [136]:
# Replace string data in installs column by the list of integer

Android_app['Installs'] = Installs
Android_app['Installs'].dtype

dtype('int64')

In [137]:
aaa = Android_app[['Category', 'Installs', 'App']]
aaa.head()

Unnamed: 0,Category,Installs,App
0,ART_AND_DESIGN,10000,Photo Editor & Candy Camera & Grid & ScrapBook
2,ART_AND_DESIGN,5000000,"U Launcher Lite – FREE Live Cool Themes, Hide ..."
3,ART_AND_DESIGN,50000000,Sketch - Draw & Paint
4,ART_AND_DESIGN,100000,Pixel Draw - Number Art Coloring Book
5,ART_AND_DESIGN,50000,Paper flowers instructions


In [138]:
pivot_cat_installs = pd.pivot_table(aaa, index='Category', columns='Installs', aggfunc='count')
pivot_cat_installs

Unnamed: 0_level_0,App,App,App,App,App,App,App,App,App,App,App,App,App,App,App,App,App,App,App,App
Installs,0,1,5,10,50,100,500,1000,5000,10000,50000,100000,500000,1000000,5000000,10000000,50000000,100000000,500000000,1000000000
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2
ART_AND_DESIGN,1.0,,,,,1.0,,1.0,4.0,10.0,6.0,18.0,4.0,4.0,3.0,4.0,1.0,,,
AUTO_AND_VEHICLES,,1.0,2.0,5.0,,5.0,3.0,8.0,2.0,6.0,4.0,23.0,5.0,13.0,3.0,2.0,,,,
BEAUTY,,,,,1.0,5.0,1.0,1.0,3.0,13.0,5.0,8.0,6.0,8.0,1.0,1.0,,,,
BOOKS_AND_REFERENCE,,,4.0,4.0,2.0,6.0,7.0,32.0,14.0,23.0,12.0,20.0,16.0,20.0,9.0,19.0,,4.0,,1.0
BUSINESS,1.0,5.0,17.0,40.0,24.0,74.0,22.0,47.0,17.0,45.0,11.0,28.0,8.0,34.0,15.0,13.0,5.0,2.0,,
COMICS,,,,,1.0,1.0,,1.0,4.0,10.0,6.0,10.0,5.0,10.0,4.0,1.0,,,,
COMMUNICATION,,1.0,2.0,14.0,5.0,28.0,8.0,19.0,16.0,20.0,10.0,16.0,9.0,40.0,22.0,43.0,7.0,16.0,5.0,6.0
DATING,,2.0,1.0,10.0,5.0,14.0,12.0,12.0,9.0,20.0,5.0,24.0,18.0,19.0,6.0,8.0,,,,
EDUCATION,,,,,,,,,,5.0,6.0,20.0,11.0,42.0,12.0,8.0,,,,
ENTERTAINMENT,,,,,,,,,,1.0,1.0,9.0,1.0,32.0,11.0,20.0,4.0,5.0,,


Have a quick look into the Pivot matrix:
- In 20 apps which has more than 1B downloads, communication/social network included 6 apps, such as Facebook, Instagramm, Twitter, etc; The other spreaded from Books and Referance category to SAport, Photography, Travel, Tools, etc
- In the second group of more than 500M installs, Communication, Productivity, Game and Tools dominated with approximately 20 out of 25 apps with the similarity of installation

In [139]:
Installs_cat = pd.DataFrame(Android_app.groupby(by= 'Category')['Installs'].sum()).sort_values(by= 'Installs', ascending = False)

In [140]:
Installs_cat['Number_Apps'] = Android_app['Category'].value_counts()
Installs_cat['Install per App'] = Installs_cat['Installs'] / Installs_cat['Number_Apps']

In [141]:
Cat_df = Installs_cat.merge(df_cat_rating, how= 'left', on= 'Category')
Cat_df

Unnamed: 0,Category,Installs,Number_Apps,Install per App,Average Ratings
0,GAME,13436924450,864,15552000.0,4.45
1,COMMUNICATION,11036916201,288,38322630.0,4.28
2,TOOLS,8101044474,751,10787010.0,4.54
3,FAMILY,6200391691,1689,3671043.0,4.4
4,PRODUCTIVITY,5791679314,346,16738960.0,4.43
5,SOCIAL,5487861902,236,23253650.0,4.27
6,PHOTOGRAPHY,4656268915,262,17772020.0,4.43
7,VIDEO_PLAYERS,3931831720,160,24573950.0,4.4
8,TRAVEL_AND_LOCAL,2894704086,207,13984080.0,4.36
9,NEWS_AND_MAGAZINES,2369212260,252,9401636.0,4.3


In [152]:
# change format from scienctific notion to normal format
Cat_df['Install per App'] = Cat_df['Install per App'].apply(lambda x: '%.f' %x)

In [153]:
Cat_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33 entries, 0 to 32
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Category         33 non-null     object 
 1   Installs         33 non-null     int64  
 2   Number_Apps      33 non-null     int64  
 3   Install per App  33 non-null     object 
 4   Average Ratings  33 non-null     float64
dtypes: float64(1), int64(2), object(2)
memory usage: 1.5+ KB


In [154]:
# the Install per app data was string -> need convert to float
Cat_df['Install per App'] = Cat_df['Install per App'].astype('float')

In [155]:
# Inisght the average install per app
Cat_df.sort_values(by= 'Install per App', ascending = False)

Unnamed: 0,Category,Installs,Number_Apps,Install per App,Average Ratings
1,COMMUNICATION,11036916201,288,38322626.0,4.28
7,VIDEO_PLAYERS,3931831720,160,24573948.0,4.4
5,SOCIAL,5487861902,236,23253652.0,4.27
6,PHOTOGRAPHY,4656268915,262,17772019.0,4.43
4,PRODUCTIVITY,5791679314,346,16738958.0,4.43
0,GAME,13436924450,864,15551996.0,4.45
8,TRAVEL_AND_LOCAL,2894704086,207,13984078.0,4.36
15,ENTERTAINMENT,989460000,85,11640706.0,4.28
2,TOOLS,8101044474,751,10787010.0,4.54
9,NEWS_AND_MAGAZINES,2369212260,252,9401636.0,4.3


### Scanning some top Installs per App

In [176]:
# COMMUNICATION
Android_app[Android_app['Category'] == 'COMMUNICATION'][['App','Installs']].sort_values(by = 'Installs', ascending = False)[:10]

Unnamed: 0,App,Installs
336,WhatsApp Messenger,1000000000
391,Skype - free IM & video calls,1000000000
411,Google Chrome: Fast & Secure,1000000000
382,Messenger – Text and Video Chat for Free,1000000000
451,Gmail,1000000000
464,Hangouts,1000000000
403,LINE: Free Calls & Messages,500000000
383,imo free video calls and chat,500000000
4676,Viber Messenger,500000000
420,UC Browser - Fast Download Private & Secure,500000000


In top 20 of communication Apps, they are all ubiquitous Message Apps or Browsers, so the competition advantage was belonging to the giant techs, they all dominated the installations on google store. 

In [165]:
# SOCIAL
Android_app[Android_app['Category'] == 'SOCIAL'][['App','Installs']].sort_values(by = 'Installs', ascending = False)[:10]

Unnamed: 0,App,Installs
2544,Facebook,1000000000
2554,Google+,1000000000
2604,Instagram,1000000000
2610,Snapchat,500000000
2546,Facebook Lite,500000000
3945,Tik Tok - including musical.ly,100000000
2592,Tango - Live Video Broadcast,100000000
6373,VK,100000000
2552,Pinterest,100000000
3951,BIGO LIVE - Live Stream,100000000


The similar as true for Social apps

In [166]:
# VIDEO_PLAYERS
Android_app[Android_app['Category'] == 'VIDEO_PLAYERS'][['App','Installs']].sort_values(by = 'Installs', ascending = False)[:10]

Unnamed: 0,App,Installs
3665,YouTube,1000000000
3687,Google Play Movies & TV,1000000000
3711,MX Player,500000000
4696,"VideoShow-Video Editor, Video Maker, Beauty Ca...",100000000
4688,VivaVideo - Video Editor & Photo Movie,100000000
3672,Motorola Gallery,100000000
3675,VLC for Android,100000000
4032,Dubsmash,100000000
10647,Motorola FM Radio,100000000
5612,Ringdroid,50000000


These top categories seems to be the fight between giant developers because of the profitability and the markets were segmented enough for them to dominate other developers, and fighting against these giants is quite impossible. 

The Genres like Game, Family and Tools dominate the google store by there apps uploaded, means fairly popular as well. However, the problem with Family and Tools did not attract a lot of users; While gaming sections were a bit satuarted: easy to entry but to gain profits is not a simple question.

In my opinion, what we should look for is a market with average competitition but still gains good enough number of users. 

BOOKS_AND_REFERENCE is quite interesting to explore in more depth, with the average number of installation was 8.4M (top10 in avg installaiton) and the number of apps on Google Store was nearly 200 apps.

Let's have a glance on BOOKS_AND_REFERENCE

In [177]:
Book_Ref_df = Android_app[Android_app['Category'] == 'BOOKS_AND_REFERENCE'][['App', 'Installs']]
Book_Ref_df.sort_values(by= 'Installs', ascending = False)[:20]

Unnamed: 0,App,Installs
152,Google Play Books,1000000000
5651,Audiobooks from Audible,100000000
4715,Wattpad 📖 Free Books,100000000
3941,Bible,100000000
4083,Amazon Kindle,100000000
6497,NOOK: Read eBooks & Magazines,10000000
5324,Al'Quran Bahasa Indonesia,10000000
179,Moon+ Reader,10000000
8293,Dictionary,10000000
5345,Quran for Android,10000000


A lot of kind of book apps from Dictionary, PDF Assistant, Noting, E-book, E-Libraries, Religious, etc. Howerver, it looks like there are a few popular apps such as Google Play Book (1B+), Bibie (100M+), Amazon Kindle (100M+), Wattpad(100M+), Audible (100M+)... alongsides a many apps with more than 10M+ apps. So, this market is sitll potential for developers.

On the other hand, the apps here was quite similar in functions: software for processing and reading e-book or libraries, e-bookstores and dictionaries were dominant, so being different by including a raw but convenient function might be the way to get profitable instead of create an app with alike function that'd already existed.