# Introduction
We all have once thought about building your own App. Here, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.
In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

we will compare thousands of apps in the Google Play Store so that we can gain insight into:

- How competitive different app categories (e.g., Games, Lifestyle, Weather) are

- Which app category offers compelling opportunities based on its popularity

- How many downloads you would give up by making your app paid vs. free

- How much you can reasonably charge for a paid app

- Which paid apps have had the highest revenue

- How many paid apps will recoup their development costs based on their sales revenue

# About the Dataset of Google Play Store Apps & Reviews

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

# Import Statements

In [None]:
import pandas as pd

# Notebook Presentation

In [None]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [None]:
df_apps = pd.read_csv('apps.csv')

# Data Cleaning

In [None]:
df_apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.7,0,Paid,$13.99,Teen,Social,"July 28, 2017",4.1 and up
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.0,0,Paid,$2.99,Everyone,Education,"April 15, 2016",3.0 and up
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.5,0,Paid,$1.49,Everyone,Personalization,"July 11, 2018",4.2 and up
3,Command & Conquer: Rivals,FAMILY,,0,19.0,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device
4,CX Network,BUSINESS,,0,10.0,0,Free,0,Everyone,Business,"August 6, 2018",4.1 and up


In [None]:
df_apps.shape

(10841, 12)

In [None]:
df_apps.sample(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
4077,¡Ay Metro!,GAME,3.8,489,36.0,10000,Free,0,Everyone 10+,Arcade,"March 17, 2015",4.0 and up
9304,A+ Gallery - Photos & Videos,PHOTOGRAPHY,4.5,223941,6.9,10000000,Free,0,Everyone,Photography,"August 6, 2018",Varies with device
5148,Star Wars ™: DIRTY,GAME,4.5,38207,15.0,100000,Paid,$9.99,Teen,Role Playing,"October 19, 2015",4.1 and up
9847,Standoff 2,GAME,4.5,299046,47.0,10000000,Free,0,Mature 17+,Action,"July 12, 2018",4.1 and up
6006,Paint By Number,FAMILY,3.5,807,4.6,500000,Free,0,Everyone,Casual,"May 1, 2018",3.0 and up


In [None]:
df_apps.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
      dtype='object')

### Drop Unused Columns

We remove the columns called `Last_Updated` and `Android_Version` from the DataFrame in our analysis for now. We will not use these columns.

In [None]:
df_apps.drop(['Last_Updated', 'Android_Ver'], axis=1, inplace=True)

In [None]:
df_apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.7,0,Paid,$13.99,Teen,Social
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.0,0,Paid,$2.99,Everyone,Education
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.5,0,Paid,$1.49,Everyone,Personalization
3,Command & Conquer: Rivals,FAMILY,,0,19.0,0,,0,Everyone 10+,Strategy
4,CX Network,BUSINESS,,0,10.0,0,Free,0,Everyone,Business


### Remove NaN values

In [None]:
df_apps.isna()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,False,False,True,False,False,False,False,False,False,False
1,False,False,True,False,False,False,False,False,False,False
2,False,False,True,False,False,False,False,False,False,False
3,False,False,True,False,False,False,True,False,False,False
4,False,False,True,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
10836,False,False,False,False,False,False,False,False,False,False
10837,False,False,False,False,False,False,False,False,False,False
10838,False,False,False,False,False,False,False,False,False,False
10839,False,False,False,False,False,False,False,False,False,False


In [None]:
df_apps.isna().any()

App               False
Category          False
Rating             True
Reviews           False
Size_MBs          False
Installs          False
Type               True
Price             False
Content_Rating    False
Genres            False
dtype: bool

In [None]:
df_apps.columns[df_apps.isna().any()].to_list()

['Rating', 'Type']

To find and remove the rows with the NaN values we can create a subset of the DataFrame based on where .isna() evaluates to True. \\
Analysis: We see that NaN values in ratings are associated with no reviews (and no installs). That makes sense.

In [None]:
nan_rows = df_apps[df_apps['Rating'].isna()]
nan_rows.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.7,0,Paid,$13.99,Teen,Social
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.0,0,Paid,$2.99,Everyone,Education
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.5,0,Paid,$1.49,Everyone,Personalization
3,Command & Conquer: Rivals,FAMILY,,0,19.0,0,,0,Everyone 10+,Strategy
4,CX Network,BUSINESS,,0,10.0,0,Free,0,Everyone,Business


In [None]:
df_apps_clean = df_apps.dropna()
df_apps_clean.shape

(9367, 10)

### Find and Remove Duplicates


In [None]:
duplicated_rows = df_apps_clean[df_apps_clean.duplicated()]
duplicated_rows.shape

(476, 10)

In [None]:
duplicated_rows.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
946,420 BZ Budeze Delivery,MEDICAL,5.0,2,11.0,100,Free,0,Mature 17+,Medical
1133,MouseMingle,DATING,2.7,3,3.9,100,Free,0,Mature 17+,Dating
1196,"Cardiac diagnosis (heart rate, arrhythmia)",MEDICAL,4.4,8,6.5,100,Paid,$12.99,Everyone,Medical
1231,Sway Medical,MEDICAL,5.0,3,22.0,100,Free,0,Everyone,Medical
1247,Chat Kids - Chat Room For Kids,DATING,4.7,6,4.9,100,Free,0,Mature 17+,Dating


In [None]:
df_apps_clean[df_apps_clean.App == '420 BZ Budeze Delivery']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
945,420 BZ Budeze Delivery,MEDICAL,5.0,2,11.0,100,Free,0,Mature 17+,Medical
946,420 BZ Budeze Delivery,MEDICAL,5.0,2,11.0,100,Free,0,Mature 17+,Medical


In [None]:
df_apps_clean[df_apps_clean['App']== 'Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10808,Instagram,SOCIAL,4.5,66577446,5.3,1000000000,Free,0,Teen,Social
10809,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10810,Instagram,SOCIAL,4.5,66509917,5.3,1000000000,Free,0,Teen,Social


**Analysis**: We see in above results of 'Instagram' App that 3 copies are retained having only chnage in Reviews. So if we want to remove redundant rows like this, we can provide a subset of column names as below. **The drop_duplicates() function only consider certain columns for identifying duplicates, by default use all of the columns.**

In [None]:
df_apps_clean = df_apps_clean.drop_duplicates(subset=['App', 'Type', 'Price'])

In [None]:
df_apps_clean.shape

(8199, 10)

In [None]:
df_apps_clean[df_apps_clean['App']== 'Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social


# Find Highest Rated Apps

In [None]:
df_apps_clean.sort_values(['Rating'], ascending=False).head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
21,KBA-EZ Health Guide,MEDICAL,5.0,4,25.0,1,Free,0,Everyone,Medical
1230,Sway Medical,MEDICAL,5.0,3,22.0,100,Free,0,Everyone,Medical
1227,AJ Men's Grooming,LIFESTYLE,5.0,2,22.0,100,Free,0,Everyone,Lifestyle
1224,FK Dedinje BGD,SPORTS,5.0,36,2.6,100,Free,0,Everyone,Sports
1223,CB VIDEO VISION,PHOTOGRAPHY,5.0,13,2.6,100,Free,0,Everyone,Photography


Only apps with very few reviews (and a low number on installs) have perfect 5 star ratings (most likely by friends and family). So we can not
rely exclusively on ratings alone to determine the quality of an app?

# Finding 5 Largest Apps in terms of Size (MBs)

In [None]:
df_apps_clean.sort_values(['Size_MBs'], ascending=False).head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
9942,Talking Babsy Baby: Baby Games,LIFESTYLE,4.0,140995,100.0,10000000,Free,0,Everyone,Lifestyle;Pretend Play
10687,Hungry Shark Evolution,GAME,4.5,6074334,100.0,100000000,Free,0,Teen,Arcade
9943,Miami crime simulator,GAME,4.0,254518,100.0,10000000,Free,0,Mature 17+,Action
9944,Gangster Town: Vice District,FAMILY,4.3,65146,100.0,10000000,Free,0,Mature 17+,Simulation
3144,Vi Trainer,HEALTH_AND_FITNESS,3.6,124,100.0,5000,Free,0,Everyone,Health & Fitness


It seems to be an upper bound of 100 MB for the size of an app in Google play Store.

# Find the 5 App with Most Reviews

In [None]:
df_apps_clean.sort_values(['Reviews'], ascending=False).head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10805,Facebook,SOCIAL,4.1,78158306,5.3,1000000000,Free,0,Teen,Social
10785,WhatsApp Messenger,COMMUNICATION,4.4,69119316,3.5,1000000000,Free,0,Everyone,Communication
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10784,Messenger – Text and Video Chat for Free,COMMUNICATION,4.0,56642847,3.5,1000000000,Free,0,Everyone,Communication
10650,Clash of Clans,GAME,4.6,44891723,98.0,100000000,Free,0,Everyone 10+,Strategy


We find the most popular apps on the Android App Store. These include the usuals: Facebook, WhatsApp etc. And All are free, which is self-explantory.

# Visualise Categorical Data: Content Ratings

In [None]:
ratings = df_apps_clean['Content_Rating'].value_counts()

In [None]:
ratings

Everyone           6621
Teen                912
Mature 17+          357
Everyone 10+        305
Adults only 18+       3
Unrated               1
Name: Content_Rating, dtype: int64

In [None]:
import plotly.express as px
import numpy as np

In [None]:
fig = px.pie(labels=np.array(ratings.index),
values=ratings.values,
title="Content Rating",
names=ratings.index,
hole=0.6,
)
fig.update_traces(textposition='inside', textfont_size=15, textinfo='percent')

fig.show()


# Examining the Number of Installs

In [None]:
df_apps_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8199 entries, 21 to 10835
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8199 non-null   object 
 1   Category        8199 non-null   object 
 2   Rating          8199 non-null   float64
 3   Reviews         8199 non-null   int64  
 4   Size_MBs        8199 non-null   float64
 5   Installs        8199 non-null   object 
 6   Type            8199 non-null   object 
 7   Price           8199 non-null   object 
 8   Content_Rating  8199 non-null   object 
 9   Genres          8199 non-null   object 
dtypes: float64(2), int64(1), object(7)
memory usage: 704.6+ KB


In [None]:
df_apps_clean.Installs.describe()

count          8199
unique           19
top       1,000,000
freq           1417
Name: Installs, dtype: object

In [None]:
df_apps_clean[['App', 'Installs']].groupby('Installs').count()

Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
1000,698
1000000,1417
1000000000,20
10,69
10000,988
10000000,933
100,303
100000,1096
100000000,189


In [None]:
df_apps_clean.Installs = df_apps_clean.Installs.astype(str).str.replace(',', "")
df_apps_clean.Installs = pd.to_numeric(df_apps_clean.Installs)
df_apps_clean[['App', 'Installs']].groupby('Installs').count()

Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
5,9
10,69
50,56
100,303
500,199
1000,698
5000,425
10000,988
50000,457


# Finding the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate

Let's examine the Price column more closely.


In [None]:
df_apps_clean.Price.describe()

count     8199
unique      73
top          0
freq      7595
Name: Price, dtype: object

In [None]:
df_apps_clean.Price = df_apps_clean.Price.astype(str).str.replace('$', "")
df_apps_clean.Price = pd.to_numeric(df_apps_clean.Price)
df_apps_clean.sort_values('Price', ascending=False).head(20)


The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.



Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
3946,I'm Rich - Trump Edition,LIFESTYLE,3.6,275,7.3,10000,Paid,400.0,Everyone,Lifestyle
2461,I AM RICH PRO PLUS,FINANCE,4.0,36,41.0,1000,Paid,399.99,Everyone,Finance
4606,I Am Rich Premium,FINANCE,4.1,1867,4.7,50000,Paid,399.99,Everyone,Finance
3145,I am rich(premium),FINANCE,3.5,472,0.94,5000,Paid,399.99,Everyone,Finance
3554,💎 I'm rich,LIFESTYLE,3.8,718,26.0,10000,Paid,399.99,Everyone,Lifestyle
5765,I am rich,LIFESTYLE,3.8,3547,1.8,100000,Paid,399.99,Everyone,Lifestyle
1946,I am rich (Most expensive app),FINANCE,4.1,129,2.7,1000,Paid,399.99,Teen,Finance
2775,I Am Rich Pro,FAMILY,4.4,201,2.7,5000,Paid,399.99,Everyone,Entertainment
3221,I am Rich Plus,FAMILY,4.0,856,8.7,10000,Paid,399.99,Everyone,Entertainment
3114,I am Rich,FINANCE,4.3,180,3.8,5000,Paid,399.99,Everyone,Finance


There are 15 "I am Rich" Apps in the Google Play Store apparently. They all cost 300 or more, which is the main point of the app.  This look dubious. So to keep our analyse safe and not effected by these side-effects, we put a cap of $250 in price.

### The most expensive apps sub $250

In [None]:
df_apps_clean = df_apps_clean[df_apps_clean['Price'] < 250]
df_apps_clean.sort_values('Price', ascending=False).head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
2281,Vargo Anesthesia Mega App,MEDICAL,4.6,92,32.0,1000,Paid,79.99,Everyone,Medical
1407,LTC AS Legal,MEDICAL,4.0,6,1.3,100,Paid,39.99,Everyone,Medical
2629,I am Rich Person,LIFESTYLE,4.2,134,1.8,1000,Paid,37.99,Everyone,Lifestyle
2481,A Manual of Acupuncture,MEDICAL,3.5,214,68.0,1000,Paid,33.99,Everyone,Medical
2463,PTA Content Master,MEDICAL,4.2,64,41.0,1000,Paid,29.99,Everyone,Medical


we look at the top 5 apps now, we see that 4 out of 5 are medical apps.

### Highest Grossing Paid Apps (ballpark estimate)

In [None]:
df_apps_clean['Revenue_Estimate'] = df_apps_clean.Installs * df_apps_clean.Price
df_apps_clean.sort_values('Revenue_Estimate', ascending=False)[:10]



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Revenue_Estimate
9220,Minecraft,FAMILY,4.5,2376564,19.0,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,69900000.0
8825,Hitman Sniper,GAME,4.6,408292,29.0,10000000,Paid,0.99,Mature 17+,Action,9900000.0
7151,Grand Theft Auto: San Andreas,GAME,4.4,348962,26.0,1000000,Paid,6.99,Mature 17+,Action,6990000.0
7477,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,5.99,Everyone,Photography,5990000.0
7977,Sleep as Android Unlock,LIFESTYLE,4.5,23966,0.85,1000000,Paid,5.99,Everyone,Lifestyle,5990000.0
6594,DraStic DS Emulator,GAME,4.6,87766,12.0,1000000,Paid,4.99,Everyone,Action,4990000.0
6082,Weather Live,WEATHER,4.5,76593,4.75,500000,Paid,5.99,Everyone,Weather,2995000.0
7954,Bloons TD 5,FAMILY,4.6,190086,94.0,1000000,Paid,2.99,Everyone,Strategy,2990000.0
7633,Five Nights at Freddy's,GAME,4.6,100805,50.0,1000000,Paid,2.99,Teen,Action,2990000.0
6746,Card Wars - Adventure Time,FAMILY,4.3,129603,23.0,1000000,Paid,2.99,Everyone 10+,Card;Action & Adventure,2990000.0


The highest-grossing paid app Minecraft is at around $70 million. It’s quite interesting that Minecraft is listed in the Family category rather than in the Game category.

Analysing App Categories

If we were to release an app, should we go after a competitive category with many other apps? Or whould we target a popular category with a high number of downloads? Or perhaps you can target a category which is both popular but also one where the downloads are spread out among many different apps. That way, even if it’s more difficult to discover among all the other apps, your app has a better chance of getting installed, right? Let’s analyse this with bar charts and scatter plots and figure out which categories are dominating the market.

In [None]:
df_apps_clean.Category.nunique()

33

In [None]:
top10_category = df_apps_clean.Category.value_counts()[:10]
top10_category

FAMILY             1606
GAME                910
TOOLS               719
PRODUCTIVITY        301
PERSONALIZATION     298
LIFESTYLE           297
FINANCE             296
MEDICAL             292
PHOTOGRAPHY         263
BUSINESS            262
Name: Category, dtype: int64

### Highest Competition (Number of Apps)

In [None]:
bar = px.bar(x = top10_category.index, # index = category name
             y = top10_category.values,
             title='Competition (Number of Apps per Category)')

bar.update_layout(xaxis_title='Category', yaxis_title='Number of Apps')
bar.show()

Most Popular Categories (Highest Downloads)

In [None]:
category_installs = df_apps_clean.groupby('Category').agg({'Installs': pd.Series.sum})
category_installs.sort_values('Installs', ascending=True, inplace=True)

In [None]:
h_bar = px.bar(x = category_installs.Installs,
               y = category_installs.index,
               orientation='h',
               title='Category Popularity (Number of Downloads)')

h_bar.update_layout(xaxis_title='Number of Downloads', yaxis_title='Category')
h_bar.show()

we see that Games and Tools are actually the most popular categories. If we plot the popularity of a category next to the number of apps in that category we can get an idea of how concentrated a category is. Do few apps have most of the downloads or are the downloads spread out over many apps?

### Category Concentration - Downloads vs. Competition

In [None]:
cat_number = df_apps_clean.groupby('Category').agg({'App': pd.Series.count})

In [None]:
cat_number

Unnamed: 0_level_0,App
Category,Unnamed: 1_level_1
ART_AND_DESIGN,61
AUTO_AND_VEHICLES,73
BEAUTY,42
BOOKS_AND_REFERENCE,169
BUSINESS,262
COMICS,54
COMMUNICATION,257
DATING,134
EDUCATION,118
ENTERTAINMENT,102


In [None]:
category_installs

Unnamed: 0_level_0,Installs
Category,Unnamed: 1_level_1
EVENTS,15949410
BEAUTY,26916200
PARENTING,31116110
MEDICAL,39162676
COMICS,44931100
LIBRARIES_AND_DEMO,52083000
AUTO_AND_VEHICLES,53129800
HOUSE_AND_HOME,97082000
ART_AND_DESIGN,114233100
DATING,140912410


In [None]:
cat_merged_df = pd.merge(cat_number, category_installs, on='Category', how="inner")
print(f'The dimensions of the DataFrame are: {cat_merged_df.shape}')
cat_merged_df.sort_values('Installs', ascending=False)

The dimensions of the DataFrame are: (33, 2)


Unnamed: 0_level_0,App,Installs
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
GAME,910,13858762717
COMMUNICATION,257,11039241530
TOOLS,719,8099724500
PRODUCTIVITY,301,5788070180
SOCIAL,203,5487841475
PHOTOGRAPHY,263,4649143130
FAMILY,1606,4437554490
VIDEO_PLAYERS,148,3916897200
TRAVEL_AND_LOCAL,187,2894859300
NEWS_AND_MAGAZINES,204,2369110650


In [None]:
scatter = px.scatter(cat_merged_df, # data
                    x='App', # column name
                    y='Installs',
                    title='Category Concentration (Downloads vs Number of Apps)',
                    size='App',
                    hover_name=cat_merged_df.index,
                    color='Installs')

scatter.update_layout(xaxis_title="Number of Apps (Lower=More Concentrated)",
                      yaxis_title="Installs",
                      yaxis=dict(type='log'))

scatter.show()

we see is that the categories like Family, Tools, and Game have many different apps sharing a high number of downloads. But for the categories like video players and entertainment, all the downloads are concentrated in very few apps.

# Drill down to genres in each category

Let’s turn our attention to the Genres column. This is quite similar to the categories column but more granular.


In [None]:
df_apps_clean.Genres.nunique()

114

In [None]:
df_apps_clean.Genres.value_counts().sort_values(ascending=True)

Lifestyle;Pretend Play        1
Strategy;Education            1
Adventure;Education           1
Role Playing;Brain Games      1
Tools;Education               1
                           ... 
Personalization             298
Productivity                301
Education                   429
Entertainment               467
Tools                       718
Name: Genres, Length: 114, dtype: int64

In [None]:
df_apps_clean.Genres.info()

<class 'pandas.core.series.Series'>
Int64Index: 8184 entries, 21 to 10835
Series name: Genres
Non-Null Count  Dtype 
--------------  ----- 
8184 non-null   object
dtypes: object(1)
memory usage: 127.9+ KB


In [None]:
df_apps_clean['Genres'].str.split(';', expand=True)

Unnamed: 0,0,1
21,Medical,
28,Arcade,
47,Arcade,
82,Arcade,
99,Medical,
...,...,...
10824,Productivity,
10828,Video Players & Editors,
10829,Video Players & Editors,
10831,News & Magazines,


In [None]:
stack = df_apps_clean['Genres'].str.split(';', expand=True).stack()[50:75]
stack

357  0           Education
373  0              Arcade
378  0         Photography
382  0            Shopping
392  0              Puzzle
400  0     Personalization
402  0        Productivity
411  0              Sports
421  0         Photography
425  0             Medical
436  0              Puzzle
439  0    Health & Fitness
453  0           Education
     1           Education
462  0    Health & Fitness
467  0            Business
469  0           Education
479  0               Tools
482  0            Business
485  0             Medical
486  0               Tools
496  0     Personalization
504  0           Education
509  0           Education
521  0              Social
dtype: object

In [None]:
num_genres = stack.value_counts()
num_genres

Education           6
Photography         2
Puzzle              2
Personalization     2
Medical             2
Health & Fitness    2
Business            2
Tools               2
Arcade              1
Shopping            1
Productivity        1
Sports              1
Social              1
dtype: int64

# Colour Scales in Plotly Charts - Competition in Genres


In [None]:
bar = px.bar(
        x = num_genres.index[:15], # index = category name
        y = num_genres.values[:15], # count
        title='Top Genres (Sub-category)',
        hover_name=num_genres.index[:15],
        color=num_genres.values[:15],
        color_continuous_scale='Agsunset'
)

bar.update_layout(xaxis_title='Genre',
                  yaxis_title='Number of Apps',
                  coloraxis_showscale=False)

bar.show()



# Grouped Bar Charts: Free vs. Paid Apps per Category
Now that we’ve looked at the total number of apps per category and the total number of apps per genre, let’s see what the split is between free and paid apps.

In [None]:
df_free_vs_paid = df_apps_clean.groupby(['Category', "Type"]).agg({'App':pd.Series.count})
df_free_vs_paid

Unnamed: 0_level_0,Unnamed: 1_level_0,App
Category,Type,Unnamed: 2_level_1
ART_AND_DESIGN,Free,58
ART_AND_DESIGN,Paid,3
AUTO_AND_VEHICLES,Free,72
AUTO_AND_VEHICLES,Paid,1
BEAUTY,Free,42
...,...,...
TRAVEL_AND_LOCAL,Paid,8
VIDEO_PLAYERS,Free,144
VIDEO_PLAYERS,Paid,4
WEATHER,Free,65


In [None]:
df_free_vs_paid = df_apps_clean.groupby(["Category", "Type"], as_index=False).agg({'App': pd.Series.count})
df_free_vs_paid.head()

Unnamed: 0,Category,Type,App
0,ART_AND_DESIGN,Free,58
1,ART_AND_DESIGN,Paid,3
2,AUTO_AND_VEHICLES,Free,72
3,AUTO_AND_VEHICLES,Paid,1
4,BEAUTY,Free,42


In [None]:
g_bar = px.bar(df_free_vs_paid,
               x='Category',
               y='App',
               title='Free vs Paid Apps by Category',
               color='Type',
               barmode='group')

g_bar.update_layout(xaxis_title='Category',
                    yaxis_title='Number of Apps',
                    xaxis={'categoryorder':'total descending'},
                    yaxis=dict(type='log'))

g_bar.show()

What we see is that while there are very few paid apps on the Google Play Store, some categories have relatively more paid apps than others, including Personalization, Medical and Weather. So, depending on the category you are targeting, it might make sense to release a paid-for app

# Further Analysis:
But this leads to many more questions:

How much should you charge? What are other apps charging in that category?

How much revenue could you make?

And how many downloads are you potentially giving up because your app is paid?


### Approach:
To answer these, lets start with a box plot that shows the number of Installs for free versus paid apps. Then we can check the below points.
How does the median number of installations compare?
Is the difference large or small?

### Lost Downloads for Paid Apps

[Box Plots Guide](https://plotly.com/python/box-plots/) and the [.box API reference](https://plotly.com/python-api-reference/generated/plotly.express.box.html)



In [None]:
box = px.box(df_apps_clean,
       y='Installs',
       x='Type',
       color='Type',
       notched=True,
       points='all',
      title='How Many Downloads are Paid Apps Giving Up?')

box.update_layout(yaxis=dict(type='log'))
box.show()

In the chart, we see that the median number of downloads for free apps is 500,000, while the median number of downloads for paid apps is around 5,000!
But does this mean we should give up on selling a paid app? Let’s see how much revenue we would estimate per category.

### Revenue by App Category
how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?


In [None]:
df_paid_apps = df_apps_clean[df_apps_clean['Type'] == 'Paid']

box = px. box(df_paid_apps,
              x= 'Category',
              y= 'Revenue_Estimate',
              title='How Much Can Paid Apps Earn?')

box.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Revenue',
                  xaxis={'categoryorder':'min ascending'},
                  yaxis=dict(type='log'))

box.add_hline(y=30000, line_width=1, line_color="purple")

box.show()

If an Android app costs dollar 30,000 (Red line) to develop, then the average app in very few categories would cover that development cost. The median paid photography app earned about $20,000. Many more app’s revenues were even lower - meaning they would need other sources of revenue like advertising or in-app purchases to make up for their development costs. However, certain app categories seem to contain a large number of outliers that have much higher (estimated) revenue - for example in Medical, Personalisation, Tools, Game, and Family.

# How Much Can we Charge? Paid App Pricing Strategies by Category.
So, if we were to list a paid app, how should we price it? To help us decide, we can look at how your competitors in the same category price their apps.

In [None]:
df_paid_apps.Price.median()

2.99

In [None]:
box = px.box(df_paid_apps,
             x='Category',
             y="Price",
             title='How Much Can we Charge?')

box.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Price',
                  xaxis={'categoryorder':'max descending'},
                  yaxis=dict(type='log'))

box.show()

Some categories have higher median prices than others.  The Medical apps have the most expensive apps as well as a median price of `$5.49`. In contrast, Personalisation apps are quite cheap on average at $1.49. Other categories which higher median prices are Business (`$4.99`) and `Dating` (`$6.99`). It seems like customers who shop in these categories are not so concerned about paying a bit extra for their apps.