# Introduction

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

# About the Dataset of Google Play Store Apps & Reviews

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

# Import Statements

In [46]:
import pandas as pd
import plotly.express as px

# Notebook Presentation

In [47]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [48]:
df_apps = pd.read_csv('apps.csv')

### Data Exploration

In [49]:
df_apps.describe()

Unnamed: 0,Rating,Reviews,Size_MBs
count,9367.0,10841.0,10841.0
mean,4.19,444111.93,19.77
std,0.52,2927628.66,21.4
min,1.0,0.0,0.01
25%,4.0,38.0,4.9
50%,4.3,2094.0,11.0
75%,4.5,54768.0,27.0
max,5.0,78158306.0,100.0


In [50]:
df_apps.shape

(10841, 12)

In [51]:
df_apps.sample(10)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
1594,ACCDB MDB DB Manager Pro - Editor for MS Access,PRODUCTIVITY,3.1,19,4.5,500,Paid,$8.99,Everyone,Productivity,"November 3, 2016",4.0 and up
6933,Apartments & Rentals - Zillow,HOUSE_AND_HOME,4.2,27386,8.85,1000000,Free,0,Everyone,House & Home,"June 21, 2018",Varies with device
4571,E-Book Read - Read Book for free,BOOKS_AND_REFERENCE,4.5,1857,4.9,50000,Free,0,Everyone,Books & Reference,"August 3, 2018",4.4 and up
8941,Weaphones™ Gun Sim Free Vol 1,FAMILY,4.3,598975,39.0,10000000,Free,0,Everyone,Simulation,"January 17, 2018",4.0 and up
7908,AMC Theatres,ENTERTAINMENT,4.3,44550,72.0,1000000,Free,0,Everyone,Entertainment,"July 18, 2018",5.0 and up
7495,Dr. Chess,GAME,4.1,97209,6.7,1000000,Free,0,Everyone,Board,"January 14, 2018",4.0.3 and up
2755,Wide address pocket prep,MEDICAL,4.5,513,12.0,5000,Free,0,Everyone,Medical,"July 11, 2018",4.4 and up
10190,Vigo Video,VIDEO_PLAYERS,4.3,1615596,4.65,50000000,Free,0,Teen,Video Players & Editors,"August 3, 2018",4.0.3 and up
6052,Bitcoin Ticker Widget,FINANCE,4.6,25744,9.15,500000,Free,0,Everyone,Finance,"July 30, 2018",Varies with device
1802,Dialer theme G Black Gold,PERSONALIZATION,4.2,33,0.84,500,Paid,$1.49,Everyone,Personalization,"January 1, 2016",4.0 and up


### Drop Unused Columns

In [52]:
df_apps.drop(['Last_Updated', 'Android_Ver'], axis=1, inplace=True)

In [53]:
df_apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.7,0,Paid,$13.99,Teen,Social
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.0,0,Paid,$2.99,Everyone,Education
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.5,0,Paid,$1.49,Everyone,Personalization
3,Command & Conquer: Rivals,FAMILY,,0,19.0,0,,0,Everyone 10+,Strategy
4,CX Network,BUSINESS,,0,10.0,0,Free,0,Everyone,Business


### Find and Remove NaN values in Ratings

In [54]:
nan_rows = df_apps[df_apps.Rating.isna()]
print(nan_rows.shape)
nan_rows.head()

(1474, 10)


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.7,0,Paid,$13.99,Teen,Social
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.0,0,Paid,$2.99,Everyone,Education
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.5,0,Paid,$1.49,Everyone,Personalization
3,Command & Conquer: Rivals,FAMILY,,0,19.0,0,,0,Everyone 10+,Strategy
4,CX Network,BUSINESS,,0,10.0,0,Free,0,Everyone,Business


In [55]:
df_apps_cleaned = df_apps.dropna()
df_apps_cleaned.shape

(9367, 10)

### Find and Remove Duplicates

In [56]:
df_apps_cleaned[df_apps_cleaned.App == 'Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10808,Instagram,SOCIAL,4.5,66577446,5.3,1000000000,Free,0,Teen,Social
10809,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10810,Instagram,SOCIAL,4.5,66509917,5.3,1000000000,Free,0,Teen,Social


In [57]:
duplicates = df_apps_cleaned[df_apps_cleaned.duplicated()]
print(duplicates.shape)
duplicates.head()

(476, 10)


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
946,420 BZ Budeze Delivery,MEDICAL,5.0,2,11.0,100,Free,0,Mature 17+,Medical
1133,MouseMingle,DATING,2.7,3,3.9,100,Free,0,Mature 17+,Dating
1196,"Cardiac diagnosis (heart rate, arrhythmia)",MEDICAL,4.4,8,6.5,100,Paid,$12.99,Everyone,Medical
1231,Sway Medical,MEDICAL,5.0,3,22.0,100,Free,0,Everyone,Medical
1247,Chat Kids - Chat Room For Kids,DATING,4.7,6,4.9,100,Free,0,Mature 17+,Dating


In [58]:
df_apps_cleaned = df_apps_cleaned.drop_duplicates() # Not strict enough...
print(df_apps_cleaned.shape)
df_apps_cleaned[df_apps_cleaned.App == 'Instagram']

(8891, 10)


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10808,Instagram,SOCIAL,4.5,66577446,5.3,1000000000,Free,0,Teen,Social
10810,Instagram,SOCIAL,4.5,66509917,5.3,1000000000,Free,0,Teen,Social


In [59]:
df_apps_cleaned = df_apps_cleaned.drop_duplicates(subset=['App', 'Type', 'Price'])
print(df_apps_cleaned.shape)
df_apps_cleaned[df_apps_cleaned.App == 'Instagram'] # Successfully drops duplicates.

(8199, 10)


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social


### Find Highest Rated Apps

Not useful data due to low review count.

Especially amusing is KBA-EZ Health Guide that got 4 reviews for 1 install.

In [60]:
df_apps_cleaned.sort_values('Rating', ascending=False).head(100)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
21,KBA-EZ Health Guide,MEDICAL,5.00,4,25.00,1,Free,0,Everyone,Medical
1230,Sway Medical,MEDICAL,5.00,3,22.00,100,Free,0,Everyone,Medical
1227,AJ Men's Grooming,LIFESTYLE,5.00,2,22.00,100,Free,0,Everyone,Lifestyle
1224,FK Dedinje BGD,SPORTS,5.00,36,2.60,100,Free,0,Everyone,Sports
1223,CB VIDEO VISION,PHOTOGRAPHY,5.00,13,2.60,100,Free,0,Everyone,Photography
...,...,...,...,...,...,...,...,...,...,...
1963,CT Cervical Spine,MEDICAL,5.00,5,17.00,1000,Free,0,Everyone,Medical
1964,CARDIAC CT TECHNIQUE,MEDICAL,5.00,6,17.00,1000,Free,0,Everyone,Medical
1968,CL REPL,TOOLS,5.00,47,17.00,1000,Free,0,Everyone,Tools
1523,EC Fairgrounds,FAMILY,5.00,3,5.50,500,Free,0,Everyone,Entertainment


### Lowest Rated Apps

Ah... 1 review 1 star, seems accurate?

In [61]:
df_apps_cleaned.sort_values('Rating').head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
1271,MbH BM,MEDICAL,1.0,1,2.3,100,Free,0,Everyone,Medical
1166,CJ DVD Rentals,COMMUNICATION,1.0,5,13.0,100,Free,0,Everyone,Communication
240,House party - live chat,DATING,1.0,1,9.2,10,Free,0,Mature 17+,Dating
357,Speech Therapy: F,FAMILY,1.0,1,16.0,10,Paid,$2.99,Everyone,Education
576,Clarksburg AH,MEDICAL,1.0,1,28.0,50,Free,0,Everyone,Medical


### Find 5 Largest Apps in terms of Size (MBs)

In [62]:
df_apps_cleaned.sort_values('Size_MBs', ascending=False).head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
9942,Talking Babsy Baby: Baby Games,LIFESTYLE,4.0,140995,100.0,10000000,Free,0,Everyone,Lifestyle;Pretend Play
10687,Hungry Shark Evolution,GAME,4.5,6074334,100.0,100000000,Free,0,Teen,Arcade
9943,Miami crime simulator,GAME,4.0,254518,100.0,10000000,Free,0,Mature 17+,Action
9944,Gangster Town: Vice District,FAMILY,4.3,65146,100.0,10000000,Free,0,Mature 17+,Simulation
3144,Vi Trainer,HEALTH_AND_FITNESS,3.6,124,100.0,5000,Free,0,Everyone,Health & Fitness


### Find the Apps with Most Reviews

Note: No paid app in the top 100 reviewed apps.

In [63]:
df_apps_cleaned.sort_values('Reviews', ascending=False).head(100)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10805,Facebook,SOCIAL,4.10,78158306,5.30,1000000000,Free,0,Teen,Social
10785,WhatsApp Messenger,COMMUNICATION,4.40,69119316,3.50,1000000000,Free,0,Everyone,Communication
10806,Instagram,SOCIAL,4.50,66577313,5.30,1000000000,Free,0,Teen,Social
10784,Messenger – Text and Video Chat for Free,COMMUNICATION,4.00,56642847,3.50,1000000000,Free,0,Everyone,Communication
10650,Clash of Clans,GAME,4.60,44891723,98.00,100000000,Free,0,Everyone 10+,Strategy
...,...,...,...,...,...,...,...,...,...,...
10051,Castle Clash: Heroes of the Empire US,GAME,4.60,4578476,24.00,50000000,Free,0,Everyone 10+,Strategy
10385,Period Tracker - Period Calendar Ovulation Tra...,HEALTH_AND_FITNESS,4.80,4559407,8.80,100000000,Free,0,Everyone,Health & Fitness
10257,Pixel Gun 3D: Survival shooter & Battle Royale,GAME,4.50,4487182,55.00,50000000,Free,0,Teen,Action
10111,Township,FAMILY,4.60,4451317,19.00,50000000,Free,0,Everyone,Casual


### Find the top Installed Apps

Only one paid app in the top 100. With over 500,000 installs and a price of $5.99... not bad.

In [64]:
df_apps_cleaned.sort_values('Installs', ascending=False).head(100)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10731,My Talking Tom,GAME,4.50,14891223,36.00,500000000,Free,0,Everyone,Casual
10746,"Security Master - Antivirus, VPN, AppLock, Boo...",TOOLS,4.70,24900999,3.40,500000000,Free,0,Everyone,Tools
10711,SHAREit - Transfer & Share,TOOLS,4.60,7790693,17.00,500000000,Free,0,Everyone,Tools
10713,imo free video calls and chat,COMMUNICATION,4.30,4785892,11.00,500000000,Free,0,Everyone,Communication
10717,Pou,GAME,4.30,10485308,24.00,500000000,Free,0,Everyone,Casual
...,...,...,...,...,...,...,...,...,...,...
6416,Xtreme Racing 2018 - Jeep & 4x4 off road simul...,FAMILY,4.20,2178,99.00,500000,Free,0,Everyone,Racing;Action & Adventure
6415,What is my IP address,TOOLS,4.50,4228,0.64,500000,Free,0,Everyone,Tools
6414,Like A Boss,FAMILY,4.40,17069,94.00,500000,Free,0,Everyone 10+,Role Playing
6412,R. Physics Puzzle Game,FAMILY,4.20,5369,60.00,500000,Free,0,Everyone,Puzzle


# Plotly Pie and Donut Charts - Visualise Categorical Data: Content Ratings

In [65]:
ratings = df_apps_cleaned.Content_Rating.value_counts()
ratings

Everyone           6621
Teen                912
Mature 17+          357
Everyone 10+        305
Adults only 18+       3
Unrated               1
Name: Content_Rating, dtype: int64

In [71]:
fig = px.pie(labels=ratings.index,
    values=ratings.values,
    title="Content Rating",
    hole=0.6,
    names=ratings.index,
    )
fig.update_traces(textposition='outside',
                  textfont_size=14,
                  textinfo='percent+label')

fig.show()


Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.



### Numeric Type Conversion: Examine the Number of Installs


In [73]:
df_apps_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8199 entries, 21 to 10835
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8199 non-null   object 
 1   Category        8199 non-null   object 
 2   Rating          8199 non-null   float64
 3   Reviews         8199 non-null   int64  
 4   Size_MBs        8199 non-null   float64
 5   Installs        8199 non-null   object 
 6   Type            8199 non-null   object 
 7   Price           8199 non-null   object 
 8   Content_Rating  8199 non-null   object 
 9   Genres          8199 non-null   object 
dtypes: float64(2), int64(1), object(7)
memory usage: 704.6+ KB


In [74]:
df_apps_cleaned.Installs = df_apps_cleaned.Installs.astype(str).str.replace(',', "")
df_apps_cleaned.Installs = pd.to_numeric(df_apps_cleaned.Installs)
df_apps_cleaned[['App', 'Installs']].groupby('Installs').count()

Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
5,9
10,69
50,56
100,303
500,199
1000,698
5000,425
10000,988
50000,457


### Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate
Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?


In [75]:
df_apps_cleaned.Price = df_apps_cleaned.Price.astype(str).str.replace('$', "")
df_apps_cleaned.Price = pd.to_numeric(df_apps_cleaned.Price)
df_apps_cleaned[['App', 'Price']].groupby('Price').count()


The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.



Unnamed: 0_level_0,App
Price,Unnamed: 1_level_1
0.00,7595
0.99,104
1.00,2
1.20,1
1.29,1
...,...
299.99,1
379.99,1
389.99,1
399.99,11


What is wrong with the people?! I will be very disappointed if any of these are actual purchases and not promos / price adjusted later...

In [78]:
df_apps_cleaned[df_apps_cleaned['Price'] > 100]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
1331,most expensive app (H),FAMILY,4.3,6,1.5,100,Paid,399.99,Everyone,Entertainment
1946,I am rich (Most expensive app),FINANCE,4.1,129,2.7,1000,Paid,399.99,Teen,Finance
2193,I am extremely Rich,LIFESTYLE,2.9,41,2.9,1000,Paid,379.99,Everyone,Lifestyle
2394,I am Rich!,FINANCE,3.8,93,22.0,1000,Paid,399.99,Everyone,Finance
2461,I AM RICH PRO PLUS,FINANCE,4.0,36,41.0,1000,Paid,399.99,Everyone,Finance
2775,I Am Rich Pro,FAMILY,4.4,201,2.7,5000,Paid,399.99,Everyone,Entertainment
3114,I am Rich,FINANCE,4.3,180,3.8,5000,Paid,399.99,Everyone,Finance
3145,I am rich(premium),FINANCE,3.5,472,0.94,5000,Paid,399.99,Everyone,Finance
3221,I am Rich Plus,FAMILY,4.0,856,8.7,10000,Paid,399.99,Everyone,Entertainment
3554,💎 I'm rich,LIFESTYLE,3.8,718,26.0,10000,Paid,399.99,Everyone,Lifestyle


Medical ones make sense, still another one of those nonsense apps. Not worth extensive manipulation to find every example, and the prices of such fake apps shouldn't make too much of a difference due to not being an order of magnitude above anything else.

In [81]:
df_apps_cleaned = df_apps_cleaned[df_apps_cleaned['Price'] < 100]
df_apps_cleaned.sort_values('Price', ascending=False).head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
2281,Vargo Anesthesia Mega App,MEDICAL,4.6,92,32.0,1000,Paid,79.99,Everyone,Medical
1407,LTC AS Legal,MEDICAL,4.0,6,1.3,100,Paid,39.99,Everyone,Medical
2629,I am Rich Person,LIFESTYLE,4.2,134,1.8,1000,Paid,37.99,Everyone,Lifestyle
2481,A Manual of Acupuncture,MEDICAL,3.5,214,68.0,1000,Paid,33.99,Everyone,Medical
2463,PTA Content Master,MEDICAL,4.2,64,41.0,1000,Paid,29.99,Everyone,Medical


### Highest Grossing Paid Apps (ballpark estimate)
This is based on price at time of data collection, and installs. Price adjustments / history are NOT taken into account.

Interesting 7/10 are Games (even though some are miscategorized (Minecraft/Bloons/Card Wars, FAMILY)

Minecraft also is in a league of its own when it comes to revenue.

In [82]:
df_apps_cleaned['Revenue_Estimate'] = df_apps_cleaned.Installs.mul(df_apps_cleaned.Price)
df_apps_cleaned.sort_values('Revenue_Estimate', ascending=False)[:10]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Revenue_Estimate
9220,Minecraft,FAMILY,4.5,2376564,19.0,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,69900000.0
8825,Hitman Sniper,GAME,4.6,408292,29.0,10000000,Paid,0.99,Mature 17+,Action,9900000.0
7151,Grand Theft Auto: San Andreas,GAME,4.4,348962,26.0,1000000,Paid,6.99,Mature 17+,Action,6990000.0
7477,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,5.99,Everyone,Photography,5990000.0
7977,Sleep as Android Unlock,LIFESTYLE,4.5,23966,0.85,1000000,Paid,5.99,Everyone,Lifestyle,5990000.0
6594,DraStic DS Emulator,GAME,4.6,87766,12.0,1000000,Paid,4.99,Everyone,Action,4990000.0
6082,Weather Live,WEATHER,4.5,76593,4.75,500000,Paid,5.99,Everyone,Weather,2995000.0
7954,Bloons TD 5,FAMILY,4.6,190086,94.0,1000000,Paid,2.99,Everyone,Strategy,2990000.0
7633,Five Nights at Freddy's,GAME,4.6,100805,50.0,1000000,Paid,2.99,Teen,Action,2990000.0
6746,Card Wars - Adventure Time,FAMILY,4.3,129603,23.0,1000000,Paid,2.99,Everyone 10+,Card;Action & Adventure,2990000.0


# Plotly Bar Charts & Scatter Plots: Analysing App Categories

In [83]:
df_apps_cleaned.Category.nunique()

33

In [85]:
top_10_categories = df_apps_cleaned.Category.value_counts()[:10]
top_10_categories

FAMILY             1606
GAME                910
TOOLS               719
PRODUCTIVITY        301
PERSONALIZATION     298
LIFESTYLE           297
FINANCE             296
MEDICAL             292
PHOTOGRAPHY         263
BUSINESS            262
Name: Category, dtype: int64

### Vertical Bar Chart - Highest Competition (Number of Apps)

In [87]:
bar = px.bar(x = top_10_categories.index, y = top_10_categories.values)

bar.show()

### Horizontal Bar Chart - Most Popular Categories (Highest Downloads)

In [88]:
category_installs = df_apps_cleaned.groupby('Category').agg({'Installs': pd.Series.sum})
category_installs.sort_values('Installs', ascending=True, inplace=True)

In [91]:
h_bar = px.bar(x = category_installs.Installs,
               y = category_installs.index,
               orientation='h',
               title='Category Popularity')

h_bar.update_layout(xaxis_title='Number of Downloads', yaxis_title='Category')

h_bar.show()

### Category Concentration - Downloads vs. Competition


In [92]:
cat_number = df_apps_cleaned.groupby('Category').agg({'App': pd.Series.count})

cat_merged_df = pd.merge(cat_number, category_installs, on='Category', how="inner")

print(f'The dimensions of the DataFrame are: {cat_merged_df.shape}')

cat_merged_df.sort_values('Installs', ascending=False)

The dimensions of the DataFrame are: (33, 2)


Unnamed: 0_level_0,App,Installs
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
GAME,910,13858762717
COMMUNICATION,257,11039241530
TOOLS,719,8099724500
PRODUCTIVITY,301,5788070180
SOCIAL,203,5487841475
PHOTOGRAPHY,263,4649143130
FAMILY,1606,4437554490
VIDEO_PLAYERS,148,3916897200
TRAVEL_AND_LOCAL,187,2894859300
NEWS_AND_MAGAZINES,204,2369110650


In [93]:
scatter = px.scatter(cat_merged_df,
                        x='App',
                        y='Installs',
                        title='Category Concentration',
                        size='App',
                        hover_name=cat_merged_df.index,
                        color='Installs')

scatter.update_layout(xaxis_title="Number of Apps (Lower=More Concentrated)",
                          yaxis_title="Installs",
                          yaxis=dict(type='log'))

scatter.show()

Separating on (600, 5M)

Quadrant I has very few extremely popular categories with high number of apps and installs, very fierce competition here.

Quadrant II on the other hand has relatively few apps in many categories with high amounts of downloads.Lots of opportunity.

Quadrant III has few apps and downloads. Much more difficult but there is opportunity for niche content.

### Examining Genres and Extracting Nested Column Data


In [96]:
df_apps_cleaned.Genres.value_counts().sort_values(ascending=True) # Not right. Multiple genres combined with ;

Lifestyle;Pretend Play        1
Strategy;Education            1
Adventure;Education           1
Role Playing;Brain Games      1
Tools;Education               1
                           ... 
Personalization             298
Productivity                301
Education                   429
Entertainment               467
Tools                       718
Name: Genres, Length: 114, dtype: int64

In [99]:
stack = df_apps_cleaned.Genres.str.split(';', expand=True).stack()
num_genres = stack.value_counts()
num_genres.sort_values(ascending=True) # All individual genres

Music & Audio                1
Music                       21
Word                        22
Trivia                      28
Creativity                  31
Music & Video               31
Casino                      37
Beauty                      42
Events                      45
Card                        46
Parenting                   50
Comics                      54
Board                       57
Art & Design                62
House & Home                62
Libraries & Demo            64
Brain Games                 65
Pretend Play                68
Weather                     72
Auto & Vehicles             73
Adventure                   78
Educational                 93
Food & Drink                94
Strategy                    95
Action & Adventure          96
Racing                     103
Role Playing               111
Maps & Navigation          118
Puzzle                     124
Dating                     134
Video Players & Editors    150
Books & Reference          171
Shopping

### Colour Scales in Plotly Charts - Competition in Genres

In [100]:
bar = px.bar(x = num_genres.index[:15],
                 y = num_genres.values[:15],
                 title='Top Genres',
                 hover_name=num_genres.index[:15],
                 color=num_genres.values[:15],
                 color_continuous_scale='Agsunset')

bar.update_layout(xaxis_title='Genre',
                  yaxis_title='Number of Apps',
                  coloraxis_showscale=False)

bar.show()

# Grouped Bar Charts: Free vs. Paid Apps per Category

In [101]:
df_apps_cleaned.Type.value_counts()

Free    7595
Paid     589
Name: Type, dtype: int64

In [105]:
df_free_vs_paid = df_apps_cleaned.groupby(["Category", "Type"], as_index=False).agg({'App': pd.Series.count})
df_free_vs_paid

Unnamed: 0,Category,Type,App
0,ART_AND_DESIGN,Free,58
1,ART_AND_DESIGN,Paid,3
2,AUTO_AND_VEHICLES,Free,72
3,AUTO_AND_VEHICLES,Paid,1
4,BEAUTY,Free,42
...,...,...,...
56,TRAVEL_AND_LOCAL,Paid,8
57,VIDEO_PLAYERS,Free,144
58,VIDEO_PLAYERS,Paid,4
59,WEATHER,Free,65


# Plotly Box Plots: Lost Downloads for Paid Apps

**Challenge**: Create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?

Use the [Box Plots Guide](https://plotly.com/python/box-plots/) and the [.box API reference](https://plotly.com/python-api-reference/generated/plotly.express.box.html) to create the following chart. 

<img src=https://imgur.com/uVsECT3.png>


In [106]:
g_bar = px.bar(df_free_vs_paid,
                   x='Category',
                   y='App',
                   title='Free vs Paid Apps by Category',
                   color='Type',
                   barmode='group')

g_bar.update_layout(xaxis_title='Category',
                    yaxis_title='Number of Apps',
                    xaxis={'categoryorder':'total descending'},
                    yaxis=dict(type='log'))

g_bar.show()

# Plotly Box Plots: Revenue by App Category

**Challenge**: See if you can generate the chart below: 

<img src=https://imgur.com/v4CiNqX.png>

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?

Hint: I've used 'min ascending' to sort the categories. 

In [107]:
box = px.box(df_apps_cleaned,
             y='Installs',
             x='Type',
             color='Type',
             notched=True,
             points='all',
             title='How Many Downloads are Paid Apps Giving Up?')

box.update_layout(yaxis=dict(type='log'))

box.show()

### How Much Can You Charge? Examine Paid App Pricing Strategies by Category


In [109]:
df_paid_apps = df_apps_cleaned[df_apps_cleaned['Type'] == 'Paid']

box = px.box(df_paid_apps,
             x='Category',
             y='Revenue_Estimate',
             title='How Much Can Paid Apps Earn?')

box.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Ballpark Revenue',
                  xaxis={'categoryorder':'min ascending'},
                  yaxis=dict(type='log'))

box.show()

In [111]:
box = px.box(df_paid_apps,
              x='Category',
              y="Price",
              title='Price per Category')

box.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Price',
                  xaxis={'categoryorder':'max descending'},
                  yaxis=dict(type='log'))

box.show()

### Conclusion

Paid apps give up a significant amount of installs (roughly 1-2 orders of magnitude) and the highest Revenue Estimate for Games (the highest earning genre) is $0.99M.

If the free route is taken a clear monetization strategy is required.