<a href="https://colab.research.google.com/github/KenkiCZ/google_notebooks/blob/main/C102_Apps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

# About the Dataset of Google Play Store Apps & Reviews

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

# Import Statements

In [262]:
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
import matplotlib.pyplot as plt

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [263]:
df_apps = pd.read_csv("/content/drive/MyDrive/Programming/C102_Apps/apps.csv")

# Notebook Presentation

In [264]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [265]:
print(f"There {df_apps.shape[0]} rows and {df_apps.shape[1]} columns")
print(f"The columns names are as follow: {[column_name for column_name in df_apps.columns]}")
print("\n")
df_apps.sample(5)

There 10841 rows and 12 columns
The columns names are as follow: ['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type', 'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver']




Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
6712,Dance On Mobile,FAMILY,4.1,23609,31.0,1000000,Free,0,Teen,Casual,"April 13, 2018",4.0 and up
10748,Google Calendar,PRODUCTIVITY,4.2,858208,4.0,500000000,Free,0,Everyone,Productivity,"August 6, 2018",Varies with device
7692,IHG®: Hotel Deals & Rewards,TRAVEL_AND_LOCAL,4.3,18622,44.0,1000000,Free,0,Everyone,Travel & Local,"August 3, 2018",4.4 and up
885,Pyaar Ek Dhoka hai Game - Anti Valentines,FAMILY,,8,17.0,100,Free,0,Everyone,Casual,"February 11, 2018",4.1 and up
7138,Free & Premium VPN - FinchVPN,TOOLS,4.2,19096,10.0,1000000,Free,0,Everyone,Tools,"July 5, 2018",4.1 and up


# Data Cleaning

**Challenge**: How many rows and columns does `df_apps` have? What are the column names? Look at a random sample of 5 different rows with [.sample()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html).

In [266]:
df_apps.drop(["Last_Updated", "Android_Ver"], axis=1)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.70,0,Paid,$13.99,Teen,Social
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.00,0,Paid,$2.99,Everyone,Education
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.50,0,Paid,$1.49,Everyone,Personalization
3,Command & Conquer: Rivals,FAMILY,,0,19.00,0,,0,Everyone 10+,Strategy
4,CX Network,BUSINESS,,0,10.00,0,Free,0,Everyone,Business
...,...,...,...,...,...,...,...,...,...,...
10836,Subway Surfers,GAME,4.50,27723193,76.00,1000000000,Free,0,Everyone 10+,Arcade
10837,Subway Surfers,GAME,4.50,27724094,76.00,1000000000,Free,0,Everyone 10+,Arcade
10838,Subway Surfers,GAME,4.50,27725352,76.00,1000000000,Free,0,Everyone 10+,Arcade
10839,Subway Surfers,GAME,4.50,27725352,76.00,1000000000,Free,0,Everyone 10+,Arcade


In [267]:
df_apps.isna().any(axis=1).value_counts()

False    9365
True     1476
Name: count, dtype: int64

In [268]:
df_apps.duplicated().value_counts()

False    10358
True       483
Name: count, dtype: int64

In [269]:
df_clean = df_apps.drop_duplicates(subset=["App", "Type", "Price"])

In [270]:
df_clean = df_clean.dropna()

In [271]:
df_clean

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
21,KBA-EZ Health Guide,MEDICAL,5.00,4,25.00,1,Free,0,Everyone,Medical,"August 2, 2018",4.0.3 and up
28,Ra Ga Ba,GAME,5.00,2,20.00,1,Paid,$1.49,Everyone,Arcade,"February 8, 2017",2.3 and up
47,Mu.F.O.,GAME,5.00,2,16.00,1,Paid,$0.99,Everyone,Arcade,"March 3, 2017",2.3 and up
82,Brick Breaker BR,GAME,5.00,7,19.00,5,Free,0,Everyone,Arcade,"July 23, 2018",4.1 and up
99,Anatomy & Physiology Vocabulary Exam Review App,MEDICAL,5.00,1,4.60,5,Free,0,Everyone,Medical,"August 2, 2018",4.0 and up
...,...,...,...,...,...,...,...,...,...,...,...,...
10824,Google Drive,PRODUCTIVITY,4.40,2731171,4.00,1000000000,Free,0,Everyone,Productivity,"August 6, 2018",Varies with device
10828,YouTube,VIDEO_PLAYERS,4.30,25655305,4.65,1000000000,Free,0,Teen,Video Players & Editors,"August 2, 2018",Varies with device
10829,Google Play Movies & TV,VIDEO_PLAYERS,3.70,906384,4.65,1000000000,Free,0,Teen,Video Players & Editors,"August 6, 2018",Varies with device
10831,Google News,NEWS_AND_MAGAZINES,3.90,877635,13.00,1000000000,Free,0,Teen,News & Magazines,"August 1, 2018",4.4 and up


### Find and Remove Duplicates

**Challenge**: Are there any duplicates in data? Check for duplicates using the [.duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) function. How many entries can you find for the "Instagram" app? Use [.drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) to remove any duplicates from `df_apps_clean`.


In [272]:
# df_clean = df_apps.drop_duplicates(subset=["App", "Type", "Price"])

# Find Highest Rated Apps

**Challenge**: Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?

In [273]:
df_clean.sort_values(by="Rating", ascending=False).head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
21,KBA-EZ Health Guide,MEDICAL,5.0,4,25.0,1,Free,0,Everyone,Medical,"August 2, 2018",4.0.3 and up
1790,SUMMER SONIC app,EVENTS,5.0,4,61.0,500,Free,0,Everyone,Events,"July 24, 2018",4.4 and up
1769,Yazdani Cd Center EllahAbad Official App,FAMILY,5.0,8,3.8,500,Free,0,Everyone,Entertainment,"January 12, 2018",4.0 and up
985,DW Security,BUSINESS,5.0,6,15.0,100,Free,0,Everyone,Business,"July 25, 2018",4.1 and up
981,EU Exit poll,LIFESTYLE,5.0,10,9.4,100,Free,0,Everyone,Lifestyle,"July 15, 2016",4.1 and up


# Find 5 Largest Apps in terms of Size (MBs)

**Challenge**: What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be limit in place or can developers make apps as large as they please?

In [274]:
df_clean.sort_values(by="Size_MBs", ascending=False).head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
9942,Talking Babsy Baby: Baby Games,LIFESTYLE,4.0,140995,100.0,10000000,Free,0,Everyone,Lifestyle;Pretend Play,"July 16, 2018",4.0 and up
10687,Hungry Shark Evolution,GAME,4.5,6074334,100.0,100000000,Free,0,Teen,Arcade,"July 25, 2018",4.1 and up
9943,Miami crime simulator,GAME,4.0,254518,100.0,10000000,Free,0,Mature 17+,Action,"July 9, 2018",4.0 and up
9944,Gangster Town: Vice District,FAMILY,4.3,65146,100.0,10000000,Free,0,Mature 17+,Simulation,"May 31, 2018",4.0 and up
3144,Vi Trainer,HEALTH_AND_FITNESS,3.6,124,100.0,5000,Free,0,Everyone,Health & Fitness,"August 2, 2018",5.0 and up


# Find the 5 App with Most Reviews

**Challenge**: Which apps have the highest number of reviews? Are there any paid apps among the top 50?

In [275]:
[df_clean.sort_values(by="Reviews", ascending=False)[:50]["Price"]!="0"]=="True"

False

# Plotly Pie and Donut Charts - Visualise Categorical Data: Content Ratings

In [276]:
ratings = df_clean["Content_Rating"].value_counts()
values_, indexes_ = ratings.values, ratings.index

In [277]:
import plotly.express as px

In [278]:
fig = px.pie(labels=indexes_, values=values_, names=indexes_, title="Content Rating")
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()

In [279]:
print(px.pie().update_traces())

Figure({
    'data': [{'domain': {'x': [0.0, 1.0], 'y': [0.0, 1.0]},
              'hovertemplate': '<extra></extra>',
              'legendgroup': '',
              'name': '',
              'showlegend': False,
              'type': 'pie'}],
    'layout': {'legend': {'tracegroupgap': 0}, 'margin': {'t': 60}, 'template': '...'}
})


In [280]:
fig = px.pie(labels=indexes_, values=values_, names=indexes_, title="Content Rating", hole=0.7)
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()

# Numeric Type Conversion: Examine the Number of Installs

**Challenge**: How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install?

Check the datatype of the Installs column.

Count the number of apps at each level of installations.

Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have make sure you remove non-numeric characters first.

In [281]:
df_clean[df_clean["Installs"].str.replace(',', '').astype(int)>=1000000000]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10783,Google Play Books,BOOKS_AND_REFERENCE,3.9,1433233,5.7,1000000000,Free,0,Teen,Books & Reference,"August 3, 2018",Varies with device
10784,Messenger – Text and Video Chat for Free,COMMUNICATION,4.0,56642847,3.5,1000000000,Free,0,Everyone,Communication,"August 1, 2018",Varies with device
10785,WhatsApp Messenger,COMMUNICATION,4.4,69119316,3.5,1000000000,Free,0,Everyone,Communication,"August 3, 2018",Varies with device
10786,Google Chrome: Fast & Secure,COMMUNICATION,4.3,9642995,3.5,1000000000,Free,0,Everyone,Communication,"August 1, 2018",Varies with device
10787,Gmail,COMMUNICATION,4.3,4604324,3.5,1000000000,Free,0,Everyone,Communication,"August 2, 2018",Varies with device
10788,Hangouts,COMMUNICATION,4.0,3419249,3.5,1000000000,Free,0,Everyone,Communication,"July 21, 2018",Varies with device
10792,Skype - free IM & video calls,COMMUNICATION,4.1,10484169,3.5,1000000000,Free,0,Everyone,Communication,"August 3, 2018",Varies with device
10803,Google Play Games,ENTERTAINMENT,4.3,7165362,9.35,1000000000,Free,0,Teen,Entertainment,"July 16, 2018",Varies with device
10805,Facebook,SOCIAL,4.1,78158306,5.3,1000000000,Free,0,Teen,Social,"August 3, 2018",Varies with device
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device


In [282]:
df_clean["Installs"] = df_clean["Installs"].str.replace(',', '').astype(int)
df_clean["Installs"]

21                1
28                1
47                1
82                5
99                5
            ...    
10824    1000000000
10828    1000000000
10829    1000000000
10831    1000000000
10835    1000000000
Name: Installs, Length: 8196, dtype: int64

In [283]:
df_clean[["App", "Installs"]].groupby("Installs").count()

Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
5,9
10,69
50,56
100,303
500,199
1000,697
5000,425
10000,987
50000,457


# Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate

Let's examine the Price column more closely.

**Challenge**: Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.

Remove all apps that cost more than $250 from the `df_apps_clean` DataFrame.

Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?


In [284]:
df_clean["Price"] = df_clean["Price"].str.replace("$","").astype(float)

In [285]:
df_clean[["Price", "App"]].groupby("Price").count()

Unnamed: 0_level_0,App
Price,Unnamed: 1_level_1
0.00,7593
0.99,104
1.00,2
1.20,1
1.29,1
...,...
299.99,1
379.99,1
389.99,1
399.99,11


In [286]:
df_clean.sort_values(by="Price", ascending=False)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
3946,I'm Rich - Trump Edition,LIFESTYLE,3.60,275,7.30,10000,Paid,400.00,Everyone,Lifestyle,"May 3, 2018",4.1 and up
3114,I am Rich,FINANCE,4.30,180,3.80,5000,Paid,399.99,Everyone,Finance,"March 22, 2018",4.2 and up
2461,I AM RICH PRO PLUS,FINANCE,4.00,36,41.00,1000,Paid,399.99,Everyone,Finance,"June 25, 2018",4.1 and up
3145,I am rich(premium),FINANCE,3.50,472,0.94,5000,Paid,399.99,Everyone,Finance,"May 1, 2017",4.4 and up
5765,I am rich,LIFESTYLE,3.80,3547,1.80,100000,Paid,399.99,Everyone,Lifestyle,"January 12, 2018",4.0.3 and up
...,...,...,...,...,...,...,...,...,...,...,...,...
4508,myAir™ for Air10™ by ResMed,MEDICAL,3.70,236,18.00,50000,Free,0.00,Everyone,Medical,"July 25, 2018",5.0 and up
4507,AK Math Coach,FAMILY,3.60,283,18.00,50000,Free,0.00,Everyone,Education,"May 19, 2015",2.3.3 and up
4506,Forgotten Hill: Fall,GAME,4.40,1063,18.00,50000,Free,0.00,Teen,Adventure,"October 30, 2017",3.0 and up
4505,AE Video Poker,GAME,4.00,721,18.00,50000,Free,0.00,Teen,Casino,"December 31, 2015",2.3 and up


### The most expensive apps sub $250

In [287]:
df_under_250 = df_clean.drop(axis=1, index= df_clean[df_clean["Price"]>250].index)

### Highest Grossing Paid Apps (ballpark estimate)

In [288]:
df_under_250["Revenue_Estimate"] = df_under_250["Price"] * df_under_250["Installs"]
df_under_250["Revenue_Estimate"].info()

<class 'pandas.core.series.Series'>
Index: 8181 entries, 21 to 10835
Series name: Revenue_Estimate
Non-Null Count  Dtype  
--------------  -----  
8181 non-null   float64
dtypes: float64(1)
memory usage: 127.8 KB


In [289]:
df_under_250[["Revenue_Estimate", "App", "Category"]].sort_values(by="Revenue_Estimate",ascending=False).head(10)

Unnamed: 0,Revenue_Estimate,App,Category
9220,69900000.0,Minecraft,FAMILY
8825,9900000.0,Hitman Sniper,GAME
7151,6990000.0,Grand Theft Auto: San Andreas,GAME
7977,5990000.0,Sleep as Android Unlock,LIFESTYLE
7477,5990000.0,Facetune - For Free,PHOTOGRAPHY
6594,4990000.0,DraStic DS Emulator,GAME
6082,2995000.0,Weather Live,WEATHER
7954,2990000.0,Bloons TD 5,FAMILY
6856,2990000.0,Threema,COMMUNICATION
7044,2990000.0,Tasker,TOOLS


In [290]:
df_under_250['Revenue_Estimate'] = df_under_250.Installs.mul(df_under_250.Price)
df_under_250.sort_values('Revenue_Estimate', ascending=False)[:15]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver,Revenue_Estimate
9220,Minecraft,FAMILY,4.5,2376564,19.0,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,"July 24, 2018",Varies with device,69900000.0
8825,Hitman Sniper,GAME,4.6,408292,29.0,10000000,Paid,0.99,Mature 17+,Action,"July 12, 2018",4.1 and up,9900000.0
7151,Grand Theft Auto: San Andreas,GAME,4.4,348962,26.0,1000000,Paid,6.99,Mature 17+,Action,"March 21, 2015",3.0 and up,6990000.0
7977,Sleep as Android Unlock,LIFESTYLE,4.5,23966,0.85,1000000,Paid,5.99,Everyone,Lifestyle,"June 27, 2018",4.0 and up,5990000.0
7477,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,5.99,Everyone,Photography,"July 25, 2018",4.1 and up,5990000.0
6594,DraStic DS Emulator,GAME,4.6,87766,12.0,1000000,Paid,4.99,Everyone,Action,"July 19, 2016",2.3 and up,4990000.0
6082,Weather Live,WEATHER,4.5,76593,4.75,500000,Paid,5.99,Everyone,Weather,"November 21, 2017",Varies with device,2995000.0
7954,Bloons TD 5,FAMILY,4.6,190086,94.0,1000000,Paid,2.99,Everyone,Strategy,"July 13, 2018",2.3.3 and up,2990000.0
6856,Threema,COMMUNICATION,4.5,51110,3.5,1000000,Paid,2.99,Everyone,Communication,"August 3, 2018",Varies with device,2990000.0
7044,Tasker,TOOLS,4.6,43045,3.4,1000000,Paid,2.99,Everyone,Tools,"June 25, 2018",Varies with device,2990000.0


# Plotly Bar Charts & Scatter Plots: Analysing App Categories

In [291]:
top_categories = df_clean["Category"].value_counts()[:10]

### Vertical Bar Chart - Highest Competition (Number of Apps)

In [292]:
bar = px.bar(x = top_categories.index, y = top_categories.values)
bar.show()

### Horizontal Bar Chart - Most Popular Categories (Highest Downloads)

In [293]:
category_group = df_clean[["Category", "Installs"]].groupby("Category")
category_table = category_group.sum() / category_group.count()
category_chart_data = category_table.sort_values(by="Installs", ascending = True)[:15]

In [294]:
bar = px.bar(y = category_chart_data.index, x = category_chart_data["Installs"], orientation="h")
bar.show()

### Category Concentration - Downloads vs. Competition

**Challenge**:
* First, create a DataFrame that has the number of apps in one column and the number of installs in another:

<img src=https://imgur.com/uQRSlXi.png width="350">

* Then use the [plotly express examples from the documentation](https://plotly.com/python/line-and-scatter/) alongside the [.scatter() API reference](https://plotly.com/python-api-reference/generated/plotly.express.scatter.html)to create scatter plot that looks like this.

<img src=https://imgur.com/cHsqh6a.png>

*Hint*: Use the size, hover_name and color parameters in .scatter(). To scale the yaxis, call .update_layout() and specify that the yaxis should be on a log-scale like so: yaxis=dict(type='log')

In [295]:
df_scatter = df_clean[["Category", "Installs", "App"]].groupby("Category").agg({"App":"count", "Installs":"sum"}).sort_values("Installs", ascending=False)
df_scatter

Unnamed: 0_level_0,App,Installs
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
GAME,910,13858762717
COMMUNICATION,257,11039241530
TOOLS,719,8099724500
PRODUCTIVITY,301,5788070180
SOCIAL,203,5487841475
PHOTOGRAPHY,263,4649143130
FAMILY,1610,4437579590
VIDEO_PLAYERS,148,3916897200
TRAVEL_AND_LOCAL,187,2894859300
NEWS_AND_MAGAZINES,204,2369110650


In [296]:
scatter = px.scatter(df_scatter, x = "App",
                     y = "Installs",
                     color='Installs',
                     size="App",
                     labels={
                     "Installs": "Installs",
                     "App": "Number of Apps (Lower=More Concentrated)"},
                     title="Category Concetration",
                     hover_name = df_scatter.index)
scatter.update_layout(yaxis=dict(type="log"))
scatter.show()

# Extracting Nested Data from a Column

**Challenge**: How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's [.stack() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html).


In [384]:
split_genres = df_clean["Genres"].str.replace(";"," ").str.split(expand=True).replace("&", None)
genre_popularity = split_genres.stack().value_counts(ascending=False)
genre_popularity = genre_popularity[:20]

In [412]:
bar_genre = px.bar(x = genre_popularity.index,
                  y = genre_popularity.values,
                  color = genre_popularity.values,
                  color_continuous_scale = "Agsunset",
                  hover_name = genre_popularity)

bar_genre.update_coloraxes(showscale=False)
bar_genre.update_xaxes(tickangle=45)
bar_genre.update_layout(title=dict(text="Top genres", font=dict(size=50),automargin=True, yref='paper'),
                        font=dict(family="Courier New, monospace", size=18),
                      xaxis_title='Genre',
                      yaxis_title='Number of Apps',
                      coloraxis_showscale=False)
bar_genre.show()

# Colour Scales in Plotly Charts - Competition in Genres

**Challenge**: Can you create this chart with the Series containing the genre data?

<img src=https://imgur.com/DbcoQli.png width=400>

Try experimenting with the built in colour scales in Plotly. You can find a full list [here](https://plotly.com/python/builtin-colorscales/).

* Find a way to set the colour scale using the color_continuous_scale parameter.
* Find a way to make the color axis disappear by using coloraxis_showscale.

# Grouped Bar Charts: Free vs. Paid Apps per Category

In [441]:
df_type = df_clean.groupby(["Category", "Type"], as_index=False).agg({"App":pd.Series.count}).sort_values(by="App", ascending=False)
df_type

Unnamed: 0,Category,Type,App
19,FAMILY,Free,1456
25,GAME,Free,834
53,TOOLS,Free,656
21,FINANCE,Free,289
31,LIFESTYLE,Free,284
...,...,...,...
17,ENTERTAINMENT,Paid,2
24,FOOD_AND_DRINK,Paid,2
40,PARENTING,Paid,2
38,NEWS_AND_MAGAZINES,Paid,2


In [464]:
bar_type = px.bar(df_type,
                  x = "Category",
                  y = "App",
                  color = "Type",
                  color_continuous_scale = "Agsunset",
                  hover_name = "Category",
                  barmode="group",
                  text_auto='.2s')


bar_type.update_xaxes(tickangle=90)
bar_type.update_layout(title=dict(text="Free vs Paid Apps by Category", font=dict(size=30)),
                      xaxis_title='Category',
                      yaxis_title='Number of Apps',
                      coloraxis_showscale=False,
                      yaxis=dict(type='log'))

bar_type.update_traces(textangle=0, textposition="outside", cliponaxis=False)
bar_type.show()

In [None]:
import plotly.graph_objects as go

# Assuming df_type is your DataFrame and it has columns "Category", "App", and "Type"
# Create separate dataframes for free and paid apps
df_free = df_type[df_type["Type"] == "Free"]
df_paid = df_type[df_type["Type"] == "Paid"]

# Create bar trace for free apps
bar_free = px.bar(df_free,
                  x="Category",
                  y="App",
                  color="Type",
                  color_discrete_map={"Free": "blue"},
                  hover_name="Category",
                  text_auto='.2s')

# Create bar trace for paid apps
bar_paid = px.bar(df_paid,
                  x="Category",
                  y="App",
                  color="Type",
                  color_discrete_map={"Paid": "orange"},
                  hover_name="Category",
                  text_auto='.2s')

# Combine the two traces
fig = bar_free.data[0]
fig.update(name="Free Apps")

fig_paid = bar_paid.data[0]
fig_paid.update(yaxis="y2", name="Paid Apps")

# Create the final figure
fig = go.Figure(data=[fig, fig_paid])

# Update layout to include two y-axes
fig.update_layout(
    title=dict(text="Free vs Paid Apps by Category", font=dict(size=40), yref='paper'),
    font=dict(family="Courier New, monospace", size=18),
    xaxis=dict(title='Category', tickangle=90),
    yaxis=dict(title='Number of Free Apps'),
    yaxis2=dict(title='Number of Paid Apps', overlaying='y', side='right', showgrid=False),
    barmode="group",
    coloraxis_showscale=False
)

fig.update_traces(textangle=0, textposition="outside", cliponaxis=False)
fig.show()


**Challenge**: Use the plotly express bar [chart examples](https://plotly.com/python/bar-charts/#bar-chart-with-sorted-or-ordered-categories) and the [.bar() API reference](https://plotly.com/python-api-reference/generated/plotly.express.bar.html#plotly.express.bar) to create this bar chart:

<img src=https://imgur.com/LE0XCxA.png>

You'll want to use the `df_free_vs_paid` DataFrame that you created above that has the total number of free and paid apps per category.

See if you can figure out how to get the look above by changing the `categoryorder` to 'total descending' as outlined in the documentation here [here](https://plotly.com/python/categorical-axes/#automatically-sorting-categories-by-name-or-total-value).

# Plotly Box Plots: Lost Downloads for Paid Apps

**Challenge**: Create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?

Use the [Box Plots Guide](https://plotly.com/python/box-plots/) and the [.box API reference](https://plotly.com/python-api-reference/generated/plotly.express.box.html) to create the following chart.

<img src=https://imgur.com/uVsECT3.png>


# Plotly Box Plots: Revenue by App Category

**Challenge**: See if you can generate the chart below:

<img src=https://imgur.com/v4CiNqX.png>

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?

Hint: I've used 'min ascending' to sort the categories.

# How Much Can You Charge? Examine Paid App Pricing Strategies by Category

**Challenge**: What is the median price price for a paid app? Then compare pricing by category by creating another box plot. But this time examine the prices (instead of the revenue estimates) of the paid apps. I recommend using `{categoryorder':'max descending'}` to sort the categories.