# Introduction

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

# About the Dataset of Google Play Store Apps & Reviews

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

# Import Statements

In [1]:
import pandas as pd


# Notebook Presentation

In [2]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [3]:
df_apps = pd.read_csv('apps.csv')

# Data Cleaning

**Challenge**: How many rows and columns does `df_apps` have? What are the column names? Look at a random sample of 5 different rows with [.sample()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html).

In [None]:
df_apps.shape

(10841, 12)

In [None]:
df_apps.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
      dtype='object')

In [None]:
df_apps.sample(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
8757,Darkness Rises,FAMILY,4.7,456474,86.0,5000000,Free,0,Teen,Role Playing,"July 24, 2018",4.4 and up
2197,DK 15 Minute Language Course,FAMILY,2.8,21,57.0,1000,Free,0,Everyone,Education,"April 6, 2018",4.0.3 and up
5607,CN Summer Challenge,FAMILY,3.7,985,49.0,100000,Free,0,Everyone,Entertainment;Brain Games,"May 24, 2018",4.4 and up
6450,Redfin Real Estate,HOUSE_AND_HOME,4.6,36857,19.0,1000000,Free,0,Everyone,House & Home,"July 25, 2018",5.0 and up
10514,Google Keep,PRODUCTIVITY,4.4,691474,4.0,100000000,Free,0,Everyone,Productivity,"August 6, 2018",Varies with device


### Drop Unused Columns

**Challenge**: Remove the columns called `Last_Updated` and `Android_Version` from the DataFrame. We will not use these columns.

In [4]:
df_apps.drop(columns=['Last_Updated', 'Android_Ver'], inplace=True)

In [5]:
df_apps.sample(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
6831,My Space - Employment Center,BUSINESS,4.5,67000,7.2,1000000,Free,0,Everyone,Business
10723,Viber Messenger,COMMUNICATION,4.3,11334973,3.5,500000000,Free,0,Everyone,Communication
327,bacterial vaginosis,MEDICAL,,0,3.6,10,Free,0,Teen,Medical
8212,Free Dating & Flirt Chat - Choice of Love,DATING,4.0,42197,11.0,5000000,Free,0,Mature 17+,Dating
1843,Black Social,SOCIAL,4.6,22,8.7,1000,Free,0,Teen,Social


### Find and Remove NaN values in Ratings

**Challenge**: How may rows have a NaN value (not-a-number) in the Ratings column? Create DataFrame called `df_apps_clean` that does not include these rows.

In [6]:
df_apps.isna().any().sum()

np.int64(2)

In [7]:
df_apps_clean = df_apps.dropna()

In [8]:
df_apps_clean.isna().any().sum()

np.int64(0)

### Find and Remove Duplicates

**Challenge**: Are there any duplicates in data? Check for duplicates using the [.duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) function. How many entries can you find for the "Instagram" app? Use [.drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) to remove any duplicates from `df_apps_clean`.


In [9]:
df_apps_clean.duplicated().sum()

np.int64(476)

In [None]:
print(df_apps_clean[df_apps_clean['App'] == 'Instagram'].shape[0])

4


In [20]:
df_apps_clean.drop_duplicates(subset=['App', 'Type', 'Price'])

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
21,KBA-EZ Health Guide,MEDICAL,5.00,4,25.00,1,Free,0,Everyone,Medical
28,Ra Ga Ba,GAME,5.00,2,20.00,1,Paid,$1.49,Everyone,Arcade
47,Mu.F.O.,GAME,5.00,2,16.00,1,Paid,$0.99,Everyone,Arcade
82,Brick Breaker BR,GAME,5.00,7,19.00,5,Free,0,Everyone,Arcade
99,Anatomy & Physiology Vocabulary Exam Review App,MEDICAL,5.00,1,4.60,5,Free,0,Everyone,Medical
...,...,...,...,...,...,...,...,...,...,...
10824,Google Drive,PRODUCTIVITY,4.40,2731171,4.00,1000000000,Free,0,Everyone,Productivity
10828,YouTube,VIDEO_PLAYERS,4.30,25655305,4.65,1000000000,Free,0,Teen,Video Players & Editors
10829,Google Play Movies & TV,VIDEO_PLAYERS,3.70,906384,4.65,1000000000,Free,0,Teen,Video Players & Editors
10831,Google News,NEWS_AND_MAGAZINES,3.90,877635,13.00,1000000000,Free,0,Teen,News & Magazines


In [21]:
df_apps_clean.duplicated()

Unnamed: 0,0
21,False
28,False
47,False
82,False
99,False
...,...
10835,False
10836,False
10837,False
10838,False


In [22]:
df_apps_clean.duplicated().sum()

np.int64(0)

# Find Highest Rated Apps

**Challenge**: Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?

In [23]:
print(df_apps_clean.loc[df_apps_clean['Rating'].idxmax()])

App               KBA-EZ Health Guide
Category                      MEDICAL
Rating                           5.00
Reviews                             4
Size_MBs                        25.00
Installs                            1
Type                             Free
Price                               0
Content_Rating               Everyone
Genres                        Medical
Name: 21, dtype: object


In [24]:
if (df_apps_clean['Rating']==5.0).any(): # Change 'Ratings' to 'Rating' and use .any() to check for any True values in the boolean Series
  print(df_apps_clean[df_apps_clean['Rating']==5.0]['App']) # Filter the DataFrame to print only the apps with a rating of 5.0

21                                   KBA-EZ Health Guide
28                                              Ra Ga Ba
47                                               Mu.F.O.
82                                      Brick Breaker BR
99       Anatomy & Physiology Vocabulary Exam Review App
                              ...                       
2680                                 Florida Wildflowers
2750    Superheroes, Marvel, DC, Comics, TV, Movies News
3030             CL Keyboard - Myanmar Keyboard (No Ads)
3115                                          Oración CX
4058                            Ek Bander Ne Kholi Dukan
Name: App, Length: 271, dtype: object


# Find 5 Largest Apps in terms of Size (MBs)

**Challenge**: What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be limit in place or can developers make apps as large as they please?

In [25]:
print(df_apps_clean.loc[df_apps_clean['Size_MBs'].idxmax()])

App               Navi Radiography Pro
Category                       MEDICAL
Rating                            4.70
Reviews                             11
Size_MBs                        100.00
Installs                           500
Type                              Paid
Price                           $15.99
Content_Rating                Everyone
Genres                         Medical
Name: 1795, dtype: object


# Find the 5 App with Most Reviews

**Challenge**: Which apps have the highest number of reviews? Are there any paid apps among the top 50?

In [26]:
print(df_apps_clean.loc[df_apps_clean['Reviews'].idxmax()])

App                    Facebook
Category                 SOCIAL
Rating                     4.10
Reviews                78158306
Size_MBs                   5.30
Installs          1,000,000,000
Type                       Free
Price                         0
Content_Rating             Teen
Genres                   Social
Name: 10805, dtype: object


In [27]:
top_50_reviews = df_apps_clean.nlargest(50, 'Reviews')
if (top_50_reviews['Type']=='Paid').any():
  print(top_50_reviews[top_50_reviews['Type']=='Paid']['App'])
else:
  print('No paid apps in the top 50')

No paid apps in the top 50


# Plotly Pie and Donut Charts - Visualise Categorical Data: Content Ratings

In [28]:
import plotly.express as px

In [33]:
ratings = df_apps_clean['Content_Rating'].value_counts()
print(ratings)

Content_Rating
Everyone           7094
Teen               1022
Mature 17+          411
Everyone 10+        360
Adults only 18+       3
Unrated               1
Name: count, dtype: int64


In [35]:
fig = px.pie(labels=ratings.index,
             title='Content Rating of Apps in Google Play Store',
             values=ratings.values,
             names=ratings.index)
fig.show()

In [36]:
# Create Donut Chart
fig = px.pie(labels=ratings.index,
             title='Content Rating of Apps in Google Play Store',
             values=ratings.values,
             names=ratings.index,
             hole=0.6)
fig.show()

In [37]:
# Calculate rating counts for each category
rating_counts = {
    '5': len(df_apps_clean[df_apps_clean['Rating'] == 5]),
    '4.5 or over': len(df_apps_clean[(df_apps_clean['Rating'] >= 4.5) & (df_apps_clean['Rating'] < 5)]),
    '4 or over': len(df_apps_clean[(df_apps_clean['Rating'] >= 4) & (df_apps_clean['Rating'] <= 4.4)]),
    '3 or over': len(df_apps_clean[(df_apps_clean['Rating'] >= 3) & (df_apps_clean['Rating'] <= 3.9)]),
    '2 or over': len(df_apps_clean[(df_apps_clean['Rating'] >= 2) & (df_apps_clean['Rating'] <= 2.9)]),
    'less than 2': len(df_apps_clean[df_apps_clean['Rating'] <= 1.9])
}

# Create a list of rating categories in the desired order
ordered_categories = ['5', '4.5 or over', '4 or over', '3 or over', '2 or over', 'less than 2']

# Create a new DataFrame with the ordered categories
donut_df = pd.DataFrame({'Rating Category': ordered_categories,
                         'Count': [rating_counts[cat] for cat in ordered_categories]})


# Create the donut chart
fig = px.pie(donut_df,
             values='Count',
             names='Rating Category',
             title='App Ratings Distribution',
             hole=0.5)

fig.show()

# Numeric Type Conversion: Examine the Number of Installs

**Challenge**: How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install?

Check the datatype of the Installs column.

Count the number of apps at each level of installations.

Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have make sure you remove non-numeric characters first.

In [38]:
df_apps_clean['Installs'].dtype

dtype('O')

In [39]:
df_apps_clean['Installs'].replace(to_replace='[+,]', value='', regex=True, inplace=True)


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [None]:
df_apps_clean['Installs'].tail(5)

Unnamed: 0,Installs
10835,1000000000
10836,1000000000
10837,1000000000
10838,1000000000
10840,1000000000


In [None]:
df_apps_clean['Installs'] = pd.to_numeric(df_apps_clean['Installs'], errors='coerce')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [None]:
df_apps_clean['Installs'].dtype

dtype('int64')

In [None]:
print(df_apps_clean[df_apps_clean['Installs'] >= 1000000000]['App'].count())

49


In [None]:
print(df_apps_clean[df_apps_clean['Installs'] >= 1000000000]['App'])

10783                           Google Play Books
10784    Messenger – Text and Video Chat for Free
10785                          WhatsApp Messenger
10786                Google Chrome: Fast & Secure
10787                                       Gmail
10788                                    Hangouts
10790    Messenger – Text and Video Chat for Free
10791                                    Hangouts
10792               Skype - free IM & video calls
10793                Google Chrome: Fast & Secure
10794                                       Gmail
10795                                    Hangouts
10797                          WhatsApp Messenger
10798                Google Chrome: Fast & Secure
10801                                    Hangouts
10803                           Google Play Games
10804                           Google Play Games
10805                                    Facebook
10806                                   Instagram
10807                                     Google+


In [None]:
print(df_apps_clean[df_apps_clean['Installs'] == 1]['App'].count())

3


In [None]:
print(df_apps_clean[df_apps_clean['Installs'] == 1]['App'])

21    KBA-EZ Health Guide
28               Ra Ga Ba
47                Mu.F.O.
Name: App, dtype: object


# Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate

Let's examine the Price column more closely.

**Challenge**: Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.

Remove all apps that cost more than $250 from the `df_apps_clean` DataFrame.

Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?


In [60]:
df_apps_clean.replace(to_replace='[$,]', value='', regex=True, inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [61]:
df_apps_clean['Price'] = pd.to_numeric(df_apps_clean['Price'], errors='coerce')
df_apps_clean['Price'].dtype



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



dtype('float64')

In [62]:
print(df_apps_clean.loc[df_apps_clean['Price'].nlargest(20).index])

                                 App   Category  Rating  Reviews  Size_MBs  \
3946        I'm Rich - Trump Edition  LIFESTYLE    3.60      275      7.30   
1331          most expensive app (H)     FAMILY    4.30        6      1.50   
1946  I am rich (Most expensive app)    FINANCE    4.10      129      2.70   
2394                      I am Rich!    FINANCE    3.80       93     22.00   
2461              I AM RICH PRO PLUS    FINANCE    4.00       36     41.00   
2775                   I Am Rich Pro     FAMILY    4.40      201      2.70   
3114                       I am Rich    FINANCE    4.30      180      3.80   
3145              I am rich(premium)    FINANCE    3.50      472      0.94   
3221                  I am Rich Plus     FAMILY    4.00      856      8.70   
3554                      💎 I'm rich  LIFESTYLE    3.80      718     26.00   
4606               I Am Rich Premium    FINANCE    4.10     1867      4.70   
5765                       I am rich  LIFESTYLE    3.80     3547

In [63]:
df_apps_clean.loc[df_apps_clean['Price'] == 400]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
3946,I'm Rich - Trump Edition,LIFESTYLE,3.6,275,7.3,10000,Paid,400.0,Everyone,Lifestyle


In [64]:
df_apps_clean = df_apps_clean[df_apps_clean['Price'] < 250]

### The most expensive apps sub $250

In [None]:
print(df_apps_clean.loc[df_apps_clean['Price'].nlargest(5).index])

                            App   Category  Rating  Reviews  Size_MBs  \
2281  Vargo Anesthesia Mega App    MEDICAL    4.60       92     32.00   
1407               LTC AS Legal    MEDICAL    4.00        6      1.30   
2629           I am Rich Person  LIFESTYLE    4.20      134      1.80   
2481    A Manual of Acupuncture    MEDICAL    3.50      214     68.00   
504   AP Art History Flashcards     FAMILY    5.00        1     96.00   

      Installs  Type  Price Content_Rating     Genres     Last_Updated  \
2281      1000  Paid  79.99       Everyone    Medical     June 18 2018   
1407       100  Paid  39.99       Everyone    Medical     April 4 2018   
2629      1000  Paid  37.99       Everyone  Lifestyle     July 18 2017   
2481      1000  Paid  33.99       Everyone    Medical   October 2 2017   
504         10  Paid  29.99     Mature 17+  Education  January 19 2016   

       Android_Ver  
2281  4.0.3 and up  
1407    4.1 and up  
2629  4.0.3 and up  
2481    4.0 and up  
504     4.0

### Highest Grossing Paid Apps (ballpark estimate)

In [70]:
# Replace any potential non-numeric characters in 'Installs' column with an empty string
df_apps_clean['Installs'] = df_apps_clean['Installs'].astype(str).str.replace('[+,]', '', regex=True)

# Convert 'Installs' column to numeric, handling errors by coercing them to NaN
df_apps_clean['Installs'] = pd.to_numeric(df_apps_clean['Installs'], errors='coerce')

# Now calculate 'Revenue_Estimate'
df_apps_clean['Revenue_Estimate'] = df_apps_clean['Installs'] * df_apps_clean['Price']

In [71]:
print(df_apps_clean.loc[df_apps_clean['Revenue_Estimate'].nlargest(10).index])

                                App       Category  Rating  Reviews  Size_MBs  \
9220                      Minecraft         FAMILY    4.50  2376564     19.00   
9224                      Minecraft         FAMILY    4.50  2375336     19.00   
8825                  Hitman Sniper           GAME    4.60   408292     29.00   
7151  Grand Theft Auto: San Andreas           GAME    4.40   348962     26.00   
7477            Facetune - For Free    PHOTOGRAPHY    4.40    49553     48.00   
7977        Sleep as Android Unlock      LIFESTYLE    4.50    23966      0.85   
6594            DraStic DS Emulator           GAME    4.60    87766     12.00   
6082                   Weather Live        WEATHER    4.50    76593      4.75   
6746     Card Wars - Adventure Time         FAMILY    4.30   129603     23.00   
6856                        Threema  COMMUNICATION    4.50    51110      3.50   

      Installs  Type  Price Content_Rating                     Genres  \
9220  10000000  Paid   6.99   Every

# Plotly Bar Charts & Scatter Plots: Analysing App Categories

In [72]:
import plotly.express as px

In [73]:
bar = px.bar(df_apps_clean,
             x='Category',
             y='Installs')
bar.show()

In [74]:
fig = px.scatter(df_apps_clean, x='Category', y='Installs')
fig.show()

### Vertical Bar Chart - Highest Competition (Number of Apps)

In [75]:
num_apps_per_category = df_apps_clean['Category'].value_counts()
ver_bar_chart = px.bar(x=num_apps_per_category.index, y=num_apps_per_category.values,
                       title='Number of Apps per Category',
                       labels={'x': 'Category', 'y': 'Number of Apps'})
ver_bar_chart.show()

### Horizontal Bar Chart - Most Popular Categories (Highest Downloads)

In [76]:
highest_downloads_per_category = df_apps_clean.groupby('Category')['Installs'].sum()

In [79]:
horizontal_bar_chart = px.bar(x=highest_downloads_per_category.values,
                              y=highest_downloads_per_category.index,
                              title='Total Installs per Category',
                              labels={'x': 'Total Installs', 'y': 'Category'},
                              orientation='h')
horizontal_bar_chart.update_layout(yaxis={'categoryorder': 'total ascending'})
horizontal_bar_chart.show()

### Category Concentration - Downloads vs. Competition

**Challenge**:
* First, create a DataFrame that has the number of apps in one column and the number of installs in another:

<img src=https://imgur.com/uQRSlXi.png width="350">

* Then use the [plotly express examples from the documentation](https://plotly.com/python/line-and-scatter/) alongside the [.scatter() API reference](https://plotly.com/python-api-reference/generated/plotly.express.scatter.html)to create scatter plot that looks like this.

<img src=https://imgur.com/cHsqh6a.png>

*Hint*: Use the size, hover_name and color parameters in .scatter(). To scale the yaxis, call .update_layout() and specify that the yaxis should be on a log-scale like so: yaxis=dict(type='log')

In [None]:
category_df = df_apps_clean.groupby('Category').agg({'App': 'count', 'Installs': 'sum'})
category_df.rename(columns={'App': 'Number of Apps', 'Installs': 'Total Installs'}, inplace=True)

In [None]:
fig = px.scatter(category_df,
                 x='Number of Apps',
                 y='Total Installs',
                 size='Number of Apps',
                 color='Total Installs',
                 hover_name=category_df.index)
fig.update_layout(yaxis=dict(type='log'))
fig.show()

# Extracting Nested Data from a Column

**Challenge**: How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's [.stack() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html).


In [14]:
df_apps_clean.Genres.value_counts()

Unnamed: 0_level_0,count
Genres,Unnamed: 1_level_1
Tools,732
Entertainment,498
Education,446
Action,349
Productivity,334
...,...
Health & Fitness;Action & Adventure,1
Casual;Music & Video,1
Tools;Education,1
Role Playing;Brain Games,1


In [88]:
stack = df_apps_clean.Genres.str.split(';', expand=True).stack()
stack

Unnamed: 0,Unnamed: 1,0
21,0,Medical
28,0,Arcade
47,0,Arcade
82,0,Arcade
99,0,Medical
...,...,...
10835,0,Arcade
10836,0,Arcade
10837,0,Arcade
10838,0,Arcade


In [89]:
num_genres = stack.value_counts()
num_genres

Unnamed: 0,count
Tools,733
Education,626
Entertainment,534
Action,364
Productivity,334
Finance,311
Personalization,310
Communication,308
Photography,304
Sports,303


# Colour Scales in Plotly Charts - Competition in Genres

**Challenge**: Can you create this chart with the Series containing the genre data?

<img src=https://imgur.com/DbcoQli.png width=400>

Try experimenting with the built in colour scales in Plotly. You can find a full list [here](https://plotly.com/python/builtin-colorscales/).

* Find a way to set the colour scale using the color_continuous_scale parameter.
* Find a way to make the color axis disappear by using coloraxis_showscale.

In [92]:
bar = px.bar(x = num_genres.index[:15], # index = category name
             y = num_genres.values[:15], # count
             title='Top Genres',
             hover_name=num_genres.index[:15],
             color=num_genres.values[:15],
             color_continuous_scale='Agsunset')
bar.update_layout(
    xaxis_title='Genre',
    yaxis_title='Number of Apps',
    coloraxis_showscale=False)
bar.show()

# Grouped Bar Charts: Free vs. Paid Apps per Category

In [94]:
df_free_vs_paid = df_apps_clean.groupby(['Category', 'Type'])['App'].count().unstack()
df_free_vs_paid.fillna(0, inplace=True)
df_free_vs_paid

Type,Free,Paid
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
ART_AND_DESIGN,59.0,3.0
AUTO_AND_VEHICLES,72.0,1.0
BEAUTY,42.0,0.0
BOOKS_AND_REFERENCE,169.0,8.0
BUSINESS,261.0,9.0
COMICS,58.0,0.0
COMMUNICATION,285.0,22.0
DATING,155.0,4.0
EDUCATION,125.0,4.0
ENTERTAINMENT,109.0,2.0


**Challenge**: Use the plotly express bar [chart examples](https://plotly.com/python/bar-charts/#bar-chart-with-sorted-or-ordered-categories) and the [.bar() API reference](https://plotly.com/python-api-reference/generated/plotly.express.bar.html#plotly.express.bar) to create this bar chart:

<img src=https://imgur.com/LE0XCxA.png>

You'll want to use the `df_free_vs_paid` DataFrame that you created above that has the total number of free and paid apps per category.

See if you can figure out how to get the look above by changing the `categoryorder` to 'total descending' as outlined in the documentation here [here](https://plotly.com/python/categorical-axes/#automatically-sorting-categories-by-name-or-total-value).

In [101]:
g_bar = px.bar(df_free_vs_paid.reset_index(), # Reset index to make 'Category' a column
               x='Category',
               y=['Free', 'Paid'], # Specify both 'Free' and 'Paid' columns for y-axis
               title='Free vs Paid Apps by Category',
               barmode='group')

g_bar.update_layout(xaxis_title='Category',
                    yaxis_title='Number of Apps',
                    xaxis={'categoryorder':'total descending'},
                    yaxis=dict(type='log'))
g_bar.show()

# Plotly Box Plots: Lost Downloads for Paid Apps

**Challenge**: Create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?

Use the [Box Plots Guide](https://plotly.com/python/box-plots/) and the [.box API reference](https://plotly.com/python-api-reference/generated/plotly.express.box.html) to create the following chart.

<img src=https://imgur.com/uVsECT3.png>


In [108]:
box = px.box(df_apps_clean,
             y='Installs',
             x='Type',
             color='Type',
             notched=True,
             points='all',
             title='How Many Downloads are Paid Apps Giving Up?')

box.update_layout(yaxis=dict(type='log'))

box.show()

# Plotly Box Plots: Revenue by App Category

**Challenge**: See if you can generate the chart below:

<img src=https://imgur.com/v4CiNqX.png>

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?

Hint: I've used 'min ascending' to sort the categories.

In [114]:
df_paid_apps = df_apps_clean[df_apps_clean['Type'] == 'Paid']
box = px.box(df_paid_apps,
             y='Revenue_Estimate',
             x='Category',
             title='How Much Can Paid Apps Earn?')

box.update_layout(yaxis=dict(type='log'))

box.show()

# How Much Can You Charge? Examine Paid App Pricing Strategies by Category

**Challenge**: What is the median price price for a paid app? Then compare pricing by category by creating another box plot. But this time examine the prices (instead of the revenue estimates) of the paid apps. I recommend using `{categoryorder':'max descending'}` to sort the categories.

In [115]:
df_paid_apps.Price.median()

2.99

In [116]:
df_paid_apps.groupby('Category')['Price'].median().sort_values(ascending=False)

Unnamed: 0_level_0,Price
Category,Unnamed: 1_level_1
DATING,5.49
MEDICAL,5.49
BUSINESS,4.99
PARENTING,4.79
FOOD_AND_DRINK,4.24
EDUCATION,3.99
FINANCE,3.99
ENTERTAINMENT,3.99
WEATHER,3.99
PRODUCTIVITY,3.49


In [118]:
box2 = px.box(df_paid_apps,
              y='Price',
              x='Category',
              title='Price Range by Category',
              category_orders={
                  'Category':
                  df_paid_apps.groupby('Category')['Price'].median().sort_values(ascending=False).index})
box2.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Price',
                  xaxis={'categoryorder':'max descending'},
                  yaxis=dict(type='log'))
box2.show()