<a href="https://colab.research.google.com/github/NAndrei313/Google-Play-Store-Analysis/blob/main/Google_Play_Store_App_Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

# About the Dataset of Google Play Store Apps & Reviews

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

# Import Statements

In [49]:
# For this project, we will utilize Pandas for data analysis and Plotly Express for data visualization.
import pandas as pd
import plotly.express as px

# Notebook Presentation

In [50]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [51]:
df_apps = pd.read_csv('apps.csv')

# Data Cleaning

In [52]:
#  How many rows and columns does df_apps have?
df_apps.shape

(10841, 12)

In [53]:
# What are the column names?
df_apps.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
      dtype='object')

In [54]:
# DataFrame visualise random sample of 5 different rows.
df_apps.sample(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
8012,OpenTable: Restaurants Near Me,FOOD_AND_DRINK,4.6,90242,19.0,5000000,Free,0,Everyone,Food & Drink,"July 31, 2018",5.0 and up
9118,Fandango Movies - Times + Tickets,ENTERTAINMENT,4.6,243747,9.35,10000000,Free,0,Teen,Entertainment,"August 2, 2018",Varies with device
1354,high cholesterol levels,MEDICAL,,0,3.3,100,Free,0,Everyone,Medical,"July 23, 2018",4.1 and up
9928,Baseball Boy!,GAME,3.4,148177,78.0,10000000,Free,0,Everyone,Arcade,"July 27, 2018",4.1 and up
3455,Transport DK,MAPS_AND_NAVIGATION,4.1,204,11.0,10000,Free,0,Everyone,Maps & Navigation,"May 29, 2017",4.0 and up


### Drop Unused Columns

**Challenge**: Remove the columns called `Last_Updated` and `Android_Version` from the DataFrame. We will not use these columns.

In [55]:
# Remove the Unused Columns from the DataFrame.
df_apps.drop(columns=['Last_Updated', 'Android_Ver'], inplace=True)

In [56]:
# Quickly inspect the top 5 rows of the DataFrame.
df_apps.head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.7,0,Paid,$13.99,Teen,Social
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.0,0,Paid,$2.99,Everyone,Education
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.5,0,Paid,$1.49,Everyone,Personalization
3,Command & Conquer: Rivals,FAMILY,,0,19.0,0,,0,Everyone 10+,Strategy
4,CX Network,BUSINESS,,0,10.0,0,Free,0,Everyone,Business


### Find and Remove NaN values in Ratings


In [57]:
#  How may rows have a NaN value (not-a-number)?
df_apps.isna().sum()

Unnamed: 0,0
App,0
Category,0
Rating,1474
Reviews,0
Size_MBs,0
Installs,0
Type,1
Price,0
Content_Rating,0
Genres,0


In [58]:
# Create DataFrame called df_apps_clean that does not include these rows.
df_apps_clean = df_apps.dropna()
df_apps_clean.isna().sum()

Unnamed: 0,0
App,0
Category,0
Rating,0
Reviews,0
Size_MBs,0
Installs,0
Type,0
Price,0
Content_Rating,0
Genres,0


In [59]:
# Check the new created DataFrame.
print(f'New Df shape is:{df_apps_clean.shape}')
df_apps_clean.sample(5)

New Df shape is:(9367, 10)


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
4795,Rock N' Cash Casino Slots -Free Vegas Slot Games,GAME,4.7,6187,33.0,100000,Free,0,Teen,Casino
4182,Daily K-Talk,FAMILY,4.3,264,60.0,10000,Free,0,Everyone,Education
9381,Metro name iD,PRODUCTIVITY,4.0,27800,4.0,10000000,Free,0,Everyone,Productivity
9332,Airbnb,TRAVEL_AND_LOCAL,4.4,359560,12.0,10000000,Free,0,Everyone,Travel & Local
10141,"Video Editor Music,Cut,No Crop",PHOTOGRAPHY,4.7,2163282,6.9,50000000,Free,0,Everyone,Photography


### Find and Remove Duplicates

**Challenge**: Are there any duplicates in data? Check for duplicates using the [.duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) function. How many entries can you find for the "Instagram" app? Use [.drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) to remove any duplicates from `df_apps_clean`.


In [60]:
# Check if there are any duplicates rows. We are not interested in the columns.
duplicated_rows = df_apps_clean[df_apps_clean.duplicated()]
print(f'How many rows and columns ar duplicated:{duplicated_rows.shape}')

How many rows and columns ar duplicated:(476, 10)


In [61]:
# Count how many rows are duplicated.
df_apps_clean.duplicated().value_counts()

Unnamed: 0,count
False,8891
True,476


As we can see in our df_clean, there are 476 duplicated rows. We need to delete all duplicates for an accurate analysis. Most of the duplicates that we need to remove are apps. We need to check and delete any duplicates of an app in our df_clean in column App.

In [62]:
# Check and delete duplicated apps.
df_apps_clean = df_apps_clean.drop_duplicates(subset=["App", "Type", "Price"])
df_apps_clean.tail(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10824,Google Drive,PRODUCTIVITY,4.4,2731171,4.0,1000000000,Free,0,Everyone,Productivity
10828,YouTube,VIDEO_PLAYERS,4.3,25655305,4.65,1000000000,Free,0,Teen,Video Players & Editors
10829,Google Play Movies & TV,VIDEO_PLAYERS,3.7,906384,4.65,1000000000,Free,0,Teen,Video Players & Editors
10831,Google News,NEWS_AND_MAGAZINES,3.9,877635,13.0,1000000000,Free,0,Teen,News & Magazines
10835,Subway Surfers,GAME,4.5,27722264,76.0,1000000000,Free,0,Everyone 10+,Arcade


In [63]:
# Check again for duplicates.
print(f'How many rows ar duplicated: {df_apps_clean.duplicated().sum()}')

How many rows ar duplicated: 0


We cleaned our DataFrame from 10,841 rows to 8,199.

In [64]:
df_apps_clean.shape

(8199, 10)

# Find Highest Rated Apps


Let's check for the highest quality app. We can see here that we have apps with a 5.00 rating but with few reviews. Is there a problem with this? For sure!
Sorting the apps by rating is not the best way to find the highest quality app.

In [65]:
# Identify which apps are the highest rated.
df_apps_clean.sort_values('Rating', ascending=False).head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
21,KBA-EZ Health Guide,MEDICAL,5.0,4,25.0,1,Free,0,Everyone,Medical
1751,Spine- The dating app,DATING,5.0,5,9.3,500,Free,0,Teen,Dating
1761,Selfie With Champion AJ Style,PHOTOGRAPHY,5.0,2,7.5,500,Free,0,Everyone,Photography
1763,Ag Valley Cooperative,BUSINESS,5.0,6,74.0,500,Free,0,Everyone,Business
1769,Yazdani Cd Center EllahAbad Official App,FAMILY,5.0,8,3.8,500,Free,0,Everyone,Entertainment


# Find the App with Most Reviews

A proper way to find the highest quality app is to sort by the highest number of reviews. We can see here that in the top 10 apps with the highest number of reviews, all are free.

In [66]:
df_apps_clean.sort_values('Reviews', ascending=False).head(10)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10805,Facebook,SOCIAL,4.1,78158306,5.3,1000000000,Free,0,Teen,Social
10785,WhatsApp Messenger,COMMUNICATION,4.4,69119316,3.5,1000000000,Free,0,Everyone,Communication
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10784,Messenger – Text and Video Chat for Free,COMMUNICATION,4.0,56642847,3.5,1000000000,Free,0,Everyone,Communication
10650,Clash of Clans,GAME,4.6,44891723,98.0,100000000,Free,0,Everyone 10+,Strategy
10744,Clean Master- Space Cleaner & Antivirus,TOOLS,4.7,42916526,3.4,500000000,Free,0,Everyone,Tools
10835,Subway Surfers,GAME,4.5,27722264,76.0,1000000000,Free,0,Everyone 10+,Arcade
10828,YouTube,VIDEO_PLAYERS,4.3,25655305,4.65,1000000000,Free,0,Teen,Video Players & Editors
10746,"Security Master - Antivirus, VPN, AppLock, Boo...",TOOLS,4.7,24900999,3.4,500000000,Free,0,Everyone,Tools
10584,Clash Royale,GAME,4.6,23133508,97.0,100000000,Free,0,Everyone 10+,Strategy


# Plotly Pie and Donut Charts - Visualise Categorical Data: Content Ratings
We saw that 'Everyone' category is the most popular then 'Teen' followed by 'Mature 17+'

In [67]:
ratings = df_apps_clean['Content_Rating'].value_counts()

In [68]:
# With Plotly Express, I created a pie chart for all Content_Rating categories.
fig = px.pie(labels=ratings.index,
             values=ratings.values,
             names=ratings.index,
             title='Ratings on Google Play',
             width=600, height=500,
             hole=0.4)

fig.update_traces(textposition='outside', textinfo='percent+label')

fig.show()

# Numeric Type Conversion: Examine the Number of Installs

**Challenge**: How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install?

Check the datatype of the Installs column.

Count the number of apps at each level of installations.

Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have make sure you remove non-numeric characters first.

In [69]:
# Check how many apps had over 1 billion installations.
over_1_billion = df_apps_clean[df_apps_clean['Installs'] == "1,000,000,000"]
print(f'How many apps had over 1 billion installations: {len(over_1_billion)}')

How many apps had over 1 billion installations: 20


In [70]:
# Check how many apps just had a single install.
single_install = df_apps_clean[df_apps_clean['Installs'] == "1"]
print(f'How many apps just had a single install: {len(single_install)}')

How many apps just had a single install: 3


In [71]:
# Count the number of apps at each level of installations.
df_apps_clean[['App', 'Installs']].groupby('Installs').count()

Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
1000,698
1000000,1417
1000000000,20
10,69
10000,988
10000000,933
100,303
100000,1096
100000000,189


In [72]:
# Convert the number of installations to a numeric data type
df_apps_clean.Installs = df_apps_clean.Installs.astype(str).str.replace(',', '')
df_apps_clean.Installs = pd.to_numeric(df_apps_clean.Installs)
df_apps_clean[['App', 'Installs']].groupby('Installs').count()

Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
5,9
10,69
50,56
100,303
500,199
1000,698
5000,425
10000,988
50000,457


As we can see here, the Google Play Store has a specific way of counting the number of installs. In this table, we can see that the highest number of installs is over 1,000,000.

# Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate

Let's examine the Price column more closely.

**Challenge**: Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.

Remove all apps that cost more than $250 from the `df_apps_clean` DataFrame.

Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?


In [75]:
# Let's examine the Price column more closely.remuve de $ sign and convert the price column to numeric data.
df_apps_clean.Price = df_apps_clean.Price.astype(str).str.replace('$','')
df_apps_clean.Price = pd.to_numeric(df_apps_clean.Price)
df_apps_clean[["App", "Price"]].groupby("Price").count()

Unnamed: 0_level_0,App
Price,Unnamed: 1_level_1
0.00,7595
0.99,104
1.00,2
1.20,1
1.29,1
...,...
299.99,1
379.99,1
389.99,1
399.99,11


Apparently, there are 15 'I am Rich' apps on the Google Play Store, each costing 300  or more, which aligns with the app's core concept. The legend goes that in 2008, Armin Heinrich launched the first 'I am Rich' app on the iOS App Store for $999.90. The app did nothing but display a gemstone image, serving as a way to flaunt one’s wealth. Armin managed to make seven sales before Apple quickly removed the app. Despite this, the idea spurred a wave of imitations on the Android App Store; however, a current search shows that these apps have since disappeared.

In [76]:
# Top 20 most expensive apps in the dataset.
df_apps_clean.sort_values('Price', ascending=False).head(20)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
3946,I'm Rich - Trump Edition,LIFESTYLE,3.6,275,7.3,10000,Paid,400.0,Everyone,Lifestyle
2394,I am Rich!,FINANCE,3.8,93,22.0,1000,Paid,399.99,Everyone,Finance
2461,I AM RICH PRO PLUS,FINANCE,4.0,36,41.0,1000,Paid,399.99,Everyone,Finance
1331,most expensive app (H),FAMILY,4.3,6,1.5,100,Paid,399.99,Everyone,Entertainment
3145,I am rich(premium),FINANCE,3.5,472,0.94,5000,Paid,399.99,Everyone,Finance
2775,I Am Rich Pro,FAMILY,4.4,201,2.7,5000,Paid,399.99,Everyone,Entertainment
5765,I am rich,LIFESTYLE,3.8,3547,1.8,100000,Paid,399.99,Everyone,Lifestyle
3221,I am Rich Plus,FAMILY,4.0,856,8.7,10000,Paid,399.99,Everyone,Entertainment
3554,💎 I'm rich,LIFESTYLE,3.8,718,26.0,10000,Paid,399.99,Everyone,Lifestyle
3114,I am Rich,FINANCE,4.3,180,3.8,5000,Paid,399.99,Everyone,Finance


In [102]:
# Remove this flawed data from the dataset to ensure our analysis of the most expensive 'real' apps is accurate.
df_apps_clean = df_apps_clean[df_apps_clean['Price'] < 250]
df_apps_clean.sort_values('Price', ascending=False).head(10)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Revenue_Estimate
2281,Vargo Anesthesia Mega App,MEDICAL,4.6,92,32.0,1000,Paid,79.99,Everyone,Medical,79990.0
1407,LTC AS Legal,MEDICAL,4.0,6,1.3,100,Paid,39.99,Everyone,Medical,3999.0
2629,I am Rich Person,LIFESTYLE,4.2,134,1.8,1000,Paid,37.99,Everyone,Lifestyle,37990.0
2481,A Manual of Acupuncture,MEDICAL,3.5,214,68.0,1000,Paid,33.99,Everyone,Medical,33990.0
504,AP Art History Flashcards,FAMILY,5.0,1,96.0,10,Paid,29.99,Mature 17+,Education,299.9
2463,PTA Content Master,MEDICAL,4.2,64,41.0,1000,Paid,29.99,Everyone,Medical,29990.0
2207,EMT PASS,MEDICAL,3.4,51,2.4,1000,Paid,29.99,Everyone,Medical,29990.0
4264,Golfshot Plus: Golf GPS,SPORTS,4.1,3387,25.0,50000,Paid,29.99,Everyone,Sports,1499500.0
4772,Human Anatomy Atlas 2018: Complete 3D Human Body,MEDICAL,4.5,2921,25.0,100000,Paid,24.99,Everyone,Medical,2499000.0
3241,"Muscle Premium - Human Anatomy, Kinesiology, B...",MEDICAL,4.2,168,25.0,10000,Paid,24.99,Everyone,Medical,249900.0


Now we see that the most expensive application is from the Medical category. In the top 10 most expensive apps, 7 belong to the Medical category.

### Highest Grossing Paid Apps (ballpark estimate)
But the most expensive application doesn't generate the most revenue because, as we saw, these applications had few installs. Now we will find the highest grossing paid apps.

In [78]:
df_apps_clean['Revenue_Estimate'] = df_apps_clean.Installs.mul(df_apps_clean.Price)
df_apps_clean.sort_values('Revenue_Estimate', ascending=False)[:10]



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Revenue_Estimate
9220,Minecraft,FAMILY,4.5,2376564,19.0,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,69900000.0
8825,Hitman Sniper,GAME,4.6,408292,29.0,10000000,Paid,0.99,Mature 17+,Action,9900000.0
7151,Grand Theft Auto: San Andreas,GAME,4.4,348962,26.0,1000000,Paid,6.99,Mature 17+,Action,6990000.0
7477,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,5.99,Everyone,Photography,5990000.0
7977,Sleep as Android Unlock,LIFESTYLE,4.5,23966,0.85,1000000,Paid,5.99,Everyone,Lifestyle,5990000.0
6594,DraStic DS Emulator,GAME,4.6,87766,12.0,1000000,Paid,4.99,Everyone,Action,4990000.0
6082,Weather Live,WEATHER,4.5,76593,4.75,500000,Paid,5.99,Everyone,Weather,2995000.0
6746,Card Wars - Adventure Time,FAMILY,4.3,129603,23.0,1000000,Paid,2.99,Everyone 10+,Card;Action & Adventure,2990000.0
7633,Five Nights at Freddy's,GAME,4.6,100805,50.0,1000000,Paid,2.99,Teen,Action,2990000.0
7954,Bloons TD 5,FAMILY,4.6,190086,94.0,1000000,Paid,2.99,Everyone,Strategy,2990000.0


The top spot of the highest-grossing paid app goes to … Minecraft at close to $70 million. It’s quite interesting that Minecraft (along with Bloons and Card Wars) is actually listed in the Family category rather than in the Game category. If we include these titles, we see that 7 out the top 10 highest-grossing apps are games. The Google Play Store seems to be quite flexible with its category labels. Currently, Minecraft has over 50 million installs and is priced at 8.49, leading to an estimated revenue of over 400 million dollars in 2025. That's impressive!

# Plotly Bar Charts & Scatter Plots: Analysing App Categories

In this analysis, I’ll explore whether it's better to release an app in a highly competitive category or in a popular one with a high number of downloads. I'll also consider targeting a category that's popular yet diversified, so that even though it's more challenging to stand out, your app has a higher chance of being installed. I’ll use bar charts and scatter plots to identify which categories dominate the market and uncover potential opportunities.

In [103]:
print(f'How many unique categories: {df_apps_clean.Category.nunique()}')

How many unique categories: 33


In [104]:
# Let's create a top 10 list based on the number of apps.
top_10category = df_apps_clean.Category.value_counts()[:10]

### Vertical Bar Chart - Highest Competition (Number of Apps)

In [121]:
ver_bar = px.bar(top_10category,
                 x=top_10category.index,
                 y= top_10category.values,
                 title="Top 10 Category",
                 color=top_10category.values,
                 color_continuous_scale='viridis')

ver_bar.update_layout(
    xaxis_title="Category name",
    yaxis_title="Number of Apps")

ver_bar.show()

### Horizontal Bar Chart - Most Popular Categories (Highest Downloads)

In [114]:
# Let's create a top list based on the number of installs.
category_installs = df_apps_clean.groupby('Category').agg({'Installs':pd.Series.sum})
category_installs.sort_values('Installs', ascending=True, inplace=True)

In [122]:
h_bar = px.bar(x=category_installs.Installs,
               y=category_installs.index,
               orientation='h',
               title="Category Popularity",
               color=category_installs.Installs,
               color_continuous_scale="agsunset")

h_bar.update_layout(xaxis_title='Number of installs', yaxis_title='Category')

h_bar.show()

Now we see that Games and Tools are actually the most popular categories.

### Category Concentration - Downloads vs. Competition

In [123]:
# create a DataFrame that has the number of apps in one column and the number of installs in another
cat_number = df_apps_clean.groupby('Category').agg({'App':pd.Series.count})
cat_merged_df = pd.merge(cat_number, category_installs, on='Category', how='inner')
print(f'The dimension of the DataFrame are: {cat_merged_df.shape}')
cat_merged_df.sort_values('Installs', ascending=False).head(5)

The dimension of the DataFrame are: (33, 2)


Unnamed: 0_level_0,App,Installs
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
GAME,910,13858762717
COMMUNICATION,257,11039241530
TOOLS,719,8099724500
PRODUCTIVITY,301,5788070180
SOCIAL,203,5487841475


In [124]:
# Then create scatter plot.
scatter = px.scatter(cat_merged_df,
                    x='App',
                    y='Installs',
                    title='Category Concentration',
                    size='App',
                    hover_name=cat_merged_df.index,
                    color='Installs')

scatter.update_layout(xaxis_title="Number of Apps (Lower=More Concentrated)",
                      yaxis_title="Installs",
                      yaxis=dict(type='log'))

scatter.show()

We see that categories like Family, Tools, and Games have many different apps sharing a high number of downloads and the highest number of installs. In contrast, categories like Video Players and Entertainment have their downloads concentrated among only a few apps.

# Competition in Genres

Let's see how many different types of genres are there and if an app can belong to more than one genre.


In [126]:
print(f'Number of genres: {df_apps_clean.Genres.nunique()}')

Number of genres: 114


In [128]:
# Whe have a lot of genres category separated by ";".Split the strings on the semi-colon and then .stack them.
stack = df_apps_clean.Genres.str.split(';', expand=True).stack()
print(f'We now have a single column with shape: {stack.shape}')

We now have a single column with shape: (8564,)


In [127]:
# Now we have reduced the number of genres, and an app can belong to only a single genre.
num_genres = stack.value_counts()
print(f'Number of genres: {len(num_genres)}')

Number of genres: 53


In [92]:
# Now we make a top 15 genres.
num_genres = num_genres.sort_values(ascending=False)[:15]

In [93]:
genres_bar = px.bar(x=num_genres.index,
                    y=num_genres.values,
                    color=num_genres.values,
                    color_continuous_scale='sunsetdark')

genres_bar.update_layout(xaxis_title='Genre',
                         yaxis_title='Number of Apps',
                         title='Top Genres',
                         coloraxis_showscale=False)
genres_bar.show()

The Tools category has the highest number of apps, which means it is a highly competitive genre.Education and Entertainment also have many apps, showing that developers are focusing on these areas.Genres like Casual and Health & Fitness have fewer apps, which might indicate a less competitive space for new apps.

# Free vs. Paid Apps per Category

In [130]:
# We see that the majority of apps are free 7595 and paid 589 on the Google Play Store.
df_apps_clean.Type.value_counts()

Unnamed: 0_level_0,count
Type,Unnamed: 1_level_1
Free,7595
Paid,589


In [131]:
# Let's create a new dataset containing three columns: Category, Type, and App. This will allow us to analyze the distribution of free and paid apps across different categories.
df_free_vs_paid = df_apps_clean.groupby(["Category", "Type"], as_index=False).agg({'App': pd.Series.count})
df_free_vs_paid.head()

Unnamed: 0,Category,Type,App
0,ART_AND_DESIGN,Free,58
1,ART_AND_DESIGN,Paid,3
2,AUTO_AND_VEHICLES,Free,72
3,AUTO_AND_VEHICLES,Paid,1
4,BEAUTY,Free,42


In [132]:
g_bar = px.bar(df_free_vs_paid,
               x='Category',
               y='App',
               title='Free vs Paid Apps by Category',
               color='Type',
               barmode='group')

g_bar.update_layout(xaxis_title='Category',
                    yaxis_title='Number of Apps',
                    xaxis={'categoryorder':'total descending'},
                    yaxis=dict(type='log'))

g_bar.show()

What we see is that while there are very few paid apps on the Google Play Store, some categories have relatively more paid apps than others, including Personalization, Medical and Weather. So, depending on the category you are targeting, it might make sense to release a paid-for app.