<a href="https://colab.research.google.com/github/KostasTheOne/Mobile-Apps-Project/blob/main/Profitable_Apps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Profitable App Analysis for the App Store and Google Play Markets


Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

# Opening andExploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.
Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

A [data set](https://www.kaggle.com/datasets/lava18/google-play-store-apps
) containing data about approximately ten thousand Android apps from Google Play.

A [data set](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps
) containing data about approximately seven thousand iOS apps from the App Store.

In [234]:
import pandas as pd

In [235]:
apple_data = pd.read_csv("/content/AppleStore.csv")
apple_data.head()

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1


In [236]:
apple_data.describe()

Unnamed: 0,id,size_bytes,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
count,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0
mean,863131000.0,199134500.0,1.726218,12892.91,460.373906,3.526956,3.253578,37.361817,3.7071,5.434903,0.993053
std,271236800.0,359206900.0,5.833006,75739.41,3920.455183,1.517948,1.809363,3.737715,1.986005,7.919593,0.083066
min,281656500.0,589824.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0
25%,600093700.0,46922750.0,0.0,28.0,1.0,3.5,2.5,37.0,3.0,1.0,1.0
50%,978148200.0,97153020.0,0.0,300.0,23.0,4.0,4.0,37.0,5.0,1.0,1.0
75%,1082310000.0,181924900.0,1.99,2793.0,140.0,4.5,4.5,38.0,5.0,8.0,1.0
max,1188376000.0,4025970000.0,299.99,2974676.0,177050.0,5.0,5.0,47.0,5.0,75.0,1.0


In [237]:
apple_data.shape

(7197, 16)

In [238]:
android_data = pd.read_csv("/content/googleplaystore.csv")
android_data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [239]:
android_data.describe()

Unnamed: 0,Rating
count,9367.0
mean,4.193338
std,0.537431
min,1.0
25%,4.0
50%,4.3
75%,4.5
max,19.0


As we can see, there are some irregularities in our data. For instance, the maximum value in the Rating column is 19, which is clearly incorrect since ratings should range from 0 to 5. Additionally, the output of the describe() function only shows the Rating column, indicating that it is the only numeric column in the dataset. To make our analysis more meaningful, we will attempt to convert other columns—such as Reviews, Size, Installs, and Price—into numeric formats where appropriate.

In [240]:
android_data.dtypes

Unnamed: 0,0
App,object
Category,object
Rating,float64
Reviews,object
Size,object
Installs,object
Type,object
Price,object
Content Rating,object
Genres,object


In [241]:
android_data.shape

(10841, 13)

We will convert the Reviews column to numeric values and then verify the changes by using the describe() function again.

In [242]:
android_data["Reviews"] = pd.to_numeric(android_data["Reviews"], errors="coerce")
android_data.describe()


Unnamed: 0,Rating,Reviews
count,9367.0,10840.0
mean,4.193338,444152.9
std,0.537431,2927761.0
min,1.0,0.0
25%,4.0,38.0
50%,4.3,2094.0
75%,4.5,54775.5
max,19.0,78158310.0


If we attempt to convert the Installs column to numeric, a ValueError will occur, indicating that the string "Free" cannot be converted to a number. This shows that the column contains non-numeric values, which must be identified and removed from the dataset before conversion.

In [243]:
android_data["Installs"] = android_data["Installs"].str.replace(",", "", regex=True)

In [244]:
android_data["Installs"] =android_data["Installs"].str.replace(r"\+", "", regex=True).astype(int)

ValueError: invalid literal for int() with base 10: 'Free'

We identify rows in the 'Installs' column that still contain non-numeric values
even after removing commas and plus signs. We use str.isnumeric() to check
which entries are purely numeric. The tilde (~) negates the condition, so
we select rows that are NOT numeric. We then print the app name, installs,
and type columns to inspect the problematic entries.

In [245]:
non_numeric_installs = android_data[~android_data["Installs"].str.replace(",", "").str.replace(r"\+", "", regex=True).str.isnumeric()]
print(non_numeric_installs[["App", "Installs", "Type"]])


                                           App Installs Type
10472  Life Made WI-Fi Touchscreen Photo Frame     Free    0


Then, we observe the problematic row from our data.

In [246]:
android_data.loc[10472]

Unnamed: 0,10472
App,Life Made WI-Fi Touchscreen Photo Frame
Category,1.9
Rating,19.0
Reviews,
Size,"1,000+"
Installs,Free
Type,0
Price,Everyone
Content Rating,
Genres,"February 11, 2018"


This row contains incorrect values and explains the observations we noticed at the start of our Google Play dataset analysis. So we remove the entire row and reset our dataset's index.

In [247]:
android_data.drop(index=10472, inplace=True)

In [248]:
android_data.reset_index(drop=True, inplace=True)

In [249]:
android_data.shape

(10840, 13)

In [250]:
android_data["Installs"] =android_data["Installs"].str.replace(r"\+", "", regex=True).astype(int)

In [251]:
print(android_data["Installs"])

0           10000
1          500000
2         5000000
3        50000000
4          100000
           ...   
10835        5000
10836         100
10837        1000
10838        1000
10839    10000000
Name: Installs, Length: 10840, dtype: int64


In [252]:
android_data.describe()

Unnamed: 0,Rating,Reviews,Installs
count,9366.0,10840.0,10840.0
mean,4.191757,444152.9,15464340.0
std,0.515219,2927761.0,85029360.0
min,1.0,0.0,0.0
25%,4.0,38.0,1000.0
50%,4.3,2094.0,100000.0
75%,4.5,54775.5,5000000.0
max,5.0,78158310.0,1000000000.0


In [253]:
android_data["Price"] = android_data["Price"].str.replace(r"\$", "", regex=True).astype(float)

In [254]:
android_data.duplicated().sum()

np.int64(483)

In [255]:
android_data[android_data.duplicated()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
229,Quick PDF Scanner + OCR FREE,BUSINESS,4.2,80805.0,Varies with device,5000000,Free,0.0,Everyone,Business,"February 26, 2018",Varies with device,4.0.3 and up
236,Box,BUSINESS,4.2,159872.0,Varies with device,10000000,Free,0.0,Everyone,Business,"July 31, 2018",Varies with device,Varies with device
239,Google My Business,BUSINESS,4.4,70991.0,Varies with device,5000000,Free,0.0,Everyone,Business,"July 24, 2018",2.19.0.204537701,4.4 and up
256,ZOOM Cloud Meetings,BUSINESS,4.4,31614.0,37M,10000000,Free,0.0,Everyone,Business,"July 20, 2018",4.1.28165.0716,4.0 and up
261,join.me - Simple Meetings,BUSINESS,4.0,6989.0,Varies with device,1000000,Free,0.0,Everyone,Business,"July 16, 2018",4.3.0.508,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8643,Wunderlist: To-Do List & Tasks,PRODUCTIVITY,4.6,404610.0,Varies with device,10000000,Free,0.0,Everyone,Productivity,"April 6, 2018",Varies with device,Varies with device
8654,"TickTick: To Do List with Reminder, Day Planner",PRODUCTIVITY,4.6,25370.0,Varies with device,1000000,Free,0.0,Everyone,Productivity,"August 6, 2018",Varies with device,Varies with device
8658,ColorNote Notepad Notes,PRODUCTIVITY,4.6,2401017.0,Varies with device,100000000,Free,0.0,Everyone,Productivity,"June 27, 2018",Varies with device,Varies with device
10049,Airway Ex - Intubate. Anesthetize. Train.,MEDICAL,4.3,123.0,86M,10000,Free,0.0,Everyone,Medical,"June 1, 2018",0.6.88,5.0 and up


In [256]:
apple_data.duplicated().sum()

np.int64(0)

We observe that the Android dataset contains duplicate entries, whereas the Apple dataset appears to be clean. It is important to clearly define what we mean by duplicate values. Using the code above, we identify rows that are identical across all columns. However, there may also be apps that share the same name but have different values in other columns, and these are not captured by this definition of duplicates.

In [257]:
duplicated_values=android_data[android_data.duplicated(subset=["App"], keep=False)]

In [258]:
print(duplicated_values)

                                                  App             Category  \
1                                 Coloring book moana       ART_AND_DESIGN   
23                             Mcqueen Coloring pages       ART_AND_DESIGN   
36     UNICORN - Color By Number & Pixel Art Coloring       ART_AND_DESIGN   
42                         Textgram - write on photos       ART_AND_DESIGN   
139                              Wattpad 📖 Free Books  BOOKS_AND_REFERENCE   
...                                               ...                  ...   
10714                              FarmersOnly Dating               DATING   
10719              Firefox Focus: The privacy browser        COMMUNICATION   
10729                                     FP Notebook              MEDICAL   
10752                  Slickdeals: Coupons & Shopping             SHOPPING   
10767                                            AAFP              MEDICAL   

       Rating    Reviews                Size   Installs  Type  

In [259]:
android_data[android_data['App']=="Instagram"]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2545,Instagram,SOCIAL,4.5,66577313.0,Varies with device,1000000000,Free,0.0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2604,Instagram,SOCIAL,4.5,66577446.0,Varies with device,1000000000,Free,0.0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2611,Instagram,SOCIAL,4.5,66577313.0,Varies with device,1000000000,Free,0.0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
3909,Instagram,SOCIAL,4.5,66509917.0,Varies with device,1000000000,Free,0.0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


In [260]:
duplicated_apps = []
unique_apps = []

for app in android_data["App"]:
  if app in unique_apps:
    duplicated_apps.append(app)
  else:
    unique_apps.append(app)
print(len(duplicated_apps))
print(duplicated_apps[:10])

1181
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


We should delete the duplicates but not random. As we can see in the "Instagram" example the only difference is in the number of reviews in each row. It's like they updated the dataset in different times, so we are going to keep only the rows with the most reviews, which means we are keeping the latest addition.

The first step is to use the groupby() function to group rows by the "App" column. From each group, we select the "Reviews" column, and then we apply the .max() function to retain only the maximum number of "Reviews" for each app. As a result, we obtain the reviews_max Series, which contains one value per app corresponding to its highest number of reviews.

In [261]:
reviews_max = android_data.groupby("App")["Reviews"].max()
reviews_max.head()

Unnamed: 0_level_0,Reviews
App,Unnamed: 1_level_1
"""i DT"" Fútbol. Todos Somos Técnicos.",27.0
+Download 4 Instagram Twitter,40467.0
- Free Comics - Comic Apps,115.0
.R,259.0
/u/app,573.0


In [262]:
android_clean = android_data.loc[android_data.groupby("App")["Reviews"].idxmax()]

In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [263]:
print("Expected lenght:",len(android_data) - 1181)
print("Cleaned data lenght:",len(android_clean))

Expected lenght: 9659
Cleaned data lenght: 9659


In [264]:
android_clean[android_clean['App']=="Instagram"]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2604,Instagram,SOCIAL,4.5,66577446.0,Varies with device,1000000000,Free,0.0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


In [265]:
android_clean.reset_index(drop=True, inplace=True)

In [266]:
android_clean.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,"""i DT"" Fútbol. Todos Somos Técnicos.",SPORTS,,27.0,3.6M,500,Free,0.0,Everyone,Sports,"October 7, 2017",0.22,4.1 and up
1,+Download 4 Instagram Twitter,SOCIAL,4.5,40467.0,22M,1000000,Free,0.0,Everyone,Social,"August 2, 2018",5.03,4.1 and up
2,- Free Comics - Comic Apps,COMICS,3.5,115.0,9.1M,10000,Free,0.0,Mature 17+,Comics,"July 13, 2018",5.0.12,5.0 and up
3,.R,TOOLS,4.5,259.0,203k,10000,Free,0.0,Everyone,Tools,"September 16, 2014",1.1.06,1.5 and up
4,/u/app,COMMUNICATION,4.7,573.0,53M,10000,Free,0.0,Mature 17+,Communication,"July 3, 2018",4.2.4,4.1 and up


In [267]:
android_clean.describe()

Unnamed: 0,Rating,Reviews,Installs,Price
count,8196.0,9659.0,9659.0,9659.0
mean,4.173267,216804.1,7798170.0,1.097231
std,0.536253,1831430.0,53769730.0,16.851618
min,1.0,0.0,0.0,0.0
25%,4.0,25.0,1000.0,0.0
50%,4.3,969.0,100000.0,0.0
75%,4.5,29453.5,1000000.0,0.0
max,5.0,78158310.0,1000000000.0,400.0


In [268]:
android_clean.shape

(9659, 13)

We do the same for apple_data.

In [269]:
duplicated_values=apple_data[apple_data.duplicated(subset=["track_name"], keep=False)]

In [270]:
print(duplicated_values)

              id           track_name  size_bytes currency  price  \
2948  1173990889  Mannequin Challenge   109705216      USD    0.0   
4442   952877179    VR Roller Coaster   169523200      USD    0.0   
4463  1178454060  Mannequin Challenge    59572224      USD    0.0   
4831  1089824278    VR Roller Coaster   240964608      USD    0.0   

      rating_count_tot  rating_count_ver  user_rating  user_rating_ver    ver  \
2948               668                87          3.0              3.0    1.4   
4442               107               102          3.5              3.5  2.0.0   
4463               105                58          4.0              4.5  1.0.1   
4831                67                44          3.5              4.0   0.81   

     cont_rating prime_genre  sup_devices.num  ipadSc_urls.num  lang.num  \
2948          9+       Games               37                4         1   
4442          4+       Games               37                5         1   
4463          4+    

In [271]:
duplicate_count = apple_data["track_name"].duplicated().sum()
print(duplicate_count)
duplicate_apps = apple_data[apple_data["track_name"].duplicated(keep=False)]["track_name"].unique()
print(duplicate_apps)

2
['Mannequin Challenge' 'VR Roller Coaster']


In [272]:
apple_data[apple_data["track_name"]=="Mannequin Challenge"]
apple_data[apple_data["track_name"]=="VR Roller Coaster"]

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
4442,952877179,VR Roller Coaster,169523200,USD,0.0,107,102,3.5,3.5,2.0.0,4+,Games,37,5,1,1
4831,1089824278,VR Roller Coaster,240964608,USD,0.0,67,44,3.5,4.0,0.81,4+,Games,38,0,1,1


In [273]:
apple_reviews_max = apple_data.groupby("track_name")["rating_count_tot"].max()
apple_clean = apple_data.loc[apple_data.groupby("track_name")["rating_count_tot"].idxmax()]
print(len(apple_data)-2)
print(len(apple_clean))

7195
7195


# Removing Non-English Apps

If you explore the datasets, you'll notice the names of some of the apps suggest they are not directed toward an English-speaking audience. We're not interested in keeping these kind of apps, so we'll remove them.

We make a function that detects non-English characters in app names. Since some English-name apps contain emojis or a small number of non-ASCII characters we allow up to 3 non-ASCII characters before classifying an app name as non-English. This approach is not perfectly accurate, it is appropriate for this dataset and helps minimize unnecessary data loss.

In [274]:
def is_english(string):
  non_ascii = sum(1 for char in string if ord(char) > 127)
  ascii_chars = sum(1 for char in string if ord(char) <= 127)
  if non_ascii >3:
    return False
  else:
    return True

In [275]:
android_english_apps = android_clean["App"].apply(is_english)

In [289]:
android_english= android_clean[android_english_apps]

In [290]:
android_english.shape

(9614, 13)

In [291]:
android_english.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,"""i DT"" Fútbol. Todos Somos Técnicos.",SPORTS,,27.0,3.6M,500,Free,0.0,Everyone,Sports,"October 7, 2017",0.22,4.1 and up
1,+Download 4 Instagram Twitter,SOCIAL,4.5,40467.0,22M,1000000,Free,0.0,Everyone,Social,"August 2, 2018",5.03,4.1 and up
2,- Free Comics - Comic Apps,COMICS,3.5,115.0,9.1M,10000,Free,0.0,Mature 17+,Comics,"July 13, 2018",5.0.12,5.0 and up
3,.R,TOOLS,4.5,259.0,203k,10000,Free,0.0,Everyone,Tools,"September 16, 2014",1.1.06,1.5 and up
4,/u/app,COMMUNICATION,4.7,573.0,53M,10000,Free,0.0,Mature 17+,Communication,"July 3, 2018",4.2.4,4.1 and up


In [292]:
apple_english_apps = apple_clean["track_name"].apply(is_english)

In [293]:
apple_english= apple_clean[apple_english_apps]

In [294]:
apple_english.shape

(6181, 16)

In [295]:
apple_english.head()

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
6595,883539642,! OH Fantastic Free Kick + Kick Wall Challenge,162557952,USD,0.0,0,0,0.0,0.0,4.0,4+,Games,40,5,2,1
3592,486692623,"""Burn your fat with me!!""",149757952,USD,1.99,302,14,4.5,4.0,5.2.4,17+,Health & Fitness,38,0,3,1
2636,956794130,"""HOOK""",76611584,USD,0.99,959,150,5.0,5.0,1.04,4+,Games,40,5,1,1
3039,1105390093,"""klocki""",97887232,USD,0.99,587,587,4.5,4.5,1.01,4+,Games,37,2,1,1
5499,974022309,( OFFTIME ) light – Track how much you use you...,28471296,USD,2.99,22,14,2.0,2.0,2.1.0,4+,Health & Fitness,37,0,5,1


# Isolating the Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.

In [298]:
android_final = android_english[android_english["Price"]==0]
print(len(android_final))

8864


In [299]:
apple_final = apple_english[apple_english["price"]==0]
print(len(apple_final))

3220


# Most common Genres by App

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

Build a minimal Android version of the app, and add it to Google Play.
If the app has a good response from users, we then develop it further.
If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.
Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

In [300]:
android_final["Category"].value_counts()

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
FAMILY,1676
GAME,862
TOOLS,750
BUSINESS,407
LIFESTYLE,346
PRODUCTIVITY,345
FINANCE,328
MEDICAL,313
SPORTS,301
PERSONALIZATION,294


In [301]:
genre_percentage = android_final["Category"].value_counts(normalize=True)
genre_percentage = (genre_percentage * 100).round(2)
print(genre_percentage)

Category
FAMILY                 18.91
GAME                    9.72
TOOLS                   8.46
BUSINESS                4.59
LIFESTYLE               3.90
PRODUCTIVITY            3.89
FINANCE                 3.70
MEDICAL                 3.53
SPORTS                  3.40
PERSONALIZATION         3.32
COMMUNICATION           3.24
HEALTH_AND_FITNESS      3.08
PHOTOGRAPHY             2.94
NEWS_AND_MAGAZINES      2.80
SOCIAL                  2.66
TRAVEL_AND_LOCAL        2.34
SHOPPING                2.25
BOOKS_AND_REFERENCE     2.14
DATING                  1.86
VIDEO_PLAYERS           1.79
MAPS_AND_NAVIGATION     1.40
FOOD_AND_DRINK          1.24
EDUCATION               1.16
ENTERTAINMENT           0.96
LIBRARIES_AND_DEMO      0.94
AUTO_AND_VEHICLES       0.93
HOUSE_AND_HOME          0.82
WEATHER                 0.80
EVENTS                  0.71
PARENTING               0.65
ART_AND_DESIGN          0.64
COMICS                  0.62
BEAUTY                  0.60
Name: proportion, dtype: float64


We continue by examining the frequency table for the prime_genre column of the App Store data set.

In [302]:
apple_final["prime_genre"].value_counts()

Unnamed: 0_level_0,count
prime_genre,Unnamed: 1_level_1
Games,1872
Entertainment,254
Photo & Video,160
Education,118
Social Networking,106
Shopping,84
Utilities,81
Sports,69
Music,66
Health & Fitness,65


In [303]:
apple_genre = apple_final["prime_genre"].value_counts(normalize=True)
apple_genre = (apple_genre*100).round(2)
print(apple_genre)

prime_genre
Games                58.14
Entertainment         7.89
Photo & Video         4.97
Education             3.66
Social Networking     3.29
Shopping              2.61
Utilities             2.52
Sports                2.14
Music                 2.05
Health & Fitness      2.02
Productivity          1.74
Lifestyle             1.58
News                  1.34
Travel                1.24
Finance               1.12
Weather               0.87
Food & Drink          0.81
Reference             0.56
Business              0.53
Book                  0.43
Medical               0.19
Navigation            0.19
Catalogs              0.12
Name: proportion, dtype: float64
