# Profitable App Profiles for the App Store and Google Play Markets

The purpose of this project is to find opportunities for application development that will be successful both on Google Play and in the App Store.

The company for which we are preparing this information only develops free applications where revenue is tied to in-app ads, so it is important to develop applications that people will continue to enjoy and play. We want to maximize both total number of downloads and time spent in the app in order to maximize profit. To do this, we will look at a sample of applications from both Google Play and the App Store to attempt to find factors we can replicate in order to increase profitability.

# Importing Data

As of the third quarter of 2022, [there were approximately 3.55 million applications available on Google Play and approximately 1.6 million apps available via the App Store](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/). Processing this amount of data is a significant investment of time and money, so we'll analyze a sample of the data.

* [A dataset](https://www.kaggle.com/lava18/google-play-store-apps) with about 10,000 applications available via Google Play. The data was collected in August 2018 and can be downloaded directly with [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
* [A dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) with about 10,000 applications available via the Apple App Store. The data was collected in August 2018 and can be downloaded directly with [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).


In [1]:
from csv import reader

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)

# The dataset for Apple App Store apps
apple_dataset = list(read_file)

# We separate the first entry as it is a header and not an application 
apple_header = apple_dataset[0]
apple_dataset = apple_dataset[1:]

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)

# The dataset for Google Play apps
google_dataset = list(read_file)

# Again, we separate the first entry as it is a header and not an application 
google_header = google_dataset[0]
google_dataset = google_dataset[1:]

After importing the data, we first write a function to help us work with the data, enabling use to print rows in a readable way.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Before continuing, we print a piece of each dataset in order to make sure that everything is working properly.

In [3]:
explore_data(apple_dataset, 0, 3, True)
print('\n')
explore_data(google_dataset, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1,

It will also be helpful to print the column names so that we can look through these for datapoints which can help us with our analysis.

In [4]:
print(google_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Looking through these columns, we would find 'App', 'Category', 'Rating', 'Reviews', 'Installs', 'Price', and 'Genre' likely to contain the information most relevant to our purposes.

In [5]:
print(apple_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


The columns for the App Store dataset are not as self-explanatory. More information about each column header can be found [here](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps), but the columns we are likely to find most useful will be 'track_name', 'price', 'rating_count_tot', 'user_rating', and 'prime_genre'.

# Cleaning the Data


## Step 1: Removing Errors

The Google Play dataset has a dedicated discussion section, and a search through that section reveals a discussion that describes an error found the list at index 10472. Let's print a few rows to verify that the error is in this row.

In [6]:
explore_data(google_dataset, 10471, 10474)

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




Looking at the second row, we see that this entry doesn't have a defined genre. We can also see that the error is in fact in the list at index 10472. We'll delete this row and then re-print the surrounding rows to verify that it has been deleted properly. We must be careful to only run the delete command once or we will delete other, correct entries as the list will shift and a different application without an error will now take the same place (index 10472) in the list.

In [7]:
del google_dataset[10472] #Be sure to only run this line once

explore_data(google_dataset, 10471, 10474)

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']




## Step 2: Removing Duplicates

If we explore the Google play data set enough, we will see that some apps have duplicate entries. For example, Instagram has four entries.

In [8]:
for app in google_dataset:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We can search through the data and find whether there are more duplicate entries. Below, we can see that there are a large number of duplicate apps in the Google Play dataset, a total of 1181 of the 10841 apps or 10.89%. We also check the Apple Store dataset and find that there are no duplicates.

In [9]:
# Check the Google Play dataset for duplicates
unique_apps = []
duplicate_apps = []

for app in google_dataset:
    name = app[0]
    if name not in unique_apps:
        unique_apps.append(name)
    else:
        duplicate_apps.append(name)

duplicates = len(duplicate_apps)
percentage = round(len(duplicate_apps) / len(google_dataset) * 100, 2)
        
print("Number of duplicate Google Play apps: " + str(duplicates))
print("Number of non-duplicate Google Play apps: " + str(len(google_dataset) - duplicates))
print("Percent of duplicate Google Play apps: " + str(percentage) +"%")


# Re-initialize these lists and check the Apple Store dataset for duplicates.
unique_apps = []
duplicate_apps = []

for app in apple_dataset:
    app_id = app[0]
    if app_id not in unique_apps:
        unique_apps.append(app_id)
    else:
        duplicate_apps.append(app_id)

duplicates = len(duplicate_apps)
percentage = round(len(duplicate_apps) / len(apple_dataset) * 100, 2)

print("\n")
print("Number of duplicate Apple Store apps: " + str(duplicates))

Number of duplicate Google Play apps: 1181
Number of non-duplicate Google Play apps: 9659
Percent of duplicate Google Play apps: 10.89%


Number of duplicate Apple Store apps: 0


In order to analyze the data correctly, we need to remove all of the duplicate entries. However, as you can see with the "Instagram" example above, the duplicate entries are not exactly the same. We could remove the duplicates randomly, but the best way would likely be to keep the entry with the most number of reviews as this should give us the most recent information.

In [10]:
reviews_max = {} 

for app in google_dataset:
    name = app[0]
    n_reviews = float(app[3])
    
    # If the application is not in the dictionary, add it
    if name not in reviews_max: 
        reviews_max[name] = n_reviews
        
    # If it is in the dictionary, make sure we have the entry with the most reviews    
    elif n_reviews > reviews_max[name]: 
        reviews_max[name] = n_reviews
 
print(len(reviews_max)) # This should match the number of non-duplicate apps

9659


Now we have a dictionary that contains the names of all of the apps we want in our list as well as the corresponding maximum number of reviews. With that information, we can choose the correct instance of that app in our list and compile a new, cleaned list without all of the duplicate entries.

To do that we:
1. Create a dictionary `android_clean` that will hold the new list and a dictionary `already_added` to assist us in removing duplicate entries.
2. Loop through every app in the dataset, recording the name of the app and its number of reviews in `name` and `n_reviews` respectively.
3. If the number of reviews is the same as the number we determined from our previous code above *and* the name is not already in the list, we add the full set of app information to the `android_clean` list and the name of the app to the `already_added` list.

Note: We need to check that the name is not in `already_added` as some duplicate entries may have the same number of total reviews as can be seen from the duplicate "Instagram" entries listed above.

In [11]:
# The new list for Google Play with no duplicate apps
google_unique = []
already_added = []

for app in google_dataset:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added: # We need to check both of these conditions
        google_unique.append(app)
        already_added.append(name)
    
print(len(google_unique)) #Verify once again that we have the correct number of entries       
    

9659


## Step 3: Isolating English-Language Applications

As this data is being prepared for a company that makes English-language applications, we would like to look only at apps designed for an English-speaking audience. However, we can see that there are apps with names that suggest they are not designed for that audience.

In [12]:
print(apple_dataset[813][1])
print(apple_dataset[6731][1])
print("\n")
print(google_unique[4412][0])
print(google_unique[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


To remove these applications, we'll write a function to look through each of the app names for non-English-language characters. This can be done using the [ASCII](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange) system. In ASCII, all of the characters commonly used in English text are indexed in the range from 0 to 127. We can use the built-in ord() function to check each of the app names. If we find a character which falls outside of this range, we can assume that the app is not designed for English speakers. 

In [13]:
def english_characters(name):
    for character in name:
        if ord(character) > 127:
            return False
    return True

print(english_characters('Instagram'))
print(english_characters('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_characters('Docs To Go™ Free Office Suite'))
print(english_characters('Instachat 😜'))

True
False
False
False


However, there is a problem in that emojis and characters such as ™ are outside the ASCII range we have set. To avoid removing English-language apps that use these characters, we'll rewrite our function to remove an application only if its name has more than three characters that fall outside of the standard English-language range. This may not be perfect, but it should work fairly well to accomplish our purposes.

In [14]:
def english_characters(name):
    foreign = 0
    for character in name:
        if ord(character) > 127:
            foreign += 1
        if foreign > 3:
            return False
    return True

print(english_characters('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_characters('Docs To Go™ Free Office Suite'))
print(english_characters('Instachat 😜'))

False
True
True


This is better. Now let's go through both datasets, remove any apps with names we identify as non-English, and add the apps to a new list.

In [15]:
# A new dataset for each platform with non-English-language apps removed
google_english = []
apple_english = []

for app in google_unique:
    if english_characters(app[0]):
        google_english.append(app)
        
for app in apple_dataset:
    if english_characters(app[1]):
        apple_english.append(app)
        
print("Google Play English-language apps: " + str(len(google_english)))
print("Apple Store English-language apps: " + str(len(apple_english)))

Google Play English-language apps: 9614
Apple Store English-language apps: 6183


## Step 4: Isolating Free Applications

So far, we have:
* Removed inaccurate data
* Removed duplicate entries
* Removed non-English applications

Since the company for which we are preparing this data only makes applications which are free to download, for the best analysis we need to analyze only free apps. However, our current datasets still contain both free and paid apps. We need to search and remove all the paid apps from our datasets.

In [16]:
# New lists for each platform with paid apps removed
google_free = []
apple_free = []

for app in google_english:
    if app[7][0] == '$': # If the price includes a "$", remove it
        price = float(app[7][1:])
    else:
        price = float(app[7])
    if price == 0.0:
        google_free.append(app)
        
for app in apple_english:
    price = float(app[4])
    if price == 0.0:
        apple_free.append(app)
        
print("Free Google Play apps: " + str(len(google_free)))
print("Free Apple Store apps: " + str(len(apple_free)))

Free Google Play apps: 8864
Free Apple Store apps: 3222


# Analyzing the Data

As we have already covered, we want apps that are successful on both platforms and are looking to determine what types of free apps would be best to develop. Since revenue will be add-based, we want to find apps that will have many users and where users will be satisfied with their experience and want to return to the app over and over.

## Analyzing Applications by Genre / Category

One of the key factors in application development will be the genre or category of app that is developed. We are looking to find a space where there is room for more applications to be developed. For the App Store, this information is `prime_genre`, index 11 of our dataset. For Google Play, we can look both at `Category`, index 1 and `Genres`, index 9.

We start with a function `freq_table` which can group each dataset of applications into a frequency table by percentage for each column of our data.  We also include a function `display_table` which can take our data and not only turn it into a frequency table but display it in descending order of frequency.

In [17]:
def freq_table(dataset, index):
    table = {}
    item_list = []
    for item in dataset:
        item_list.append(item[index])
    for item in item_list:
        if item not in table:
            table[item] = 1
        else:
            table[item] += 1
            
    # Convert to a percentage       
    for item in table:
        table[item] = round(table[item] / len(dataset) * 100, 2)
    return table

def display_table(dataset, index):
    table = freq_table(dataset, index)
    
    # Convert the frequency table from a dictionary to a list of tuples
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    # Sort the table in descending order
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
    return

print('\033[1m' + 'Free English-Language Apple Store Apps by Genre' + '\033[0m')
display_table(apple_free, 11)
print("\n")
print('\033[1m' + 'Free English-Language Google Play Apps by Category' + '\033[0m')
display_table(google_free, 1)
print("\n")
print('\033[1m' + 'Free English-Language Google Play Apps by Genre' + '\033[0m')
display_table(google_free, 9)

[1mFree English-Language Apple Store Apps by Genre[0m
Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


[1mFree English-Language Google Play Apps by Category[0m
FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES 

## Finding the Best App Store Recommendation

We see that, by far, the most common application genre in the App Store is Games as over half of the applications in our dataset are categorized as games. This tells us what is most common by number of apps, but it doesn't the important information of how popular each type of app is. 

We don't have number of downloads available in this dataset, but we can use total ratings which should be a fairly good proxy for downloads. We create a frequency table for each type of genre and then find the average number of ratings for each of these genres.

In [18]:
apple_genres = freq_table(apple_free, 11)

print('\033[1m' + "Average Number of Ratings for App Store Applications by Genre" + '\033[0m')
for genre in apple_genres:
    total = 0
    len_genre = 0
    for app in apple_free:
        genre_app = app[11]
        if genre_app == genre:
            total += float(app[5])
            len_genre += 1
    avg_ratings = total / len_genre
    print(genre + ": " + str(round(avg_ratings, 2)))    

[1mAverage Number of Ratings for App Store Applications by Genre[0m
Social Networking: 71548.35
Photo & Video: 28441.54
Games: 22788.67
Music: 57326.53
Reference: 74942.11
Health & Fitness: 23298.02
Weather: 52279.89
Utilities: 18684.46
Travel: 28243.8
Shopping: 26919.69
News: 21248.02
Navigation: 86090.33
Lifestyle: 16485.76
Entertainment: 14029.83
Food & Drink: 33333.92
Sports: 23008.9
Book: 39758.5
Finance: 31467.94
Education: 7003.98
Productivity: 21028.41
Business: 7491.12
Catalogs: 4004.0
Medical: 612.0


The top five genres by number of reviews are Navigation, Reference, Social Networking, Music, and Weather. Let's look at each of these genres more closely to see whether it is an area of potential. We'll look at the number of reviews for the top ten most-reviewed apps in each of the selected genres.

In [19]:
genre_checker = {'Navigation', 'Reference', 'Social Networking', 'Music', 'Weather'}

print('\033[1m' + "Number of Ratings for Top 10 App Store Apps by Genre" + '\033[0m')
print('\n')

for genre in genre_checker:
    reviews = 0
    total = 0
    print('\033[1m' + genre + '\033[0m')
    for app in apple_free:
        if app[11] == genre:
            total += 1
            if total <= 10:
                print(app[1] + ": " + app[5])
    print("Total Apps: " + str(total))
    print('\n')

[1mNumber of Ratings for Top 10 App Store Apps by Genre[0m


[1mSocial Networking[0m
Facebook: 2974676
Pinterest: 1061624
Skype for iPhone: 373519
Messenger: 351466
Tumblr: 334293
WhatsApp Messenger: 287589
Kik: 260965
ooVoo – Free Video Call, Text and Voice: 177501
TextNow - Unlimited Text + Calls: 164963
Viber Messenger – Text & Call: 164249
Total Apps: 106


[1mWeather[0m
The Weather Channel: Forecast, Radar & Alerts: 495626
The Weather Channel App for iPad – best local forecast, radar map, and storm tracking: 208648
WeatherBug - Local Weather, Radar, Maps, Alerts: 188583
MyRadar NOAA Weather Radar Forecast: 150158
AccuWeather - Weather for Life: 144214
Yahoo Weather: 112603
Weather Underground: Custom Forecast & Local Radar: 49192
NOAA Weather Radar - Weather Forecast & HD Radar: 45696
Weather Live Free - Weather Forecast & Alerts: 35702
Storm Radar: 22792
Total Apps: 28


[1mNavigation[0m
Waze - GPS Navigation, Maps & Real-time Traffic: 345046
Google Maps - Navigation & Tr

While each of these categories have a high average number of ratings, the total number of applications is very low and the number of ratings is heavily skewed by only a few apps. Removing only one or two apps from each category brings them back into line with the averages for many other categories, suggesting that these genres may not actually be the most popular. Let's examine the Games category since it is overwhelmingly the most popular category.

In [20]:
print('\033[1m' + 'Number of Ratings for App Store Apps with Genre "Games"' + '\033[0m')
print('\n')

genre = 'Games'
reviews = 0
total = 0
print('\033[1m' + genre + '\033[0m')
for app in apple_free:
    if app[11] == genre:
        total += 1
        if total < 20:
            print(app[1] + ": " + app[5])
print('\n')
print('\033[1m' + "Total Apps: " + '\033[0m' + str(total))


[1mNumber of Ratings for App Store Apps with Genre "Games"[0m


[1mGames[0m
Clash of Clans: 2130805
Temple Run: 1724546
Candy Crush Saga: 961794
Angry Birds: 824451
Subway Surfers: 706110
Solitaire: 679055
CSR Racing: 677247
Crossy Road - Endless Arcade Hopper: 669079
Injustice: Gods Among Us: 612532
Hay Day: 567344
PAC-MAN: 508808
DragonVale: 503230
Head Soccer: 481564
Despicable Me: Minion Rush: 464312
The Sims™ FreePlay: 446880
Sonic Dash: 418033
8 Ball Pool™: 416736
Tiny Tower - Free City Building: 414803
Jetpack Joyride: 405647


[1mTotal Apps: [0m1874


While there are a couple of applications with a very large number of ratings, removing them doesn't bring the average number of ratings down tremendously. Let's also look at the most popular genres by average rating.

In [21]:
print('\033[1m' + "Average Ratings for App Store Applications by Genre" + '\033[0m')


for genre in apple_genres:
    total = 0
    len_genre = 0
    for app in apple_free:
        genre_app = app[11]
        if genre_app == genre:
            total += float(app[7])
            len_genre += 1
    avg_ratings = total / len_genre
    print(genre + ": " + str(round(avg_ratings, 2)))    

[1mAverage Ratings for App Store Applications by Genre[0m
Social Networking: 3.59
Photo & Video: 3.9
Games: 4.04
Music: 3.95
Reference: 3.67
Health & Fitness: 3.77
Weather: 3.48
Utilities: 3.53
Travel: 3.49
Shopping: 3.97
News: 3.24
Navigation: 3.83
Lifestyle: 3.41
Entertainment: 3.54
Food & Drink: 3.63
Sports: 3.07
Book: 3.07
Finance: 3.38
Education: 3.64
Productivity: 4.0
Business: 3.97
Catalogs: 4.12
Medical: 3.0


We can see that the Games genre has the second highest average rating behind only Catalogs, a category with only four (0.12% * 3222) applications and a very small number of average ratings. This fact, combined with our earlier look at the sheer number of gaming apps suggests that applications in the gaming category would do very well. In addition, the large number of gaming apps means that multiple apps could likely be released without taking measurable market share from one another.

## Finding the Best Google Play Recommendation

Our Google Play dataset has information on downloads, so we can use this information instead of number of ratings. Displaying a frequency table by number of downloads shows that there are not specific data for each number of downloads but that they are grouped in ranges.

In [22]:
print('\033[1m' + "Google Play Apps by Number of Downloads" + '\033[0m')

display_table(google_free, 5)

[1mGoogle Play Apps by Number of Downloads[0m
1,000,000+ : 15.73
100,000+ : 11.55
10,000,000+ : 10.55
10,000+ : 10.2
1,000+ : 8.39
100+ : 6.92
5,000,000+ : 6.83
500,000+ : 5.56
50,000+ : 4.77
5,000+ : 4.51
10+ : 3.54
500+ : 3.25
50,000,000+ : 2.3
100,000,000+ : 2.13
50+ : 1.92
5+ : 0.79
1+ : 0.51
500,000,000+ : 0.27
1,000,000,000+ : 0.23
0+ : 0.05
0 : 0.01


Since we can't know the precise number of downloads for each application, we'll consider each number as the absolute minimum for its range. For example, is the application is listed as having 100,000+ downloads, we will count it as having 100,000 downloads. If an app is listed as having 50,000,000+ downloads, we will count it as having 50,000,000 downloads.

In [23]:
google_categories = freq_table(google_free, 1)

google_category = []

print('\033[1m' + "Average Number of Downloads for Google Play Apps by Genre" + '\033[0m')

for category in google_categories:
    total = 0
    len_category = 0
    for app in google_free:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            
            # Remove ',' and '+' from download string
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
            
    # Find the average number of installs and round it to the nearest number
    avg_installs = total / len_category
    avg_installs = round(avg_installs)
    
    # Add back comma separators for readability
    print(category + ": ", end ="")
    print(f"{avg_installs:,}")

[1mAverage Number of Downloads for Google Play Apps by Genre[0m
ART_AND_DESIGN: 1,986,335
AUTO_AND_VEHICLES: 647,318
BEAUTY: 513,152
BOOKS_AND_REFERENCE: 8,767,812
BUSINESS: 1,712,290
COMICS: 817,657
COMMUNICATION: 38,456,119
DATING: 854,029
EDUCATION: 1,833,495
ENTERTAINMENT: 11,640,706
EVENTS: 253,542
FINANCE: 1,387,692
FOOD_AND_DRINK: 1,924,898
HEALTH_AND_FITNESS: 4,188,822
HOUSE_AND_HOME: 1,331,541
LIBRARIES_AND_DEMO: 638,504
LIFESTYLE: 1,437,816
GAME: 15,588,016
FAMILY: 3,695,642
MEDICAL: 120,551
SOCIAL: 23,253,652
SHOPPING: 7,036,877
PHOTOGRAPHY: 17,840,110
SPORTS: 3,638,640
TRAVEL_AND_LOCAL: 13,984,078
TOOLS: 10,801,391
PERSONALIZATION: 5,201,483
PRODUCTIVITY: 16,787,331
PARENTING: 542,604
WEATHER: 5,074,486
VIDEO_PLAYERS: 24,727,872
NEWS_AND_MAGAZINES: 9,549,178
MAPS_AND_NAVIGATION: 4,056,942


While Family is the most popular category in the Google Play store by number of apps, it is one of the less popular categories in terms of downloads per application. Games, the second most popular category in number of apps, is the sixth most popular category in downloads per application. While there are five other categories that have more downloads, the fact that it is very popular in the Google Play store, coupled with the Games category's incredible popularity in the Apple Store suggests that we should look at this category in more detail and see if it would be a good recommendation for both platforms.

In [24]:
import math

print('\033[1m' + "Average Rating and Total Ratings for Google Play Apps by Genre" + '\033[0m')

for category in google_categories:
    total = 0
    len_category = 0
    for app in google_free:
        category_app = app[1]
        # Remove NaN entries to properly calculate the average rating
        if category_app == category and not math.isnan(float(app[2])):
            total += float(app[2])
            len_category += 1
    avg_ratings = total / len_category
    print(category + ": " + str(round(avg_ratings, 2)) + ", " + str(len_category))    

[1mAverage Rating and Total Ratings for Google Play Apps by Genre[0m
ART_AND_DESIGN: 4.34, 55
AUTO_AND_VEHICLES: 4.18, 72
BEAUTY: 4.28, 42
BOOKS_AND_REFERENCE: 4.35, 159
BUSINESS: 4.1, 253
COMICS: 4.18, 53
COMMUNICATION: 4.13, 234
DATING: 3.98, 131
EDUCATION: 4.34, 102
ENTERTAINMENT: 4.12, 85
EVENTS: 4.44, 45
FINANCE: 4.13, 289
FOOD_AND_DRINK: 4.17, 92
HEALTH_AND_FITNESS: 4.24, 233
HOUSE_AND_HOME: 4.14, 61
LIBRARIES_AND_DEMO: 4.18, 64
LIFESTYLE: 4.08, 279
GAME: 4.23, 821
FAMILY: 4.17, 1484
MEDICAL: 4.15, 228
SOCIAL: 4.25, 201
SHOPPING: 4.23, 178
PHOTOGRAPHY: 4.16, 248
SPORTS: 4.21, 238
TRAVEL_AND_LOCAL: 4.07, 179
TOOLS: 4.03, 657
PERSONALIZATION: 4.3, 233
PRODUCTIVITY: 4.18, 282
PARENTING: 4.34, 48
WEATHER: 4.23, 65
VIDEO_PLAYERS: 4.04, 145
NEWS_AND_MAGAZINES: 4.1, 198
MAPS_AND_NAVIGATION: 4.04, 112


Unfortunately, unlike with the App Store, looking at ratings doesn't tell us very much for the Google Play store, both because the ratings across categories do not have much variance and because the total number of ratings is low across nearly every category.

# Conclusions

While the datasets for App Store and Google Play apps provide us with different information, applications categorized as games look to be the best to develop in order to have success with free, ad-based revenue applications on both Android and iOS.

1. Games are by far the most popular app in the App Store dataset. While there are so many games from which to choose, the category still has an average number of downloads per application.
2. Games are one of the most liked apps on the platform, having the second-highest rating of any genre.
3. While the distribution of apps is very different for the Play Store dataset, games are still the second-most popular app category.
4. In spite of the fact that there are so many games available in the Play Store, the category is still the sixth-most popular in terms of average downloads.

All of this data suggests that one, people like to play games and are actively looking to download these apps and two, people enjoy these types of apps and are likely to play them for longer periods of time than other categories. While it is my recommendation to develop game applications for release on both Android and iOS, I recommend that you first do more research on this specific genre of application in order to discover what type of game apps might be most successful.