## Profitable App Profiles for the App Store and Google Play Markets

This project aims to help developers what type of apps are likely to attract more users. We'll be working as data analysts for a company that builds iOS and Android apps and our main goal is to help our company understand further the needs of the users.



In [1]:
from csv import reader

appstore_data = list(reader(open("AppleStore.csv", encoding='utf8')))
playstore_data = list(reader(open("googleplaystore.csv", encoding = 'utf8')))

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(appstore_data[1:], 0, 3, rows_and_columns=True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


In [4]:
appstore_cols = appstore_data[0]
print(appstore_cols)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [5]:
explore_data(playstore_data[1:], 0, 3, rows_and_columns=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [6]:
playstore_cols = playstore_data[0]
print(playstore_cols)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [7]:
playstore_data[10473]

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

In [8]:
del playstore_data[10473]

Looking at the playstore dataset again, we can see that the dataset had duplicate entries:

In [9]:
for app in playstore_data:
    name = app[0]
    if name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [10]:
duplicates = []
uniques = []

for app in playstore_data[1:]:
    name = app[0]
    if name in uniques:
        duplicates.append(name)
    else:
        uniques.append(name)

print(len(duplicates))

1181


To remove the duplicate entries in this dataset, we'll only consider the rows which had the highest number of reviews compared. Since the only column that differs among the rows are the review count, we'll only consider the highest count since that reflects the latest data for that app.

In [11]:
reviews_max = {}

for app in playstore_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [12]:
len(reviews_max)

9659

After we've determined the max ratings for each app, we'll use this information to create a new dataset in which only the unique apps with the highest ratings will be retained. For each of the row in the dataset, we'll check if the rating count exists in the corresponding app and if does, add it into a new dataframe.

In [13]:
android_clean = []
already_added = []

for app in playstore_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
    

In [14]:
len(android_clean)

9659

In [15]:
def inspect_chars(a_string):
    n_non_eng_chrs = 0
    for char in a_string:
        if ord(char) > 127:
            n_non_eng_chrs += 1
    if n_non_eng_chrs > 3:
        return False
    else:
        return True

print(inspect_chars('Instagram'))
print(inspect_chars('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(inspect_chars('Docs To Go™ Free Office Suite'))
print(inspect_chars('Instachat 😜'))

True
False
True
True


In [16]:
appstore_data_eng = []

for app in appstore_data:
    name = app[1]
    if inspect_chars(name):
        appstore_data_eng.append(app)

print(appstore_data_eng[1:5])
print(len(appstore_data_eng[1:]))

[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']]
6183


In [17]:
playstore_data_eng = []

for app in android_clean:
    name = app[0]
    if inspect_chars(name):
        playstore_data_eng.append(app)

print(playstore_data_eng[:5])
print(len(playstore_data_eng))

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]
9614


In [18]:
free_apple_apps = []

for app in appstore_data_eng[1:]:
    price = float(app[4])
    if price == 0:
        free_apple_apps.append(app)

len(free_apple_apps)

3222

In [19]:
free_google_apps = []

for app in playstore_data_eng:
    price = app[7]
    if price == '0':
        free_google_apps.append(app)
        
len(free_google_apps)

8864

To minimize risks and overhead, our validation strategy for an app idea has three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

In [20]:
def freq_table(dataset, index):
    freqs = {}
    total = 0
    for row in dataset:
        val = row[index]
        if val in freqs:
            freqs[val] += 1
        else:
            freqs[val] = 1
        total += 1

    for cat in freqs:
        freqs[cat] = round(100 * freqs[cat] / total, 2) 
    return freqs

In [21]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Looking at the frequency table below, we can see that the most common genre is Games followed by Entertainment.

A lot of the apps in the Apple store tend to be skewed on games compared to the rest of the genres. This implies that most apps are designed more for entertainment than practicality.

Despite the dominance of gaming apps in the Apple store, we can't say yet if this also leads to a large number of users, thus, additional tables (number of reviews/votes per genre) would be worth exploring to create an appropriate app profile.

If this will be the option in creating an app profile, then the most profitable genre to be recommended would be a gaming app or app that lets users play for fun or for competition.

In [22]:
display_table(free_apple_apps, 11)

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


Looking at the cleaned google play dataset, we can see that the most dominant categories are family and game. This seemed to be consistent with the dominance of apps by category in the apple store (entertainement) except that the share of non-entertaniment apps is a lot larger in the google playstore.

Similar to the recommendations in the app store dataset, the recommended genre to fit in the app profile here would be gaming/family (an app that lets users play and have fun with the family)

In [23]:
display_table(free_google_apps, 1)

FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


Genres, on the other hand, tell a different story as productivity apps like tools lead among the top categories in the google playstore. This is most likely because the entertainment and other miscallenous categories were broken down into much specific subcategories.

Similar to the recommendations in the google playstore dataset, the recommended genre to fit in the app profile here would be a mix of productivity genres (tools or apps for improving tasks) and entertainment (apps that help you be entertained).

In [24]:
display_table(free_google_apps, 9)

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;

In both cases, it would still be best if we could use other metrics such as number of reviews/ratings per genres/categories to reveal which apps are more likely to have the most users.

In [26]:
unique_genres = freq_table(free_apple_apps, 11)
unique_genres

{'Social Networking': 3.29,
 'Photo & Video': 4.97,
 'Games': 58.16,
 'Music': 2.05,
 'Reference': 0.56,
 'Health & Fitness': 2.02,
 'Weather': 0.87,
 'Utilities': 2.51,
 'Travel': 1.24,
 'Shopping': 2.61,
 'News': 1.33,
 'Navigation': 0.19,
 'Lifestyle': 1.58,
 'Entertainment': 7.88,
 'Food & Drink': 0.81,
 'Sports': 2.14,
 'Book': 0.43,
 'Finance': 1.12,
 'Education': 3.66,
 'Productivity': 1.74,
 'Business': 0.53,
 'Catalogs': 0.12,
 'Medical': 0.19}

In [32]:
for genre in unique_genres:
    total = 0
    len_genre = 0
    for app in free_apple_apps:
        genre_app = app[11]
        if genre_app == genre:
            user_rating = float(app[5])
            total += user_rating
            len_genre += 1
    avg_ratings = total/len_genre
    print(genre + ': ' + str(avg_ratings))

Social Networking: 71548.34905660378
Photo & Video: 28441.54375
Games: 22788.6696905016
Music: 57326.530303030304
Reference: 74942.11111111111
Health & Fitness: 23298.015384615384
Weather: 52279.892857142855
Utilities: 18684.456790123455
Travel: 28243.8
Shopping: 26919.690476190477
News: 21248.023255813954
Navigation: 86090.33333333333
Lifestyle: 16485.764705882353
Entertainment: 14029.830708661417
Food & Drink: 33333.92307692308
Sports: 23008.898550724636
Book: 39758.5
Finance: 31467.944444444445
Education: 7003.983050847458
Productivity: 21028.410714285714
Business: 7491.117647058823
Catalogs: 4004.0
Medical: 612.0


Looking at the average user ratings above, we can see a clearer picture of the user engagements in apple apps. Majority of the users tend to have one or more social media apps while another group tends to have photo apps (for taking/enhancing photos) or music apps (for listening/editing music). 

Given the results above, we can say that the genre with the highest potential of profitability would either be: social media (apps to connect to other people), photo/video apps (apps that h, games, or a combination of either or all of the three genres.

In [31]:
unique_categories = freq_table(free_google_apps, 1)
unique_categories

{'ART_AND_DESIGN': 0.64,
 'AUTO_AND_VEHICLES': 0.93,
 'BEAUTY': 0.6,
 'BOOKS_AND_REFERENCE': 2.14,
 'BUSINESS': 4.59,
 'COMICS': 0.62,
 'COMMUNICATION': 3.24,
 'DATING': 1.86,
 'EDUCATION': 1.16,
 'ENTERTAINMENT': 0.96,
 'EVENTS': 0.71,
 'FINANCE': 3.7,
 'FOOD_AND_DRINK': 1.24,
 'HEALTH_AND_FITNESS': 3.08,
 'HOUSE_AND_HOME': 0.82,
 'LIBRARIES_AND_DEMO': 0.94,
 'LIFESTYLE': 3.9,
 'GAME': 9.72,
 'FAMILY': 18.91,
 'MEDICAL': 3.53,
 'SOCIAL': 2.66,
 'SHOPPING': 2.25,
 'PHOTOGRAPHY': 2.94,
 'SPORTS': 3.4,
 'TRAVEL_AND_LOCAL': 2.34,
 'TOOLS': 8.46,
 'PERSONALIZATION': 3.32,
 'PRODUCTIVITY': 3.89,
 'PARENTING': 0.65,
 'WEATHER': 0.8,
 'VIDEO_PLAYERS': 1.79,
 'NEWS_AND_MAGAZINES': 2.8,
 'MAPS_AND_NAVIGATION': 1.4}

In [36]:
for category in unique_categories:
    total = 0
    len_category = 0
    for app in free_google_apps:
        category_app = app[1]
        if category_app == category:
            string_installs = app[5]
            string_installs = string_installs.replace('+','')
            string_installs = string_installs.replace(',','')
            user_installs = float(string_installs)
            total += user_installs
            len_category += 1
    avg_installs = total/len_category
    print(category + ': ' + str(round(avg_installs, 2)))

ART_AND_DESIGN: 1986335.09
AUTO_AND_VEHICLES: 647317.82
BEAUTY: 513151.89
BOOKS_AND_REFERENCE: 8767811.89
BUSINESS: 1712290.15
COMICS: 817657.27
COMMUNICATION: 38456119.17
DATING: 854028.83
EDUCATION: 1833495.15
ENTERTAINMENT: 11640705.88
EVENTS: 253542.22
FINANCE: 1387692.48
FOOD_AND_DRINK: 1924897.74
HEALTH_AND_FITNESS: 4188821.99
HOUSE_AND_HOME: 1331540.56
LIBRARIES_AND_DEMO: 638503.73
LIFESTYLE: 1437816.27
GAME: 15588015.6
FAMILY: 3695641.82
MEDICAL: 120550.62
SOCIAL: 23253652.13
SHOPPING: 7036877.31
PHOTOGRAPHY: 17840110.4
SPORTS: 3638640.14
TRAVEL_AND_LOCAL: 13984077.71
TOOLS: 10801391.3
PERSONALIZATION: 5201482.61
PRODUCTIVITY: 16787331.34
PARENTING: 542603.62
WEATHER: 5074486.2
VIDEO_PLAYERS: 24727872.45
NEWS_AND_MAGAZINES: 9549178.47
MAPS_AND_NAVIGATION: 4056941.77


Looking at the average install counts above, we can see a clearer picture of the user installs in google apps. Majority of the users tend to have installed communication apps (call/text, social media) more than once, followed by video players and socials (actual social media apps).

Given the results above, we can say that the genre with the highest potential of profitability would either be: communication apps or apps that focus on helping users communicate with one another, video players or apps that help users play dedicated media, and socials or apps that help users connect with one another.