# App Profile Recommendation

This analysis is done on behalf of a developer that makes free apps for Android and iOS. The developer gets the majority of their revenue from in-app ads. The result is that the revenue is primarily influenced by how many users use their apps; the more users who engage with the app, the better. The developer is primarily concerned with the English-speaking market.

The goal of the analysis is to use data from a sample of iOS and Android apps to determine what type of apps are most likely to be downloaded.

Below is an exploration of the dataset. A function is created to easily extract various information from the data sets.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
        '''
        dataset: list of list
        start, end: integers ; Represents start & end indices of slice of the data set.
        rows_and_columns: boolean 
        
        return: None
        
        The function prints out the rows in the dataset within the slice specified.
        Each row is separated by an empty line for readability.
        The function also prints out the number or rows & columns if the argument is set to True.
        '''

        dataset_slice = dataset[start:end]
        for row in dataset_slice:
            print(row)
            print('\n')
        
        #prints out the number of rows and columns if the argument is set to True.
        #dataset should not have a header row. If it does the function will print actual number of data rows + 1.
        if rows_and_columns:
            print('Number of rows:', len(dataset))
            print('Number of columns:', len(dataset[0]))
        

## Relevant Columns

The output above shows the column names from the header rows in the dataset. The aim is to identify columns that are relevant to this analysis.

The documentation for the data sets can be found here:
- [Android dataset](https://www.kaggle.com/lava18/google-play-store-apps)
- [Apple dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

In [2]:
from csv import reader
opened_file = open('apple_store.csv', encoding="utf8")
read_file = reader(opened_file)
apple_data = list(read_file)

opened_file = open('google_play_store.csv', encoding="utf8")
read_file = reader(opened_file)
android_data = list(read_file)

In [3]:
print('Apple Data \n')
print(explore_data(apple_data[0:], 0, 1, False))
print('Android Data \n')
print(explore_data(android_data[0:], 0, 1, False))

Apple Data 

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


None
Android Data 

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


None


## Looking for errors in the data

The exploration of the data below reveals that there is an issue with one row (missing 'genre' value), as well as numerous duplicate entries. A closer look at the duplicate entries show that they are information for the same app collected at different times. The entry for an app with the greatest number of reviews will be considered to be the most recent, and this entry for the app will be kept while the other duplicates are deleted.

In [4]:
#remove row with missing genre data.
print(android_data[10473])
del android_data[10473]
print(android_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


In [5]:
#Look for duplicate apps in android_data
duplicate_apps = []
unique_apps = []

for app in android_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps[:15])

Number of duplicate apps:  1181


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [6]:
#Exploring duplicate entries
for app in android_data:
    name = app[0]
    if name == 'Slack':
        print (app)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


In [7]:
#App entry with the highest number of reviews is kept and the duplicates are deleted. There should be a total of 9659 unique apps.
reviews_max = {}

for app in android_data[1:]:
    
    name = app[0]
    reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < reviews:
        reviews_max[name] = reviews
    else:
        reviews_max.update({name: reviews})
        


print(len(reviews_max)) #checking for correct length
print(reviews_max['Slack']) #comparing against range of Slack ratings shown above.

9659
51510.0


In [8]:
android_clean = []
already_added = []

for app in android_data[1:]:
     
    name = app[0]
    reviews = float(app[3])
    
    if (reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

print(len(android_clean))

9659


## Non-English Apps
The company only creates apps for the English speaking market, we therefore want to remove all non-english apps from the dataset. This will make any analysis done on the dataset more relevant for our company.

In [9]:
def is_english_test(string):
    '''
    string: any string
    return: boolean
    
    Checks character encoding of string characters.
    Returns False is the string contains any non-english characters.
    '''
    
    for character in string:
        if ord(character) > 127:
            return False
    return True

print(is_english_test('Instagram'))
print(is_english_test('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english_test('Docs To Go™ Free Office Suite'))
print(is_english_test('Instachat 😜'))

True
False
False
False


As seen above, English apps are mistakenly being detected as non-English. This is due to emojis and certain characters, like the trademark character, having an ASCII code above 127.
To refine the filter we will only remove apps with 3 or more characters above ASCII = 127.

In [10]:
def is_english(string):
    '''
    string: any string
    return: boolean
    
    Checks character encoding of string characters.
    Returns False is the string contains any non-english characters.
    '''
    count = 0
    
    for character in string:
        
        if ord(character) > 127:
            count += 1
            
    if count > 3:
        return False
    else:
        return True
    
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


In [11]:
#Using the functions created above to remove non-english apps from the datasets
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)

for app in apple_data[1:]:
    name = app[1]
    if is_english(name):
        ios_english.append(app)

explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

In [12]:
#sorting out apps that are free
free_android = []
free_ios = []

for app in android_english:
    if app[6] == 'Free':
        free_android.append(app)

for app in ios_english:
    if app[4] == '0.0':
        free_ios.append(app)
        
print(len(free_android))
print(len(free_ios))

8863
3222


## Most Common Apps by Genre
To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Ideally, the app we create will be added to both the App Store and Google Play. Therefore we look for app profiles that are successful in both of these stores.

We begin by finding out what the most common genres in both stores are. The relevant data is stored in the `'prime_genre'` column for ios and in `'Genres'` and `'Category'` for android.

In [13]:
def freq_table(dataset, index):
    '''
    Returns dictionary-form frequency table for a dataset and desired column(index)
    
    dataset: list of lists. The function expects the header row to be removed.
    index: integer. Column to generate frequency table for
    return: dictionary. Frequency table in the form of a dictionary
    '''
    genres_dict = {}
    count = 0
    
    for row in dataset:
        count += 1
        value = row[index]
        if value in genres_dict:
            genres_dict[value] += 1
        else:
            genres_dict[value] = 1
    
    genres_per = {}
    for key in genres_dict:
        percentage = (genres_dict[key] / count) * 100
        genres_per[key] = percentage 
    
    return genres_per


def display_table(dataset, index):
    '''
    Prints a list of the genre percentages in descending order
    '''
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [14]:
display_table(free_android, 1)


FAMILY : 19.21471285117906
GAME : 9.511452104253639
TOOLS : 8.462146000225657
BUSINESS : 4.580841701455489
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.542818458761142
SPORTS : 3.4187069840911652
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2494640640866526
HEALTH_AND_FITNESS : 3.068938282748505
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7826920907142052
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.128286133363421
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
ENTERTAINMENT : 0.8800631840234684
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0.64

`FAMILY` and `GAMES` make up the majority of the apps. If we browse Google Play we can see that most of the apps in the `Family` category are games intended for children. Despite this, practical, non-entertainment apps seems to be well represented.

In [15]:
display_table(free_ios, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Over half of all apps on the iOS store is in the `Games` category with another 7% in the `Entertainment` category. The general trend seems to be that apps intended for entertainment make up the majority. Apps for practical or educational purposes is only a small percent.
It is still unclear if this is a reflection of demand. It might be that all the games are competing in a market that doesn't have room for them all.

## Most Popular Apps according to Genre

Next, we will investigate which types of apps have the most users. We will do this by using the `Installs` column for Google Play. For the iOS App Store we will use the number of user ratings, `rating_count_tot`, as an approximation of the number of users.

### iOS App Store

We investigate the iOS App Store first.

In [16]:
#Looping through the apps and totaling the ratings for apps with the same genre
genres_ios = freq_table(free_ios, -5)

for genre in genres_ios:
    count = 0
    length = 0
    
    for app in free_ios:
        genre_app = app[-5]
        if genre_app == genre:
            ratings = float(app[5])
            count += ratings
            length += 1
    
    avg_ratings = count / length
    print(genre, ':', avg_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


In [17]:
for app in free_ios:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The `Navigation` genre is the one with the most ratings, but we can see that Waze and Google Maps are responsible for over half of the ratings. The same pattern is found in the other genres. `Social Networking` has Facebook, `Music` has Spotify, `Reference` has the bible, and so forth. We are trying to find the most popular genres but the few, very large apps might be skewing our ranking. We will look into this in a bit.

### Google Play Store

Now we check the Google Play store.

In [18]:
print('Number of app installs:')
display_table(free_android, 5)

Number of app installs:
1,000,000+ : 15.750874421753355
100,000+ : 11.564932866975065
10,000,000+ : 10.50434390161345
10,000+ : 10.21098950693896
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835


The numbers are not very precise. `100,000+` could mean anywhere from 100 000 to 1M. However, the data doesn't have to be all that precise for our purposes. We will use the current data as it is. `100,000+` will count as 100 000 installations exactly and so on.
As can be seen from the list above the numbers are in string format. We will need to convert them to float and then calculate the averages.

In [19]:
genres_android = freq_table(free_android, 1)

for genre in genres_android:
    count = 0
    length = 0
    
    for app in free_android:
        genre_app = app[1]
        if genre_app == genre:
            installs = app[5]
            installs = installs.replace(',', '')
            installs = installs.replace('+', '')
            count += float(installs)
            length += 1
    
    avg_installs = count / length
    print(genre, ':', avg_installs)
        

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1704192.3399014778
COMICS : 817657.2727272727
COMMUNICATION : 38326063.197916664
DATING : 854028.8303030303
EDUCATION : 1768500.0
ENTERTAINMENT : 9146923.076923076
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4167457.3602941176
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 12914435.883748516
FAMILY : 5183203.576042279
MEDICAL : 123064.7898089172
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 4274688.722772277
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16772838.591304347
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24790074.17721519
NEWS_AND_MAGAZINES : 

As with the App Store, these numbers can be misleading as they are skewed in most genres by a few big apps that have 100M - 1B installs(Facebook, Youtube, Chrome, MS Word etc).
As before, the problem is that the install numbers might not truly reflect the popularity of the genre of apps as a whole. It also seems difficult to compete in most of the genres due to domination by a few giants.

# Conclusion

The app markets on both Google Play store and iOS App Store. The various genres are either saturated by competitors (i.e. Gaming) or dominated by a few giant apps.
The most plausible strategy is to find a genre on both markets that is only dominated by a few giants and is not saturated in the <100M segment. With proper resources it should be possible to create a successful app in this genre.

What genre is chosen would also depend on what kind of resources are available to the company:
- What is the marketing budget?
- What kinds of domain knowledge does the company have?
- Can they partner with various brands?

Since the company gets it's money from inn-app ads the app would have to be one that is frequently used. Weather apps, travel apps and other apps that the user is unlikely to interface with often will probably not be a good investment.