# Popular app genres on Android and iOS
## Identifying profitable genres to guide future development of free apps relying on ad revenue

We are data analysts at a company that builds free apps and generates revenue through ads. We have been asked to provide some guidance on the genre of apps that is most popular with users, and therefore most promising.  

The aim of this project is to determine which genres of free apps attract many users per app. For this analysis, a sample of 10,000 Android apps and 7,000 iOS apps will be considered. Other operating systems were excluded, as Android and iOS cover the majority of the mobile app market. 

In [1]:
# Importing the data as list of lists from CSV

def easy_import(file_name):
    from csv import reader
    opened_file = open(file_name, encoding="utf8")
    read_file = reader(opened_file)
    output = list(read_file)
    return output

ios_data = easy_import('data/AppleStore.csv')
android_data = easy_import('data/googleplaystore.csv')

In [2]:
# Examining structure of the lists. To make this easier, a function 
# that extracts and neatly prints specified rows is defined below.

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
explore_data(android_data, 0, 3)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']




In [3]:
explore_data(ios_data, 0, 3)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']




To find out which genre of free apps attracts the most users, the <i>Category</i>, <i>Installs</i> and <i>Price</i> columns of the Google Play Store dataset will be used. For the iOS dataset, the <i>Price</i>, and <i>Prime_Genre</i> columns will be used. Since the iOS dataset does not have a direct measure of installs, the <i>Rating_Count_Tot</i> column will be used as an approximation.

## Data cleaning 

### Data cleaing - amending an incorrect entry documented in a discussion forum

I looked into the discussion forum ([link](https://www.kaggle.com/lava18/google-play-store-apps/discussion)) for the Android dataset and found a wrong entry that has to be deleted.

In [4]:
print(android_data[10473])
del(android_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


### Data cleaning - checking for duplicates

In [5]:
# Defining a function that does this for us by looping through the data 
# and comparing each app name to all previously found names. 

def duplicates(dataset, name_index):
    all_apps = []
    duplicate_apps = []
    for entry in dataset[1:]:
        if entry[name_index] in all_apps:
            duplicate_apps.append(entry[name_index])
        else: 
            all_apps.append(entry[name_index])
    return duplicate_apps

# Checking for duplicates in the Android dataset 

android_duplicates = duplicates(android_data, 0)
print(android_duplicates[:5])
print('\n')
print('Total number of android apps: ' + str(len(android_data)))
print('Number of duplicate android apps: ' + str(len(android_duplicates)))

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


Total number of android apps: 10841
Number of duplicate android apps: 1181


In [6]:
# Checking for duplicates in the iOS dataset

ios_duplicates = duplicates(ios_data, 1)
print(ios_duplicates[:5])   
print('\n')
print('Number of duplicate iOS apps: ' + str(len(ios_duplicates)))

['Mannequin Challenge', 'VR Roller Coaster']


Number of duplicate iOS apps: 2


There are 1181 duplicate Android, and only 2 duplicate iOS apps. Further research shows that duplicate entries in the  App Store data are different apps with the same name.
Duplicate entries for Android apps, however, are actually the same app, but with data taken at different time points. We will use the number of reviews as a proxy for the time point. For each app, we will retain only the entry with the highest number of reviews.

In [7]:
# Finding the highest number of reviews for all apps.
# This info will be stored in a dictionary.

reviews_max = {}
for entry in android_data[1:]:
    name = entry[0]
    n_reviews = float(entry[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

len(reviews_max)

9659

In [8]:
# Comparing the number of reviews for each entry (i.e. each app) to the max recorded
# in the reviews_max dictionary. If it matches, the entry will be added to
# the android_clean list of lists. This way duplicated are removed by adding only
# the entry which has the higher number of reviews.

android_clean = []
added = []

for entry in android_data[1:]:
    name = entry[0]
    n_reviews = float(entry[3])
    if n_reviews == reviews_max[name] and name not in added:
        android_clean.append(entry)
        added.append(name)

len(android_clean)

9659

The 1181 duplicate Android apps were removed succesfully.

### Data cleaning - Removing non-English apps using the ASCII numbering system range as a filter

Examining the datasets shows that a large number of apps are non-English. As our company produces only apps in English, apps targeting non-English speakers should not form part of our analysis. Unfortunately, no language information is recorded in the datasets, but English apps usually don't use characters whose ASCII number is above 127. Therefore, apps with 3 or more characters above 127 will be removed from the datasets.

In [9]:
# Defining a function that loops through all characters in a string
# and checks whether more than 2 have ASCII numbers above 127.

def is_english(string):
    counter = 0
    for char in string:
        if ord(char) > 127:
            counter += 1
    if counter > 3:
        return False
    else:
        return True

In [10]:
# Testing is_english function on selected strings. 

print(is_english('Instragram'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
True
False


The function defined above appears to work well, so I will now use it on the datasets.

In [11]:
# Defining a helper function that applies is_english to each entry in the dataset.

def remove_non_english(dataset, name_index):
    dataset_filtered = []
    for entry in dataset:
        name = entry[name_index]
        if is_english(name) == True:
            dataset_filtered.append(entry)
    return dataset_filtered

In [12]:
# Removing non-English Android apps. 

android_english = remove_non_english(android_clean, 0)
explore_data(android_english, 0, 3)
print(len(android_english))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


9614


In [13]:
# Removing non-English iOS apps.

ios_english = remove_non_english(ios_data, 1)
explore_data(ios_english, 0, 3)
print(len(ios_english))

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


6184


### Data cleaning - Isolating free apps

The datasets contain both free and paid apps. We will use the Price columns in the Android and iOS datasets to isolate free apps.

In [14]:
# Isolating free Android apps

android_free = []
for entry in android_english:
    if entry[6] == 'Free':
        android_free.append(entry)

In [15]:
explore_data(android_free, 0, 3)
print('Free Android apps: ' + str(len(android_free)))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Free Android apps: 8863


In [16]:
# Isolating free iOS apps

ios_free = []
for entry in ios_english[1:]:
    if float(entry[4]) == 0.0:
        ios_free.append(entry)

In [17]:
explore_data(ios_free, 0, 3)
print('Free Android apps: ' + str(len(ios_free)))

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Free Android apps: 3222


## Analysis

### What is the market share of each genre?

After removing duplicate, paid and non-English apps, the datasets can now be analysed. To gain an understanding of the most popular genres in general, I will look at the 'market share' of each genre.

In [18]:
# Defining a function to extract the frequencies of each unique genre as a dictionary.

def freq_table(dataset, index):
    freq_table = {}
    total = 0             # counting total no. of apps
    for entry in dataset:
        total += 1        # counting total no. of apps
        genre = entry[index]
        if genre in freq_table:
            freq_table[genre] += 1   # counting occurence of genres
        else:
            freq_table[genre] = 1
    for key in freq_table:
        freq_table[key] /= total    # converting to proportion
        freq_table[key] *= 100      # converting to %
        freq_table[key] = round(freq_table[key], 2)       
    return freq_table

# Defining a function to apply freq_table to a dataset,
# and then print the frequencies in descending order.

def display_table(table):
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [19]:
# Displaying the frequencies of unique android app genres in %.

android_freq = freq_table(android_free, 1)
display_table(android_freq)

FAMILY : 18.9
GAME : 9.73
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


In [20]:
# Displaying the frequencies of unique iOS app genres in %.

ios_freq = freq_table(ios_free, -5)
display_table(ios_freq)

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


Several notable trends can be observed:
- The App Store appears to be dominated by entertainment apps: The genre with the most apps is Games at 58 % of all apps, followed by Entertainment (7.88 %) and Photo & Video (4.97 %). 
- The Google Play store shows a more balanced distribution of genres, including both apps for entertainment and apps for productivity. The genre 'games', for example, only makes up (9.73 %) of all apps.

### What is average number of users per app for each genre?

The frequencies with which genres are represented is not sufficient to determine what our product development should focus on. For example, a genre may make up a large portion of all available apps, but apps might still only attract few users per app. This would be less profitable than, for instance, a market with few apps attracting many users per app. Therefore, we will now look at users per app for the unique genres identified above. 

In [21]:
# Obtaining average no. of users per genre for Android apps 

avg_users_android = {}

for category in android_freq:    # Looping through genres
    total = 0                    # Counting total users for apps in genre
    len_category = 0             # Counting no. of apps in genre
    for entry in android_free:   # Looping through Android apps to obtain both users and apps per genre
        category_app = entry[1]
        if category_app == category:
            installs_str = entry[5].replace('+', '')
            installs_str = installs_str.replace(',', '')
            installs_flt = float(installs_str)
            total += installs_flt
            len_category += 1
    avg_per_category = total / len_category
    avg_users_android[category] = round(avg_per_category, 0)
    
display_table(avg_users_android)

COMMUNICATION : 38456119.0
VIDEO_PLAYERS : 24727872.0
SOCIAL : 23253652.0
PHOTOGRAPHY : 17840110.0
PRODUCTIVITY : 16787331.0
GAME : 15588016.0
TRAVEL_AND_LOCAL : 13984078.0
ENTERTAINMENT : 11640706.0
TOOLS : 10801391.0
NEWS_AND_MAGAZINES : 9549178.0
BOOKS_AND_REFERENCE : 8767812.0
SHOPPING : 7036877.0
PERSONALIZATION : 5201483.0
WEATHER : 5074486.0
HEALTH_AND_FITNESS : 4188822.0
MAPS_AND_NAVIGATION : 4056942.0
FAMILY : 3697848.0
SPORTS : 3638640.0
ART_AND_DESIGN : 1986335.0
FOOD_AND_DRINK : 1924898.0
EDUCATION : 1833495.0
BUSINESS : 1712290.0
LIFESTYLE : 1437816.0
FINANCE : 1387692.0
HOUSE_AND_HOME : 1331541.0
DATING : 854029.0
COMICS : 817657.0
AUTO_AND_VEHICLES : 647318.0
LIBRARIES_AND_DEMO : 638504.0
PARENTING : 542604.0
BEAUTY : 513152.0
EVENTS : 253542.0
MEDICAL : 120551.0


This data shows that that the five Android app genres with the highest number of users per app:
- Communication (38,000,000+)
- Video Players (24,000,000+)
- Social (23,000,000+)
- Photography (17,000,000+)
- Productivity (16,000,000+)

In [22]:
# Obtaining average no. of users per genre for iOS apps 

avg_users_ios = {}

for genre in ios_freq:           # Looping through genres
    total = 0                    # Counting total users for apps in genre
    len_genre = 0                # Counting no. of apps in genre
    for entry in ios_free:       # Looping through iOS apps to obtain both users and apps per genre
        genre_app = entry[-5]
        if genre_app == genre:
            installs = float(entry[5])
            total += installs
            len_genre += 1
    avg_per_genre = total / len_genre
    avg_users_ios[genre] = round(avg_per_genre, 0)

display_table(avg_users_ios)

Navigation : 86090.0
Reference : 74942.0
Social Networking : 71548.0
Music : 57327.0
Weather : 52280.0
Book : 39758.0
Food & Drink : 33334.0
Finance : 31468.0
Photo & Video : 28442.0
Travel : 28244.0
Shopping : 26920.0
Health & Fitness : 23298.0
Sports : 23009.0
Games : 22789.0
News : 21248.0
Productivity : 21028.0
Utilities : 18684.0
Lifestyle : 16486.0
Entertainment : 14030.0
Business : 7491.0
Education : 7004.0
Catalogs : 4004.0
Medical : 612.0


This data shows that that the five iOS app genres with the highest number of users per app:
- Navigation (86,000+)
- Reference (74,000+)
- Social Networking (71,000+)
- Music (57,000+)
- Weather (52,000+)

Intestetingly, the genre with the highest % of all apps (Games) has a rather low number of users per app on iOS. 

### Conclusions & recommendations

Overall, average user per app numbers are much higher for Android apps compared to iOS apps. The top genres are also quite different between the two. This means that we will have to compromise when deciding what kinds of apps to develop.

We can see that the genres 'Social'/'Social Networking' attract a large number of users per app on both Android and iOS. This genre is, however, heavily dominated by a few 'big players' and therefore unlikely to be profitable for the development of free, ad-revenue based apps. The same is true for other popular Android genres like 'Communication' or 'Video players', and for popular iOS genres like 'Navigation' or 'Music'.  

The genre 'Productivity' ranks among the top 5 Android genres in terms of users per app and still attracts a reasonable number of users on iOS. The opposite is true for the genre 'Weather' which is in the top 5 on iOS and somewhere in the middle on Android. Therefore, I would recommend that we focus our app development on these two genres.