# App Store and Google Play Store App Profitability

## Analysis of mobile apps for App Store and Google Play Store


<i>Kim Kirk <br> July 3, 2020</i>

## Synopsis

A descriptive multivariate data analysis was conducted on mobile apps data from the Apple Mobile App Store and Google Play Store. 10,000 Play Store apps and 7,200 Mobile App Store apps data sets from Kaggle were imported, cleaned, and analyzed. The profitability of Mobile Apps and Play Store apps were explored to identify which types of apps are likely to attract more users, this information is then used to inform app development within the company to reduce development risk.

### Data Processing

Importing the iOS and Android apps data sets. Performing some exploratory data analysis by printing out contents of a few rows of each data set including number of rows and columns. Identifying useful fields in the dataset.



In [1]:

import csv

opened_ios_file = open('AppleStore.csv')
read_ios_file = csv.reader(opened_ios_file)
ios_apps = list(read_ios_file)

opened_goog_file = open('googleplaystore.csv')
read_goog_file = csv.reader(opened_goog_file)
goog_apps = list(read_goog_file)

opened_ios_file.close()
opened_goog_file.close()


#this function provided by Dataquest.io developers
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

explore_data(ios_apps, 0, 4, True)
explore_data(goog_apps, 0, 4, True)

print('\n')
print(" iOS fields", ios_apps[0])
print('\n')
print(" Android fields", goog_apps[0])


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'J

Documentation links to additional field descriptions
[Android](https://www.kaggle.com/lava18/google-play-store-apps)
[iOS](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps?select=appleStore_description.csv)

Cleaning the data. The Android apps data set has error in one of its rows: a "CATEGORY" value is missing. Removing the row with the error. A check is performed to ensure the row was properly deleted.



In [2]:

print(goog_apps[10473])

del goog_apps[10473]

goog_apps[10473]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi',
 'TOOLS',
 '4.2',
 '134203',
 '4.1M',
 '10,000,000+',
 'Free',
 '0',
 'Everyone',
 'Tools',
 'August 7, 2018',
 '6.06.14',
 '4.4 and up']

Checking for duplicate Android app names by printing out duplicate entries in the data.

In [3]:

duplicate_apps = []
unique_apps = []

for app in goog_apps:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Sample of names of duplicate apps:', duplicate_apps[0:3])
print('\n')
print('Sample of duplicate entries: ')

for app in goog_apps:
    name = app[0]
    if name == 'Video Downloader':
        print(app)


Number of duplicate apps: 1181


Sample of names of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business']


Sample of duplicate entries: 
['Video Downloader', 'VIDEO_PLAYERS', '4.2', '59089', '5.4M', '10,000,000+', 'Free', '0', 'Everyone', 'Video Players & Editors', 'August 3, 2018', '1.0.8', '4.4 and up']
['Video Downloader', 'VIDEO_PLAYERS', '4.2', '58981', '5.4M', '10,000,000+', 'Free', '0', 'Everyone', 'Video Players & Editors', 'August 3, 2018', '1.0.8', '4.4 and up']


Duplicate entries will be removed using the criterion of entries with number of reviews lower than entry with highest number of reviews. 

Gathering the app name and highest review value from the Android apps data set. Creating a list with highest number of reviews for the corresponding Android app name. A check is performed at the end of the code block to ensure the length of new data set compared to expected length is correct.


In [4]:

reviews_max = {}


for app in goog_apps[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

print('Length of new data set same as expected length')
print(len(reviews_max) == 9659)

    

Length of new data set same as expected length
True


Removing duplicate entries from the Android apps data set. Creating a new data set that has the Android app row for the highest reviews for the corresponding app. A check is performed at the end of the code block to ensure the length of new data set compared to expected length is correct, as well as to ensure the entire row was captured in the list.

In [5]:

android_clean = []
already_added = []

for app in goog_apps[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
print('Length of new data set same as expected length')
print(len(android_clean) == 9659)
print('\n')
print('Sample rows for Android data set')
print(android_clean[0:2])
    

Length of new data set same as expected length
True


Sample rows for Android data set
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']]


Identifying iOS and Android apps directed to English-language audience. Creating a function that checks for common English characters in app's name. A check is performed to ensure the function's logic is accurate.


In [6]:

def common_english_chars(a_string):
    total_nonASCII_char = 0
    for char in a_string:
        if ord(char) > 127:
            total_nonASCII_char += 1
        if total_nonASCII_char > 3:
            return False
    return True

print('Check for function\'s logical accuracy')
print(common_english_chars('Instagram') == True)
print(common_english_chars('爱奇艺PPS -《欢乐颂2》电视剧热播') == False)
print(common_english_chars('Docs To Go™ Free Office Suite') == True)
print(common_english_chars('Instachat 😜') == True)
print(common_english_chars('Turbo Dismount®') == True)

Check for function's logical accuracy
True
True
True
True
True


Filtering out non-English apps from iOS and Android apps data sets. Creating a new data set that includes English-language apps only. A check is performed on the function to ensure the logic is accurate.

In [45]:
def filter_english_apps(a_data_set):
    updated_data_set = []
    for app in a_data_set[1:]:
        if a_data_set == ios_apps:
            app_name = app[1]
        else:
            app_name = app[0]
       
        if common_english_chars(app_name) == True:
            updated_data_set.append(app)
            
    return updated_data_set

ios_data = filter_english_apps(ios_apps)
android_data = filter_english_apps(android_clean)


print('\n')
print('Check for English language apps only in iOS and Android data sets')
for app in ios_data:
    if 'LEGO® NEXO KNIGHTS™ : MERLOK 2.0' in app[1:2]:
        print('True')
    elif '搜狗输入法-Sogou Keyboard' in app[1:2]:
        print('Non-English data in data set')

for app in android_data:
    if 'Wattpad 📖 Free Books' in app[0]:
        print('True')
    elif 'Flame - درب عقلك يوميا' in app[0]:
        print('Non-English data in data set')
    





Check for English language apps only in iOS and Android data sets
True
True


Exploring the new data sets to see how many rows remaining in each.

In [8]:
print('iOS data has', len(ios_data), 'rows remaining')
print('Android data has', len(android_data), 'rows remaining')

iOS data has 6183 rows remaining
Android data has 9613 rows remaining


Removing all paid iOS and Android apps. Creating a function that appends only free apps to final iOS and Android data sets. A check is performed to ensure paid apps have been removed.

In [9]:
def free_english_apps(data_set):
    free_apps_data_set = []
    for row in data_set:
        if data_set == ios_data:
            price = row[4]
        else:
            price = row[7]
        if price == '0' or price == '0.0':
            free_apps_data_set.append(row)
            
    return free_apps_data_set

ios_data_final = free_english_apps(ios_data)
android_data_final = free_english_apps(android_data)

print('iOS sample row')
print(ios_data_final[0:2])
print('\n')
print('Android sample row')
print(android_data_final[0:2])

def test_paid(a_final_data_set, index_position):
    for app in a_final_data_set:
        if '0.0' and '0' not in app[index_position]:
            print('Paid app found')

print('\n')          
print('Check for paid apps in data set')
print(test_paid(ios_data_final, 4), 'in iOS data set')
print(test_paid(android_data_final, 7), 'in Android data set')


iOS sample row
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']]


Android sample row
[['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]


Check for paid apps in data set
None in iOS data set
None in Android data set


### Exploratory Data Analysis

Exploring the final data sets to see how many rows remaining in each.

In [10]:
print(len(ios_data_final), 'apps in iOS data set')
print(len(android_data_final), 'apps in Android data set')

3222 apps in iOS data set
8863 apps in Android data set


Analyzing the data sets. Identifying apps with profiles that are successful on both Apple Store and Google Play store markets to mitigate risk and overhead for our company's app development. 

Exploring fields in iOS and Android data sets to identify most common genres in each market. Creating sorted frequency tables for the genres to see what the most common genre is for each market. A check is performed to ensure function logic is working properly.

In [55]:
print('iOS field for genre data')
print(ios_data_final[0][11], ' "PRIME_GENRES" field')
print('\n')
print('Android fields for genre data')
print(android_data_final[0][1], ' "CATEGORY" field')
print(android_data_final[0][9], ' "GENRES" field')
print('\n')

def freq_table(data_set, index):
    genre_counting = {}
    total_apps_count = 0
    
    for app in data_set:
        genre = app[index]
        if genre in genre_counting:
            genre_counting[genre] += 1
        else:
            genre_counting[genre] = 1
   
    for count in genre_counting:
        total_apps_count += genre_counting[count]
      
    for count in genre_counting:
        genre_counting[count] = genre_counting[count]/total_apps_count  
        genre_counting[count] *= 100
        
    return genre_counting

print('Check for empty frequency tables:')
print(freq_table(ios_data_final, 11).items() is None)
print(freq_table(android_data_final, 1).items() is None)
print(freq_table(android_data_final, 9).items() is None)
print('\n')

#this function provided by dataquest.io developers
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

print('iOS most frequent genres for "PRIME_GENRES" field')
display_table(ios_data_final, 11)
print('\n')
print('Android most frequent genres for "CATEGORY" field')
display_table(android_data_final, 1)
print('\n')
print('Android most frequent genres "GENRES" field')    
display_table(android_data_final, 9)
print('\n')


    


iOS field for genre data
Social Networking  "PRIME_GENRES" field


Android fields for genre data
ART_AND_DESIGN  "CATEGORY" field
Art & Design  "GENRES" field


Check for empty frequency tables:
False
False
False


iOS most frequent genres for "PRIME_GENRES" field
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Android most frequent genres 

Continuing to analyze iOS and Android apps genre data. Finding the most popular iOS apps by genre.

In [36]:
freq_table_ios_popular = freq_table(ios_data_final, 11)

print('iOS apps:')
print('\n')
for genre in freq_table_ios_popular:
    total = 0
    len_genre = 0
    for app in ios_data_final:
        genre_app = app[11]
        if genre_app == genre:
            user_rating = float(app[5])
            total += user_rating
            len_genre += 1
    print('For genre:', genre, 'the average number of user ratings is: ', round(total/len_genre, 1))
    

            
       


iOS apps:


For genre: Book the average number of user ratings is:  39758.5
For genre: Education the average number of user ratings is:  7004.0
For genre: Reference the average number of user ratings is:  74942.1
For genre: Social Networking the average number of user ratings is:  71548.3
For genre: News the average number of user ratings is:  21248.0
For genre: Photo & Video the average number of user ratings is:  28441.5
For genre: Catalogs the average number of user ratings is:  4004.0
For genre: Business the average number of user ratings is:  7491.1
For genre: Productivity the average number of user ratings is:  21028.4
For genre: Food & Drink the average number of user ratings is:  33333.9
For genre: Health & Fitness the average number of user ratings is:  23298.0
For genre: Finance the average number of user ratings is:  31467.9
For genre: Sports the average number of user ratings is:  23008.9
For genre: Lifestyle the average number of user ratings is:  16485.8
For genre: Travel

Continuing to analyze iOS and Android apps genre data. Finding the most popular Android apps by genre by number average number of installs.

In [44]:
freq_table_android_popular = freq_table(android_data_final, 1)

for genre in freq_table_android_popular:
    total = 0
    len_category = 0
    for app in android_data_final:
        category_app = app[1]
        if category_app == genre:
            num_installs = app[5]
            if '+' in num_installs:
                num_installs = num_installs.replace('+', '')
            if ',' in num_installs:
                num_installs = num_installs.replace(',', '')
            num_installs = float(num_installs)
            total += num_installs
            len_category += 1
    print('For genre:', genre, 'the average number of installs is: ', round(total/len_category, 1))
    
        
        

For genre: VIDEO_PLAYERS the average number of installs is:  24727872.5
For genre: LIFESTYLE the average number of installs is:  1437816.3
For genre: COMMUNICATION the average number of installs is:  38456119.2
For genre: HOUSE_AND_HOME the average number of installs is:  1331540.6
For genre: MEDICAL the average number of installs is:  120550.6
For genre: TOOLS the average number of installs is:  10801391.3
For genre: AUTO_AND_VEHICLES the average number of installs is:  647317.8
For genre: ENTERTAINMENT the average number of installs is:  11640705.9
For genre: TRAVEL_AND_LOCAL the average number of installs is:  13984077.7
For genre: ART_AND_DESIGN the average number of installs is:  2021626.8
For genre: EVENTS the average number of installs is:  253542.2
For genre: LIBRARIES_AND_DEMO the average number of installs is:  638503.7
For genre: HEALTH_AND_FITNESS the average number of installs is:  4188822.0
For genre: PARENTING the average number of installs is:  542603.6
For genre: SPORT

## Conclusion

The iOS data show that the most common genre is "Games" and the second most common genre is "Entertainment". There is a significant drop in percentage between the most common genre and second most common genre. Most of the apps are designed for entertainment purposes (Games, Entertainment, Photo & Video). There are a large number of apps for "Games" genre but this does not necessarily mean that this genre has the most users; there could be other factors that contribute to the "Games" genre being most frequent in the App Store, such as apps with "Games" as their genre are easier to develop. 

The Android data show that "Family" and "Tools" are the most common genres. Comparing iOS data to Android data, we see that "Games" and "Entertainment are also common genres. There are a large number of apps for "Family" and "Tools" but this does not necessarily mean that these genres have the most users; again there could be other factors that contribute to these two genres being most frequent in the Google Play store, such as these two genres are easier to develop. 

For iOS apps data, the highest average number of user ratings is 86,090.3 for Navigation genre. Followed by 74,942.1 for Reference genre and 71,548.3 for Social Networking genre. Suggesting that the top three most popular app genre for iOS app users is Navigation, Reference, and Social Networking apps. My recommendation is to focus on these three app genres for the iOS app development. 

For Android apps data, the highest average number of user installs is 24,727,872.5 for Video Players genre. Followed by 23,253,652.1 for Social genre and 17,840,110.4 for Photography genre. Suggesting that the top three most popular app genre for installs for Android app users is Video Players, Social, and Photography apps. My recommendation is to focus on these three app genres for the Android app development. 