## Understanding what apps attract more users

### Synopsis
The goal of this study is to analyze app data to help developers understand what type of apps are likely to attract more users. Apps predominantly generate revenue from in-app ads, so the more users see and engage with the ads, the more revenue is likely to be generated. In this study I am interested in free apps.

In [1]:
from csv import reader

In [2]:
file  = open('AppleStore.csv',encoding='utf8')
apple = reader(file)
apple_data = list(apple)
file.close()

file  = open('googleplaystore.csv',encoding='utf8')
google = reader(file)
google_data = list(google)
file.close()

Let's explore the dataset

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') 

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
# Print out first 10 rows of App store data
explore_data(apple_data,0,10)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061

In [5]:
# Print out first 10 rows of Google Play Store data
explore_data(google_data,0,10)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

In [6]:
len(apple_data[1:]), len(apple_data[0])

(7197, 16)

In [7]:
len(google_data[1:]), len(google_data[0])

(10841, 13)

Let's take a closer look at the headers from both datasets to figure out which categories will best suit our goals

In [8]:
print(apple_data[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [9]:
print(google_data[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In order to answer our question of what types of apps are likely to attract customers, our best option is to segment the apps based on their **genres**, as that category best describes the type of the app.

## Data Cleaning

In this section we inspect the data to find erroneous values. We also focus on free apps directed towards an english-speaking audience. An erroneous data entry is shown below

In [10]:
print(google_data[0]), print("\n"), explore_data(google_data,10473,10475)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




(None, None, None)

We see that there is an error in the google dataset. At index 10373, the cell lvalues are shifted to the left. So we have incorrect values for every column in this row. Let's delete it

In [11]:
del google_data[10473]

### Duplicate entries

Next, we remove apps that have been recorded more than once. However, I will not just randomly remove duplicates. I will keep the apps that are most recent, and will achieve this by using the number of reviews as a filter. The most recent data will have the highest number of reviews.

Let's confirm there are duplicates first

In [12]:
app_name = []
duplicates = []

for names in google_data[1:]:
    name = names[0]
    
    if name in app_name:
        duplicates.append(name)
       
    elif name not in app_name:
        app_name.append(name)
        
print(duplicates[:10])

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


In [13]:
len(duplicates)

1181

In order to remove duplicates, I will first search for them, and only keep the values with the **highest number of reviews** as this corresponds to most recent data

In [14]:
# create a dictionary to store unique app names
reviews_max = {}

for apps in google_data[1:]:
    name = apps[0]
    n_reviews = float(apps[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews        
        
print(len(reviews_max))

9659


Note: In the above step, we made a dictionary and looped through the data set. We added both app name and number of reviews, updating the number of reviews to only keep the largest number. This method helps accout for duplicates since we use the number of reviews to highlighted the most recent app updates. 

In the precedeing steps we again search the data, and for every app we append its information to a list if its not already present and if its number of reviews match that of our dictionary. 

In [15]:
# use dictionary above to remove duplicate rows
android_clean = []
already_added = []

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
len(android_clean)

9659

### Removing non english apps

Here we will filter out non english apps. We do this by checking the unicode code point value. The regular english characters all fall within 0-127 so values outside that range are most likely not english based. 

In [16]:
def char_check(string):
    default = True
    for l in string:
        if ord(l) > 127:
            default = False
            break
    return default
  


In [17]:
char_check('Instagram'), char_check('爱奇艺PPS -《欢乐颂2》电视剧热播')

(True, False)

In [18]:
char_check('Docs To Go™ Free Office Suite'), char_check('Instachat 😜')

(False, False)

In [19]:
# filter out the non english apps from both data sets
google_filter = []
apple_filter = []

for app in android_clean[1:]:
    name = app[0]
    flag = char_check(name)
    
    if flag:
        google_filter.append(app)
        

for app in apple_data[1:]:
    name = app[0]
    flag = char_check(name)
    
    if flag:
        apple_filter.append(app)
        

In [20]:
len(google_filter), len(apple_filter)

(9116, 7197)

In [21]:
len(google_data), len(apple_data)

(10841, 7198)

### Isolating free apps

In [22]:
google_free = []
apple_free = []

for app in google_filter[1:]:
    app_type = app[6]
    
    if app_type == "Free":
        google_free.append(app)

for app in apple_filter[1:]:
    app_type = app[4]
    
    if app_type == "0.0":
        apple_free.append(app)

In [23]:
len(google_free), len(apple_free)

(8405, 4055)

## Validation strategy for an app idea

Building apps can be a time consuming process and we want to have an idea of how successful our apps will be to determine if we should proceed. We therefore build a minimal androide version of our app, add to google play and see how successful it is. If after 6 months its doing well, we include an ios version. 

In order for this strategy to work, we need to find app profiles that are successful on both markets. One way to do this might be to look at the most common genres for each market.

In [33]:
# Let's print out the headers for both data sets
print(google_data[0]), print('\n'), print(apple_data[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


(None, None, None)

We could get an idea of the most common genres for each market by talling up the **Genres** and **Category** for the google apps, and **prime_genre** category for the ios appps

In [25]:
def freq_table(dataset,index):
    '''
    This function accepts a list and an index value, and provides a relative frequency of a given data column in the data 
    '''
    freq_tab = {}
    
    for app in dataset:
        value = app[index]
        
        if value in freq_tab:
            freq_tab[value] += 1
        else:
            freq_tab[value] = 1
    
    num_apps = len(dataset)
    for key in freq_tab:
        freq_tab[key] = (freq_tab[key]/num_apps) * 100
        
    return freq_tab

In [42]:
def display_table(dataset, index):
    '''
    This function creates a frequency table and prints out the relative frequency 
    '''
    
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        #print(entry[1], ' : ', entry[0])
        print("{} : {:.2f}".format(entry[1].capitalize(), entry[0]) )

In [43]:
# display the frequency table for Genres (Google)
display_table(google_free,9)

Tools : 8.57
Entertainment : 6.09
Education : 5.39
Business : 4.71
Productivity : 3.97
Lifestyle : 3.88
Finance : 3.74
Medical : 3.64
Sports : 3.33
Personalization : 3.31
Communication : 3.22
Health & fitness : 3.13
Action : 3.12
Photography : 3.01
News & magazines : 2.80
Social : 2.67
Travel & local : 2.31
Shopping : 2.25
Books & reference : 2.19
Simulation : 2.08
Dating : 1.83
Arcade : 1.83
Casual : 1.77
Video players & editors : 1.74
Maps & navigation : 1.36
Food & drink : 1.20
Puzzle : 1.13
Racing : 1.02
Role playing : 0.94
Auto & vehicles : 0.94
Libraries & demo : 0.90
Strategy : 0.89
House & home : 0.81
Weather : 0.80
Events : 0.71
Adventure : 0.65
Beauty : 0.63
Art & design : 0.59
Comics : 0.56
Parenting : 0.50
Trivia : 0.42
Educational;education : 0.42
Card : 0.40
Educational : 0.38
Casino : 0.38
Board : 0.37
Education;education : 0.35
Word : 0.25
Music : 0.20
Casual;pretend play : 0.19
Puzzle;brain games : 0.18
Racing;action & adventure : 0.14
Casual;brain games : 0.14
Enterta

In [44]:
# display the frequency table for Category (google)
display_table(google_free,1)

Family : 18.80
Game : 9.61
Tools : 8.58
Business : 4.71
Productivity : 3.97
Lifestyle : 3.89
Finance : 3.74
Medical : 3.64
Personalization : 3.31
Sports : 3.26
Communication : 3.22
Health_and_fitness : 3.13
Photography : 3.01
News_and_magazines : 2.80
Social : 2.67
Travel_and_local : 2.31
Shopping : 2.25
Books_and_reference : 2.19
Dating : 1.83
Video_players : 1.76
Maps_and_navigation : 1.36
Food_and_drink : 1.20
Education : 1.17
Entertainment : 0.94
Auto_and_vehicles : 0.94
Libraries_and_demo : 0.90
House_and_home : 0.81
Weather : 0.80
Events : 0.71
Parenting : 0.65
Art_and_design : 0.64
Beauty : 0.63
Comics : 0.57


In [39]:
# display the frequency table for prime_genre (apple)
display_table(apple_free,11)

Games : 55.66
Entertainment : 8.24
Photo & Video : 4.12
Social Networking : 3.50
Education : 3.26
Shopping : 2.98
Utilities : 2.69
Lifestyle : 2.32
Finance : 2.07
Sports : 1.95
Health & Fitness : 1.87
Music : 1.65
Book : 1.63
Productivity : 1.53
News : 1.43
Travel : 1.38
Food & Drink : 1.06
Weather : 0.76
Reference : 0.49
Navigation : 0.49
Business : 0.49
Catalogs : 0.22
Medical : 0.20


Analayzing the genres for the ios apps, `Games` make up **56%** of the free app genres, with `Entertainment` being a distant second at about **8%**. What is interesting is the top 4 generes are entertainment or social media related. It seems apps are predominantly desiged for entertainment compared to practical purposes. Although the gaming apps are most dominant, this does not imply a larger number of users. Perhaps checking the user ratings count might give a sense of how much interaction occurs between users and the app. 

User preferences are not so clear cut with the google apps, as they show a mix of practical and entertainment genres. However the top apps here predominantly lean towards the practical apps. The most common genre was `Tools` but in terms of percentages it did not significantly differ from the other top genres. If we had to compare the google and apple trends, we see the apple trends predominantly leaned towards entertainment while the google trends favoured the productivity apps. We would need to see how the user count to make a more reasonable assessement. 

In [62]:
# get a list of the genres and their relative frequency from apple store
app_genres = freq_table(apple_free,11)
#app_genres
#app_genres = sorted(app_genres, reverse = True)

for genre in app_genres:
    total = 0 # store sum of user ratings
    len_genre = 0 # number of apps specific to each genre
    
    for app in apple_free:
        genre_app = app[11]
        
        if genre_app == genre:
            total += float(app[5])
            len_genre += 1
            
    #print(f'{genre} has an average no of user ratings of: {total/len_genre}')
    print('On Average, {:,.2f} App store users provided ratings for the {} genre.'.format(total/len_genre,genre))        
            

On Average, 27,249.89 App store users provided ratings for the Photo & Video genre.
On Average, 18,924.69 App store users provided ratings for the Games genre.
On Average, 56,482.03 App store users provided ratings for the Music genre.
On Average, 32,503.56 App store users provided ratings for the Social Networking genre.
On Average, 67,447.90 App store users provided ratings for the Reference genre.
On Average, 19,952.32 App store users provided ratings for the Health & Fitness genre.
On Average, 47,220.94 App store users provided ratings for the Weather genre.
On Average, 14,010.10 App store users provided ratings for the Utilities genre.
On Average, 20,216.02 App store users provided ratings for the Travel genre.
On Average, 18,746.68 App store users provided ratings for the Shopping genre.
On Average, 15,892.72 App store users provided ratings for the News genre.
On Average, 25,972.05 App store users provided ratings for the Navigation genre.
On Average, 8,978.31 App store users pr

The genres corresponding to the entertainment genres (e.g. social Networking, Photo & Video) on average had more user engagement relative to the more productive apps. There wasn't a major discrepancy overall for the different genres, but the medical genre did have the lowest user engagement across all the genres. A recommended area for app development would be music, since it had the highest user engagement relative to the other genres. 

Now, let's create a frequency table for the different Categories

In [63]:
category_freq = freq_table(google_free,1)

for cat in category_freq:
    total = 0
    len_category = 0
    
    for app in google_free:
        category_app = app[1]
        
        if category_app == cat:
            no_installs = app[5]
            no_installs = no_installs.replace('+','')
            no_installs = no_installs.replace(',','')
            
            total += float(no_installs)
            len_category += 1
    #print(f'{cat} has an average no of user ratings of: {total/len_category}')
    print('On Average, {:,.2f} Google play store users provided ratings for the {} category.'.format(total/len_category,cat))        

On Average, 1,077,983.33 Google play store users provided ratings for the ART_AND_DESIGN category.
On Average, 645,317.23 Google play store users provided ratings for the AUTO_AND_VEHICLES category.
On Average, 513,151.89 Google play store users provided ratings for the BEAUTY category.
On Average, 8,504,745.98 Google play store users provided ratings for the BOOKS_AND_REFERENCE category.
On Average, 1,602,958.31 Google play store users provided ratings for the BUSINESS category.
On Average, 880,440.62 Google play store users provided ratings for the COMICS category.
On Average, 36,106,662.33 Google play store users provided ratings for the COMMUNICATION category.
On Average, 764,959.46 Google play store users provided ratings for the DATING category.
On Average, 1,844,897.96 Google play store users provided ratings for the EDUCATION category.
On Average, 12,346,329.11 Google play store users provided ratings for the ENTERTAINMENT category.
On Average, 232,885.83 Google play store user

The categories with the most engagement were the `Communication`, `Entertainment`, `Social` and `Game` categories. We observed a similar pattern with the google play store; These particular cateogory/genre of apps not only had a large market, but they also made up a significant portion of the app world within those repositories.

## Conclusion

Using a dataset containing information on apps from both the App Store and Google Play Store, I have been able to paint a reasonable picture of the distribution of the different apps within those online repositories. My analysis showed that the most interacted with apps predominantly lie within the Entertainment, music, social media and Gaming industry. Therefore, if our focus is to maxime ad-revenue, we are better served targeting those markets.

Of course our choices are not only limited to the most popular markets. During the course of my analysis, we saw there was reasonable user engagement in other categories that may not be as competitive. This might provide the opportunity to get into a niche market depending on the product offered. My conclusion was to target the music industry, as well look into productivity apps as they showed some promise. Of course in order to be competitive, we also need some truly innovative features on our app. Nevertheless, we are better equiped to design a succesful app with this analysis