# Profitable App Profiles: Which ones contributed most to overall performance?


Our aim in this project is to do a general analysis of our apps' performance in App Store and Google Play markets. Because our product ranges are free and the main source of revenue comes from in-app ads, our analysis will focus on the volume of traffic visiting and using our apps. Technically, we will try to answer the following questions:

> 1. Which apps attracted the highest volume of visits?
> 2. Why these apps are more favorable?
> 3. What are the opportunities for our apps?

## About the dataset

Our analysis use two datasets: one containing data about approximately ten thousand Android apps from Google Play, and another containing about approximately seven thousand iOS apps from the App Store. You can download the datasets directly from:

Android apps: https://dq-content.s3.amazonaws.com/350/AppleStore.csv

iOS apps: https://dq-content.s3.amazonaws.com/350/googleplaystore.csv

Utilizing these available data allows us to save a huge amount of time and cost. While exploring those data, we can evaluate our codes before applying to our real dataset. Let's start by opening these data sets and explore them. 

In [1]:
from csv import reader

## Open the Applestore data
opened_file = open(r"C:\Users\hoang\Desktop\New folder\AppleStore.csv", encoding="utf8")
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

## Open the Google Play data
opened_file = open(r"C:\Users\hoang\Desktop\New folder\googleplaystore.csv", encoding = "utf8")
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]


In [2]:
## Explore function to help you have a quick look at specific slices of data

def explore_data(dataset, start, end, rows_and_columns = False):
    data_slice = dataset[start:end]
    
    for row in data_slice:
        print(row)
        print('\n')
    
    if rows_and_columns:
        print("The number of rows: ", len(dataset))
        print("The number of columns: ", len(dataset[0]))

In [3]:
## Let's explore 5 first rows of our datasets
print('iOS Dataset: ')
print('\n')
print(ios_header)
print('\n')
explore_data(ios,0,2, True)

iOS Dataset: 


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


The number of rows:  7197
The number of columns:  16


In [4]:
print('Android Dataset: ')
print('\n')
print(android_header)
print('\n')
explore_data(android,0,2, True)

Android Dataset: 


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


The number of rows:  10841
The number of columns:  13


## Check duplicates & wrong inputs

For our aim in this project, it seemed that with iOS dataset, the rating_count tot collumn would help to answer the most favorable apps, and we can explore the reasons behind by analysing other attributes (such as: average rating, size of the apps, category, etc.). Similarly, with Android dataset, the number of installs would classify the most attractive apps. But firstly, we need to clean our data to ensure the accuracy. 

In [5]:
# To make sure all duplicates and wrong input in our datasets, build function to detect them:

def duplicate_check(dataset):
    unique_apps = []
    duplicate_list =[]

    for row in dataset:
        apps_nameid = row[0]
        if apps_nameid not in unique_apps:
            unique_apps.append(apps_nameid)
        else:
            duplicate_list.append(apps_nameid)
    return duplicate_list

def row_length_check(dataset):
    row_length = len(dataset[0])
    row_diff = []
    for row in dataset:
        if len(row) != row_length:
            row_index = dataset.index(row)
            row_diff.append(row_index)
            
    return row_diff

def data_check(dataset):
    print('Number of duplicates: ', len(duplicate_check(dataset)))
    print('\n')      
    print('Number of rows with different length: ', len(row_length_check(dataset)))
    
            

In [6]:
data_check(ios)

Number of duplicates:  0


Number of rows with different length:  0


In [7]:
data_check(android)

Number of duplicates:  1181


Number of rows with different length:  1


In [8]:
row_diff = row_length_check(android)
print(row_diff)


[10472]


In [9]:
# As you can see, this apps is missing its category in row 10472, which makes the other columns shift to the left unexpectedly.
# delete the row from dataset
for i in row_diff:
    del android[i]

In [10]:
# We will check data again to make sure there is no rows with different length after we delete row 10472
data_check(android)

Number of duplicates:  1181


Number of rows with different length:  0


In [11]:
# Now, we will explore more about duplicates in Android data
for app in android:
    name = app[0]
    if name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


For duplicates in Android data, we can see the difference is in the number of reviews. In fact, the number of reviews normally increases cumulatively over time, so instead of deleting duplicate randomly, it is better to keep row with the highest review and delete the others. 2 ways to do that: 

Iterating apps name in duplicate_check(dataset) result to find list of duplicates with their reviews number, create a list of review and keep the row with reviews equal maximum of this list.

We can create a dictionary to keep all uniques apps and the maximum of its reviews. Then we iterate dataset to remove rows that had different combinations with ones in the dictionary.Let's try with a dictionary.

In [12]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

android_clean = []
already_added = []
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
    

In [13]:
print(len(android_clean))

9659


### Clean apps name

Suppose that we want to focus on apps aiming at English speaking users, so we are not interested in apps with non-English names. To remove these apps from our dataset, we need a function to detect apps names and compare to text range following ASCII standard. Normally characters in an English text are in the range 0 to 127. The idea is, therefore, to create a function to detect range of character in a text and remove the text if they are out of English character range.

In [14]:
# For english character, the range is between 0 and 127

def detect_appname(string, a, b):
    lower_range = a
    upper_range = b
    # to minimize the impact of data loss, we only remove apps with names having more than 3 non ASCII characters
    n = 0
    for char in string:
        if ord(char) > upper_range:
            n += 1
    if n > 3:
        return False
    return True
    

In [15]:
detect_appname('爱奇艺PPS -《欢乐颂2》电视剧热播', 0, 127)

False

In [16]:
ios_englishapps = []
android_englishapps =[]

for row in ios:
    apps_name = row[1]
    if detect_appname(apps_name, 0, 127):
        ios_englishapps.append(row)
print("Number of iOS english apps: ", len(ios_englishapps))

for row in android_clean:
    apps_name = row[0]
    if detect_appname(apps_name, 0, 127):
        android_englishapps.append(row)
print("Number of Android english apps: ", len(android_englishapps))

Number of iOS english apps:  6183
Number of Android english apps:  9614


In [17]:
# As we mentionned before, we focus only on free apps in this analysis, so now we will isolate the free apps in a seperate list

ios_freeapps = []
android_freeapps = []

for row in ios_englishapps:
    price = float(row[4])
   # print(price)
    if price == 0:
        ios_freeapps.append(row)
print("Number of iOS free apps: ", len(ios_freeapps))

for row in android_englishapps:
    price = row[7]
#     print(price)
    if price == "0":
        android_freeapps.append(row)
print("Number of Android free apps: ", len(android_freeapps))

Number of iOS free apps:  3222
Number of Android free apps:  8864


## Analyze Data

After having our clean data, we begin to explore data to answer the questions. Strategie to develop an app is through 3 steps: Firstly a sample version is built and put on Google Play, this version is then evaluated base on response of users. If it receives good reviews on Google Play, a more completed version will be developed and promoted on App store as well. The overall performance will then be evaluated from both platforms. 

With this in mind, we will isolate apps with the highest frequency of specific features, to see which one is the most favorable of all apps. For example, we can explore data to answer what are the most common genres in each market, and then develop our own apps base on this insight.

In [18]:
# review data
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


The number of rows:  10840
The number of columns:  13


In [19]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


The number of rows:  7197
The number of columns:  16


In [20]:
# We will use category of Android data and prime_genre of ios data for our analysis
def freq_table(dataset, index):
    frequency_dict = {}
    percent_dict = {}
    
    for row in dataset:
        app_genre = row[index]
        if app_genre in frequency_dict:
            frequency_dict[app_genre] += 1
        else:
            frequency_dict[app_genre] = 1
    for genre in frequency_dict:
        percent_dict[genre] = (frequency_dict[genre] / len(dataset)) * 100
    
    return percent_dict

# Build a function to sort our result in descending order
def display_table(dataset, index):
    table = freq_table(dataset, index)
    tup_list = []
    for key in table:
        key_val_tup = (table[key], key)
        tup_list.append(key_val_tup)
    sorted_list = sorted(tup_list, reverse = True)
    for entry in sorted_list:
        print(entry[1], ':', entry[0])

        
    

In [21]:
display_table(ios_freeapps, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We can see that Games is the most common genre for iOS free apps, with 58% of apps belongs to this category, followd by Entertaiment apps. To keep in mind that we only focus on free apps, so this result can not apply to the whole platform. Apps designed for entertainment purposes (games, photo and video, social networking, sports, music) seemed more preferable than for practical purposes (education, shopping, utilities, productivity, lifestyle). Apps developers pay the least attention to Navigation, Medical and Catalogs apps. However, the large number of apps in a specific genre may explain a huge number of users for this genre, but does not ensure a large number of users for each apps averagely.

In [22]:
display_table(android_freeapps, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [23]:
display_table(android_freeapps, -4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

For android free apps, Family is the most common category and Tools is the most common genre. The runner-ups are Games and Entertainment respectively. We can notice that Tools classified as a category was ranked the third place despite being the most common genre. Androi apps for practical purposes attracted more attention from developers than iOS apps. Apps relating to education, business, medical outperformed Entertainment as common categories and had quite significant frequency as common Genres.

In [24]:
# As mentionned before that the frequency table does not technically signify volume of users for each categroris or genres.
# in this part we will use other attribute to find out about this: the number of install in android data and the number of
# rating in iOS data.
ios_freq = fre_table(ios_freeapps, -5)

for genre in ios_freq:
    total = 0
    len_genre = 0
    for app in ios_freeapps:
        genre_app = app[-5]
        if genre_app == genre:
            rating = float(app[5])
            total += rating
            len_genre += 1
    
    avg_rating = total/ len_genre
    print(genre, ':', avg_rating)

NameError: name 'fre_table' is not defined

Among all genres, Navigation had the highest average rating. Interestingly enough, Navigation ranked very low in list of common genres and categories. Let's explore more to understand the reason behind.

In [None]:
for app in ios_freeapps:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

We notice that the average rating of Navigation is strongly influenced by Waze - GPS Navigation and Google Maps. Other apps had much lower rating amount.

Similarly, Reference had a very high average rating which is strongly affected by Bible, Dictionary and Google Translate's ratings.

In [None]:
for app in ios_freeapps:
    if app[-5] == "Reference":
        print(app[1], ': ', app[5])

Social Networking and Music are among the most common genres and had a high average rating. Top apps of these two genres belong to Facebook, Pinterest, Skype for Social Networking, and Pandora, Spotify, and Shazam for Music.

In [None]:
for app in ios_freeapps:
    if app[-5] == "Social Networking":
        print(app[1], ':', app[5])

Given that apps for entertainment purpose seemed to attract more attention, we recommend to develop apps in this genre. For example, an app where people can create their own music with a store of melody/rhythm and the users can share their product on social network.

In [None]:

categories_android = freq_table(android_freeapps, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_freeapps:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

COMMUNICATION had the highest average installs. Similarly to iOS apps, we will see whether this number is biased or not. 

In [None]:
for app in android_freeapps:
    top_apps = []
    if app[1] == "COMMUNICATION" and (app[5] == '1,000,000,000+'
                                     or app[5] == '500,000,000+'
                                     or app[5] == '100,000,000+'):
        print(app[0], ': ', app[5])
        

In [None]:
# Calcul the average of communication after we remove the top apps as outliers
under_100_m = []

for app in android_freeapps:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

Compared to an average of '38456119', the number reduces 10 time after removing outliers. Let's try with Social as well.

In [None]:
for app in android_freeapps:
    top_apps = []
    if app[1] == "SOCIAL" and (app[5] == '1,000,000,000+'
                                     or app[5] == '500,000,000+'
                                     or app[5] == '100,000,000+'):
        print(app[0], ': ', app[5])

In [None]:
# Calcul the average of Social after we remove the top apps as outliers
under_100_m = []

for app in android_freeapps:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'SOCIAL') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

The average rating also reduces more than 7 times compare to previous number without top apps. The domination of top apps in these genres makes it very competitive for a new comer to develop an app in the same genre. Instead, apps relating to other genres like Book or Family can be a good investment. However, we need to investigate these genres more to make the final decision. 

## Conclusions

In this project, we explored datasets of apps on App Store and Google Play, practice some basic processing and cleaning. Analysing the dataset, we notice that the average rating or installs was strongly skewed by certain apps. Among the most common genres and categories, the domination of top apps makes it very competitive for developing a new app. However, more opportunities can be investigated in other entertainment genres.