## Profitable App Profiles for the App Store and Google Play Markets

This project is being carried out to explore competitive applications in the market. In our company we earn money from ads, displayed in free applications. In this project, our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

### Opening and Exploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. Collecting data from such a large number of applications, involves a lot of cost and time.

Therefore, we found data that fit our analysis, but in a smaller quantity:

- A data set containing data about approximately ten thousand Android apps from Google Play.<br>  You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).


- A data set containing data about approximately seven thousand iOS apps from the App Store.<br>   You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

In [1]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

To make it easier to explore the two data sets, we'll first write a function named explore_data() that we can use repeatedly to explore rows in a more readable way. We'll also add an option for our function to show the number of rows and columns for any data set.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(android_header)
print('\n')
explore_data(android, 0, 3, True)



['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We see that the Google Play data set has 10841 apps and 13 columns.
At a quick glance, the columns that might be useful for the purpose of our analysis are <pre> 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.</code></pre>

In [3]:
print(ios_header)
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


We have 7197 iOS apps in this data set, and the columns that seem interesting are:<pre> 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. </code></pre> Column names from the ios database, can be a little hard to understand. Therefore, I provide a link to the description [Click here](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps)


### Deleting wrong data

The Google Play data set has a dedicated discussion [section](https://www.kaggle.com/lava18/google-play-store-apps/discussion). We learned that row 10472 does not have a <pre>'genre'.</code></pre> We decided to remove it.

In [4]:
del android[10472] # don't run this more than once

## Removing duplicates entries

### Part one

If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance, the application Instagram has four entries:

In [5]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [6]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In our analysis, we don't want to have duplicates. Of the repeated titles, one should be selected. The best criterion here will be number of reviews. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings. 

To do that, we will:

- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app


- Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)


Google Play dataset has duplicate entries ![Screenshot](https://s3.amazonaws.com/dq-content/350/py1m8_duplicates_new.png)


![Here](https://s3.amazonaws.com/dq-content/350/py1m8_fourth_col.png)

### Part two

Building dictionary

In [7]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = app[3]
    if 'M' in n_reviews:
        n_reviews = float(n_reviews[:-1]) * 1000000  # Usunięcie litery "M" i przeliczenie na liczbę
    else:
        n_reviews = float(n_reviews)
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
        
print(len(reviews_max))    

9659


In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [8]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews.

In [9]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        

Now let's quickly explore the new data set, and confirm that the number of rows is 9,659.

In [10]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


## Removing Non-English Apps
### Part One

When looking through the data for a longer time, you may come across names in a language other than English

In [11]:
print(ios[813][1])
print(ios[6731][1])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


All these characters that are specific to English texts are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it.

To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters.

In [12]:
def check_english(word):
    counter_words = 0
    for char in word:
        if (ord(char) > 127):
            counter_words += 1
            
    if counter_words > 3: 
        return False
    else:
        return True

In [13]:
print(check_english('Instagram'))
print(check_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_english('Docs To Go™ Free Office Suite'))
print(check_english('Instachat 😜'))

True
False
True
True


In [14]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if check_english(name):
        android_english.append(app)
    
for app in ios:
    name = app[1]
    if check_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)    

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

## Isolating the Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install. We need to isolate free apps from paid ones.

In [15]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

8864
3222


We're left with 8864 Android apps and 3222 iOS apps

## Most Common Apps by Genre
### Part One

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1.Build a minimal Android version of the app, and add it to Google Play.<br>
2.If the app has a good response from users, we then develop it further.<br>
3.If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets.

### Part Two

We'll build two functions we can use to analyze the frequency tables: freq_table and display_table

In [16]:
def freq_table(dataset, index):
    genres = {}
    for app in dataset:
        genre = app[index]
        if genre in genres:
            genres[genre] += 1

        else:
            genres[genre] = 1
            
    total_apps = len(dataset)
    for genre in genres:
        genres[genre] = round((genres[genre] / total_apps) * 100, 1)
    return genres
        
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

     

In [17]:
display_table(ios_final, -5)

Games : 58.2
Entertainment : 7.9
Photo & Video : 5.0
Education : 3.7
Social Networking : 3.3
Shopping : 2.6
Utilities : 2.5
Sports : 2.1
Music : 2.0
Health & Fitness : 2.0
Productivity : 1.7
Lifestyle : 1.6
News : 1.3
Travel : 1.2
Finance : 1.1
Weather : 0.9
Food & Drink : 0.8
Reference : 0.6
Business : 0.5
Book : 0.4
Navigation : 0.2
Medical : 0.2
Catalogs : 0.1


It can be seen that by far the predominance of games in the App Store with more than 58%. Then there are Entertainment and Photo & Video. Apps on ios are geared toward entertainment and casual spending.

In [18]:
display_table(android_final, 1)

FAMILY : 18.9
GAME : 9.7
TOOLS : 8.5
BUSINESS : 4.6
PRODUCTIVITY : 3.9
LIFESTYLE : 3.9
FINANCE : 3.7
MEDICAL : 3.5
SPORTS : 3.4
PERSONALIZATION : 3.3
COMMUNICATION : 3.2
HEALTH_AND_FITNESS : 3.1
PHOTOGRAPHY : 2.9
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.7
TRAVEL_AND_LOCAL : 2.3
SHOPPING : 2.2
BOOKS_AND_REFERENCE : 2.1
DATING : 1.9
VIDEO_PLAYERS : 1.8
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.2
EDUCATION : 1.2
ENTERTAINMENT : 1.0
LIBRARIES_AND_DEMO : 0.9
AUTO_AND_VEHICLES : 0.9
WEATHER : 0.8
HOUSE_AND_HOME : 0.8
PARENTING : 0.7
EVENTS : 0.7
COMICS : 0.6
BEAUTY : 0.6
ART_AND_DESIGN : 0.6


In the case of Google play, the results are quite different. Leading the way are apps of the 'FAMILY' kind. Then there are 'GAME' and 'TOOLS'.

'Family' is a very general term. After a thorough check on Google Play, I found that these are mostly games for children.

Unlike the App store, there are many more applications here related to practical things i.e. tools, business, lifestyle.

## Most Popular Apps by Genre on the App Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre.

In [19]:

table = freq_table(ios_final,11)

for genre in table:
    len_genre = 0
    total = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            users = float(app[5])
            total += users
            len_genre += 1
    x = total / len_genre
           
    print(x, ':', genre)
    

71548.34905660378 : Social Networking
28441.54375 : Photo & Video
22788.6696905016 : Games
57326.530303030304 : Music
74942.11111111111 : Reference
23298.015384615384 : Health & Fitness
52279.892857142855 : Weather
18684.456790123455 : Utilities
28243.8 : Travel
26919.690476190477 : Shopping
21248.023255813954 : News
86090.33333333333 : Navigation
16485.764705882353 : Lifestyle
14029.830708661417 : Entertainment
33333.92307692308 : Food & Drink
23008.898550724636 : Sports
39758.5 : Book
31467.944444444445 : Finance
7003.983050847458 : Education
21028.410714285714 : Productivity
7491.117647058823 : Business
4004.0 : Catalogs
612.0 : Medical


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps.

In [20]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps.

Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating

In [23]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)
    if avg_n_installs > 10000000:
        print(category)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
COMMUNICATION
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
ENTERTAINMENT
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
GAME
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SOCIAL
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
PHOTOGRAPHY
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TRAVEL_AND_LOCAL
TOOLS : 10801391.298666667
TOOLS
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PRODUCTIVITY
PARENTING : 54

However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [24]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

## Conclusions


In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book.