# <b>Profitable App Profiles for the App Store and Google Play Markets<b>

Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. To do this, we'll need to collect and analyze data about mobile apps available on Google Play and the App Store. 

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

## <b>Opening and Exploring the Data<b>

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead.


In [158]:
from csv import reader

opened_file=open('AppleStore.csv')
read_file=reader(opened_file)
apps_data=list(read_file)
ios_header=apps_data[0]
ios=apps_data[1:]

opened_file=open('googleplaystore.csv')
read_file=reader(opened_file)
apps_data=list(read_file)
android_header=apps_data[0]
android=apps_data[1:]




In [159]:
def explore_data(dataset,start,end,rows_and_columns=False):
    dataset_slice=dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
         print('Number of rows:', len(dataset))
         print('Number of columns:', len(dataset[0]))
print(android_header)
print('\n')
explore_data(android,0,4,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


In [160]:
print(ios_header)
print('\n')
explore_data(ios,0,4,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


# <b>Deleting incorrect datas <b>

In [161]:
print(android[10472])  # incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android[0])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


In [162]:
print(len(android))
del android[10472]  # don't run this more than once
print(len(android))

10841
10840


# <b>Deleting Duplicate Elements from Data sets<b>

## <b>Confirmation of Duplicate datas<b>

If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance, the application Instagram has four entries.

In [163]:
for app in android:
    data=app[0]
    if data =='Instagram':
        print(data)

Instagram
Instagram
Instagram
Instagram



Below we looped through the android data set (the Google Play data set), and for each iteration:
<ul>
<li>We saved the app name to a variable named name.</li>
<li>If name was already in the unique_apps list, we appended name to the duplicate_apps list.</li>
<li>Else (if name wasn't already in the unique_apps list), we appended name to the unique_apps list.</li>
</ul>

In [164]:
duplicate_apps=[]
unique_apps=[]
for app in android:
    data=app[0]
    if data in unique_apps:
        duplicate_apps.append(data)
    else:
        unique_apps.append(data)
print('No. of Duplicate Apps',len(duplicate_apps))
print('\n')
print('No. of unique apps',len(unique_apps))

No. of Duplicate Apps 1181


No. of unique apps 9659


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed two cells above for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we will:
<ul>

<li>Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app<\li>

<li>Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)<\li>
<\ul>

In [165]:
review_max={}
for app in android:
    name=app[0]
    n_reviews=float(app[3])
    if name in review_max and review_max[name]<n_reviews:
        review_max[name]=n_reviews
    elif name not in review_max:
        review_max[name]=n_reviews

In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [166]:
print("Expected length",len(android)-1181)
print("Actual length",len(review_max))

Expected length 9659
Actual length 9659


Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:
<ul>

   <li> We start by initializing two empty lists, android_clean and already_added.<\li>

   <li> We loop through the android data set, and for every iteration <\li> 
   
   <\ul>
    
            We isolate the name of the app and the number of reviews.
   
            We add the current row (app) to the android_clean list, and the app name (name) to the already_added list if:
           
            
            The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; 
            
            
The name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps.

In [167]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (review_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

In [168]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


# <b>Removing Non-English Apps<b>

If you explore the data sets enough, you'll notice the names of some of the apps suggest they are not directed toward an English-speaking audience. 
We're not interested in keeping these kind of apps, so we'll remove them. One way to go about this is to remove each app whose name contains a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.).

All these characters that are specific to English texts are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it, and we can take advantage of that to build a function that checks an app name and tells us whether it contains non-ASCII characters.

We built this function below, and we use the built-in ord() function to find out the corresponding encoding number of each character.

In [169]:

def eng(s):
    for character in s:
        if ord(character)>127:
            return False
        else:
            return True
print(eng('Instagram'))
print(eng('爱奇艺PPS -《欢乐颂2》电视剧热播'))


True
False


But the function couldn't correctly identify certain English app names like 'Docs To Go™ Free Office Suite' and 'Instachat 😜'. This is because emojis and characters like ™ fall outside the ASCII range and have corresponding numbers over 127.

If we're going to use the function we've created, we'll lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English. Our filter function is still not perfect, but it should be fairly effective.

## Improved Function

In [170]:
android_english=[]
ios_english=[]

def eng(s):
    c=0
    for character in s:
        if ord(character)>127:
            c=c+1
        if c>3:
                return False
            
        else :
            return True
print(eng('Docs To Go™ Free Office Suite'))
print(eng('Instachat 😜'))
for index in android_clean:
    name=index[0]
    if eng(name):
        android_english.append(index)
for index in ios:
    name=index[1]
    if eng(name):
        ios_english.append(index)

True
True


# Isolating the free apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

Isolating the free apps will be our last step in the data cleaning process. On the next screen, we're going to start analyzing the data.




In [171]:
free_apps_android=[]
free_apps_ios=[]
for key in android_english :
    price=key[7]
    if price =='0':
        free_apps_android.append(key)
for key in ios_english:
    price=key[4]
    if price=='0.0':
        free_apps_ios.append(key)
print(len(free_apps_android))
print(len(free_apps_ios))

8905
4056


# Goal for making a Profitable App

## Introduction

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:
<ul>
<li>Build a minimal Android version of the app, and add it to Google Play. </li>
<li>If the app has a good response from users, we then develop it further. </li>
<li>If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store. </li>
</ul>

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

## Motive

We'll build two functions we can use to analyze the frequency tables:
<ol>
<li>One function to generate frequency tables that show percentages</li>
<li>Another function that we can use to display the percentages in a descending order</li>
</ol>


### Note
We already learned to generate frequency tables that show percentages, and we're going to build a function for that in the exercise below. However, dictionaries don't have order, and it will be very difficult to analyze the frequency tables. We'll need to build a second function which can help us display the entries in the frequency table in a descending order.

To do that, we'll need to make use of the built-in sorted() function. This function takes in an iterable data type (like a list, dictionary, tuple, etc.), and returns a list of the elements of that iterable sorted in ascending or descending order (the reverse parameter controls whether the order is ascending or descending).
The sorted() function doesn't work too well with dictionaries because it only considers and returns the dictionary keys.
However, the sorted() function works well if we transform the dictionary into a list of tuples, where each tuple contains a dictionary key along with its corresponding dictionary value. To ensure the sorting works right, the dictionary value comes first, and the dictionary key comes second:

The display_table() function you see below:
<ol>

<ul>Takes in two parameters: dataset and index. dataset is expected to be a list of lists, and index is expected to be an integer.</ul>

<ul>Generates a frequency table using the freq_table() function (which you're going to write as an exercise).</ul>

<ul>Transforms the frequency table into a list of tuples, then sorts the list in a descending order.</ul>

<ul>Prints the entries of the frequency table in descending order.</ul>

</ol>





In [172]:
def freq_table(data_set,index):
    f={}
    total=0
    for key in data_set:
        total+=1
        i=key[index]
        if i in f:
            f[i]+=1
        else:
            f[i]=1
    perc={}
    for key in f:
        percent=(f[key]/total)*100
        perc[key]=percent
    return perc
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [173]:
#For IOS DATASET
display_table(free_apps_ios,-5)

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


In [174]:
#For ANDROID DATASET
display_table(free_apps_android,1)

FAMILY : 18.97810218978102
GAME : 9.70241437394722
TOOLS : 8.433464345873105
BUSINESS : 4.581695676586187
LIFESTYLE : 3.9303761931499155
PRODUCTIVITY : 3.885457608085345
FINANCE : 3.6833239752947784
MEDICAL : 3.5148792813026386
SPORTS : 3.3801235261089273
PERSONALIZATION : 3.312745648512072
COMMUNICATION : 3.2341381246490735
HEALTH_AND_FITNESS : 3.065693430656934
PHOTOGRAPHY : 2.9421673217293653
NEWS_AND_MAGAZINES : 2.829870859067939
SOCIAL : 2.6501965188096577
TRAVEL_AND_LOCAL : 2.3245367770915215
SHOPPING : 2.2459292532285233
BOOKS_AND_REFERENCE : 2.1785513756316677
DATING : 1.8528916339135317
VIDEO_PLAYERS : 1.7967434025828188
MAPS_AND_NAVIGATION : 1.4149354295339696
FOOD_AND_DRINK : 1.235261089275688
EDUCATION : 1.167883211678832
ENTERTAINMENT : 0.9545199326221224
LIBRARIES_AND_DEMO : 0.9320606400898372
AUTO_AND_VEHICLES : 0.9208309938236946
HOUSE_AND_HOME : 0.8197641774284109
WEATHER : 0.7973048848961257
EVENTS : 0.7074677147669848
PARENTING : 0.6513194834362718
ART_AND_DESIGN : 0

# Most Favourite Apps by Genre

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

In [175]:
gnr_ios=freq_table(free_apps_ios,-5)
for i in gnr_ios:
    total=0
    len_gen=0
    for app in free_apps_ios:
        g=app[-5]
        if g==i:
            n_rating=float(app[5])
            total+=n_rating
            len_gen+=1
    avg_n_ratings = total / len_gen
    print(i, ':', avg_n_ratings)
    

Weather : 47220.93548387097
Photo & Video : 27249.892215568863
Finance : 13522.261904761905
Entertainment : 10822.961077844311
Catalogs : 1779.5555555555557
Shopping : 18746.677685950413
Navigation : 25972.05
Music : 56482.02985074627
Travel : 20216.01785714286
Food & Drink : 20179.093023255813
Sports : 20128.974683544304
Health & Fitness : 19952.315789473683
News : 15892.724137931034
Utilities : 14010.100917431193
Games : 18924.68896765618
Education : 6266.333333333333
Medical : 459.75
Productivity : 19053.887096774193
Reference : 67447.9
Lifestyle : 8978.308510638299
Book : 8498.333333333334
Business : 6367.8
Social Networking : 53078.195804195806


In [176]:
categories_android = freq_table(free_apps_android, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in free_apps_android:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

GAME : 15551995.891203703
MEDICAL : 120550.61980830671
LIBRARIES_AND_DEMO : 638503.734939759
COMMUNICATION : 38322625.697916664
BUSINESS : 1708215.906862745
ENTERTAINMENT : 11640705.88235294
SOCIAL : 23253652.127118643
SHOPPING : 7001693.425
ART_AND_DESIGN : 1952105.1724137932
BOOKS_AND_REFERENCE : 8587351.855670104
DATING : 854028.8303030303
AUTO_AND_VEHICLES : 647317.8170731707
HEALTH_AND_FITNESS : 4188821.9853479853
PERSONALIZATION : 5183850.806779661
NEWS_AND_MAGAZINES : 9401635.952380951
PHOTOGRAPHY : 17772018.759541985
LIFESTYLE : 1436126.94
TOOLS : 10787009.952063914
EVENTS : 253542.22222222222
FOOD_AND_DRINK : 1924897.7363636363
WEATHER : 5074486.197183099
BEAUTY : 513151.88679245283
COMICS : 803234.8214285715
PARENTING : 542603.6206896552
MAPS_AND_NAVIGATION : 3993339.603174603
TRAVEL_AND_LOCAL : 13984077.710144928
PRODUCTIVITY : 16738957.554913295
SPORTS : 3638640.1428571427
FINANCE : 1387692.475609756
FAMILY : 3668870.823076923
HOUSE_AND_HOME : 1331540.5616438356
EDUCATION :

In [177]:
for app in free_apps_android:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

# Conclusion

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.
We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.