# Most Profitable Apps
The following project intends to provide an insight on the profiles of the most attractive apps in the IOS store, in order to (hopefully!) help developers in creating more popular apps.

There are over 4 million apps in Apple App Store and Google Play together. Analyzing data from 4 million apps would be costly and time-consuming. Instead, I will analyze samples of data.

There are two relevant public sources of data which seem suitable for the analysis: a [data set](https://www.kaggle.com/lava18/google-play-store-apps/home) from Google Play with data of approximately 10,000 Android apps (collected in August 2018) and a [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home#AppleStore.csv) from the App Store with data of approximately 7,000 iOS apps (collected in July 2017).

For the purpose of the analysis, the year-lenght time difference of the collection of both databases should not be a significant factor for differences between results in both databases, as we assume reasons for popularity of some apps over others should be relatively stable in small and medium periods of time.

## Opening and Exploring

In [2]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The iOS data set ###
opened_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
iOS = list(read_file)
iOS_header = iOS[0]
iOS = iOS[1:]

I started by opening and exploring both data sets. To make it easier to visualize, I used the following explore_data() function:

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        #this is only useful if we don't have a header
        #else, it will give a wrong count of rows and columns

In [4]:
explore_data(android, 0, 5, True)
print('\n')
explore_data(iOS, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13


['1', '281656475', 'PAC-MAN Premi

Now that we have visualized part of what we can find inside each database, I will print both headers to know the names of the columns of each database. This will help us understanding which is the relevant information to be used in the analysis.

In [5]:
print('Header for Android apps:')
print(android_header)
print('\n')
print('Header for iOS apps:')
print(iOS_header)

Header for Android apps:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Header for iOS apps:
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


## Data Cleaning

I am only interested in apps which are directed at an english-speaking audience and that are free to download. Consequently, data cleaning will remove those apps which do not comply with this requirements.

Additionally, data cleaning involves removing wrong data for data reliability purposes.

### Android discussion forums: missing info in a row and data duplication

In the discussion section of the Google Play data set it was [discovered that there is missing information](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) in the category column of the row 10472 (which corresponds to 10473 before removing the header). This mistake leads to a mismatch row-column in all the following apps. Therefore, I proceed to print row 10472 to check whether the error does effectively exist and delete the app from the database if it is confirmed.

In [6]:
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [7]:
#Effectively, there is no category. Besides, genre is blank.
del android[10472]

#Deleted. Wrong row is not there anymore!

Another mistake which can be found in the discussion section of the Google Play database involves apps duplication.

We will proceed to explore duplicated apps to find whether they have the same information for every column (in which case we would only delete the extra apps) or not (in which case we would have to look further into the differing information, and may eventually need to reconsider the data set reliability).

In [8]:
first_names = []
duplicated = {}
dup_name = []
for apps in android:
    name = apps[0]
    if name in first_names:
        if name in duplicated:
            duplicated[name] += 1
        else:
            duplicated[name] = 1
            dup_name.append(name)
    else:
        first_names.append(name)

amount = len(first_names)
print('There are '+ str(amount) + ' apps when adjusting for duplications.')
amount_dup = len(duplicated)
print(str(amount_dup) + ' apps are duplicated at least once.')

There are 9659 apps when adjusting for duplications.
798 apps are duplicated at least once.


Although nothing of relevance arised in the discussion forums for iOS data, I am going to check there are no duplicated apps in said database.

In [9]:
iOS_first_names = []
iOS_duplicated = {}
iOS_dup_name = []
for apps in iOS:
    name = apps[0]
    if name in iOS_first_names:
        if name in iOS_duplicated:
            iOS_duplicated[name] += 1
        else:
            iOS_duplicated[name] = 1
            iOS_dup_name.append(name)
    else:
        iOS_first_names.append(name)

iOS_amount = len(iOS_first_names)
print('There are '+ str(iOS_amount) + ' apps when adjusting for duplications.')
iOS_amount_dup = len(iOS_duplicated)
print(str(iOS_amount_dup) + ' apps are duplicated')

There are 7197 apps when adjusting for duplications.
0 apps are duplicated


Voilá! No duplications in iOS.

Back to android:
I know, from the discussion forums, that in android, Instagram is one of the duplicated apps. I am going to find exactly how data is duplicated in the columns after the app's name to see if I can find a good criteria of elimination.

In [10]:
for apps in android:
    if ('Instagram' == apps[0]):
        print(apps)
        print('\n')

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




It seems, from Instagram's case, that the only data that varies between columns is the one corresponding to Reviews, which means data was obtained at different points in time.

To check this holds, I am going to find out what happens with other three duplicated apps.

In [11]:
print(dup_name[0])

for apps in android:
    if (dup_name[0] == apps[0]):
        print(apps)
        print('\n')
        
print('\n')
print(dup_name[299])

for apps in android:
    if (dup_name[299] == apps[0]):
        print(apps)
        print('\n')
        
print('\n')
print(dup_name[700])

for apps in android:
    if (dup_name[700] == apps[0]):
        print(apps)
        print('\n')

Quick PDF Scanner + OCR FREE
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']




Block Puzzle Classic Legend !
['Block Puzzle Classic Legend !', 'GAME', '4.2', '17039', '4.9M', '5,000,000+', 'Free', '0', 'Everyone', 'Puzzle', 'April 13, 2018', '2.9', '2.3.3 and up']


['Block Puzzle Classic Legend !', 'GAME', '4.2', '17044', '4.9M', '5,000,000+', 'Free', '0', 'Everyone', 'Puzzle', 'April 13, 2018', '2.9', '2.3.3 and up']




Chrome Beta
['Chrome Beta', 'PRODUCT

Cases above support the theory of data being obtained at different points in time. In some cases, the number of reviews vary slightly and in others it remains equal, such as in 'Quick PDF Scanner + OCR FREE' first two prints. This can be due to lower interaction of users with the download platform and engagement with the app than in Instagram's case, in which the review number varies significantly between the observations.

Assuming review number will be the most affected data type, I will proceed to eliminate the duplicated apps taking a criteria of keeping the row with the most reviews (which should coincide with the most recent under normal circumstances) and deleting the others. Staying with the most updated content, for the analysis will be prioritized.

In [12]:
reviews_max = {}
for apps in android:
    name = apps[0]
    n_reviews = float(apps[3])
    if ((name in reviews_max) and (reviews_max[name] < n_reviews) or (name not in reviews_max)):
        reviews_max[name] = n_reviews
#The dictionary reviews_max now contains only the most reviewed versions of all apps.

clean_android = []
already_added = []
for apps in android:
    name = apps[0]
    n_reviews = float(apps[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        clean_android.append(apps)
        already_added.append(name)
#Now clean_android has each app only once and every app in it should have complete info

### English Characters

The analysis is of interest for english-speaking developers. We will try to remove all apps which are definitely not in english.

Our first filter will be those apps with characters not commonly used in english, by using the [ASCII](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange) system and staying with apps named with characters numbered 0 to 127. If an app has a name with a character above 127, it is very likely it will not be in english (for example, character '爱' is number 29,233 according to ASCII and is not used in english). Note that some apps which are in english may be lost in the process, but what is of interest is that the remaining are in englis (as much as we can, and as far as we do not stay with a number of apps that is too small).

Each character's associated number can be obtained with Python's built-in function ord().

In [13]:
def english_character(string):
    non_ascii = 0
    for chars in string:
        if ord(chars) > 127 or ord(chars) < 0: #I know ASCII values start in 0, but better safe than sorry if that ever changes.
            non_ascii += 1
    if non_ascii > 3:
        return False
    else:
        return True
    
#Checking the function:
print(english_character('Instagram')) #True
print(english_character('爱奇艺PPS -《欢乐颂2》电视剧热播')) #False
print(english_character('Docs To Go™ Free Office Suite')) #True
print(english_character('Instachat 😜')) #True

True
False
True
True


In [14]:
echar_clean_android = []
for apps in clean_android:
    name = apps[0]
    if english_character(name) == True:
        echar_clean_android.append(apps)

echar_clean_iOS = []
for apps in iOS:
    name = apps[2]
    if english_character(name) == True:
        echar_clean_iOS.append(apps)
        
print(len(echar_clean_android))
print(len(echar_clean_iOS))

9614
6183


### Free Apps

Now, a second pre-requisite for the usefullness of the analysis is that of apps being free.

Android has two columns that can indicate if an app is free or not: type (index 6) and price (index 7; stored as 0). iOS has only price (index 4; stored as 0.0). For simplicity, I will discriminate between free and non-free apps using the price column for both and store them in new lists. Then I will print the final number of apps in each list.

In [15]:
final_android = []
for apps in echar_clean_android:
    price = apps[7]
    if price == '0':
        final_android.append(apps)
        
final_iOS = []
for apps in echar_clean_iOS:
    price = apps[5]
    if price == '0':
        final_iOS.append(apps)
        
print(len(final_android))
print(len(final_iOS))

8864
3222


## Data Analysis

The idea behind the analysis is for it to be useful for developers who intend to launch a basic version of the app in Google Play, develop it further if it attracts a sufficient amount of users and, if it becomes profitable after 6 months, introduce it in the App Store. Therefore, we must find app's profile that optimizes popularity and profitability considering both markets.

In [16]:
def freq_table(dataset, index):
    dic = {}
    number_apps = len(dataset)
    for apps in dataset:
        col_value = apps[index]
        if col_value in dic:
            dic[col_value] += ((1/number_apps)*100)
        else:
            dic[col_value] = ((1/number_apps)*100)
    return dic

#Builds a freq table with percentages of appearance of values for the category of the index chosen.

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
#Displays the freq table as tuple values in descending order.

print('Table for android "Genre"')
display_table(final_android, 9)
print('\n')
print('Table for android "Category"')
display_table(final_android, 1)
print('\n')
print('Table for iOS "prime_genre"')
display_table(final_iOS, 12)

Table for android "Genre"
Tools : 8.449909747292507
Entertainment : 6.069494584837599
Education : 5.34747292418777
Business : 4.591606498194979
Productivity : 3.8921480144404565
Lifestyle : 3.8921480144404565
Finance : 3.7003610108303455
Medical : 3.5311371841155417
Sports : 3.46344765342962
Personalization : 3.3167870036101235
Communication : 3.2378158844765483
Action : 3.1024368231047053
Health & Fitness : 3.079873646209398
Photography : 2.944494584837555
News & Magazines : 2.7978339350180583
Social : 2.6624548736462152
Travel & Local : 2.3240072202166075
Shopping : 2.2450361010830324
Books & Reference : 2.14350180505415
Simulation : 2.041967509025268
Dating : 1.861462093862813
Arcade : 1.8501805054151597
Video Players & Editors : 1.771209386281586
Casual : 1.7599277978339327
Maps & Navigation : 1.398916967509025
Food & Drink : 1.2409747292418778
Puzzle : 1.1281588447653441
Racing : 0.9927797833935037
Role Playing : 0.9363718411552363
Libraries & Demo : 0.9363718411552363
Auto & Vehi

The rank for android apps in Genre column show that apps that attract most users are those labeled as tools (8.45%), entertainment (6.07%) and education (5.35%), followed by business and productivity apps below 5% each. The three most significant categories for android apps are: family (18.91%), game (9.72%) and tools (8.46%), again followed by business, lifestyle and productivity categories below a 5% threshold. Results suggest successful apps in the android market include educational tools merged with gaming for a family public.

When looking at the most reviewed apps in the App Store, the genres that rank highest are games (58.16%) and entertainment (7.88%). This time, business and productivity don't make it to the top ten and lifestyle ranks 8th.

Data from both stores support gaming and entertainment. However, Google Play's high-ranking apps have a stronger presence of practical tools.

To get an idea of the apps with most users, we will analyze the install column of the android data set (number of downloads). As we don't have that data for iOS, we will substitute number of downloads with the total number of user ratings as a proxy.

### iOS dataset

In [17]:
gen_iOS = freq_table(final_iOS, 12)
iOS_review_genre = []

for genre in gen_iOS:
    total = 0
    len_genre = 0
    for apps in final_iOS:
        genre_app = apps[12]
        if genre == genre_app:
            u_rating = float(apps[6])
            total += u_rating
            len_genre += 1
    avg_rating = total/len_genre
    tupl = (avg_rating, genre)
    iOS_review_genre.append(tupl)

ordered = sorted(iOS_review_genre, reverse = True)
for items in ordered:
    print(str(items[0]) + ': ' + str(items[1]))

86090.33333333333: Navigation
74942.11111111111: Reference
71548.34905660378: Social Networking
57326.530303030304: Music
52279.892857142855: Weather
39758.5: Book
33333.92307692308: Food & Drink
31467.944444444445: Finance
28441.54375: Photo & Video
28243.8: Travel
26919.690476190477: Shopping
23298.015384615384: Health & Fitness
23008.898550724636: Sports
22788.6696905016: Games
21248.023255813954: News
21028.410714285714: Productivity
18684.456790123455: Utilities
16485.764705882353: Lifestyle
14029.830708661417: Entertainment
7491.117647058823: Business
7003.983050847458: Education
4004.0: Catalogs
612.0: Medical


On average, navigation apps have the most reviews. However, there are two navigation apps with an enormous amount of reviews (Google and Waze), which greatly influences the average result.

In [18]:
for apps in final_iOS:
    if apps[12] == 'Navigation':
        print(apps[2] + ': ' + apps[6])

Waze - GPS Navigation, Maps & Real-time Traffic: 345046
Geocaching®: 12811
ImmobilienScout24: Real Estate Search in Germany: 187
Railway Route Search: 5
CoPilot GPS – Car Navigation & Offline Maps: 3582
Google Maps - Navigation & Transit: 154911


The same applies for gigants in social networking, reference, music, wheather and finance.

For more accuracy in the analysis we could look at the distribution of apps in each genre and account for atypical apps in number of reviews. With this basic analysis we can stay with the following best-rated apps as a recommendation:
- Books
- Food & drink
- Photo & video.

### Android dataset

When trying to analyze the install data, the task becomes more complex, because we have values such as '10,000+'. However, we don't need very precise data for our analysis, just to get the apps which attract the most users, even if it means obtaining a  disordered sub-group.

I will continue calculating the average range of installs per app genre.

First we need to clean the data from '+' and ',' and convert it from strings to floats.

In [19]:
for apps in final_android:
    installs = apps[5]
    installs = installs.replace('+', '')
    installs = installs.replace(',','')

In [20]:
install_freq = freq_table(final_android, 1)
cat_and_install = []

for cats in install_freq:
    total = 0
    len_category = 0
    for apps in final_android:
        category_app = apps[1]
        if category_app == cats:
            installs = apps[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',','')
            total += float(installs)
            len_category += 1
    avg_of_cat = total / len_category
    cat_and_install.append((avg_of_cat, cats))
    
ordered_candi = sorted(cat_and_install, reverse=True)
for candies in ordered_candi:
    print(candies)

(38456119.167247385, 'COMMUNICATION')
(24727872.452830188, 'VIDEO_PLAYERS')
(23253652.127118643, 'SOCIAL')
(17840110.40229885, 'PHOTOGRAPHY')
(16787331.344927534, 'PRODUCTIVITY')
(15588015.603248259, 'GAME')
(13984077.710144928, 'TRAVEL_AND_LOCAL')
(11640705.88235294, 'ENTERTAINMENT')
(10801391.298666667, 'TOOLS')
(9549178.467741935, 'NEWS_AND_MAGAZINES')
(8767811.894736841, 'BOOKS_AND_REFERENCE')
(7036877.311557789, 'SHOPPING')
(5201482.6122448975, 'PERSONALIZATION')
(5074486.197183099, 'WEATHER')
(4188821.9853479853, 'HEALTH_AND_FITNESS')
(4056941.7741935486, 'MAPS_AND_NAVIGATION')
(3695641.8198090694, 'FAMILY')
(3638640.1428571427, 'SPORTS')
(1986335.0877192982, 'ART_AND_DESIGN')
(1924897.7363636363, 'FOOD_AND_DRINK')
(1833495.145631068, 'EDUCATION')
(1712290.1474201474, 'BUSINESS')
(1437816.2687861272, 'LIFESTYLE')
(1387692.475609756, 'FINANCE')
(1331540.5616438356, 'HOUSE_AND_HOME')
(854028.8303030303, 'DATING')
(817657.2727272727, 'COMICS')
(647317.8170731707, 'AUTO_AND_VEHICLES'

Considering the 'Installs' analysis is less precise than the previous analysis regarding reviews numbers, we will take the top 10 apps in the Installs ranking and compare it to previous results. In the top 10 for Installs we can again find the top reviewed apps for Android: 'Tools' and 'Games'. 'Productivity' also ranked well consistently in both analysis and, although 'Communication' ranks 11th in top reviewed apps, it outstands when it comes to Installs numbers.

Android's best performers:
- Tools
- Games
- Productivity
- Communication



## Conclusion

Final recommendation of an app that works well for both market is 'Photo & Video'. This is a category that consistently ranks well in each analysis made of the Google Play and App Store databases.

Examples for the category can be apps which helps obtaining best results in photographs taken for upload to social media, apps that easily make short videos out of similar photographs by metadata stored in devices (or in other apps) or apps that allow easy creation of 'memes', 'stickers' and other popular sharable content.

As a second possible good option, one category that still performs rather well in both markets is 'Books'. There are many libraries in both markets, so special features are a must. For example: daily book quotes, forums for books, books exchange enablers, access to authors' extra information on the books, curious facts, etc.