# Project: Profitable App Profiles for the App Store and Google Play Markets

## Goal: Analyse data to help app developers understand what type of apps are likely to attract users on Google Play and App Store

**Opening and Exploring the Data**

In [2]:
import csv
opened_file_1 = open('AppleStore.csv')
opened_file_2 = open('googleplaystore.csv')

file_1 = csv.reader(opened_file_1)
file_2 = csv.reader(opened_file_2)

#Save file AppleStore as list of lists
a_list = list(file_1)
a_header = a_list[0]
a_file = a_list[1:]

#Save file GooglePlayStore as list of lists
g_list = list(file_2)
g_header = g_list[0]
g_file = g_list[1:]

#Create explore_data() function to get number of rows and columns of both files
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))    

*Info on dataset: Applestore*

In [3]:
explore_data(a_file,0,4,rows_and_columns=True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows:  7197
Number of columns:  16


In [4]:
print(a_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


More info on column names can be accessed [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

*Info on dataset: GooglePlayStore*

In [5]:
explore_data(g_file,0,4,rows_and_columns=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  10841
Number of columns:  13


In [6]:
print(g_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


More info on column names can be accessed [here](https://www.kaggle.com/lava18/google-play-store-apps)

**Deleting Wrong Data**

The [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) of the Google Play dataset outlines an error at row 10472. Let's explore this.

In [7]:
print(g_file[10472])
print('\n')
print(g_header)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


The content for Category column is missing, as instead of text, it returns a number. This causes the Rating of the app to be 19 while the maximum rating for a Google Play App is 5. The total numbers of columns for this row thus are 12, instead of 13. As a result, we delete this row

In [8]:
#Number of columns before deleting the row
print(len(g_file))
del g_file[10472]
#Number of columns after deleting the row
print(len(g_file))

10841
10840


**Removing Duplicate Entries: Part 1**

The Google Play dataset contains duplicate entries. We'll explore the number of duplicates in the below section

In [9]:
duplicate_entries = []
unique_entries = []

for row in g_file:
    name = row[0]
    if name in unique_entries:
        duplicate_entries.append(name)
    else:
        unique_entries.append(name)
print('Number of duplicate apps: ', len(duplicate_entries))

Number of duplicate apps:  1181


In [10]:
for row in g_file:
    name = row[0]
    if name == 'Facebook':
        print(row)

['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']


From the example of Facebook app above, we can see that there are 2 duplicates. The only difference is on the fourth position of the row, the number of reviews. We will use this as criterion to remove the duplicates of the dataset instead of removing them randomly.

**Removing Duplicate Entries: Part Two**

In [11]:
print(len(unique_entries))

9659


To remove the duplicates, we will follow the following steps:

`Step 1`. Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app

In [12]:
#Create a dictionary
reviews_max = {}

#loop through the Google Play dataset
for row in g_file:
    n_reviews = float(row[3])
    name = row[0]
    
    if name in reviews_max and n_reviews > reviews_max[name]:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
print('Lenght of the dictionary is: ', len(reviews_max))

Lenght of the dictionary is:  9659


`Step 2`. Use the information stored in the dictionary and create a new dataset, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews)

In [13]:
#Create 2 empty lists
#List of clean dataset without duplicate entries
android_clean = []
#List to keep track of app names that are already added to android_clean list
already_added = []

#Loop through the Google Play dataset
for row in g_file:
    name = row[0]
    #Convert the number of reviews (type string to a float for comparison)
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
print(len(android_clean))

9659


The purpose of the loop is to go through each row in the Google Play dataset and check:
1. If the number of reviews is the same as the maximum number of reviews for that app (by way of comparison to values in the reviews_max), and
2. If the app name is not in the already_added list.
If these two conditions are satisfied, then:
1. The row will be added to the clean dataset list (i.e. android_clean)
2. The name of the app will be added to the already_added list

Why we need an already_added list to keep track of the apps that are already added to the android_clean list?

Attempt: The number of reviews is the criterion we choose to remove duplicates, however, it's not the only criterion. Duplicate entries can be the result of differences in other columns. When we loop through the dataset, for different duplicates of the app with the same number of reviews, an already_added list is there to ensure that only the first occurence of the app is added, not the ones after that.

Dataquest answer: Correct - 'We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry'

**Removing Non-English apps: Part One**

In [14]:
# A function that checks if any character in the string not
# belong to the set of common English characters
def string_function(string):
    for char in string:
        if ord(char) > 127:
            return False
    return True

print(string_function('Docs To Go™ Free Office Suite'))
print(string_function('Instachat 😜'))

False
False


**Removing Non_English apps: Part Two**

The previous function doesn't work on app names having emoji or special characters. We'll filter it by editing the function so that we'll only remove an app name if it has more than three characters with corresponding numbers falling outside the ASCII range (or greater than 127)

The new function is as below:

In [15]:
def is_english(string):
    out_of_range = []
    for char in string:
        if ord(char) > 127:
            out_of_range.append(char)
    if len(out_of_range) > 3:
        return False
    else:
        return True

# Check the new function
result1 = is_english('Docs To Go™ Free Office Suite')
result2 = is_english('爱奇艺PPS -《欢乐颂2》电视剧热播')
print(result1)
print(result2)
""" Pseudocode:
How many characters > 127 in the string?

if the string has more than three characters > 127
    return false
else:
    return true 
""" 

True
False


' Pseudocode:\nHow many characters > 127 in the string?\n\nif the string has more than three characters > 127\n    return false\nelse:\n    return true \n'

Use the new function to filter out non-English apps from both datasets

In [16]:
# Google Play store dataset

new_android = []
non_english = []
for row in android_clean:
    name = row[0]
    if is_english(name)==True:
        new_android.append(row)
    else:
        non_english.append(row)

print('Number of English apps from Google Play store dataset: ', len(new_android))

Number of English apps from Google Play store dataset:  9614


In [17]:
# Applestore dataset

new_apple = []
for row in a_file:
    name = row[1]
    if is_english(name):
        new_apple.append(row)

print('Number of English apps from Applestore dataset: ', len(new_apple))

Number of English apps from Applestore dataset:  6183


**Isolating the free apps**

In [18]:
# Isolate free apps in a separate list

# Google Play store dataset
free_android = []
for row in new_android:
    price = row[7]
    if price == '0':
        free_android.append(row)
print(len(free_android))

# Apple store dataset
free_apple = []
for row in new_apple:
    price = row[4]
    if price == '0.0':
        free_apple.append(row)
print(len(free_apple))

8864
3222


**Most common apps by genre: Part One**

Our validation strategy for an app idea is:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For example, we can begin the analysis by the most common genres for each market, and this can be achieved by building frequency tables for a few columns in our datasets.

**Most common apps by genre: Part Two**

In [19]:
# Write a frequency table

def freq_table(dataset, index):
    freq_table = {}
    total_num = len(dataset)
    for row in dataset:
        value = row[index]
        if value in freq_table:
            freq_table[value] += 1
            
        else:
            freq_table[value] = 1
    for row in freq_table:
        freq_table[row] /= total_num 
        freq_table[row] *= 100
    return freq_table

# Display frequency table of column prime_genre in Apple store dataset
res = freq_table(free_apple, 11)

# Display frequency table of column Genres in Google Play store dataset
res1 = freq_table(free_android, 9)

# Display frequency table of column Category in Google Play store dataset
res2 = freq_table(free_android, 1)


In [20]:
# Write the function display_table() to sort the percentages in descending order
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    
    # Transforms the frequency table into a list of tuples
    for key in table:
        val_as_tuple = (table[key], key)
        table_display.append(val_as_tuple)
    table_sorted = sorted(table_display, reverse = True)
    # Print the tuples
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


**Most common apps by genre: Part Three**

1. Analyse frequency table generated for the prime_genre column of the App Store dataset

In [21]:
result_1 = display_table(free_apple, 11)
print(result_1)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665
None


* What is the most common genre? What is the next most common?
    Games
* What other patterns do you see?
    Games account for more than half
* What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)?
    Entertainment tops the list
* Can you recommend an app profile for the App Store market based on this frequency table alone? If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users? -The large number of apps doesn't necessarily mean that genre generally have a large number of users as we are using only a subset of the real dataset i.e. apps in english only
        

2. Analyze the frequency tables you generated for the Category and Genres column of the Google Play dataset.

In [22]:
# Frequency table for category column
result_2 = display_table(free_android, 1)
print(result_2)


FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [23]:
# Frequency table for genres column
result_3 = display_table(free_android, 9)
print(result_3)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

* What are the most common genres? - Family / Tools
* What other patterns do you see? - The frequencies are more equally distributed among the categories/patterns
* Compare the patterns you see for the Google Play market with those you saw for the App Store market. - There are a wider range of genres/categories in the Google Play market as opposed to the Apple Store market. They seem to be more evenly distributed and are of more practical purposes rather than fun.
* Can you recommend an app profile based on what you found so far? Do the frequency tables you generated reveal the most frequent app genres or what genres have the most users? 

**Most Popular Apps by Genre on the App Store**

In [24]:
# Calculate the average number of user ratings per app genre on the App Store


# Frequency table of column prime_genre in Apple store dataset
freq_table_genre = freq_table(free_apple, 11)

for genre in freq_table_genre:
    # Initiate a variable that stores the sum of user ratings specific to each genre
    total = 0
    # Initiate a variable that store the number of apps specific to each genre
    len_genre = 0
    for row in free_apple:
        genre_app = row[11]
        if genre_app == genre:
            num_user_rating = float(row[5])
            total += num_user_rating
            len_genre += 1
    # Calculate average number of user ratings per app genre
    avg_num = total / len_genre
    print(genre, avg_num)

Social Networking 71548.34905660378
Photo & Video 28441.54375
Games 22788.6696905016
Music 57326.530303030304
Reference 74942.11111111111
Health & Fitness 23298.015384615384
Weather 52279.892857142855
Utilities 18684.456790123455
Travel 28243.8
Shopping 26919.690476190477
News 21248.023255813954
Navigation 86090.33333333333
Lifestyle 16485.764705882353
Entertainment 14029.830708661417
Food & Drink 33333.92307692308
Sports 23008.898550724636
Book 39758.5
Finance 31467.944444444445
Education 7003.983050847458
Productivity 21028.410714285714
Business 7491.117647058823
Catalogs 4004.0
Medical 612.0


*Analyse the results above
The genre of the most popular apps (have the most users) is Navigation, followed by Social Networking and Reference. Given the data from the frequency table earlier, we notice that apps from the genre navigation and reference are not common, accounting for less than 1% each of the total apps in Apple store. However, they achieve the most attention with highest number of users and user ratings. Thus, the app profile recommendation for the Apple store is to build an app belonging to one of these two genres which serve more practical purposes
than fun.

*Dataquest's solution (have another read): 
Look and investigate closer. Ask questions - like which apps account for the highest number of ratings? Get the big picture. 

In [25]:
for app in free_apple:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


-> the average number is heavily influenced by a few giants 

**Most Popular Apps by Genre on Google Play**

In [26]:
# Frequency table for the Category column of the Google Play dataset
freq_tbl_category = freq_table(free_android, 1)

for category in freq_tbl_category:
    total = 0
    len_category = 0
    for row in free_android:
        category_app = row[1]
        if category_app == category:
            n_installs = row[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            total += float(n_installs)
            len_category += 1
    avg = total / len_category
    print(category, ':', avg)

    

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

The category with the highest number of installs is communication, followed by game and social.

Let's explore the apps in category: Communication

In [27]:
for row in free_android:
    category = row[1]
    if category == 'COMMUNICATION' and row[5] == '1,000,000,000+':
        print(row[0])

WhatsApp Messenger
Messenger – Text and Video Chat for Free
Skype - free IM & video calls
Google Chrome: Fast & Secure
Gmail
Hangouts


Let's explore the apps in category: Social

In [28]:
for row in free_android:
    category = row[1]
    if category == 'SOCIAL' and row[5] == '1,000,000,000+':
        print(row[0])

Facebook
Google+
Instagram


Let's explore the apps in category: Game

for row in free_android:
    category = row[1]
    if category == 'GAME' and (row[5] == '1,000,000,000+'
                               or row[5] == '500,000,000+'):
        print(row[0],':',row[5])

From the above, we can see that the market is dominated by a few popular apps run by giant companies which are hard to compete against. 

Let's explore apps in category: education

In [29]:
for row in free_android:
    category = row[1]
    if category == 'EDUCATION' and (row[5] == '1,000,000+'
                               or row[5] == '500,000,000+'
                                   or row[5] == '50,000,000,+'):
        print(row[0],':',row[5])

Learn Spanish - Español : 1,000,000+
English for beginners : 1,000,000+
Learn Japanese, Korean, Chinese Offline & Free : 1,000,000+
Cars Coloring Pages : 1,000,000+
English speaking texts : 1,000,000+
Thai Handwriting : 1,000,000+
THAI DICT 2018 : 1,000,000+
Kanji test · Han search Kanji training (free version) : 1,000,000+
Free intellectual training game application | : 1,000,000+
PINKFONG Baby Shark : 1,000,000+
Udemy - Online Courses : 1,000,000+
edX - Online Courses by Harvard, MIT & more : 1,000,000+
Memorado - Brain Games : 1,000,000+
Lynda - Online Training Videos : 1,000,000+
Brilliant : 1,000,000+
CppDroid - C/C++ IDE : 1,000,000+
C++ Programming : 1,000,000+
C Programming : 1,000,000+
Udacity - Lifelong Learning : 1,000,000+
Learn C++ : 1,000,000+
Learn programming : 1,000,000+
Learn JavaScript : 1,000,000+
Learn Java : 1,000,000+
Learn HTML : 1,000,000+
Programming Hub, Learn to code : 1,000,000+
Learn SQL : 1,000,000+
Socratic - Math Answers & Homework Help : 1,000,000+
Lin

We can see there are many apps for learning languages, programming languages, and online courses. Building an app in the education category might be potential as knowledge is always in increasingly high demand. This type of app can serve our purposes of attracting the most users in both markets. For example, we can build an app that provides the latest research findings, how it agrees or disagrees with the common perception, and how it can be applied in the real world. People a 

**Conclusion**

In this project, we analysed data for apps in Google Play store and App store to find the best type of app to build which is profitable in both markets. The conclusion we reach is that we can build an educational ap

For the second best market, it wasn't clear-cut what to choose between India and Canada. We decided to send the results to the marketing team so they can use their domain knowledge to take the best decision.