# Profitable App profiles for App Store and Google Play Markets

The aim of this project is to identify the apps that are likely to generate the highest amount of revenue through ads.

We are aiming at finding which type of apps are likely to attract the highest amount of users. The more the number of users, the more people who engage with the ads.

# Opening and Exploring the Data

We will use two sets of data for our project.
- A [dataset](https://www.kaggle.com/lava18/google-play-store-apps) containing approximately 10,000 Android apps on Google Play. The dataset can be downloaded directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
- A [dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing approximately 7,000 iOS apps from AppStore. The dataset can be downloaded directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)


Let's first open the datasets and continue exploring.

In [1]:
import csv

def open_file(file):
    opened_file = open(file, encoding = 'utf-8')
    read_file = csv.reader(opened_file)
    return list(read_file)

In [2]:
##Google Play Dataset##
android = open_file('googleplaystore.csv')
android_header = android[0]
android = android[1:]

##AppStore Dataset##
ios = open_file('AppleStore.csv')
ios_header = ios[0]
ios = ios[1:]

We create a function `explore_data` that will make it easier to explore the rows in our dataset. We also add an option to show the number of rows and columns.

In [3]:
def explore_dataset(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start: end]
    for app in dataset_slice:
        print(app)
        print('\n') #Adds empty lines between rows
        
    if rows_and_columns:
        print('rows = ', len(dataset))
        print('columns = ', len(dataset[0]))

In [4]:
print(android_header)
print('\n')
explore_dataset(android, 0, 3, rows_and_columns = True)


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


rows =  10841
columns =  13


The Google Play dataset has 10841 apps and 13 columns. At a quick glance, the columns that could help us in our analysis are `'App'`, `'Category'`, `'Reviews'`, `'Installs'`, `'Type'`, `'Price'`, `'Genres'`.

In [5]:
print(ios_header)
print('\n')
explore_dataset(ios, 0, 3, rows_and_columns = True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


rows =  7197
columns =  16


This ios dataset has 7197 apps and 16 columns. The columns that will be helpful in our analysis are: `'track_name'`, `'currency'`, `'price'`, `'rating_count_tot'`, `'rating_count_ver'`, `'prime_genre'`. The columns are not self-explanatory but the details about the columns can be found in the dataset [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

# Deleting wrong data
The Google Play dataset has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion). One of the [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) outlines an error for a missing value in  one of the rows. Let's find the row.

In [6]:
for row in android:
    if len(row) != len(android_header):
        print('Index position is: ', android.index(row), '\n')
        print('The row has', len(row), 'values', '\n') #The header has 13 columns

Index position is:  10472 

The row has 12 values 



Let's print it and check against the header row and another row that is correct.

In [7]:
print(android[10472])
print('\n')
print(android_header)
print('\n')
print(android[0])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The row 10472 corresponds to the app *Life Made WI-Fi Touchscreen Photo Frame*, and we can clearly see that the category is a number and the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5. As mentioned in the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), the problem is caused by missing value in the `'Category'` column.  
As a consequence, we'll delete this row.

In [8]:
#del android[10472] 
print(len(android))

# I will comment the code after running to avoid accidentally deleting another row.

10840


# Removing Duplicate Entries


## Part 1
If we explore the datasets long enough we will notice that there are some duplicate entries. For example, Instagram has four entries:

In [9]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app, '\n')

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 



Let's create a function that will count the number of duplicates for a specific column in a dataset. It will also show some examples.

In [10]:
def check_duplicate(dataset, name_index): # Name_index is the column index of the which we want to check against 
    
    unique_apps = []
    duplicates = []
    for app in dataset:
        name = app[name_index]
        if name in unique_apps:
            duplicates.append(name)
        else:
            unique_apps.append(name)
        
    print('Number of duplicates : ', len(duplicates))
    print('\n')
    print('Example of duplicate Apps :', duplicates[:15])

In [11]:
check_duplicate(android, 0)

Number of duplicates :  1181


Example of duplicate Apps : ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.  
\
If you examine the rows we printed two cells above for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

We will create a dictionary and only include the apps with the highest number of reviews since that is probably the latest update to the data.

In [12]:
unique_apps = {} 
for app in android:
    name = app[0]
    reviews = app[3]
    
    if name not in unique_apps: #check membership
        unique_apps[name] = reviews
    else:
        if reviews > unique_apps[name]: 
            unique_apps[name] = reviews
            
print(len(unique_apps))     

9659


 Here we will create a new list `'android_clean'` with the apps with the highest number of reviews using the dictionary we created above.

In [13]:
android_clean = []
already_added = []

for app in android:
    
    name = app[0]
    reviews = float(app[3])
    n_reviews = float(unique_apps[app[0]])
    
    if (n_reviews == reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

 Since we had already calculated the duplicates, it should be easy to confirm if the new list is of the length we expect.

In [14]:
print('Expected length: ', len(android) - 1181)
print('Actual Leghth: ', len(android_clean))

Expected length:  9659
Actual Leghth:  9659


Let's explore the new dataset.

In [15]:
explore_dataset(android_clean, 0, 3, rows_and_columns = True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


rows =  9659
columns =  13


# Removing Non-English Apps

In [16]:
def is_english(name):
    non_ascii = 0
    
    for char in name:
        if ord(char) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True

In [17]:
android_english =[]
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if is_english(app[1]):
        ios_english.append(app)

In [18]:
explore_dataset(android_english, 0, 2, True)
print('\n')

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


rows =  9614
columns =  13




In [19]:
explore_dataset(ios_english, 0, 2, True)
print('\n')

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


rows =  6183
columns =  16




# Isolating Free Apps
Since we are only interested in free apps, we are going to isolate only the free apps for analysis.

In [20]:
android_free = []
ios_free = []

We will check the apps whose price is `$0.0` ios and `$0` in android.

In [21]:
for app in android_english:
    if app[7] == '0':
        android_free.append(app)

In [22]:
for app in ios_english:
    if app[4] == '0.0':
        ios_free.append(app)

In [23]:
print('Final android Apps: ',len(android_free))
print('Final ios Apps: ',len(ios_free))

Final android Apps:  8862
Final ios Apps:  3222


# Most Common Apps by Genre

 ## Part 1

We will find the genres thst are the most common on both Playstore and Appstore. App profiles that work on both are better because a larger audience will be reached while scaling.

In [24]:
#android 2, 10
#ios 12
#In android[reviews, installs, ratings]. In ios['rating_count_tot', 'rating_count_ver']

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

In [25]:
def freq_table(dataset, index):
    frequencies = {}
    for app in dataset:
        genre = app[index]
        if genre in frequencies:
            frequencies[genre] += 1
        else:
            frequencies[genre] = 1
        
    return frequencies

In [26]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

## Part 2

In [27]:
display_table(android_free, 1)

FAMILY : 1678
GAME : 859
TOOLS : 749
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 312
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 104
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 55
BEAUTY : 53


In [28]:
display_table(android_free, 9)

Tools : 748
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 312
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 155
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 81
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 39
Casino : 38
Trivia : 37
Educational;Education : 35
Educational : 33
Board : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Puzzle;Brain Games : 16
Racing;Action & Adventure : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

In [29]:
display_table(ios_free, 11)

Games : 1874
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4


## Part 3

From the frequecy table of Appstore apps, it is clear that fun apps have dominated the space. Eighty percent of the top five genres are for fun. `Games` genre is the most dominant with more than five times the number of apps of the second genre.  
In Appstore, entertainment genres dominate the top while the more productivity apps are concentrated at the bottom.

On Google Playstore, productivity genres are dominant. Out of the first ten genres, only one is for entertainment. The family category has the most apps. It is safe to conclude that Playstore is dominated by family-oriented apps while Appstore is dominated by fun apps. `Games` category is the second-best on Playstore.   
Developing gaming apps is the best strategy since they perform fairly well on Playstore and exemplary well on Appstore. They would enable the developer optimise both platforms and succed.

# Most Popular Apps by Genre on the App Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre.  
For the Google Play data set, we can find this information in the `Installs` column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` app.

We will start by calculating which genres have the highest average user rating.  
To do that, we will:
- Isolate the apps.
- Add the user ratings for that genre
- Divide by the number of apps in that genre

In [30]:
genre_ios = freq_table(ios_free, 11)
for genre in genre_ios:
    total = 0
    len_genre = 0
    for app in ios_free:
        rate_count = float(app[5])
        genre_app = app[11]
        if genre == genre_app:
            total += rate_count
            len_genre += 1
    
    print(str(genre),': ', (round(total/len_genre)))

Social Networking :  71548
Photo & Video :  28442
Games :  22789
Music :  57327
Reference :  74942
Health & Fitness :  23298
Weather :  52280
Utilities :  18684
Travel :  28244
Shopping :  26920
News :  21248
Navigation :  86090
Lifestyle :  16486
Entertainment :  14030
Food & Drink :  33334
Sports :  23009
Book :  39758
Finance :  31468
Education :  7004
Productivity :  21028
Business :  7491
Catalogs :  4004
Medical :  612


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

# Most Popular Apps by Genre on Google Play

In [31]:
categories = freq_table(android_free, 1)
for category in categories:
    total = 0
    len_category = 0
    for app in android_free:
        installs = ((app[5]).replace('+', ''))
        installs = int(installs.replace(',', ''))
        category_app = app[1]
        if category == category_app:
            total += installs
            len_category += 1
    
    print(str(category),': ', (round(total/len_category)))

ART_AND_DESIGN :  1986335
AUTO_AND_VEHICLES :  647318
BEAUTY :  513152
BOOKS_AND_REFERENCE :  8767812
BUSINESS :  1712290
COMICS :  817657
COMMUNICATION :  38456119
DATING :  854029
EDUCATION :  1820673
ENTERTAINMENT :  11640706
EVENTS :  253542
FINANCE :  1387692
FOOD_AND_DRINK :  1924898
HEALTH_AND_FITNESS :  4188822
HOUSE_AND_HOME :  1331541
LIBRARIES_AND_DEMO :  638504
LIFESTYLE :  1437816
GAME :  15560966
FAMILY :  3694276
MEDICAL :  120616
SOCIAL :  23253652
SHOPPING :  7036877
PHOTOGRAPHY :  17805628
SPORTS :  3638640
TRAVEL_AND_LOCAL :  13984078
TOOLS :  10682301
PERSONALIZATION :  5201483
PRODUCTIVITY :  16787331
PARENTING :  542604
WEATHER :  5074486
VIDEO_PLAYERS :  24727872
NEWS_AND_MAGAZINES :  9549178
MAPS_AND_NAVIGATION :  4056942


**To be Continued...**