# Profitable App Profiles for the App Store and Google Play Markets

Our goal here is figuring out which types of apps would be the most profitable for a hypothetical company that makes free Android and IOS apps and relies on in-app advertisements for revenue. The revenue model here is dictated by the amount of users that actively use the app, as more people using the app would result in higher engagement with the advertisements.

So here we will go through the different types of apps offered on the Google Play Store and Apple App Store in order to figure out which apps would attract the most users.

There are over 2 million Android apps on the Google Play Store and over 2 million IOS apps on the App Store. Collecting data for these millions of apps will require a significant investment, so instead we will be using a sample of the data instead.

We have 2 different data sets, one for Android applications and another for IOS applications.

We can see the documentation for our Android app data set containing ten thousand apps [here](https://www.kaggle.com/lava18/google-play-store-apps), and the documentation for our IOS app data set containing seven thousand apps [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).



In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print("\n")
    if rows_and_columns:
        print("Number of rows:", len(dataset))
        print("Number of columns:", len(dataset[0]))

In [3]:
#Google Play Store
opened_file=open("googleplaystore.csv")
from csv import reader
read_file=reader(opened_file)
android=list(read_file)
android_header=android[0]
android_apps=android[1:]

explore_data(android_apps,0,5,True)

#Apple App Store
opened_file=open("AppleStore.csv")
read_file=reader(opened_file)
ios=list(read_file)
ios_header=ios[0]
ios_apps=ios[1:]

explore_data(ios_apps,0,5,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13
['284882215', 'Facebook', '38987980

The names for the columns of the Play Store data set are self explanatory, however the names for the App Store columns is less evident.

The documentation for the App Store data set can be seen [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps). Below is a table showing what each column represents.


| Column Name      | Description |
| ----------- | ----------- |
| id      | App ID       |
| track_name   | App Name        |
| size_bytes      | Size in bytes       |
| currency   | Currency type        |
| price      | Price amount       |
| rating_count_tot   | User Rating counts (for all versions)        |
| rating_count_ver      | User Rating counts (for current version)       |
| user_rating   | Average User Rating value (for all versions)        |
| user_rating_ver      | Average User Rating value (for current version)         |
| ver   | Latest version code        |
| cont_rating      | Content Rating       |
| prime_genre   | Primary Genre        |
| sup_devices.num      | Number of supporting devices       |
| ipadSc_urls.num   | Number of screenshots showed for display        |
| lang.num      | Number of supported languages       |
| vpp_lic   | Vpp Device Based Licensing Enabled        |


In [7]:
# Here we can see that the Android app with an index of 10472 has a missing "Genre" rating.
# This could create an issue going forward, so we shall remove the offending row.

print(android[10473])  # incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android[0])      # correct row

print(len(android))
del android[10473]  # don't run this more than once
print(len(android))

android_header=android[0]
android_apps=android[1:]


# for app in android_apps:
#     app_index=-1
#     app_index+=1
#     if len(app) != len(android_header):
#         print(app)
#         del android_apps[app_index]
# for app in ios_apps:
#     if len(app) != len(ios_header):
#         print(app)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
10842
10841


In [8]:
print(len(android_apps))


10840


In [9]:
#We can check and see if we have any duplicate rows in our data sets.

duplicate_apps=[]
unique_apps=[]

for app in ios_apps:
    name=app[0]
    
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print(len(duplicate_apps))

duplicate_apps=[]
unique_apps=[]

for app in android_apps:
    name=app[0]
    
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print(len(duplicate_apps))

0
1181


# Duplicate Apps

We can see that we have over one thousand duplicate Android Apps, while we have no duplicate IOS apps. We will remove the duplicate applications based on the number of reviews, deleting the versions of the apps that have the least number of reviews.

In [20]:
expected_length=len(android_apps)-len(duplicate_apps)

reviews_max={}

for app in android_apps:
    name=app[0]
    n_reviews=float(app[3])
    if name in reviews_max and reviews_max[name]<n_reviews:
        reviews_max[name]=n_reviews
    elif name not in reviews_max:
        reviews_max[name]=n_reviews
        
print(expected_length)
print(len(reviews_max))

android_clean=[]
already_added=[]

for app in android_apps:
    name=app[0]
    n_reviews=float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

print(len(android_clean))

9659
9659
9659


# Removing Duplicate Apps

Above, we removed all duplicate apps found in our original android_apps data set. We now have a new data set known as android_clean, which has been purged of all duplicate applications.

In [21]:
explore_data(android_clean,0,7,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+

# Removing duplicate data

The company will be making English apps, thus apps that are not in English will be less relevant to our findings. We shall remove any applications with names that are not English.

In [27]:
def is_english(input_string):
    for i in input_string:
        if ord(i)>127:
            return False
    return True
        
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


In [35]:
print(ord("😜"))
print(ord("™"))

128540
8482


As we saw above, some English apps might contain characters outside of the above range, such as apps that contain Emojis in the name.

Thus we will instead use a function that only returns False if there are at least 3 characters outside of the range.

In [64]:
def is_english(input_string):
    non_eng=0
    for i in input_string:
        if ord(i)>127:
            non_eng+=1
        if non_eng>3:
            return False
    return True
        
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


In [66]:
android_english=[]
ios_english=[]

def eng_dataset(dataset,new_dataset,index=0):
    for app in dataset:
        name=app[index]
        if is_english(name):
            new_dataset.append(app)
            
eng_dataset(dataset=android_clean,new_dataset=android_english,index=0)

eng_dataset(ios_apps,ios_english,index=1)

explore_data(android_english,0,5,True)
explore_data(ios_english,0,5,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9614
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0