# Finding what type of apps attract more users

As a company developing free to download apps, our main source of revenue are the in-app ads. To maximize our revenue we need to attract as much users as possible so the watch our ads.

With this project we intend to understand what are the apps currently attracting more users.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
file_apple = open('AppleStore.csv')
file_google = open('googleplaystore.csv')
from csv import reader
appstore = list(reader(file_apple))
googleplay = list(reader(file_google))

In [3]:
appstore_header = appstore[0]
appstore = appstore[1:]

In [4]:
googleplay_header = googleplay[0]
googleplay = googleplay[1:]

In [5]:
appstore_header

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

In [6]:
googleplay_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [7]:
explore_data(appstore,0,3,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


In [8]:
explore_data(googleplay,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [9]:
del googleplay[10472]

## Information that can be relevant to our study:
### AppStore
(Check all the documentation [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps))
- Price `[price]`
- Number of user ratings `[rating_count_tot]`
- Average user rating `[user_rating]`
- Content rating `[cont_rating]`
- Primary genre `[prime_genre]`

### GooglePlay
(Check all the documentation [here](https://www.kaggle.com/lava18/google-play-store-apps))
- Rating `[Ratings]`
- Reviews `[Reviews]`
- Installs `[Installs]`
- Type (paid or free) `[Type]`
- Content rating `[Content Rating]`
- Genre `[Genre]`


In [10]:
def get_dict(store,app_name_index):
    final_dict = {}
    for app in store:
        if app[app_name_index] in final_dict:
            final_dict[app[app_name_index]] += 1
        else:
            final_dict[app[app_name_index]] = 1
    return final_dict

In [11]:
def get_duplicates(dictionary):
    duplicates = []
    for app in dictionary:
        if dictionary[app] > 1:
            duplicates.append(app)
    return duplicates

In [12]:
apple_duplicates = get_duplicates(get_dict(appstore,1))
apple_duplicates

['VR Roller Coaster', 'Mannequin Challenge']

### Google Play duplicates
We found out there are some duplicates at the Google Play database, as seen below:

In [13]:
duplicate_apps = []
unique_apps = []

for app in googleplay:
    if app[0] in unique_apps:
        duplicate_apps.append(app[0])
    else:
        unique_apps.append(app[0])
        
print('Number of duplicate apps:',len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:',duplicate_apps[:10])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


We will remove duplicates, keeping only the version with more reviews.

In [14]:
reviews_max = {}

for app in googleplay:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

In [15]:
len(reviews_max)

9659

We will loop through the dataset to get a new list with unique apps.

In [16]:
android_clean = []
already_added = []

for app in googleplay:
    name = app[0]
    n_reviews = float(app[3])
    if reviews_max[name] == n_reviews and name not in already_added:
        android_clean.append(app)
        already_added.append(name)       

In [17]:
len(android_clean)

9659

We will check for apps with non-English characters and remove them.

In [18]:
def english(string):
    count = 0
    for letter in string:
        if ord(letter) > 127:
            count += 1
            if count > 3:
                return False
    return True

In [19]:
print(english('Instagram'))
print(english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english('Docs To Go™ Free Office Suite'))
print(english('Instachat 😜'))

True
False
True
True


In [20]:
android_english = []
appstore_english = []

for app in android_clean:
    if english(app[0]):
        android_english.append(app)
        
for app in appstore:
    if english(app[1]):
        appstore_english.append(app)

In [21]:
print(len(android_english))
print(len(appstore_english))

9614
6183


We will isolate free apps, as they are the study objective

In [22]:
android_free = []
ios_free = []

for app in android_english:
    if app[6] == 'Free':
        android_free.append(app)
        
for app in appstore_english:
    if app[4] == '0.0':
        ios_free.append(app)
        
explore_data(android_free,0,2,True)
explore_data(ios_free,0,2,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 8863
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 3222
Number of columns: 16


We found an app with Price '0' but not Type 'Free', so we corrected it

In [23]:
count = 0
for app in android_english:
    if app[6] != 'Free' and app[7] == '0':
        mistake = app
        break
    count += 1
        
print(googleplay_header)
print(mistake)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']


In [24]:
android_english[7939][6] = 'Free'

In [25]:
explore_data(ios_free,0,3,True)
print('\n\n\n')
explore_data(android_free,0,3,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 3222
Number of columns: 16




['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Va

In [26]:
android_header = googleplay_header
ios_header = appstore_header

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's inspect the datasets' headers and find out what data can help us reach our goal:

In [27]:
print(android_header)
print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


For android, `'Category'` and `'Genres'`. For iOS, `'prime_genre'`. See one example below for each store:

In [28]:
print(android_free[0])
print('\n')
print(ios_free[0])

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


In [29]:
def freq_table(dataset,index):
    dictionary = {}
    for app in dataset:
        if app[index] in dictionary:
            dictionary[app[index]] += 100/len(dataset)
        else:
            dictionary[app[index]] = 100/len(dataset)
    return dictionary

In [30]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [31]:
prime_genre_freq = display_table(ios_free,11)

Games : 58.1626319056464
Entertainment : 7.883302296710134
Photo & Video : 4.965859714463075
Education : 3.6623215394165176
Social Networking : 3.2898820608317867
Shopping : 2.6070763500931133
Utilities : 2.5139664804469306
Sports : 2.1415270018621997
Music : 2.048417132216017
Health & Fitness : 2.0173805090006227
Productivity : 1.7380509000620747
Lifestyle : 1.5828677839851035
News : 1.3345747982619496
Travel : 1.2414649286157668
Finance : 1.1173184357541899
Weather : 0.8690254500310364
Food & Drink : 0.8069522036002481
Reference : 0.558659217877095
Business : 0.5276225946617009
Book : 0.4345127250155184
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


AppStore is clearly dominated by entertainment apps (Games and Entertainment account for top 2)

In [32]:
genres_freq = display_table(android_free,9)

Tools : 8.450863138892059
Entertainment : 6.070179397495205
Education : 5.348076272142616
Business : 4.592124562789124
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.700778517432021
Medical : 3.5315355974275078
Sports : 3.4638384294257025
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.3242694347286474
Shopping : 2.245289405393208
Books & Reference : 2.1437436533905
Simulation : 2.0421979013877922
Dating : 1.8616721200496447
Arcade : 1.8503892587160105
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.2411147466997632
Puzzle : 1.128286133363421
Racing : 0.9928917973598105
Role Playing : 0.9364774906916394
Libraries & Demo : 0.9364774906916394
Auto & Vehicles : 0.9251946293580052
S

In [33]:
category_freq = display_table(android_free,1)

FAMILY : 18.898792733837702
GAME : 9.725826469592825
TOOLS : 8.462146000225694
BUSINESS : 4.592124562789124
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.700778517432021
MEDICAL : 3.5315355974275078
SPORTS : 3.3961412614238973
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.3355522960622817
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533905
DATING : 1.8616721200496447
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.2411147466997632
EDUCATION : 1.1621347173643237
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916394
AUTO_AND_VEHICLES : 0.9251946293580052
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189553
PARENTING : 0.6544059573507842
ART_AND_DESIGN : 0.

It is harder to navigate through GooglePlay numbers as the Genres table is very extensive, so we will look towards the Category frequency table. Here the distribution is more diverse but still we have Family and Games accounting for top 2, followed by productivity apps like Tools and Business.

So, for now:
- AppStore is more directed towards Entertainment
- GooglePlay has a more diversified collection of apps

We will now look at the number of users per genre. For GooglePlay we have info in `Installs` showing us the number of downloads. For AppStore, we will manage to look into the ratings (in `rating_count_tot`) to retrieve that information.

In [34]:
for genre in freq_table(ios_free,11):
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[11]
        if genre_app == genre:
            total += float(app[5])
            len_genre += 1
    print(genre + ':',total/len_genre)

News: 21248.023255813954
Education: 7003.983050847458
Music: 57326.530303030304
Book: 39758.5
Food & Drink: 33333.92307692308
Entertainment: 14029.830708661417
Navigation: 86090.33333333333
Weather: 52279.892857142855
Business: 7491.117647058823
Catalogs: 4004.0
Health & Fitness: 23298.015384615384
Sports: 23008.898550724636
Travel: 28243.8
Games: 22788.6696905016
Shopping: 26919.690476190477
Lifestyle: 16485.764705882353
Social Networking: 71548.34905660378
Utilities: 18684.456790123455
Finance: 31467.944444444445
Photo & Video: 28441.54375
Medical: 612.0
Productivity: 21028.410714285714
Reference: 74942.11111111111


Top 3:
- Navigation: 86090
- Reference: 74942
- Social Networking: 71548

Regarding GooglePlay, we have the relative figures (0+, 1+, 5+, ... , 1,000,000,000+) for number of installs. As we do not need exact precision we will remove the plus (+) sign and consider that as the final number of installs.

In [36]:
android_free[0]

['Photo Editor & Candy Camera & Grid & ScrapBook',
 'ART_AND_DESIGN',
 '4.1',
 '159',
 '19M',
 '10,000+',
 'Free',
 '0',
 'Everyone',
 'Art & Design',
 'January 7, 2018',
 '1.0.0',
 '4.0.3 and up']

In [35]:
android_category_table = freq_table(android_free,1)

In [40]:
for category in android_category_table:
    total = 0
    len_category = 0
    for app in android_free:
        if app[1] == category:
            num = app[5].replace('+','')
            num = num.replace(',','')
            total += float(num)
            len_category += 1
    print(category,':',round(total/len_category))

AUTO_AND_VEHICLES : 647318
EVENTS : 253542
FOOD_AND_DRINK : 1924898
HEALTH_AND_FITNESS : 4188822
SHOPPING : 7036877
BOOKS_AND_REFERENCE : 8767812
ENTERTAINMENT : 11640706
TRAVEL_AND_LOCAL : 13984078
BEAUTY : 513152
PHOTOGRAPHY : 17840110
VIDEO_PLAYERS : 24727872
PRODUCTIVITY : 16787331
TOOLS : 10801391
NEWS_AND_MAGAZINES : 9549178
PERSONALIZATION : 5201483
SOCIAL : 23253652
HOUSE_AND_HOME : 1331541
LIBRARIES_AND_DEMO : 638504
LIFESTYLE : 1437816
BUSINESS : 1712290
MEDICAL : 120551
SPORTS : 3638640
COMMUNICATION : 38456119
EDUCATION : 1833495
FINANCE : 1387692
DATING : 854029
PARENTING : 542604
FAMILY : 3697848
MAPS_AND_NAVIGATION : 4056942
WEATHER : 5074486
COMICS : 817657
GAME : 15588016
ART_AND_DESIGN : 1986335


Top 3:
- COMMUNICATION : 38456119
- VIDEO_PLAYERS : 2472787
- SOCIAL : 23253652