## Analyzing Mobile App Data

This projects compares datasets in the most popular apps.

This project will help the developers to understand what app types attract more users on Google Play and the App Store.

The description of the categories can be found here:

[Google Play Apps](https://www.kaggle.com/lava18/google-play-store-apps)	

[Appstore](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)	


In [1]:
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
apple_data = list(read_file)
apple_data_header = apple_data[0]
apple_data = apple_data[1:]

opened_file = open('googleplaystore.csv')
from csv import reader
read_file = reader(opened_file)
google_data = list(read_file)
google_data_header = google_data[0]
google_data = google_data[1:]

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(apple_data_header)
print('\n')
explore_data(apple_data, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


In [3]:
print(google_data_header)
print('\n')
explore_data(google_data, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

This row has a missing rating column, so we need to delete it in order to avoid misinterpreted data.  

In [4]:
print(google_data[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [5]:
del google_data[10472]

These data sets include some duplicate entries. For example:

In [6]:
duplicate_apps = []
unique_apps = []

for row in google_data:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Some examples of duplicate apps:', duplicate_apps[:5])

Number of duplicate apps: 1181


Some examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


Duplicate entries should be deleted, but not just randomly. We can use the number of the view column. The version that has more views is the latest, so we can keep it and delete duplicate without losing any other column information. 

In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [7]:
reviews_max = {}

for row in google_data:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Expected length:', len(google_data) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


In [8]:
android_clean = []
already_added = []

for row in google_data:
    name = row[0]
    n_reviews = float(row[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(row)
        already_added.append(row[0])
        
print('Actual length:', len(android_clean))

Actual length: 9659


Now we explore the new data set, and confirm that the number of rows is 9,659.

We need to remove all non-English apps. 

In [9]:
def is_english(string):
    
    for character in string:
        if ord(character) > 127:
            return False
    
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


Emojis and characters like ™ fall outside the ASCII range, so some English apps were deleted. We need to improve our function otherwise we will lose some useful data. 

In [10]:
def is_english(string):
    non_ASCII = 0
    
    for character in string:
        if ord(character) > 127:
            non_ASCII += 1
            
    if non_ASCII > 3:
        return False
    else:
        return True
    
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


Now we can use our new function to clean our two data sets.

In [11]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in apple_data:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
     
print(explore_data(android_english, 0, 5, True))
print('\n')
print(explore_data(ios_english, 0, 5, True))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9614
Number of columns: 13
None


['284882215', 'Facebook', '389879808', 'U

Now we need to isolate free apps. 

In [12]:
android_free = []
ios_free = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_free.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_free.append(app)
        
print(explore_data(android_free, 0, 5, True))
print('\n')
print(explore_data(ios_free, 0, 5, True))

            
            

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 8864
Number of columns: 13
None


['284882215', 'Facebook', '389879808', 'U

We want to develop an application that is profitable in both Google Play and App Store.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1.Build a minimal Android version of the app, and add it to Google Play.

2.If the app has a good response from users, we develop it further.

3.If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

We must explore which apps are most popular in both stores. For this purpose we can use **prime_genre** column in the App Store Data set and **Genres and Category** in the Google Play data set.

In [13]:

def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Now we will analyze **prime_genre** column in IOS data set.

In [14]:
display_table(ios_free, -5)


Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs. 

Let's take a look at Health & Fitness genre.

In [16]:
for app in ios_free:
    if app[-5] == 'Health & Fitness':
        print(app[1], ':', app[5]) # print name and number of ratings

Calorie Counter & Diet Tracker by MyFitnessPal : 507706
Lose It! – Weight Loss Program and Calorie Counter : 373835
Weight Watchers : 136833
Sleep Cycle alarm clock : 104539
Fitbit : 90496
Period Tracker Lite : 53620
Nike+ Training Club - Workouts & Fitness Plans : 33969
Plant Nanny - Water Reminder with Cute Plants : 27421
Sworkit - Custom Workouts for Exercise & Fitness : 16819
Clue Period Tracker: Period & Ovulation Tracker : 13436
Headspace : 12819
Fooducate - Lose Weight, Eat Healthy,Get Motivated : 11875
Runtastic Running, Jogging and Walking Tracker : 10298
WebMD for iPad : 9142
8fit - Workouts, meal plans and personal trainer : 8730
Garmin Connect™ Mobile : 8341
Record by Under Armour, connects with UA HealthBox : 7754
Fitstar Personal Trainer : 7496
My Cycles Period and Ovulation Tracker : 7469
Seven - 7 Minute Workout Training Challenge : 6808
RUNNING for weight loss: workout & meal plans : 6407
Lifesum – Inspiring healthy lifestyle app : 5795
Waterlogged - Daily Hydration Tr

And for **Genres and Category** column in Android data set.

In [15]:
display_table(android_free, 1) # Category
print('\n')
display_table(android_free, 9) # Genres

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

So it turns out that Game is the most common genre of Apple Store and the runner-up is Entertainment. Apps designed for practical purposes are less popular. 

In Google Play Family category mostly include games for kids. However practical and for-fun apps are more balanced. 


Now let's find out which apps have the largest number of users. We can do it by calculating the number of installs for each genre. 

Google Play has this information in the Installs Column, but it is absent in the App Store, but we can use rating_count_tot for this purpose.

In [17]:
genre_table_ios = freq_table(ios_free, -5)

for genre in genre_table_ios:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Business : 7491.117647058823
Education : 7003.983050847458
Reference : 74942.11111111111
Food & Drink : 33333.92307692308
Health & Fitness : 23298.015384615384
Navigation : 86090.33333333333
Productivity : 21028.410714285714
Travel : 28243.8
Catalogs : 4004.0
Weather : 52279.892857142855
News : 21248.023255813954
Games : 22788.6696905016
Lifestyle : 16485.764705882353
Medical : 612.0
Finance : 31467.944444444445
Social Networking : 71548.34905660378
Utilities : 18684.456790123455
Music : 57326.530303030304
Photo & Video : 28441.54375
Shopping : 26919.690476190477
Entertainment : 14029.830708661417
Book : 39758.5
Sports : 23008.898550724636


This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

In [18]:
category_fr_table = freq_table(android_free, 1)

for category in category_fr_table:
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
            
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

PHOTOGRAPHY : 17840110.40229885
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
BEAUTY : 513151.88679245283
COMICS : 817657.2727272727
SPORTS : 3638640.1428571427
DATING : 854028.8303030303
EVENTS : 253542.22222222222
HEALTH_AND_FITNESS : 4188821.9853479853
SOCIAL : 23253652.127118643
ENTERTAINMENT : 11640705.88235294
FAMILY : 3695641.8198090694
PERSONALIZATION : 5201482.6122448975
FOOD_AND_DRINK : 1924897.7363636363
COMMUNICATION : 38456119.167247385
HOUSE_AND_HOME : 1331540.5616438356
BUSINESS : 1712290.1474201474
SHOPPING : 7036877.311557789
TRAVEL_AND_LOCAL : 13984077.710144928
EDUCATION : 1833495.145631068
BOOKS_AND_REFERENCE : 8767811.894736841
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
LIBRARIES_AND_DEMO : 638503.734939759
MAPS_AND_NAVIGATION : 4056941.7741935486
FINANCE : 1387692.475609756
NEWS_AND_MAGAZINES : 9549178.467741935
VIDEO_PLAYERS : 24727872.452830188
TOOLS : 10801391.298666667
MEDICAL : 120550.61980830671
ART_AND_DESIGN : 1986335.087719298

We can take a look at Health and Fitness category in Google Play data set.

In [19]:
for app in android_free:
    if app[1] == 'HEALTH_AND_FITNESS':
        print(app[0], ':', app[5])

Step Counter - Calorie Counter : 500,000+
Lose Belly Fat in 30 Days - Flat Stomach : 5,000,000+
Pedometer - Step Counter Free & Calorie Burner : 1,000,000+
Six Pack in 30 Days - Abs Workout : 10,000,000+
Lose Weight in 30 Days : 10,000,000+
Pedometer : 10,000,000+
LG Health : 10,000,000+
Step Counter - Pedometer Free & Calorie Counter : 10,000,000+
Pedometer, Step Counter & Weight Loss Tracker App : 10,000,000+
Sportractive GPS Running Cycling Distance Tracker : 1,000,000+
30 Day Fitness Challenge - Workout at Home : 10,000,000+
Home Workout for Men - Bodybuilding : 1,000,000+
Fat Burning Workout - Home Weight lose : 100,000+
Buttocks and Abdomen : 500,000+
Walking for Weight Loss - Walk Tracker : 100,000+
Running & Jogging : 500,000+
Sleep Sounds : 1,000,000+
Fitbit : 10,000,000+
Lose Belly Fat-Home Abs Fitness Workout : 50,000+
Cycling - Bike Tracker : 500,000+
Abs Training-Burn belly fat : 100,000+
Calorie Counter - EasyFit free : 1,000,000+
Aunjai i lert u : 500,000+
Garmin Connect

We can exclude the most popular apps. 

In [20]:
for app in android_free:
    if app[1] == 'HEALTH_AND_FITNESS' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Period Tracker - Period Calendar Ovulation Tracker : 100,000,000+
Samsung Health : 500,000,000+


There aren't many extra popular apps, so this market shows some potential.

In [21]:
for app in android_free:
    if app[1] == 'HEALTH_AND_FITNESS' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Lose Belly Fat in 30 Days - Flat Stomach : 5,000,000+
Pedometer - Step Counter Free & Calorie Burner : 1,000,000+
Six Pack in 30 Days - Abs Workout : 10,000,000+
Lose Weight in 30 Days : 10,000,000+
Pedometer : 10,000,000+
LG Health : 10,000,000+
Step Counter - Pedometer Free & Calorie Counter : 10,000,000+
Pedometer, Step Counter & Weight Loss Tracker App : 10,000,000+
Sportractive GPS Running Cycling Distance Tracker : 1,000,000+
30 Day Fitness Challenge - Workout at Home : 10,000,000+
Home Workout for Men - Bodybuilding : 1,000,000+
Sleep Sounds : 1,000,000+
Fitbit : 10,000,000+
Calorie Counter - EasyFit free : 1,000,000+
Garmin Connect™ : 10,000,000+
BetterMe: Weight Loss Workouts : 5,000,000+
Bike Computer - GPS Cycling Tracker : 1,000,000+
Running Distance Tracker + : 1,000,000+
Runkeeper - GPS Track Run Walk : 10,000,000+
Walking: Pedometer diet : 1,000,000+
8fit Workouts & Meal Planner : 10,000,000+
Keep Trainer - Workout Trainer & Fitness Coach : 1,000,000+
PumpUp — Fitness Co

There are some apps for better sleep in both data sets in Health and Fitness category, so it might be a good idea to try and bring something new to this area. For example, apps that track your habits, reminds you when to go to sleep, give you some advice for better sleep.  

**CONCLUSION**

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

There are many fitness techniques guides and calorie counters, but not so many healthy habit's trackers and apps for improving sleep quality. 