<a href="https://colab.research.google.com/github/Said-Akbar/Data-science/blob/master/project_00_Android_iOS_apps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## App store and Google Play trend analyses

The aim of this project is to analyze the app marketplace of Apple app Store and Android Google Play to find out what type of apps are more likely to attract more users.

In [0]:
from csv import reader
def load_dataset(filename):
    data_file = open(filename)
    app_lists = reader(data_file)
    apps_data = list(app_lists)
    return apps_data[0], apps_data[1:]

In [0]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [0]:
header_apl, apple = load_dataset('AppleStore.csv')
header_andr, andrd = load_dataset('googleplaystore.csv')

In [0]:
explore_data(apple, 0,3, rows_and_columns=True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


In [0]:
explore_data(andrd, 0,3, rows_and_columns=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [0]:
print('Android apps data header:\n', header_andr, '\n')
print('Apple apps data header:\n', header_apl)

Android apps data header:
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

Apple apps data header:
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


For android apps, we are interested in Category, Rating, Installs, Genres, price.
For IOS apps, we are interested in currency, price, rating_count_tot, cont_rating and user_rating

Links to the datasets and their description:

[Android apps](https://www.kaggle.com/lava18/google-play-store-apps/home) 

[IOS apps](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

In [0]:
andrd[10472] # this row is missing 'category'

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

In [0]:
del andrd[10472]

There might be duplicate apps in the dataset. To check for duplicate apps, we search for similar named apps as follows:

In [0]:
duplicate_apps =[]
unique_apps = []
for app in andrd:
    name=app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Duplicate apps:', len(duplicate_apps), '\n')
print('Some examples of duplicates:', duplicate_apps[:15]) 

Duplicate apps: 1181 

Some examples of duplicates: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


As we can see above, Google play dataset has 1181 duplicate apps. For example, if we search for 'Google My Business' app, we encounter several copies of it:

In [0]:
c=0
for app in andrd:
    if app[0]=='Google My Business':
        print(app, c)
    c+=1

['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up'] 193
['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up'] 239
['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up'] 268


Removing these duplicates will be done based on the number reviews. More reviews mean the data is recent. However, if reviews are the same, then rows are deleted randomly by leaving only one record in the dataset.


In [0]:
apps_dict = {}
for app in andrd:
    reviews = float(app[3])
    name = app[0]
    if (name in apps_dict) and (apps_dict[name] < reviews):
        apps_dict[name]=reviews # if the app is already in the list, then update with max reviews
    elif name not in apps_dict:
        apps_dict[name]=reviews # if app not in the list, create the app name

In [0]:
len(apps_dict)

9659

In [0]:
android_data = []
names= []
for app in andrd:
    name = app[0]
    reviews = float(app[3])
    if reviews == apps_dict[name] and name not in names:
        android_data.append(app)
        names.append(name)

Above, we first created a new dictionary containing  maximum number of reviews for each unique app name, so duplicates with less reviews are removed. In the next step, we created a list that contains apps lists with reviews.

In [0]:
duplicate_apps =[]
unique_apps = []
for app in apple:
    name=app[1]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print(len(apple))
print('Duplicate apps:', len(duplicate_apps), '\n')
print('Some examples of duplicates:', duplicate_apps[:15]) 

7197
Duplicate apps: 2 

Some examples of duplicates: ['Mannequin Challenge', 'VR Roller Coaster']


Now we have to remove non-English apps

In [0]:
def is_eng(string):
    c=0
    for i in string:
        if ord(i)>127:
            c+=1
        if c>3: return False
    return True

In [0]:
# checking the function
print(is_eng('Instagram'))
print(is_eng('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_eng('Docs To Go™ Free Office Suite'))
print(is_eng('Instachat 😜'))

True
False
True
True


In [0]:
android_ds = []
ios_ds = []
for app in android_data:
    name=app[0]
    if is_eng(name):
        android_ds.append(app)
for app in apple:
    name=app[1]
    if is_eng(name):
        ios_ds.append(app)

In [0]:
print('Length of Android dataset:', len(android_ds))
print('Length of iOS dataset:', len(ios_ds))

Length of Android dataset: 9614
Length of iOS dataset: 6183


In [0]:
free_andr = []
free_ios = []
for app in android_ds:
    if '$' in app[7]:
        price == float(app[7].split('$')[1])
    else:
        price = float(app[7])
    if price == 0.0:
        free_andr.append(app)
for app in ios_ds:
    price = float(app[4])
    if price == 0.0:
        free_ios.append(app)

Our goal is to build an app that is free and generates revenue only from in-app ads. For this purpose, we need to investigate both Google Play and iOS app markets to find out which apps are thriving in both markets.

We will have to investigate genres of apps and create a frequency table for both markets. Specifically, in Google Play dataset, we are interested in 'Genres' and 'Category', and for iOS dataset, 'prime_genre'.

In [0]:
def display_table(dataset, index): # function to sort dictionary
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        if entry[0]>20:
            print(entry[1], ':', entry[0])

In [0]:
def freq_table(dataset, index): #function to create a frequency table
    freqn_table = {}
    for i in dataset:
        col_value = i[index]
        if col_value in freqn_table:
            freqn_table[col_value]+=1
        else:
            freqn_table[col_value]=1
    return freqn_table

Our functions to build a frequency table is ready. Let us check which categories are popular in both markets:

In [0]:
display_table(free_ios, 11)

Games : 1874
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26


The most common genre for iOS is 'Games' and the runner-up is 'Entertainment'. from the table above, we can say that most of the apps are designed for entertainment. However, having a large number of apps does not imply that the genre has a large audience.

In [0]:
display_table(free_andr, 1)

FAMILY : 1858
GAME : 944
TOOLS : 828
BUSINESS : 419
MEDICAL : 395
PERSONALIZATION : 375
PRODUCTIVITY : 373
LIFESTYLE : 364
FINANCE : 345
SPORTS : 325
COMMUNICATION : 314
HEALTH_AND_FITNESS : 288
PHOTOGRAPHY : 280
NEWS_AND_MAGAZINES : 250
SOCIAL : 239
TRAVEL_AND_LOCAL : 219
BOOKS_AND_REFERENCE : 218
SHOPPING : 201
DATING : 170
VIDEO_PLAYERS : 163
MAPS_AND_NAVIGATION : 129
FOOD_AND_DRINK : 112
EDUCATION : 106
ENTERTAINMENT : 87
LIBRARIES_AND_DEMO : 84
AUTO_AND_VEHICLES : 84
WEATHER : 79
HOUSE_AND_HOME : 73
EVENTS : 64
PARENTING : 60
ART_AND_DESIGN : 60
COMICS : 55
BEAUTY : 53


In [0]:
display_table(free_andr, 9)

Tools : 827
Entertainment : 557
Education : 503
Business : 419
Medical : 395
Personalization : 375
Productivity : 373
Lifestyle : 363
Finance : 345
Sports : 331
Communication : 314
Action : 299
Health & Fitness : 288
Photography : 280
News & Magazines : 250
Social : 239
Travel & Local : 218
Books & Reference : 218
Shopping : 201
Simulation : 190
Arcade : 184
Dating : 170
Casual : 165
Video Players & Editors : 161
Maps & Navigation : 129
Puzzle : 119
Food & Drink : 112
Role Playing : 104
Strategy : 94
Racing : 91
Libraries & Demo : 84
Auto & Vehicles : 84
Weather : 79
House & Home : 73
Adventure : 72
Events : 64
Art & Design : 56
Comics : 54
Beauty : 53
Card : 47
Parenting : 46
Board : 42
Casino : 39
Educational;Education : 38
Trivia : 37
Educational : 37
Education;Education : 35
Casual;Pretend Play : 25
Word : 23


Based on free android apps frequency table, we can see that tools and entertainment are most common genres for Google Play store as well.
Combining both markets tables, we can say that tools and entertainment genres should be in our mind when creating a new app. However, these tables do not take into account the number of users for each genre. For this purpose, we will need to add more insights based on the number of users for genres.

In [0]:
# sort_dictionary for our dictionary
def sort_dict(table):
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        if entry[0]>20:
            print(entry[1], ':', entry[0])

In [0]:
def avg_users(table, index, installs): # table is dataset, index is the location of genres column, 
                                       # installs is the count of installations/users.
    os_genres = freq_table(table, index)
    os_avg_usr = {}
    for genre in os_genres:
        total = 0
        len_genre = 0
        for app in table:
            genre_app = app[index]
            if genre_app == genre:
                if '+' in app[installs]:
                    users = app[installs]
                    users = users.replace('+', '')
                    users = users.replace(',','')
                    users = float(users)
                else:
                    users = float(app[installs])
                total += users
                len_genre += 1
        avg_users = total/len_genre
        os_avg_usr[genre]=round(avg_users)
    sort_dict(os_avg_usr)

In [0]:
avg_users(free_ios, 11, 5)

Navigation : 86090
Reference : 74942
Social Networking : 71548
Music : 57327
Weather : 52280
Book : 39758
Food & Drink : 33334
Finance : 31468
Photo & Video : 28442
Travel : 28244
Shopping : 26920
Health & Fitness : 23298
Sports : 23009
Games : 22789
News : 21248
Productivity : 21028
Utilities : 18684
Lifestyle : 16486
Entertainment : 14030
Business : 7491
Education : 7004
Catalogs : 4004
Medical : 612


The average number of users for each genre for iOS apps shown above. Navigation genre has the highest number of users and the runner-up is reference genre.

In [0]:
avg_users(free_andr, 1, 5)

COMMUNICATION : 35153714
VIDEO_PLAYERS : 24121489
SOCIAL : 22961790
PHOTOGRAPHY : 16636241
PRODUCTIVITY : 15530942
GAME : 14256218
TRAVEL_AND_LOCAL : 13218663
ENTERTAINMENT : 11375402
TOOLS : 9785955
NEWS_AND_MAGAZINES : 9472807
BOOKS_AND_REFERENCE : 7641778
SHOPPING : 6966909
WEATHER : 4570893
PERSONALIZATION : 4086652
HEALTH_AND_FITNESS : 3972300
MAPS_AND_NAVIGATION : 3900635
SPORTS : 3373768
FAMILY : 3345019
FOOD_AND_DRINK : 1891060
ART_AND_DESIGN : 1887285
EDUCATION : 1782566
BUSINESS : 1663759
LIFESTYLE : 1369955
HOUSE_AND_HOME : 1331541
FINANCE : 1319851
DATING : 828971
COMICS : 817657
AUTO_AND_VEHICLES : 632501
LIBRARIES_AND_DEMO : 630904
PARENTING : 525352
BEAUTY : 513152
EVENTS : 249581
MEDICAL : 96944


The average number of users for Android app genres is shown above. Communication genre has the highest number of average users. This includes messengers and other texting/calling apps. We recommend that our company should also create an app in communication category.