# Project 1: Profitable Apps Profile

## Introduction

As a free-apps company, our revenue comes from the in-app adds. Therefore, our goal is to increase as much as possible the numbers of users.

In this project we are going to find out which apps are likely to attract more users. This will help our developers to know where to put the efforts.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
from csv import reader

open_AS_file = open("my_datasets/App_Store/AppleStore.csv", encoding='utf8')
open_GP_file = open("my_datasets/Google_Play/googleplaystore.csv", encoding='utf8')

read_AS_file = reader(open_AS_file)
read_GP_file = reader(open_GP_file)

app_AS = list(read_AS_file)
app_GP = list(read_GP_file)

app_AS_header = app_AS[0]
app_GP_header = app_GP[0]

app_AS = app_AS[1:]
app_GP = app_GP[1:]

In [3]:
explore_data(app_AS,0,3,True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17


In [4]:
explore_data(app_GP,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


## Column Names:

Next 2 cells show the column names of each of the 2 DS:

In [5]:
print(app_AS_header)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [6]:
print(app_GP_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


## Interesting columns:

For the App Store:
 - track_name
 - rating_count_tot
 - rating_count_ver
 - price
 - currency
 - prime_genre

 More info: __[documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)__
 
For the Google Play:
 - App
 - Category
 - Rating
 - Reviews
 - Price
 - Genres

## Checking Data in Google Play

Below the sample of row 10472, where the "Category" field is missing:

In [7]:
print(app_GP_header)
print("Elements in header: " + str(len(app_GP_header)))
print("\n")
print(app_GP[10472])
print("Elements in row 10472: " + str(len(app_GP[10472])))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
Elements in header: 13


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Elements in row 10472: 12


More info about the issue: __[discussion post](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)__

## Deleting Data in Google Play

Below the procedure to delete row 10472 (remember to only run the instruction once):

In [8]:
del app_GP[10472]

The new row 10472 is:

In [9]:
print(app_GP[10472])
print("Elements in row 10472: ", len(app_GP[10472]))

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
Elements in row 10472:  13


## Checking Data in App Store

Here the discussion section of the App Store:
__[discussion section](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion)__

No issue found.

## Removing duplicated rows

In the Google Play data set, it is possible to find multiple times an application.

The main reason is the capture of information from different moments in time. Therefore, only the newer entry shall be kept.

As example:

In [10]:
duplicate_apps = []
unique_apps = []

for app in app_GP:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
for app in app_GP:
    repeated_name = duplicate_apps[0]
    if app[0] == repeated_name:
        print(app)
        print("\n")

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']




In [11]:
print("Total number of duplicated rows: ", len(duplicate_apps))

Total number of duplicated rows:  1181


In order to remove rows from duplicated apps, a criterion must be set. A possible option is to remove rows considering the number of reviews: the higher number of reviews, the newer is the data captured.

For that purpose, dictionary with names (key) and number of reviews (value) is created:

In [12]:
reviews_max = {}

for app in app_GP:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and n_reviews > reviews_max[name]:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

The length of this dictionary must be 9659: 10840 initial elements - 1181 duplicated.

In [13]:
print("Initial elements: ", len(app_GP))
print("Duplicated elements: ", len(duplicate_apps))
print("Elements in reviews_max dictionary: ", len(reviews_max))

Initial elements:  10840
Duplicated elements:  1181
Elements in reviews_max dictionary:  9659


We use now the information from the reviews_max dictionary to create new dataset with cleaned data.

In [14]:
android_clean = []
already_added = []

for app in app_GP:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

Length of android_clean must be 9659:

In [15]:
print(len(android_clean))

9659


## Removing non-ASCII apps

In order to consider exclusively the apps with an ASCII-char name, a filter will be performed.

If any of the name characters is not ASCII, the app will be removed from the database.

In [16]:
def name_with_ASCII_char_old (name):
    for char in name:
        if ord(char) > 127:
            return False
    return True

#Examples:

print(name_with_ASCII_char_old('Instagram'))
print(name_with_ASCII_char_old('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(name_with_ASCII_char_old('Docs To Go™ Free Office Suite'))
print(name_with_ASCII_char_old('Instachat 😜'))

True
False
False
False


As emojis and other non-ASCII chars as "™" can be used in English apps, a modification in the filter function will be done. Given a name, the app will be labeled as English if it contains a maximum of 3 non-ASCII characters.

In [17]:
def name_with_ASCII_char (name):
    count = 0
    for char in name:
        if ord(char) > 127:
            count += 1
        if count == 4:    
            return False
    return True

#Examples:

print(name_with_ASCII_char('Instagram'))
print(name_with_ASCII_char('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(name_with_ASCII_char('Docs To Go™ Free Office Suite'))
print(name_with_ASCII_char('Instachat 😜'))

True
False
True
True


Taking advantage of the function "name_with_ASCII_char", App Store and Google Play datasets are going to be modified in order to include exclusively applications with "English" names:

In [18]:
app_GP_English = []
app_AS_English = []

for app in android_clean:
    name = app[0]
    if name_with_ASCII_char(name):
        app_GP_English.append(app)
        
for app in app_AS:
    name = app[2]
    if name_with_ASCII_char(name):
        app_AS_English.append(app)
        
print("Global Google Play dataset has ",len(android_clean)," entries.")
print("English Google Play dataset has ",len(app_GP_English)," entries.")
print("\n")
print("Global App Store dataset has ",len(app_AS)," entries.")
print("English App Store dataset has ",len(app_AS_English)," entries.")

Global Google Play dataset has  9659  entries.
English Google Play dataset has  9614  entries.


Global App Store dataset has  7197  entries.
English App Store dataset has  6183  entries.


## Filtering free apps

After taking out the non-English apps, now it's turn to keep only the free apps.

New datasets will be created, from the previous ones, storing only the free apps

In [19]:
app_GP_free = []
app_AS_free = []

for app in app_GP_English:
    price = app[7]
    if price == "0":
        app_GP_free.append(app)
        
for app in app_AS_English:
    price = app[5]
    if price == "0":
        app_AS_free.append(app)
        
print("The number of English and free apps in Google Play is ", len(app_GP_free), ".")
print("The number of English and free apps in App Store is ", len(app_AS_free), ".")

The number of English and free apps in Google Play is  8864 .
The number of English and free apps in App Store is  3222 .


## Chosing the most profitable genre

Now it's time to choose the genre of the app that will be developed. As the goal is to create an English and free app, the latest dataset will be used.

The market strategy that will be used is:

1. Develop a basic app for Google Play
2. If the app has good response, develop it further
3. If the app is profitable after 6 months, develop the ios version

According to this, in order to maximize the profits, both datasets must be considered in order to choose the genre of the app.

To define the genre, frequency table of genres will be performed. In case of the GP dataset, the interesting columns will be the 2nd ("Category") and the 10th ("Genres"). For the AS dataset, the 13th ("genre_prime") column will be analyzed.

Needed functions are defined below: 

In [20]:
def freq_table(dataset, index):
    freq_tbl = {}
    total = 0
    for app in dataset:
        total += 1
        el = app[index]
        if el in freq_tbl:
            freq_tbl[el] += 1
        else:
            freq_tbl[el] = 1
    for element in freq_tbl:
        freq_tbl[element] = freq_tbl[element] * 100 / total
    return freq_tbl

#def sort_frequency_table_as_tuple(freq_tbl):
#    sorted_tuple = []
#    for element in freq_tbl:
#        tuple_input = (element, freq_tbl[element])
#        sorted_tuple.append(tuple_input)
#    sorted_tuple = sorted(sorted_tuple, key=lambda tup: tup[1], reverse = True)
#    return sorted_tuple

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
    

The function "print_frequency_table" will be used for this purpose. The input parameters are the dataset, its header row and the chosen column to be analyzed.

In [21]:
display_table(app_GP_free, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.700361010830325
MEDICAL : 3.5311371841155235
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.237815884476534
HEALTH_AND_FITNESS : 3.079873646209386
PHOTOGRAPHY : 2.9444945848375452
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768953
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418774
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075813
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 0

In [22]:
display_table(app_GP_free, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.700361010830325
Medical : 3.5311371841155235
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.237815884476534
Action : 3.1024368231046933
Health & Fitness : 3.079873646209386
Photography : 2.9444945848375452
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.041967509025271
Dating : 1.861462093862816
Arcade : 1.8501805054151625
Video Players & Editors : 1.7712093862815885
Casual : 1.759927797833935
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418774
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075813

In [23]:
display_table(app_AS_free, 12)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.6623215394165114
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.017380509000621
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


For Google Play, the "Genre" column is too granular. "Category" column will be used instead.

After having a look, "FAMILY" (18,9%) is the biggest Category, followed by "GAME" (9,7%). 
For App Store, "Games" (58.1%) is the biggest group without direct competitor.

However, this is just the number of apps. Still needs to be analyzed the users per genre.

## Popular apps in App Store

In order to check the average number of installs per genre, different columns will be parsed in the datasets:
1. For AS: "rating_count_tot" (column 7 - index 6)
2. For GP: "Installs" (column 6 - index 5)

In [24]:
#Function to sort and display dictionaries
def display_table_from_dict(table):
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [25]:
AS_reviews_table = freq_table(app_AS_free, 12)
for genre in AS_reviews_table:
    total = 0
    len_genre = 0
    for app in app_AS_free:
        genre_app = app[12]
        if genre == genre_app:
            total += float(app[6])
            len_genre += 1
    AS_reviews_table[genre] = total / len_genre

display_table_from_dict(AS_reviews_table)

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


This is the average number of reviews for each of the genres. At first sight, it seems that Navigation could be a genre with success.

However, it's also important to see the distribution. In this genre, there are not many apps and 2 of them are Google Maps and Waze, which influences a lot our value.

Same with Social Networking and Music.

References is also influenced by the Bible and Dictionary.com reviews number. However, digitalize a book might be a good option (ofering tools that paper books don't have, as dictionary).

## Popular apps in App Store

In order to check the average number of installs per genre, different columns will be parsed in the datasets:
1. For AS: "rating_count_tot" (column 7 - index 6)
2. For GP: "Installs" (column 6 - index 5)

The problem with "Installs" is that it is a discrete field, defining ranges (0, 5+, 10+, 50+...). Even though it's not a big issue, an edition of the data is needed to convert it from string to float.

In [26]:
GP_reviews_table = freq_table(app_GP_free, 1)
for category in GP_reviews_table:
    total = 0
    len_category = 0
    for app in app_GP_free:
        category_app = app[1]
        if category == category_app:
            n_installs = app[5]
            n_installs = n_installs.replace("+","")
            n_installs = n_installs.replace(",","")
            total += float(n_installs)
            len_category += 1
    GP_reviews_table[category] = total / len_category

display_table_from_dict(GP_reviews_table)

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

Similar to the AS analysis, there is noise from some highly used apps that influences our analysis (in Communication, Social,...)

Again, books and references are in a good position. It's composed by different types of apps: from Bible to eBook readers. Taking a known book and digitalize it, with extra features (as dictionary, quiz, quotes,...) might be a good idea for a free app.