# Data analytics for application ads
We want to calculate the amount of users per different genres of applications, to eventually figure out which ones are the most attractive to users both in Apple Store and Google Play market. Based on that information, we can allocate our spendings on ads placement.

We can analyze sample of data, and apply our findings to deduce the whole picture, as analyzing the whole bunch of all apps could be time consuming(there are around 4 million apps cumulatively.  Now we will open two files, containing information about Google Play and App Store

In [1]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(android,0,3)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']




In [4]:
explore_data(ios,0,3)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']




It is time to perform what is known as data cleaning. Data cleaning refers to the work done prior to analysis, which encorporates removing or correcting wrong data, removing duplicates, modifying the data to fit our purposes etc. In our case, we want to get rid of all non-free apps, and also from all apps, that don't have English language installed.

First, we will delete all the records that miss/misrepresent data.

In [5]:
del android[10472]

We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed two cells above for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we will:

Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

In [6]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [7]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

In [8]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


Now it's time to remove all non-english apps! We will do that, deleting all apps, that have chinese characters like '爱'.Behind the scenes, each character has an identification number(ASCII), and all english characters are stored between 1-127(capital letters, small letters, numbers)

In [9]:
def check_if_english(abc):
    for character in abc:
        if ord(character) > 127:
            return False
    return True

In [10]:
check_if_english('Instagram')

True

In [11]:
check_if_english('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [12]:
check_if_english('Docs To Go™ Free Office Suite')

False

In [13]:
check_if_english('Instachat 😜')

False

We see, that even some English apps are classified as not english. This is because all the emojis are also above order 127, as well as hieroglyphs. To give English apps benefit of the doubt, we can keep app in stock if it has less than three foreign characters(emojis too).

In [14]:
def check_if_english_uptothree(abc):
    k = 0
    for character in abc:
        if ord(character) > 127:
            k +=1
            if k > 3:
                return False
    return True

In [15]:
check_if_english_uptothree('Docs To Go™ Free Office Suite')


True

In [16]:
check_if_english_uptothree('爱奇艺PPS -《欢乐颂2》电视剧热播')


False

In [17]:
check_if_english_uptothree('Instachat 😜')

True

In [18]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if check_if_english_uptothree(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if check_if_english_uptothree(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

Now we want to isolate free apps into separate list

In [19]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

8864
3222


So far, we removed inaccurate data, removed duplicates, removed non english apps and isolated free apps. Our aim is to determine kinds of apps that are likely to attract more users. Our validation strategy after building a test app consists of:
* Build Android version of app, place it on Google Play
* If app has good response from users, develop it further
* If it is profitable after 6 months, we also build IOS version of it and put it on App Store.

In [20]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [21]:
display_table(android_final,1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [26]:
display_table(ios_final,11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


From what we see in frequency tables, we can suggest that the most profitable type of app would be that belonging to gaming industry/entertainment in Google Play, but in App Store the dominance is unclear. What we will do is try to deduce it from the calculations of average number of ratings per category apps.

In [28]:
freq_ios = freq_table(ios_final,11)

In [29]:
for genre in freq_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Entertainment : 14029.830708661417
Medical : 612.0
Photo & Video : 28441.54375
Utilities : 18684.456790123455
Business : 7491.117647058823
Productivity : 21028.410714285714
Games : 22788.6696905016
Health & Fitness : 23298.015384615384
Navigation : 86090.33333333333
News : 21248.023255813954
Catalogs : 4004.0
Shopping : 26919.690476190477
Social Networking : 71548.34905660378
Lifestyle : 16485.764705882353
Travel : 28243.8
Sports : 23008.898550724636
Book : 39758.5
Music : 57326.530303030304
Reference : 74942.11111111111
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Weather : 52279.892857142855
Education : 7003.983050847458


Social Network and Navigation seem to have high values, but that's a domination by giants like Facebook, Skype , or Google Maps. Small app as ours have little chance to stand out between these. Another possible option is Reference category(books, quotes), which will be our go-to option.

In [30]:
freq_android = freq_table(android_final,9)

In [None]:
for category in freq_android:
    total = 0
    len_category = 0
    for app in android_final:
        app_genre = app[9]
        