# Profitable App Profiles for the App Store and Google
For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and in the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

In [66]:
from csv import reader

### Google Play data set###
opened_file=open('googleplaystore.csv',encoding="utf8")
read_file=reader(opened_file)
android=list(read_file)
android_header = android[0]
android = android[1:]

###App Store data set ###
opened_file = open('AppleStore.csv',encoding="utf8")
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]



In [67]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Here we explore some rows of the iOS dataset

In [68]:
explore_data(ios,1,4,True)
print(ios_header)

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Possible useful columns:`price`,`track_name`,`prime_genre`,`rating_count_ver` 

Since our company focuses on an English-speaking audience and on free apps, we will ignore paid apps and apps that are not in English.

In [69]:
print(len(android[0]))

13


In [70]:
print(android[0])

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


There seems to be a row that has the incorrect number of parameters. We will get rid of this row

In [71]:
for count,row in enumerate(android):
    if len(row)!=13:
        print(count,row)
        

        
        
    

    

10472 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


As we can see this row has only 12 entries instead of 13. We now delete row 10472

In [72]:
del android[10472]

It is possible that some apps are listed twice or more, so in order to not wrongfully count the same app more than once, we now check is the app Instagram is repeated:

In [73]:
for app in android:
    name=app[0]
    if name=='Instagram':
        print(app)
    

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


So now we see that at least one app is listed 4 times. We now split the original data into two lists, in one we list the duplicate apps and in the other one the unique ones. To check for this we see if some app is listed more than once.

In [117]:
duplicate_apps=[]
unique_apps=[]

for app in android:
    name=app[0]
    
    if name in unique_apps:
        duplicate_apps.append(name)
        
    else:
        unique_apps.append(name)
        
    
print(len(unique_apps))

9659


We are going to use the number of reviews to identify the most recent data point. We will proceed to delete the oldest entries because we want the most recent data.

In [118]:
print(len(duplicate_apps))

1181


The list has 1181 duplicates

In [119]:
apprev={}

for app in android:
    
    name=app[0]
    rev=float(app[3])
    
    if name in apprev and rev>apprev[name]:
        apprev[name]=rev
    elif name not in apprev:
        apprev[name]=rev
        
        
print(len(apprev))
        

9659


In [151]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (apprev[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name) # make sure this is inside the if block

In [152]:
print(len(android_clean))

9659


In [153]:
def isenglish(string):
    count=0
    for a in string:
        if ord(a)>127:
            count+=1
        if count>3:
            return False
    return True
    

In [156]:
isenglish('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [165]:
android_eng=[]


for app in android_clean:
    name=app[0]
    if isenglish(name):
        android_eng.append(app)
        


In [180]:
print(len(android_eng))
print(android_header)
print(android_eng[1][7])

9614
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
0


In [188]:
freeapps=[]

for app in android_eng:
    price=app[7]
    
    if price[0]=='$':
        
        price=float(app[7][1:])
        
        
    if float(price)==0.0:
        freeapps.append(app)
        
print(len(freeapps))

8864


because money

In [190]:
print(freeapps[0])

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


In [201]:
def freq_table(dataset,index):

    table={}
    categories=[]

    for app in dataset:
        cate=app[1]
        if cate in categories:
            table[cate]+=1
        else:
            categories.append(cate)
            table[cate]=1
    result={}       
    for item in table:
        result[item]=table[item]/len(freeapps)*100
    print(result)
    return result
    

    

In [202]:
freq_table(freeapps,1)

    

{'ART_AND_DESIGN': 0.6430505415162455, 'AUTO_AND_VEHICLES': 0.9250902527075812, 'BEAUTY': 0.5979241877256317, 'BOOKS_AND_REFERENCE': 2.1435018050541514, 'BUSINESS': 4.591606498194946, 'COMICS': 0.6204873646209386, 'COMMUNICATION': 3.2378158844765346, 'DATING': 1.861462093862816, 'EDUCATION': 1.1620036101083033, 'ENTERTAINMENT': 0.9589350180505415, 'EVENTS': 0.7107400722021661, 'FINANCE': 3.7003610108303246, 'FOOD_AND_DRINK': 1.2409747292418771, 'HEALTH_AND_FITNESS': 3.0798736462093865, 'HOUSE_AND_HOME': 0.8235559566787004, 'LIBRARIES_AND_DEMO': 0.9363718411552346, 'LIFESTYLE': 3.9034296028880866, 'GAME': 9.724729241877256, 'FAMILY': 18.907942238267147, 'MEDICAL': 3.531137184115524, 'SOCIAL': 2.6624548736462095, 'SHOPPING': 2.2450361010830324, 'PHOTOGRAPHY': 2.944494584837545, 'SPORTS': 3.395758122743682, 'TRAVEL_AND_LOCAL': 2.33528880866426, 'TOOLS': 8.461191335740072, 'PERSONALIZATION': 3.3167870036101084, 'PRODUCTIVITY': 3.892148014440433, 'PARENTING': 0.6543321299638989, 'WEATHER': 

{'ART_AND_DESIGN': 0.6430505415162455,
 'AUTO_AND_VEHICLES': 0.9250902527075812,
 'BEAUTY': 0.5979241877256317,
 'BOOKS_AND_REFERENCE': 2.1435018050541514,
 'BUSINESS': 4.591606498194946,
 'COMICS': 0.6204873646209386,
 'COMMUNICATION': 3.2378158844765346,
 'DATING': 1.861462093862816,
 'EDUCATION': 1.1620036101083033,
 'ENTERTAINMENT': 0.9589350180505415,
 'EVENTS': 0.7107400722021661,
 'FINANCE': 3.7003610108303246,
 'FOOD_AND_DRINK': 1.2409747292418771,
 'HEALTH_AND_FITNESS': 3.0798736462093865,
 'HOUSE_AND_HOME': 0.8235559566787004,
 'LIBRARIES_AND_DEMO': 0.9363718411552346,
 'LIFESTYLE': 3.9034296028880866,
 'GAME': 9.724729241877256,
 'FAMILY': 18.907942238267147,
 'MEDICAL': 3.531137184115524,
 'SOCIAL': 2.6624548736462095,
 'SHOPPING': 2.2450361010830324,
 'PHOTOGRAPHY': 2.944494584837545,
 'SPORTS': 3.395758122743682,
 'TRAVEL_AND_LOCAL': 2.33528880866426,
 'TOOLS': 8.461191335740072,
 'PERSONALIZATION': 3.3167870036101084,
 'PRODUCTIVITY': 3.892148014440433,
 'PARENTING': 0.6

In [203]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [204]:
display_table(freeapps,1)

{'ART_AND_DESIGN': 0.6430505415162455, 'AUTO_AND_VEHICLES': 0.9250902527075812, 'BEAUTY': 0.5979241877256317, 'BOOKS_AND_REFERENCE': 2.1435018050541514, 'BUSINESS': 4.591606498194946, 'COMICS': 0.6204873646209386, 'COMMUNICATION': 3.2378158844765346, 'DATING': 1.861462093862816, 'EDUCATION': 1.1620036101083033, 'ENTERTAINMENT': 0.9589350180505415, 'EVENTS': 0.7107400722021661, 'FINANCE': 3.7003610108303246, 'FOOD_AND_DRINK': 1.2409747292418771, 'HEALTH_AND_FITNESS': 3.0798736462093865, 'HOUSE_AND_HOME': 0.8235559566787004, 'LIBRARIES_AND_DEMO': 0.9363718411552346, 'LIFESTYLE': 3.9034296028880866, 'GAME': 9.724729241877256, 'FAMILY': 18.907942238267147, 'MEDICAL': 3.531137184115524, 'SOCIAL': 2.6624548736462095, 'SHOPPING': 2.2450361010830324, 'PHOTOGRAPHY': 2.944494584837545, 'SPORTS': 3.395758122743682, 'TRAVEL_AND_LOCAL': 2.33528880866426, 'TOOLS': 8.461191335740072, 'PERSONALIZATION': 3.3167870036101084, 'PRODUCTIVITY': 3.892148014440433, 'PARENTING': 0.6543321299638989, 'WEATHER': 