# App Project - Dataquest

Data Analyst :

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app ‚Äî the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

**This project is a part of dataquest's Data Scientist Path**

## Collecting the data

In [9]:
from csv import reader

# Dataset for The Apple Store
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:]

# Dataset for The Google Play Store
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

## Exploring the data

In [10]:
# Pre-defined function
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [12]:
print(apple_header)
print('\n')
explore_data(apple,0,5,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


In [13]:
print(android_header)
print('\n')
explore_data(android,0,5,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'E

We can use these columns in our analysis : Category, price, rating, reviews, genre, etc.
This is the [link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) to the Apple Store Dataset documentation.

## Data cleaning

In [14]:
# App with no rating
android[10472] 

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

In [15]:
# Removing that row
del android[10472]

There are some duplicates that we need to remove. Based on the example below, we will choose the fourth column as a criterion, keeping only the app with the highest number of reviews.

In [22]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [24]:
# Calculating the number of duplicates
duplicate=[] 
unique=[] 

for app in android:
    name=app[0]
    if name in unique:
        duplicate.append(name)
    else:
        unique.append(name)
            
print('Number of duplicate apps: ', len(duplicate))
print('Number of unique apps: ', len(unique))

Number of duplicate apps:  1181
Number of unique apps:  9659


We will create a dictionary with unique app names as keys, and the corresponding value is the highest number of reviews.

In [26]:
reviews_max={}

for app in android:
    name=app[0]
    n_reviews=float(app[3]) #Reviews is the fourth column
    
    if name in reviews_max and reviews_max[name]<n_reviews:
        reviews_max[name]=n_reviews
    elif name not in reviews_max:
        reviews_max[name]=n_reviews


In [27]:
print(len(reviews_max)) #We expect 9,659 entries

9659


Now, we'll remove the duplicate rows.

In [28]:
android_clean=list() #Stores clean data
already_added=list() #Stores app names

for app in android:
    name=app[0]
    n_reviews=float(app[3])
    
    if name not in already_added and n_reviews == reviews_max[name]:
        android_clean.append(app)
        already_added.append(name)


In [29]:
print(len(android_clean)) #Expexted 9,659 entries

9659


We are interested in english apps only, so we will remove an foreign app.
We can detect non-english apps using ASCII (American Standard Code for Information Interchange).
English letter fall in between 0 and 127, any character less than 0 or greater that 127 is a non-english character.

In [36]:
# Function to detect non-english names
def is_english_non_edited(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True

In [37]:
# Testing the function
is_english_non_edited('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠')

False

In [38]:
is_english_non_edited('Docs To Go‚Ñ¢ Free Office Suite')

False

In [39]:
is_english_non_edited('Instachat üòú')

False

We can lose some data because certain characters used in an english app return 'False'.
That's why we'll only remove apps with more than three foreign characters.

In [67]:
# Edited function
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

In [68]:
# Testing the edited version
is_english('Docs To Go‚Ñ¢ Free Office Suite')

True

In [69]:
is_english('Instachat üòú')

True

In [70]:
is_english('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠')

False

Now we will remove any non-english app from both our datasets.

In [82]:
android_english=list()
apple_english=list()

for app in android_clean:
    name=app[0]
    if is_english(name):
        android_english.append(app)

for app in apple:
    name=app[1]
    if is_english(name):
        apple_english.append(app)

In [83]:
# Let's see how many apps we have left
print(len(android_english))
print(len(apple_english))

9614
6183


We only want free apps. Therfore, we'll remove any non-free app from our datasets. This is the last data cleaning step.

In [84]:
android_apps=list() 
apple_apps=list() 

for app in android_english:
    try:
        price=float(app[7])
    except:
        price=float(app[7][1:])
    if price == 0.0:
        android_apps.append(app)

for app in apple_english:
    try:
        price=float(app[4])
    except:
        price=float(app[4][1:])
    if price == 0.0:
        apple_apps.append(app)

In [85]:
print(len(android_apps))
print(len(apple_apps))

8864
3222


## Data Analysis

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

In [86]:
print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


We can use 'Genres' and 'Category' to generate a frequency table for android apps.

In [87]:
print(apple_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


For iOS apps, we can use 'prime_genre'.

In [88]:
# Helper function
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [89]:
# Function de generate frequency tables
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    
    return table_percentages
    

In [79]:
# Frequency table for prime_genre
display_table(apple_apps,11)

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


The most common genre : Games.
Second most common : Entertainment.
The most common apps are for no practical purposes.

Recommended app : Games more than 55% of the free anglish app market.

In [80]:
# Frequency table for Genres
display_table(android_apps,9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

Most common apps : Tools and Entertainement.
They are more practical.

In [90]:
# Frequency table for Category
display_table(android_apps,1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Family and Games are the most common app genres.

The most common apps are : Games and Entertainment.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

In [96]:
# for App Store
ft_prime_genre = freq_table(apple_apps,11)

for genre in ft_prime_genre:
    total=0
    len_genre=0
    for app in apple_apps:
        genre_app = app[11]
        
        if genre_app == genre:
            n_ratings = float(app[5])
            total+= n_ratings
            len_genre+=1
    average = total/len_genre
    print(genre,' : ',str(average))

Social Networking  :  71548.34905660378
Health & Fitness  :  23298.015384615384
Catalogs  :  4004.0
Weather  :  52279.892857142855
Utilities  :  18684.456790123455
Travel  :  28243.8
Navigation  :  86090.33333333333
Entertainment  :  14029.830708661417
Productivity  :  21028.410714285714
News  :  21248.023255813954
Photo & Video  :  28441.54375
Business  :  7491.117647058823
Lifestyle  :  16485.764705882353
Music  :  57326.530303030304
Finance  :  31467.944444444445
Medical  :  612.0
Reference  :  74942.11111111111
Games  :  22788.6696905016
Education  :  7003.983050847458
Book  :  39758.5
Shopping  :  26919.690476190477
Food & Drink  :  33333.92307692308
Sports  :  23008.898550724636


The recommended app is : Social networking

In [103]:
# for Google Store
ft_category = freq_table(android_apps,1)

for category in ft_category:
    total=0
    len_category=0
    for app in android_apps:
        category_app = app[1]
        
        if category_app == category:
            n_install = float(app[5].replace('+','').replace(',',''))
            total+=n_install
            len_category+=1
    average = total/len_category
    print(category,' : ',str(average))

NEWS_AND_MAGAZINES  :  9549178.467741935
TRAVEL_AND_LOCAL  :  13984077.710144928
WEATHER  :  5074486.197183099
SHOPPING  :  7036877.311557789
COMMUNICATION  :  38456119.167247385
SOCIAL  :  23253652.127118643
BUSINESS  :  1712290.1474201474
EVENTS  :  253542.22222222222
PHOTOGRAPHY  :  17840110.40229885
FINANCE  :  1387692.475609756
HOUSE_AND_HOME  :  1331540.5616438356
PRODUCTIVITY  :  16787331.344927534
FAMILY  :  3695641.8198090694
COMICS  :  817657.2727272727
MAPS_AND_NAVIGATION  :  4056941.7741935486
VIDEO_PLAYERS  :  24727872.452830188
GAME  :  15588015.603248259
LIBRARIES_AND_DEMO  :  638503.734939759
SPORTS  :  3638640.1428571427
BOOKS_AND_REFERENCE  :  8767811.894736841
HEALTH_AND_FITNESS  :  4188821.9853479853
ART_AND_DESIGN  :  1986335.0877192982
PARENTING  :  542603.6206896552
ENTERTAINMENT  :  11640705.88235294
MEDICAL  :  120550.61980830671
AUTO_AND_VEHICLES  :  647317.8170731707
EDUCATION  :  1833495.145631068
LIFESTYLE  :  1437816.2687861272
FOOD_AND_DRINK  :  1924897.7