# 1. Dataquest Guided Project: Profitable App Profiles

This project analyses data to help understand what type of apps are are likely to attract more users on Google Play and App Store. 
The goal is to provide insights to fictional company specialising in free mobile apps. 

For this purpose we use two open source data sets: 
1. [Data set](https://www.kaggle.com/lava18/google-play-store-apps) about Android Apps
2. [Data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) about iOS apps


In [1]:
applestore = open('AppleStore.csv')
googleplaystore = open("googleplaystore.csv")
from csv import reader
apple_data = list(reader(applestore))
google_data = list(reader(googleplaystore))


In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(apple_data, 1,4, True)
explore_data(google_data,1,4, True)


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2

In [4]:
explore_data(apple_data, 0,1, True)
explore_data(google_data, 0,1, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Number of rows: 7198
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Number of rows: 10842
Number of columns: 13


In [5]:
print(google_data[10473])
print('\n')
print(google_data[2])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


# Cleaning the data Set 
## Removing faulty rows

As mentioned in the [discussions forum](https://www.kaggle.com/lava18/google-play-store-apps/discussion/164101), line 10473 has faulty values and therefore needs to be deleted. 

In [6]:

del google_data[10473]

In [7]:
print(google_data[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


## Removing duplicates
The Google Play data includes several duplicates. 
The code below will show sample duplicates before deleting those duplicates with lower ratings. This is based on the assumption that the entry with highest rating is the most recent i.e. most up-to-date one. 


First we identify the number of duplicates. For this we create two empty list and loop through the data set. Whenever we come across a name that is already in the **unique_apps** list, we append this name to **duplicate_apps**. Otherwise, it's appended to former list. 

In [8]:
duplicate_apps = []
unique_apps = []
for app in google_data[1:]: 
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(app)
    else: 
        unique_apps.append(name)

number_of_dups = len(duplicate_apps)
print(duplicate_apps[0:3])
print(number_of_dups)

[['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up'], ['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device'], ['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']]
1181


There are **1181** duplicates, i.e. **9659** unique entries.

Next we find the max reviews for each app. This is stored in a dictionary where the **app name** is the key and **nunmber of reviews** is the total. To verify the result, we check that the resulting dictionary has a length of **9659**. 

In [9]:
reviews_max = {}
for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name]<n_reviews: 
        reviews_max[name]=n_reviews
    elif name is not reviews_max: 
        reviews_max[name] = n_reviews

print(len(reviews_max))



9659


Next we create **android_clean**, a list that contains all unique app entries with highest number of reviews. For this we loop through the data set and append all apps that meet two criteria: 
1. Their number of reviews is equal to the number of reviews stored in **reviews_max** for the respective app
2. The name of the app is not in the list **already_added**. Whenever, we append an app to **android_clean**, the app name is appended to **already_added**. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry

In [10]:
android_clean = []
already_added = []
for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])

    if (reviews_max[name] == n_reviews) and (name not in already_added): 
        android_clean.append(app)
        already_added.append(name)
print(len(android_clean))


9659


## Removing non-English entries
We need to delete all apps that have not English names as we'd like to analyze only the apps that are directed toward an English-speaking audience. 

The characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII system. The function below  detects weather a string is made from common English character by checking if the ASCII values of all elements are below 127. 

To allow for English apps with special characters,  all English apps with up to three emoji or other special characters will still be labeled as English.

In [11]:
def check_string(string): 
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

#Testing the check_string function
print("Instagram", check_string('Instagram'))
print('爱奇艺PPS -《欢乐颂2》电视剧热播', check_string('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print('Instachat 😜', check_string('Instachat 😜'))
print('Docs To Go™ Free Office Suite', check_string('Docs To Go™ Free Office Suite'))

Instagram True
爱奇艺PPS -《欢乐颂2》电视剧热播 False
Instachat 😜 True
Docs To Go™ Free Office Suite True


In [12]:
english_google_apps = []
english_apple_apps = []

for app in android_clean:
    if check_string(app[0]): 
        english_google_apps.append(app)

for app in apple_data:
    if check_string(app[0]): 
        english_apple_apps.append(app)

explore_data(english_apple_apps, 0, 3, True)
explore_data(english_google_apps, 0, 3, True)


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '21

## Isolating free apps

In [13]:
free_android = [english_google_apps[0]]
free_ios = [english_apple_apps[0]]

for app in english_google_apps: 
    if app[6] == "Free": 
        free_android.append(app)

for app in english_apple_apps[1:]: 
    if float(app[4]) == 0.0 :
        free_ios.append(app)
        
print("android:", len(free_android))
print("iOS:", len(free_ios))

android: 8864
iOS: 4057


# Data Analysis 
## Identifying common genres
Our end goal is to publish our app on Google Play and App Store, beginning with the former. Therefore, we need to identify app profiles that work well for both markets. 

Firstly, we'll find the common genres for each market. For this, we'll build a frequency table for the **prime_genre** column of the App Store data set, and the **Genres** and **Category** columns of the Google Play data set.



In [14]:
def freq_table(dataset, index): 
    freq_table = {}
    freq_table_perc = {}
    all_apps = 0
    for row in dataset[1:]: 
        col_value = row[index]
        all_apps+= 1
        if col_value in freq_table: 
            freq_table[col_value]+=1
        else: 
            freq_table[col_value] = 1
    for val in freq_table: 
        freq_table_perc[val] = round(float(freq_table[val]/all_apps*100),2)
    return freq_table_perc

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
print("___________CATEGORY ANDROID")        
display_table(free_android,1)
print("___________GENRE ANDROID")        
display_table(free_android,9)
print("___________IOS")
display_table(free_ios,11)


___________CATEGORY ANDROID
FAMILY : 19.21
GAME : 9.51
TOOLS : 8.46
BUSINESS : 4.58
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.54
SPORTS : 3.42
PERSONALIZATION : 3.32
COMMUNICATION : 3.25
HEALTH_AND_FITNESS : 3.07
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.78
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.13
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
ENTERTAINMENT : 0.88
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6
___________GENRE ANDROID
Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.58
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.54
Sports : 3.46
Personalization : 3.32
Communication : 3.25
Action : 3.1
Health & Fitness : 3.07
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books &

### Analysis of frequency tables 
**Conclusions for free English apps on Apple Store**: 
- **Games** is the most common genre, followed by **Entertainment** 
- Games is by far the most common genre, with **55.65%** of apps falling within that category
- Most of the apps are designed for entertainment, while utility apps are rare 


**Conclusions for free English apps on Google Play Store**: 
- **Family** is the most common category, followed by **Game**
- **Tools** is the most common genre, followed by **Entertainment** 
- As opposed to the app store,  Google Play shows a more balanced landscape of both practical and for-fun apps
- The genre column is more granular than the category column, we will therfore focus on former


Based on these frequency tables we cannot make a recommendation yet, as they only provide information about supply, not about demand. Therefore, in the next part, we will anaylise user numbers. 

## Identifying popular apps by genres

### Most popular apps by genre on App Store
As the App Store is not providing information about the total number of installs per app, we'll take the total number of user ratings as a proxy. 

Below, we calculate the average number of user ratings per app genre on the App Store:


In [25]:
def freq_table_genre(dataset, index): 
    prime_genre_freq = {}
    for row in dataset[1:]: 
        col_value = row[index]
        if col_value in prime_genre_freq: 
            prime_genre_freq[col_value]+=1
        else: 
            prime_genre_freq[col_value] = 1
    return prime_genre_freq

ios_genres = freq_table_genre(free_ios,11)
for genre in ios_genres: 
    total = 0 
    len_genre = 0 
    for app in free_ios: 
        genre_app = app[11]
        if genre == genre_app: 
            user_ratings = float(app[5])
            total += user_ratings
            len_genre += 1
     
    avg_number_of_user_ratings = total/len_genre
    print(genre, avg_number_of_user_ratings)


Business 6367.8
Medical 459.75
Social Networking 53078.195804195806
Catalogs 1779.5555555555557
Weather 47220.93548387097
Utilities 14010.100917431193
Music 56482.02985074627
Travel 20216.01785714286
Health & Fitness 19952.315789473683
Education 6266.333333333333
Productivity 19053.887096774193
News 15892.724137931034
Food & Drink 20179.093023255813
Photo & Video 27249.892215568863
Navigation 25972.05
Games 18924.68896765618
Shopping 18746.677685950413
Sports 20128.974683544304
Reference 67447.9
Book 8498.333333333334
Lifestyle 8978.308510638299
Finance 13522.261904761905
Entertainment 10822.961077844311


Based on these results, **Music** and **Social Networking** apps seem to be the most popular. However, we can assume that the numbers for these genres are skewed by a few big players, such as **Spotify** and **Facebook**, which get a lot of reviews while smaller players get hardly any. Another popular category is **Weather** but weather apps don't lend themselves to our finance model. Therfore, we would recommend to focus on developing an app for **Reference** or **Photo & Video**. As the market for fun-driven apps on the App Store seems to be saturated, we would recommend a utility-focused app in the **Reference** genre

### Most popular apps by genre on Google Play
The Google Play database provides the total number of installs for each app. Before we anaylse the values in this column with a frequency app, we need to convert those values into floats. 



In [26]:
android_genres = freq_table_genre(free_android,1)
android_genres_and_installs = []
for category in android_genres: 
    total = 0
    len_category = 0
    for app in free_android: 
        category_app = app[1]
        if category == category_app: 
            installs = app[5]
            installs_new = installs.replace("+",'').replace(",","")
            installs_float = float(installs_new)
            total += installs_float
            len_category+=1
    avg_installs_in_category = total/len_category
    print(category, avg_installs_in_category)
    android_genres_and_installs.append((avg_installs_in_category, category))

print(sorted(android_genres_and_installs, reverse=True))

TOOLS 10801391.298666667
MAPS_AND_NAVIGATION 4056941.7741935486
PARENTING 542603.6206896552
GAME 12914435.883748516
HEALTH_AND_FITNESS 4167457.3602941176
EVENTS 253542.22222222222
PRODUCTIVITY 16772838.591304347
SPORTS 4274688.722772277
VIDEO_PLAYERS 24790074.17721519
NEWS_AND_MAGAZINES 9549178.467741935
WEATHER 5074486.197183099
BEAUTY 513151.88679245283
PHOTOGRAPHY 17840110.40229885
FINANCE 1387692.475609756
COMMUNICATION 38326063.197916664
AUTO_AND_VEHICLES 647317.8170731707
BOOKS_AND_REFERENCE 8767811.894736841
SOCIAL 23253652.127118643
BUSINESS 1704192.3399014778
TRAVEL_AND_LOCAL 13984077.710144928
DATING 854028.8303030303
ENTERTAINMENT 9146923.076923076
LIFESTYLE 1437816.2687861272
COMICS 817657.2727272727
ART_AND_DESIGN 1952260.3448275863
HOUSE_AND_HOME 1331540.5616438356
SHOPPING 7036877.311557789
LIBRARIES_AND_DEMO 638503.734939759
MEDICAL 123064.7898089172
FOOD_AND_DRINK 1924897.7363636363
FAMILY 5183203.576042279
PERSONALIZATION 5201482.6122448975
EDUCATION 1768500.0
[(38326

On average, **Communication** apps have the most installs. However, we can assume that this number is heavily skewed by a few big players, such as Facebook, Google etc. A similar pattern can be assumed for the follow-up categories **Video Players**.  **Social**, **Photography** and **Productivity**.

The **Game** category is popular but as we've seen previously, the market seems over saturated for this genre. 
 
Interestingly, **Books and Reference** r