## Mobile Apps Analysis.

### Introduction:

Let's pretend that we're working as a data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and in App Store.

In this notebook we'll be unconvering apps that are free to download with the main source of revenue consists of in-app ads. Which means that the number of users determines our revenue for any given free app. The goal of this project is to analyze data to help the developers understand what type of apps are likely to attract more users.

Let's begin by importing important packages, then we'll opened the dataset and begin to explore.

In [1]:
from csv import reader

In [2]:
app_store_file = list(reader(open("./apple_store/AppleStore.csv", encoding="utf8")))
header_app_store = app_store_file[0]
app_store = app_store_file[1:]

google_play_file = list(reader(open('./android/googleplaystore.csv', encoding='utf8')))
header_android = google_play_file[0]
android_store = google_play_file[1:]

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print("Apple Store Data: ")
app_store_data = explore_data(app_store, 1, 5, True)
print("Column Names: ")
print(app_store_file[0])
print("=" * 20)

Apple Store Data: 
['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


Number of rows: 7197
Number of columns: 17
Column Names: 
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Here are the first few rows of the App Store dataset. There are 7197 number of rows and 17 columns (7197 x 17), some columns are not self explanatory you can read the more detailed description here: [Apple Store Dataset: Kaggle]("https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps")

We could use track_name, price, and prime genre column for our analysis.

In [4]:
print("Google Play Store Data: ")
google_play_data = explore_data(android_store, 1, 5, True)
print("Column Names: ")
print(google_play_file[0])

Google Play Store Data: 
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13
Column Names: 
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


There are 10841 rows and 13 columns in the Google Playstore dataset. The column names are pretty straight forward unlike the Apple Store dataset. Here is the link to the description of each column: [Google Play Store Dataset: Kaggle]("https://www.kaggle.com/datasets/lava18/google-play-store-apps")

We could use App, Cateogry, Rating, Price 

# Data Cleaning

### 1. Check for Incorrect Entry of Data

In [5]:
print("google play column: ")
print(header_android[0:3])

print("=" * 20)
print("Incorrect data: ")
print(android_store[10472][0:3]) # incorrect data at category and rating. Rating scale should be 1 - 5 (float)
print("=" * 50)

google play column: 
['App', 'Category', 'Rating']
Incorrect data: 
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19']


There has been reported an incorrect data in the Google playstore dataset in the discussion section, specifically in the Rating column. The incorrect data is **Life Made WI-Fi Touchscreen Photo Frame** which has the number "19" in the Rating column. 

According to the Google Playstore rating, it should be from a scale of 1 - 5. This was caused by missing value in "Category" column that shifted the value.

We could delete this entry from our dataset.

In [6]:
del android_store[10472] # run it only once.

In [7]:
android_store[10472] # check to see wether the data has been deleted or not

['osmino Wi-Fi: free WiFi',
 'TOOLS',
 '4.2',
 '134203',
 '4.1M',
 '10,000,000+',
 'Free',
 '0',
 'Everyone',
 'Tools',
 'August 7, 2018',
 '6.06.14',
 '4.4 and up']

### 2. Check and Remove Duplicates

As we explore further noticed how there are duplicated entries in the datasets. We don't want duplicates in our analysis as it would result in redundancies. Here is an example of **Instagram**.

In [8]:
print("Header: ", header_android[:4])
print("=" * 50)
for x in android_store:
    if x[0] == "Instagram":
        print(x[:4])

Header:  ['App', 'Category', 'Rating', 'Reviews']
['Instagram', 'SOCIAL', '4.5', '66577313']
['Instagram', 'SOCIAL', '4.5', '66577446']
['Instagram', 'SOCIAL', '4.5', '66577313']
['Instagram', 'SOCIAL', '4.5', '66509917']


We see that Instagram has 4 entries, we could randomly remove them and keep only 1. However if you look closely in the Review column you see that each data has different Review counts. This means the higher the number of Review is, the recent the data are. 

We might loss some important information if we were to perform random deletion, instead we could keep the highest number of Reviews. Let's check how many duplicated data in the Google Playstore dataset.

In [9]:
duplicate_apps = []
unique_apps = []

for app in android_store:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print("Number of duplicated apps: ", len(duplicate_apps))
print("Example of duplicated apps: ")
print("\n")
print(duplicate_apps[:10])

print("=" * 20)
print("Number of unique apps: ", len(unique_apps))
print("Example of unique apps: ")
print("\n")
print(unique_apps[:10])
print("=" * 20)
print("\n")

print("Expected length: ", len(android_store) - len(duplicate_apps))

Number of duplicated apps:  1181
Example of duplicated apps: 


['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']
Number of unique apps:  9659
Example of unique apps: 


['Photo Editor & Candy Camera & Grid & ScrapBook', 'Coloring book moana', 'U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'Sketch - Draw & Paint', 'Pixel Draw - Number Art Coloring Book', 'Paper flowers instructions', 'Smoke Effect Photo Maker - Smoke Editor', 'Infinite Painter', 'Garden Coloring Book', 'Kids Paint Free - Drawing Fun']


Expected length:  9659


As you can see there are a total of 1181 duplicated apps and 9659 apps that are unique. If we were to remove duplicates in our data then our expected length would be 9659 => (original length of the dataset - duplicates)

To remove them we could use Python's dictionary as it does not allow any duplicates. Dictionary consists of key and value. The "Key" here will be unique app names and the value will be their corresponding Review.

In [10]:
reviews_max = {}

for x in android_store: # loop through the google play dataset
    name = x[0] # app name
    n_reviews = float(x[3]) # reviews (string), convert to data type: float
    
    if name in reviews_max and reviews_max[name] < n_reviews: # if name already exist as key in reviews_max
        reviews_max[name] < n_reviews
    elif not name in reviews_max:
        reviews_max[name] = n_reviews

print("Expected length: ", len(reviews_max))

Expected length:  9659


Here's a breakdown from above code:
1. Create a variable named "name" which will store the app names
2. Create another variable named "n_review" which will store the corresponding Review and convert it into float
3. Create a conditional statement, this is where the filter process takes place:
    - if name already exists as key in the "review_max" dictionary and review_max[name] < n_reviews then update the number of reviews for that entry in the reviews_max dictionary.
    - if name is not in the review_max dictionary key, create a new entry in the dictionary where the key is the app name and the value if the number of reviews. 

In [11]:
android_clean = []
already_added = []

for x in android_store:
    name = x[0]
    n_reviews = float(x[3])
    if n_reviews == reviews_max[name] and not name in already_added:
        android_clean.append(x)
        already_added.append(name)
        
print(len(android_clean))

9659


Let's filter out duplicates and non duplicates by seperating them into two seperate lists, we'll be using reviews_max dictionary.

Start by creating two empty lists: 
 - android_clean: we will store the new cleaned dataset here.
 - already_added: we will store app names.

Then looped through the Google Playstore dataset. for each iteration we do the following:
   1. create a varible named "name" which will store the app names
   2. create another varible named "n_reviews" which will store the number of Reviews and convert it into float.
   3. Create a conditional if statement where:
        - if n_reviews is equal to the value in the reviews_max dictionary and not inside alread_added list:
           - append the whole row to android_clean list
           - append the app names only to the already_added list
        


### 3. Removing Non-English Apps

In [12]:
# print("From Apple Store dataset: ")
# print("\n")
print(app_store[814][1])
print(app_store[6731][1])
# print("=" * 20)
# print("From Google Play dataset: ")
# print("\n")
# print(android_clean[4412][0])

436957087
1144164707


Since we are only interested in english app we need to filter out non-english apps from our dataset. To do that we need to check each app name wether they contained symbols that are uncommonly used in English text. English text usually includes letters from the English alphabet like numbers (ex: 0-9), punctuation (ex: ., !, ?) and other symbols (ex: +, *, /)

Each string has a corresponding number associated with it. According to the ASCII (American Standard Code for Information Interchange) system the numbers that corresponds to the character that are commonly used in English text are all range from 0 to 127.

We could use the ord() function to check the length of these characters.

In [13]:
print(ord("a"))
print(ord("A"))
print(ord("爱"))
print(ord("5"))
print(ord("+"))

97
65
29233
53
43


Let's create a function that takes a string. Then we loop over the string and check it using the ord() function wether the string is over 127 or not. 

In [14]:
def check_string_test(text):
    for x in text:
        if ord(x) > 127:
            return False
    return True

print(check_string_test("Instagram"))
print(check_string_test("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(check_string_test('Docs To Go™ Free Office Suite'))
print(check_string_test('Instachat 😜')) # False because of the Emoji 

True
False
False
False


Unfortunately emojis and ™ symbol fall out of the ASCII range. To minimize the loss of our data we could remove apps if it has a name that is more than 3 characters with the corresponding numbers falling outside the ASCII range. This means English apps with emojis and other special characters will still be labeled English.

In [15]:
def check_string(text):
    count_char = 0
    for x in text:
        if ord(x) > 127:
            count_char += 1
    if count_char > 3:
        return False
    else:
        return True

print(check_string("Instagram"))
print(check_string("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(check_string('Docs To Go™ Free Office Suite'))
print(check_string('Instachat 😜')) # False because of the Emoji 

True
False
True
True


Now let's implement the previous function to our dataset. In this function I'm filtering out non-English apps from both dataset. Create two empty list which will contain the following: 

1. **non_english_apps**, will store all non-English apps
2. **english_apps**, will store English only apps (the whole row).

In [16]:
def check_english_apps(dataset, index):
    non_english_apps = []
    english_apps = []
    
    for name in dataset:
        if check_string(name[index]):
            english_apps.append(name)
        else:
            non_english_apps.append(name)
    return english_apps, non_english_apps
english_apps_android, non_english_apps_android = check_english_apps(android_clean, 0)
english_apps_apple_store, non_english_apps_apple_store = check_english_apps(app_store, 2)

print("Example of Google play dataset (English only):")
print(explore_data(english_apps_android, 0, 5, True))
print("Total of English only apps in Google Playstore: ", len(english_apps_android))
print("=" * 20)
print("\n")
print("Example of Apple Store dataset (English only):")
print(explore_data(english_apps_apple_store, 0, 5, True))
print("Total of English only apps in Apple Store: ", len(english_apps_apple_store))
print("Total of non-english apps in Apple Store: ", len(non_english_apps_apple_store))

Example of Google play dataset (English only):
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9614
Number of c

### 4. Isolating Free Apps

As mentioned above in the introduction we're only interested in Free apps with ads as our source of revenue. Both datasets contains free and non-free apps. We need to isolate the free apps from the non-free apps.

To do that we need to create two empty list to seperate them:
   1. free: which will contain all apps that are free (with the price equal to "0").
   2. non_free_app: will contain all non-free apps.

the two list will be inside a function, since we'll be reusing this code. The function will take two parameters:
   1. dataset: the dataset that will pass through the function
   2. index: index number

looped through the given dataset that's passed through the function. Then for each iteration, let's assign the app name (by their index number) to a variable named "price". then create an conditional statement where:
   1. if variable "price" is equal to "0" or "price" is equal to "0.0" (data type: string not float or integer) then append the whole row to the "free" list
   2. else append the whole row to non_free list

In [17]:
def check_price(dataset, index):
    free = []
    non_free = []
    for x in dataset:
        price = x[index]
        if price == "0" or price == "0.0":
            free.append(x)
        else:
            non_free.append(x)
    return free, non_free

free_app_store, nonfree_app_store= check_price(english_apps_apple_store, 5)
free_android_store, nonfree_android_store = check_price(english_apps_android, 7)

print("Free Apps in Google Playstore: ", len(free_android_store)) # I'm missing 2 data here, it should be 8864. What did I do wrong?
print("Non-free Apps in Google Playstore: ", len(nonfree_android_store))
print("=" * 20)

print("Free Apps in Apple Store: ", len(free_app_store))
print("Non-free Apps in Apple Store: ", len(nonfree_app_store))

Free Apps in Google Playstore:  8862
Non-free Apps in Google Playstore:  752
Free Apps in Apple Store:  3222
Non-free Apps in Apple Store:  2961


## Analysis

### Most Common Genre

As mentioned in the introduction above, the goal of this project is to determine which type of apps that are more likely to attract users since the number of users affect our revenue.

Here is a strategy for an app idea that has three steps:
   1. Build a minimal Android version of the app, then add it to Google Playstore.
   2. If the app has a good response from the users we can then develop further.
   3. if the app turns out to be profitable in 6 months, we then build an iOS version of the app and deploy it to App Store.
   
Since our end goal is to add app on both Google and App Store, we need to find app profiles that are successful in both markets. For example a profile that works well for both markets might be a productivity app that makes use of gamification.

Lets create an analysis to determine which are most common genre. We could use "prime_genre" from the App Store dataset, and "Genres" and "Category" from Google Playstore.

In [18]:
def freq_table(dataset, index):
    freq = {}
    length = len(dataset)
    for data in dataset:
        genre = data[index]
        if not genre in freq:
            freq[genre] = 1
        else:
            freq[genre] += 1
    for keys in freq:
        avg_num = (freq[keys] / length) * 100
        freq[keys] = avg_num
    return freq

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

print("App Store most common \'prime_genres\': ")
print("=" * 50)
print("\n")
display_table(free_app_store, 12)
print("=" * 50)

App Store most common 'prime_genres': 


Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Based on the frequency table as shown above "Games" is the most common "prime_genre" in App Store, along with "Entertainment" and Photo & Video.
**Here are 5 most common "prime_genre" in App Store:**
   1. Games.
   2. Entertainment.
   3. Photo & Video.
   4. Education. 
   5. Social Networking.
   
We see that Games takes up to 58.16% which means most apps are gaming related, followed by Entertainment (in this case it is mostly streaming services such as Youtube, Disney, NFL Sunday, etc.) and Photo & Videos with 4.96%. In conclusion most apps in the App Store are for entertainment only purposes along with Education and Social Networking too.

While the least common is Navigation, Medical and Catalogs.

In [19]:
print("Example of Entertainment apps: ")
first_five = [x[2] for x in free_app_store if x[12] == "Entertainment"][:5]
print(first_five)

Example of Entertainment apps: 
['DIRECTV', 'niconico', 'Fandango Movies - Times + Tickets', 'SFR TV', 'The Moron Test']


In [20]:
print("Google Playstore most common \'Genres\': ")
print("=" * 40)
display_table(free_android_store, 9)
print("=" * 50)

Google Playstore most common 'Genres': 
Tools : 8.429248476641842
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.5926427443015125
Productivity : 3.8930264048747465
Lifestyle : 3.8930264048747465
Finance : 3.7011961182577298
Medical : 3.5206499661475967
Sports : 3.4642292936131795
Personalization : 3.3175355450236967
Communication : 3.238546603475513
Action : 3.1031369893929135
Health & Fitness : 3.080568720379147
Photography : 2.945159106296547
News & Magazines : 2.798465357707064
Social : 2.663055743624464
Travel & Local : 2.324531708417964
Shopping : 2.2455427668697814
Books & Reference : 2.143985556307831
Simulation : 2.0424283457458814
Dating : 1.8618821936357481
Arcade : 1.8505980591288649
Video Players & Editors : 1.7716091175806816
Casual : 1.7603249830737984
Maps & Navigation : 1.399232678853532
Food & Drink : 1.2412547957571656
Puzzle : 1.128413450688332
Racing : 0.9930038366057323
Role Playing : 0.9365831640713158
Libraries & Demo : 0.93658316407

Surprisingly in the Google Playstore dataset entertainment and practical purposes apps are most common.

Here are **5 most common "Genres" in google playstore:**

   1. Tools. 
   2. Entertainment. 
   3. Education. 
   4. Business. 
   5. Productivity.
   
Although unlike the App Store with Games that are more dominant than the rest, there's a balance between entertainment (most of the Entertainment apps falls between games and streaming services.) and practical purposes apps (Google translate, Google Assistant, Maps, etc.) 

As shown in the table above, the five first list of Genres has a very small percentages compared to the Apple Store dataset. Let's take a look at the Category column. 

In [21]:
print("Google Playstore most common \'Category\': ")
display_table(free_android_store, 1)
print("=" * 50)

Google Playstore most common 'Category': 
FAMILY : 18.449559918754233
GAME : 9.873617693522906
TOOLS : 8.440532611148726
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5206499661475967
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.782893252087565
MAPS_AND_NAVIGATION : 1.399232678853532
EDUCATION : 1.2863913337846988
FOOD_AND_DRINK : 1.2412547957571656
ENTERTAINMENT : 1.128413450688332
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8350259535093659
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
ART_AND_DE

**5 most common "Category" in google playstore:**
   1. FAMILY. 
   2. GAME. 
   3. TOOLS. 
   4. BUSINESS. 
   5. LIFESTYLE. 
   
Both Category and Genre column are almost similar that is entertainment and practical purposes apps are most common. Most of the apps in Google Playstore are family friendly as shown in the table above. Family and Games category are almost similar but Family contains mostly family friendly games that are usually ment for children, while Games category falls more to broader audience (all ages).

### Number of Users

Let's find out genres with the most users. To do that we need to calculate the average number of installs for each app genre. In the Google PLaystore dataset we could use 'Installs' column, however there is no such column in App Store. We could use "rating_count_tot" column instead.

To start off we will begin calculating the average number of user ratings per app genre on the App Store. Here are the following steps:
   1. Isolate the apps of each genre
   2. Add up the user ratings for the apps of that genre
   3. Divide the sum by the number of apps belonging to that genre (not by the total number of apps)

In [22]:
freq_app_store = freq_table(free_app_store, 12)
print("Average number of users for each genre in App Store: ")
print("=" * 50)
genre_sort = []

for genre in freq_app_store:
    total = 0
    len_genre = 0
    for data in free_app_store:
        genre_app = data[12]
        if genre_app == genre:
            total += float(data[6])
            len_genre += 1
    avg_genre = (total / len_genre)
    genre_sort.append((genre, avg_genre))
    print(genre, ": ", avg_genre)
genre_sort.sort(key=lambda x: x[1], reverse=True)
print("\n")
print("="*20)
print(genre_sort[:5])

Average number of users for each genre in App Store: 
Productivity :  21028.410714285714
Weather :  52279.892857142855
Shopping :  26919.690476190477
Reference :  74942.11111111111
Finance :  31467.944444444445
Music :  57326.530303030304
Utilities :  18684.456790123455
Travel :  28243.8
Social Networking :  71548.34905660378
Sports :  23008.898550724636
Health & Fitness :  23298.015384615384
Games :  22788.6696905016
Food & Drink :  33333.92307692308
News :  21248.023255813954
Book :  39758.5
Photo & Video :  28441.54375
Entertainment :  14029.830708661417
Business :  7491.117647058823
Lifestyle :  16485.764705882353
Education :  7003.983050847458
Navigation :  86090.33333333333
Medical :  612.0
Catalogs :  4004.0


[('Navigation', 86090.33333333333), ('Reference', 74942.11111111111), ('Social Networking', 71548.34905660378), ('Music', 57326.530303030304), ('Weather', 52279.892857142855)]


On Average "Navigation" has the highest amount of users followed by Reference which refers to grammar apps such as Dictionary and even the Bible. There are almost 1 million users installed the Bible app and as for Navigation we have Google Maps and Waze with over than a million users hence why these two types of apps have the highest average users. The same goes to Social Networking, we have Facebook with an almost 3 million users.

In [23]:
for x in free_app_store:
    if x[-5] == "Navigation":
        print(x[2], ">>> ", x[6])

Waze - GPS Navigation, Maps & Real-time Traffic >>>  345046
Geocaching® >>>  12811
ImmobilienScout24: Real Estate Search in Germany >>>  187
Railway Route Search >>>  5
CoPilot GPS – Car Navigation & Offline Maps >>>  3582
Google Maps - Navigation & Transit >>>  154911


In [24]:
for x in free_app_store:
    if x[-5] == "Reference":
        print(x[2], ">>> ", x[6])

Bible >>>  985920
Dictionary.com Dictionary & Thesaurus >>>  200047
Dictionary.com Dictionary & Thesaurus for iPad >>>  54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran >>>  18418
Merriam-Webster Dictionary >>>  16849
Google Translate >>>  26786
Night Sky >>>  12122
WWDC >>>  762
Jishokun-Japanese English Dictionary & Translator >>>  0
教えて!goo >>>  0
VPN Express >>>  14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition >>>  17588
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools >>>  4693
Guides for Pokémon GO - Pokemon GO News and Cheats >>>  826
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free >>>  718
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) >>>  8535
GUNS MODS for Minecraft PC Edition - Mods Tools >>>  1497
Real Bike Traffic Rider Virtual Reality Glasses >>>  8


In [25]:
for x in free_app_store:
    if x[-5] == "Social Networking":
        print(x[2], ">>> ", x[6])

Facebook >>>  2974676
LinkedIn >>>  71856
Skype for iPhone >>>  373519
Tumblr >>>  334293
Match™ - #1 Dating App. >>>  60659
WhatsApp Messenger >>>  287589
TextNow - Unlimited Text + Calls >>>  164963
Grindr - Gay and same sex guys chat, meet and date >>>  23201
imo video calls and chat >>>  18841
Ameba >>>  269
Weibo >>>  7265
Badoo - Meet New People, Chat, Socialize. >>>  34428
Kik >>>  260965
Qzone >>>  1649
Fake-A-Location Free ™ >>>  354
Tango - Free Video Call, Voice and Chat >>>  75412
MeetMe - Chat and Meet New People >>>  97072
SimSimi >>>  23530
Viber Messenger – Text & Call >>>  164249
Find My Family, Friends & iPhone - Life360 Locator >>>  43877
Weibo HD >>>  16772
POF - Best Dating App for Conversations >>>  52642
GroupMe >>>  28260
Lobi >>>  36
WeChat >>>  34584
ooVoo – Free Video Call, Text and Voice >>>  177501
Pinterest >>>  1061624
知乎 >>>  397
Qzone HD >>>  458
Skype for iPad >>>  60163
LINE >>>  11437
QQ >>>  9109
LOVOO - Dating Chat >>>  1985
QQ HD >>>  5058
Messeng

In the Google Playstore dataset the number of installs is not precise, example: 10,000+, 100,000+, 1,000,000+ which shows we don't know how much is 10,000+ it could be 15,000 or 20,000. But that's alright, we don't need very precise data since we only want to find out which app attracts more users.

We are going to leave the numbers as they are which means 10,000+ will be 10,000 but we need to do a little bit of cleaning: stripping off the symbols from the string so it would be much more easier to process.

In [26]:
display_table(free_android_store, 5)

1,000,000+ : 15.730083502595352
100,000+ : 11.543669600541637
10,000,000+ : 10.550665763935905
10,000+ : 10.212141728729407
1,000+ : 8.395396073121193
100+ : 6.917174452719477
5,000,000+ : 6.82690137666441
500,000+ : 5.574362446400361
50,000+ : 4.773188896411646
5,000+ : 4.513653802753328
10+ : 3.5432182351613632
500+ : 3.2498307379823967
50,000,000+ : 2.279395170390431
100,000,000+ : 2.1214172872940646
50+ : 1.9183028661701647
5+ : 0.7898894154818324
1+ : 0.5077860528097494
500,000,000+ : 0.2708192281651997
1,000,000,000+ : 0.22568269013766643
0+ : 0.045136538027533285
0 : 0.011284134506883321


In [27]:
freq_android = freq_table(free_android_store, 1)
app_and_user = []

print("Average number of users for each genre in Google Playstore: ")
print("=" * 50)
print("\n")

sort_list = []
for category in freq_android:
    total_android = 0
    len_category = 0
    for data in free_android_store:
        category_app = data[1]
        if category_app == category:
            num_installs = data[5] 
            num_installs = num_installs.replace("+", "").replace(",", "")
            total_android += float(num_installs)
            len_category += 1
    avg_android = total_android / len_category
    sort_list.append((category, avg_android))
    print(category, ": ", avg_android)
    
sort_list.sort(key=lambda x: x[1], reverse=True)
print("\n")
print("="*20)
print(sort_list[:5])

Average number of users for each genre in Google Playstore: 


ART_AND_DESIGN :  1905351.6666666667
AUTO_AND_VEHICLES :  647317.8170731707
BEAUTY :  513151.88679245283
BOOKS_AND_REFERENCE :  8767811.894736841
BUSINESS :  1712290.1474201474
COMICS :  817657.2727272727
COMMUNICATION :  38456119.167247385
DATING :  854028.8303030303
EDUCATION :  3082017.543859649
ENTERTAINMENT :  21134600.0
EVENTS :  253542.22222222222
FINANCE :  1387692.475609756
FOOD_AND_DRINK :  1924897.7363636363
HEALTH_AND_FITNESS :  4188821.9853479853
HOUSE_AND_HOME :  1313681.9054054054
LIBRARIES_AND_DEMO :  638503.734939759
LIFESTYLE :  1437816.2687861272
GAME :  15837565.085714286
FAMILY :  2691618.159021407
MEDICAL :  120616.48717948717
SOCIAL :  23253652.127118643
SHOPPING :  7036877.311557789
PHOTOGRAPHY :  17805627.643678162
SPORTS :  3638640.1428571427
TRAVEL_AND_LOCAL :  13984077.710144928
TOOLS :  10695245.286096256
PERSONALIZATION :  5201482.6122448975
PRODUCTIVITY :  16787331.344927534
PARENTING :  54260

Communication and Video Players has high amount of average users followed by Social then Entertainment, this reflects previously mentioned there is a balance between entertainment and practical apps in the Playstore. There are roughly (since this is not a precise number.) almost 40 million users in the Communication category alone. Like in the Apple Store dataset this is because of popular apps such as Whatsapp, Messenger – Text and Video Chat for Free and Google Chrome that has large amount of users.

VIDEO_PLAYERS which contains streaming services (Youtube and Iqiyi), video editor and players on average has roughly 25 million users, then we have SOCIAL which contains social media apps such as Instagram, Facebook and Tumblr has 23 million users and ENTERTAINMENT that consists mostly of streaming services such as Netflix, Tubi TV and Pluto Tv on average has 21 million average users.

In [37]:
for x in free_android_store:
    if x[1] == "VIDEO_PLAYERS":
        print(x[0], ">>>", x[5])

YouTube >>> 1,000,000,000+
All Video Downloader 2018 >>> 1,000,000+
Video Downloader >>> 10,000,000+
HD Video Player >>> 1,000,000+
Iqiyi (for tablet) >>> 1,000,000+
Motorola FM Radio >>> 100,000,000+
Video Player All Format >>> 10,000,000+
Motorola Gallery >>> 100,000,000+
Free TV series >>> 100,000+
Video Player All Format for Android >>> 500,000+
VLC for Android >>> 100,000,000+
Code >>> 10,000,000+
Vote for >>> 50,000,000+
XX HD Video downloader-Free Video Downloader >>> 1,000,000+
OBJECTIVE >>> 1,000,000+
Music - Mp3 Player >>> 10,000,000+
HD Movie Video Player >>> 1,000,000+
YouCut - Video Editor & Video Maker, No Watermark >>> 5,000,000+
Video Editor,Crop Video,Movie Video,Music,Effects >>> 1,000,000+
YouTube Studio >>> 10,000,000+
video player for android >>> 10,000,000+
Vigo Video >>> 50,000,000+
Google Play Movies & TV >>> 1,000,000,000+
HTC Service － DLNA >>> 10,000,000+
VPlayer >>> 1,000,000+
MiniMovie - Free Video and Slideshow Editor >>> 50,000,000+
Samsung Video Library 

## Conclusion

Although we see that on average Navigation, References and Social Network has a high amount of users which is the result of popular or well known apps from big tech companies that contributes large amount of users, we see that in overall view the App Store leans more to entertainment only apps as shown in the most common genre section. While in the Google Playstore there is a balance between practical and entertainment. 

We can conclude that we could try to develop gaming apps for the Apple Store, however since there are already plenty in the market we must look into further what type of games that usually user enjoy and would mostlikely catch their interest. As for the Google Playstore we could develop both practical and entertainment apps in one, which means this particular app can be served as both for entertainment and practical. For example learning game apps such as Duolingo.

In [29]:
# goal: what type of apps are likely to attract more users.

# What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or 
# more for entertainment (games, photo and video, social networking, sports, music)?
