# Apps Project

In this project, I pretend to be a Data Analyst working for a company that builds Android and iOS mobile apps in English. These apps are then made available on the Google Play Store and the iOS App Store. I only build apps that are free to download and install, and our main source of revenue is in-app ads. This means the revenue for any given app is mostly influenced by the number of users who use the app — the more users that see and engage with the ads, the better. 

My goal for this project is to analyze data to help our developers understand what type of apps would most likely attract more users on both Google Play Store and the iOS App Store. 

In [None]:
from csv import reader 

#iOS dataset
applestore = list(reader(open('AppleStore.csv')))
applestore_header = applestore[0]
applestore = applestore[1:]

#Google Play dataset
googleplaystore = list(reader(open('googleplaystore.csv')))
googleplaystore_header = googleplaystore[0]
googleplaystore = googleplaystore[1:]

In [None]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0])) 

In [None]:
#exploring the iOS dataset 
print(applestore_header)
print('\n')
explore_data(applestore, 0, 3, True)


There are 7197 apps in the iOS sample dataset.Each app is further described with 16 unique features represented by a column. Some features that would be primal for this data analysis are: `track_name`, `price`, `rating_count_tot`, `rating_count_ver`, `user_rating` and `prime_genre`. If the company were situated in a country that internet is either not in abundance or is very pricy, then a feature like `size_bytes` would be very important. The feature names might not be so self explanatory. To get more details on what each of these features connote, go [here](https://github.com/dataquestio/solutions/blob/master/Mission350Solutions.ipynb).

In [None]:
#exploring the Google Play dataset 
print(googleplaystore_header)
print('\n')
explore_data(googleplaystore, 0, 3, True)

The sample Google Play dataset has 10841 apps with each app being characterized by 13 features (13 columns). The features that might come in handy for this data analysis are `App`, `Category`, `Rating`, `Reviews`, `Installs`, `Type`, `Price` and `Genres`. You can find out more on what each of these features mean [here](https://www.kaggle.com/lava18/google-play-store-apps).

# Data Cleansing

Kaggle has a discussion forum for both the [iOS](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion) and the [Google Play](https://www.kaggle.com/lava18/google-play-store-apps/discussion) dataset. One of the discussions in the [Google Play](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) discussion forum shows that an entry of the dataset has some errors. Before moving forward, I'll make an attempt to show this error and fix it.  

In [None]:
print(googleplaystore_header) #the Google Play dataset header
print("\n")
print(googleplaystore[104])   #an entry without an error
print("\n")
print(googleplaystore[10])    #an entry without an error
print("\n")
print(googleplaystore[10472]) #the entry with an error
print("\n")
print(googleplaystore[10472][1]) #category feature of the entry with an error
print("\n")
print(googleplaystore[10472][-4]) #genre feature of the entry with an error

The 10472th entry corresponds to the app `Life Made Wi-Fi Touchscreen Photo Frame`, the 104th entry corresponds to the app `Hairstyles step by step` and the 10th entry corresponds to the app `Text on Photo - Fonteee`. The `Category` feature for `Life Made Wi-Fi Touchscreen Photo Frame` is `1.9`, that of `Hairstyles step by step` is `BEAUTY` and that of `Text on Photo - Fonteee` is `ART_AND_DESIGN`. Clearly, the `Category` is a categorical variable. The `Category` feature for 10472th entry is numerical. This indicates an error on the part of the data collector. Also, its `Genres` feature outputs nothing. This means that entry is empty. Because of this, I'd delete the entire row

In [None]:
del googleplaystore[10472] #do not run more than once
print(len(googleplaystore))

# Detecting and Removing duplicate entries

Exploring the Kaggle discussion forum for the Google Play dataset shows that some apps have multiple entries. For example, the app `Instagram` has 

In [None]:
instagram_count = 0
for app in googleplaystore:
    name = app[0]
    if name == "Instagram":
        instagram_count += 1
        print(app)
print("\n")        
print("The app 'Instagram' has ", instagram_count, " entries")
        

In [None]:
unique_apps = []
duplicate_apps = []
for app in googleplaystore:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print("The number of apps with more than one entry is", len(duplicate_apps))
print("\n")
print("Some examples of duplicate apps are", duplicate_apps[:6])

We really need only one datapoint per duplicate entry for our data analysis. Also, it might be bad practice to totally delete all these duplicate entries or delete them randomly. Therefore, I postulate some criterion for the entry deletion and/or selection.

1. If all the features of the app for all the duplicates are exactly the same, we can go ahead and delete all the duplicate rows
2. The feature `Reviews` represents the number of app users that dropped a review (maybe a rating). If we discover that per duplicate entry, this value changes, we might be tempted to select the entry with the highest number of review. We infer that the different numbers show that the data was collected at different times. This should give us the entry corresponding to the most recent data collection. Refer at the different entries corresponding to the app `Instagram` above and check for their `Review` feature. 

In [None]:
print("The amount of unique apps is", len(unique_apps))
print("\n")
reviews_max = {}
for app in googleplaystore:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))  #this should be the same as the amount of unique apps
    

Now, we've succeeded in extracting the unique apps by the criterion written in `2` above. The `reviews_max` dictionary does not contain any other feature of the apps except for `Reviews`. We therefore have to populate a new list that contains these unique apps with their features. We call this list `googleplaystore_clean`.

1. First, we create an empty `googleplaystore_clean` list and another empty `already_added` list. We isolate the name of the app and the number of reviews for each app in the Google play dataset.

2. We then iterate through the dataset. For every iteration, we add the current row (app) to the `googleplaystore_clean` list, and the app name (name) to the `already_added` list if: 
    a. The number of reviews of the current app matches the number of reviews of that app as described in the `reviews_max` dictionary; **and**
    b. The name of the app is not already in the `already_added` list. 
  
We need to add the `b` part of the condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry. If we just check for the condition that `reviews_max[name]` == `n_reviews`, we'll still end up with duplicate entries for certain apps.

In [None]:
googleplaystore_clean = []
already_added = []
for app in googleplaystore:
    name = app[0]
    n_reviews = float(app[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        googleplaystore_clean.append(app)
        already_added.append(name)

print(len(googleplaystore_clean))
            
            


We now explore the data using the already defined `extract_data()` function. We see that the number of entries is 9659 - **as expected!**

In [None]:
explore_data(googleplaystore_clean, 0, 4, True)

# Removing Non-English Apps

The company we work for develops apps in English language only. However, a look through both the iOS and the Google Play datasets shows that they contain non-English apps - at least, Chinese apps. For example:

In [None]:
print(applestore[813][1])
print(applestore[6731][1])

print(googleplaystore_clean[4412][0])
print(googleplaystore_clean[7940][0])

We can go ahead to delete these non-English apps for the sake of our analysis. One way to go about this is looking for apps name that contain non-English character by the ASCII system. The English language characters include lower case and upper case alphabets from A to Z, digits 0 - 9, punctuation marks (!, ., :, ', ; etc.) and other symbols (@, &, ^, # etc).

These characters are encoded by numbers using the ASCII system from 0 - 127. For example, the character `a` is encoded by the number `97` and the character `1` is encoded by the number `49`. We can leverage this to build a function that checks an app name and tells us whether it contains non-ASCII characters. The built-in function `ord()` outputs the ASCII number corresponding to every English language character.

In [None]:
def is_english(string):
    for i in string:
        if ord(i) > 127:
            return False

    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

The `non_ascii()` function seems to work just fine as seen from the output above. However, some English apps have emojis or other symbols that are outside the range of the ASCII number code for English characters (over `127`). For example

In [None]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

print(ord("™"))
print(ord("😜"))

If we're going to use the `is_english()` function we've created, we'll lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. 

In [None]:
def is_english(string):
    num_char_over_ascii_range = 0
    for i in string:
        if ord(i) > 127:
            num_char_over_ascii_range += 1
        if num_char_over_ascii_range > 3:
            return False

    return True

In [None]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

This filter is clearly not perfect but should be fairly effective for our analysis as against the older `is_english()` function.

We then proceed to using the new `is_english()` function to extract the `"English"` apps from both datasets.

In [None]:
applestore_english = []
googleplaystore_english = []

for app in applestore:
    name = app[1]
    if is_english(str(name)) == True:
        applestore_english.append(app)
        

for app in googleplaystore_clean:
    name = app[0]
    if is_english(str(name)) == True:
        googleplaystore_english.append(app)
        
explore_data(applestore_english, 0, 2, True)
print("\n")
explore_data(googleplaystore_english, 0, 2, True)


We can see that we're left with 6183 iOS apps and 9614 Android apps.

# Removing Non-Free Apps

We stated earlier that the company only builds apps that are in English and, are free to download and install. The data sets contain both free and non-free apps. Therefore, we'll need to remove the non-free apps so that we can use only the free apps for our analysis.

In [None]:
applestore_english_free = []
googleplaystore_english_free = []

for app in applestore_english:
    free_or_non_free = app[4]
    if free_or_non_free == "0.0":
        applestore_english_free.append(app)

for app in googleplaystore_english:
    free_or_non_free = app[7]
    if free_or_non_free == "0":
        googleplaystore_english_free.append(app)
        

print(len(applestore_english_free))        
print(len(googleplaystore_english_free))

We're left with  3222 iOS apps and 8864 Android apps. We think this is enough for analysis.

# Data Analysis

As stated earlier, my goal for this project is to analyze data to help our developers understand what type of apps would most likely attract more users on both Google Play Store and the iOS App Store. To minimize risks and costs, I come up with a validation strategy for an *ideal* app idea.

1. Build a minimal Android version of the app, and add it to Google PlayStore. 
    - The rationale behind this is, *ceteris paribus*, reasons like "my phone runs on an old Android version that cannot support the app" will no longer be tenable for not downloading the app. Also, studies have shown that [Android phones are three times cheaper than iPhones](https://www.trustedreviews.com/news/android-phones-nearly-three-times-cheaper-than-iphone-2924886). This translates to more users having access to the app via the Google Play Store.

2. If the app has a good response from users, we develop it further.
    - The rationale behind this is, *ceteris paribus*, an app that has both a high amount of reviews and user ratings attracts a high amount of users.
 
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.
    - The rationale behind this is, *ceteris paribus*, if an app on Google Play Store is profitable after six months (by a predetermined benchmark), then this means we've been able to build and sustain trust and loyalty from the users. Deploying the app on the Apple Store would not do us any harm at all.
    
Because my end goal is to add the app on both Google Play Store and the App Store, I need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification. 

I begin this analysis by getting a sense of the most common genres for each market. For this, I'll build a frequency table for the `Genres` and `Category` columns of the Google Play Store dataset and, `prime_genre` column of the Apple Store dataset. 

In [None]:
def freq_table(dataset, index):
    freq_table_dict = {}
    total = 0
    
    for app in dataset:
        total += 1
        i = app[index]
        if i in freq_table_dict:
            freq_table_dict[i] += 1
        else:
            freq_table_dict[i] = 1

    freq_table_per = {}
    
    for j in freq_table_dict:
        a =  (freq_table_dict[j] / total) * 100
        freq_table_per[j] = a
        
    return freq_table_per
            
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

We start by examining the frequency table for the `prime_genre` column of the Apple Store dataset. The most common genre from the free apps on the Apple Store dataset is `Games` with a frequency of `58.16%`. This is more than half the total amount of downloads. The runner-up is `Entertainment` with a frequency of `7.88%`. Closely marking behind `Entertainment` apps, is `Photo & Video` and `Social Networking` apps, both with a frequency of less than `5%`. Next, we have `Shopping` apps with a frequency of `2.6%`. Generally, the most downloaded English apps on the Google Play Store are designed for fun - games, social media, pictures and videos, entertainment etc. Apps with practical purpses like news, travel, book, medical, catalogs, reference, navigation etc are rarely downloaded. Therefore, it is not debatable that our focus here is on free apps on the Apple Store that are designed for fun. However, I cannot recommend an app profile for the App Store based off of this frequency table. The fact that fun apps are the most numerous doesn't imply that they also have the greatest number of users — the demand might not be the same as the offer.

Next, we examine the frequency table for the `Genres` column of the Google Play Store data set.

In [None]:
display_table(applestore_english_free, 11)

The most common genre is the `Tools` with about `8.5%` download frequency. Next is `Entertainment` with about `6%`. `Education` comes third, with a `5.34` download frequency. The next 10 genres after `Education` are all closely knit with a frequency of about 3%. Here, the frequency of the genres doesn't hover around just one type. This dataset looks like it has a different landscape in comparison to the Apple Store dataset. The frequencies here are spread across all genres - both apps for practical purposes and fun apps. 

In [None]:
display_table(googleplaystore_english_free, 9)

Finally, we examine the frequency table for the `Category` column of the Google Play Store data set. It might be surprising to have 2 grouping columns in the same dataset. Under this grouping, the most common genre is `FAMILY` with a frequency of `18.9%`. Next are `GAME` and `TOOLS` with frequencies of less thn 10% each. One might wonder why this is so since both column groupings - by `Category` and by `genre` are describing the same apps in the same dataset. 

In [None]:
display_table(googleplaystore_english_free, 1)

I hypothesize that these frequencies might not necessarily reveal the most frequent app genres. One way of confirming if my "hypothesis" is true is to calculate the average number of installs for each app genre. However, this information seems to be missing from the Google Play Store dataset. As a workaround, I'll use the total number of user ratings as a proxy - `rating_count_tot` in the Apple Store dataset and `Reviews` in the Google Play Store. I'd do this by calculating the average number of user ratings per app genre on the App Store.

In [None]:
#for the Apple Store

appstore_freq_by_genre = freq_table(applestore_english_free, -5) #generate a dictionary showing the unique genres in this dataset and their frequencies

for genre in appstore_freq_by_genre:
    tot = 0
    len_genre = 0
    for app in applestore_english_free:
        genre_app = app[-5]
        if genre_app == genre:
            user_rating = float(app[5])
            tot += user_rating
            len_genre += 1
    avg_rating = tot / len_genre
    appstore_freq_by_genre[genre] = avg_rating
    

#sorting the values in the new dictionary

def sort_dict(dictionary):
    sorted_appstore_by_genre = []
    for key in dictionary:
        key_val_as_tuple = (dictionary[key], key)
        sorted_appstore_by_genre.append(key_val_as_tuple)

    list_sorted = sorted(sorted_appstore_by_genre, reverse = True)
    for entry in list_sorted:
        print(entry[1], ':', entry[0])

applestore_genre_table = sort_dict(appstore_freq_by_genre)  #generate a table showing the unique app genre and their average number of user ratings in descending order

appstore_freq_by_genre["Navigation"]

From the table above, we see that `Navigation`, `Reference` and `Social Networking` are the app genre with the highest amount of average user rating per genre category. I'd go on with these analysis to understand which apps in these category have the highest user rating.

In [None]:
#for Navigation

apps_under_navigation = {}

for app in applestore_english_free:
    name = app[1]
    genre_app = app[-5]
    user_rating = float(app[5])
    if genre_app == "Navigation":
        apps_under_navigation.update([(name, user_rating)])
        
sort_dict(apps_under_navigation)  #Navigation apps and the number of ratings

Navigation apps have about 86000 ratings on the average. However, from this analysis, this figure is heavily influenced by Waze and Google Maps which have close to half a million user ratings together.

In [None]:
#for Reference

apps_under_reference = {}

for app in applestore_english_free:
    name = app[1]
    genre_app = app[-5]
    user_rating = float(app[5])
    if genre_app == "Reference":
        apps_under_reference.update([(name, user_rating)])
        
sort_dict(apps_under_reference)  #Reference apps and the number of ratings

Reference apps have almost 75000 ratings on the average. From this analysis however,  Bible and Dictionary.com Dictionary & Thesaurus are dominating with a joint user review of over a million users.

In [None]:
#for Social Networking

apps_under_social_networking = {}

for app in applestore_english_free:
    name = app[1]
    genre_app = app[-5]
    user_rating = float(app[5])
    if genre_app == "Social Networking":
        apps_under_social_networking.update([(name, user_rating)])
        
sort_dict(apps_under_social_networking)  #Social Networking apps and the number of ratings

Social Networking apps have almost 72000 ratings on the average. However, the ratings are broadly spread across these apps. From this analysis, Facebook and Pinterest have almost 4 million user reviews together. Also, other Social Networking apps like Skype for iPhone, Messenger, Tumblr and Whatsapp Messenger etc are also highly reviewed.

# App Profile Recommendation for the Apple Store

We already concluded that apps in the Apple Store with practical purposes are rarely downloaded and/or installed. However, we still have practical apps like Waze, Google Maps, Bible and Dictionary with about 1.5 million user ratings. We cannot categorically estimate how long people would spend on these practical apps. What we certainly know is that the longer a person spends on an app, the more chances they have of interacting with an in-app add. People spend more time on fun apps. Therefore, I would focus on apps that people would most likely spend time interacting with. I recommend developing a practical app that incorporates certain level of fun.

For example, an app that prepares people for job or grad school interviews with a humjorous intelligent assistant that tries to mimic *Trevor Noah's* voice and satirically has a response for every question answered - correctly or otherwise. The app would also have other people studying for different categories of interviews interacting with you through Knowledge Sharing Sessions, discussion forums, daily sarcastic quotes reminding users to visit the app etc. 

There are probably other category of apps that might have  traffic. However, there is a trade-off between high number of ratings and the amount of time spent on each app. A weather app, for example, might have a high number of ratings but how long could the average person possibly spend on a weather app?

## Google Play Store

For the Google Play Store, we have a clearer picture of an estimate of people who interact with the apps as against the Apple Store. This is captured by the `Installs` column. However, this only gives us a range of the number of user ratings i.e. 1+, 100+, 5+, 5,000+ etc. These values don't seem so precise. For example, an app with 100+ installs might actually have 200 or 2,000 or 5,000,000 installs. However, I don't need very perfect precision with respect to the number of users for the purpose of this analysis — I only want to find out which app genres attract the most users.

In [None]:
display_table(googleplaystore_english_free, 9)

I am going to leave the numbers as they are, which means that I'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 5,000+ installs has 5,000 installs, and so on. First, I need to do away with the `+` behind each comma and also eliminate the `,` for values greater that 1,000.

In [None]:
googleplaystore_freq_by_category = freq_table(googleplaystore_english_free, 1) #generate a dictionary showing the unique categories in this dataset and their frequencies

for category in googleplaystore_freq_by_category:
    tot = 0
    len_category = 0
    for app in googleplaystore_english_free:
        category_app = app[1]
        installs = str(app[5])
        if category_app == category:
            tot += float(installs.replace('+', '').replace(',', ''))
            len_category += 1
    avg_installs = tot / len_category
    googleplaystore_freq_by_category[category] = avg_installs

googleplaystore_category_table = sort_dict(googleplaystore_freq_by_category)

We see that the top 3 app categories with the highest number of installs (based on our metric) are `COMMUNICATION`, `VIDEO_PLAYERS` and `SOCIAL`. I proceed with this analysis to find out the dominating apps in these categories.

In [None]:
#for Communication

apps_under_communication = {}

for app in googleplaystore_english_free:
    name = app[0]
    category_app = app[1]
    installs = float(str(app[5]).replace('+', '').replace(',', ''))
    if category_app == "COMMUNICATION":
        apps_under_communication.update([(name, installs)])
        
print(sort_dict(apps_under_communication))  #Communication apps and the number of installs

We see that even though the `COMMUNICATION` category has over 38 billion installations on the average, this number is heavily influenced by certain apps. WhatsApp, Skype, Messenger, Hangouts, Google Chrome and Gmail each have at least 1 billion installs. Some other apps in these categories have at least 100 million - 500 million installs. These apps would have a dominating effect on their category install average. The same pattern holds true for the `VIDEO_PLAYERS` and `SOCIAL` categories. Apps like Youtube, Google Play Movies & TV, MX Player and VLC dominate the `VIDEO_PLAYERS` category by installs while Instagram, Google +, Facebook and Snapchat dominate `SOCIAL`.

In [None]:
#for Video Players

apps_under_video_players = {}

for app in googleplaystore_english_free:
    name = app[0]
    category_app = app[1]
    installs = float(str(app[5]).replace('+', '').replace(',', ''))
    if category_app == "VIDEO_PLAYERS":
        apps_under_video_players.update([(name, installs)])
        
print(sort_dict(apps_under_video_players))  #Video Player apps and the number of installs

In [None]:
#for Social

apps_under_social = {}

for app in googleplaystore_english_free:
    name = app[0]
    category_app = app[1]
    installs = float(str(app[5]).replace('+', '').replace(',', ''))
    if category_app == "SOCIAL":
        apps_under_social.update([(name, installs)])
        
print(sort_dict(apps_under_social))  #Social apps and the number of installs

It might be tough, if not impossible, to compete with dominating apps like Facebook, WhatsApp, Youtube etc. that have already carved out a niche for themselves on the Google Play Store. The Store has users whose preference cut across all categories - from fun (Social, Commumnication, Entertainment, Games, Video Players etc) to pratical (Productivity, Travel, Lifestyle, Books and Reference, Tools etc) apps.

Consequently, I suggest the same recommendation given for the Apple Store. However, the choice of the app(s) to build for the Play Store is more flexible due to the user preference balance across all genres. 

# Conclusion

In this project, I analyzed data about the Apple Store and Google Play Store apps with the goal of recommending an app profile that can be profitable for both markets.

I concluded that developing a practical app that incorporates certain level of fun could potentially lead to more revenue for the company from both Stores. For example, an app that prepares people for job or grad school interviews with a very humorous intelligent assistant that tries to mimic Trevor Noah's voice and satirically has a response for every question answered - correctly or otherwise. The app would also have other people studying for different categories of interviews interacting with you through Knowledge Sharing Sessions, discussion forums, daily sarcastic quotes reminding users to visit the app etc.

While I acknowledge that there can be no absolute correct recommendation, I believe exploring my recommendation might be worthwhile for the hypothetical company.

-MO