# Most Recommended Apps from App Store and Google Play

Nowadays, it seems that internet users are demanding for more good quality apps. With the increase of technology, developers from all over the world are creating hundreds of apps every day and it is possible to find those apps at the App Store(for IOS devices) and Google Play (for android devices). However, not all of those apps are considered of high quality and content and this is one reason for users to be suspicious about downloading apps to their smartphones and devices.

To soften this issue, we are developing this project in which we aim to analyze data from two different apps platform to help developers understand what type of apps attract more users, so they can offer good quality and content apps. It must be clear that we are only going to analyze free apps to download and install. In order to acomplish that, we are using the App Store data set and the Google Play data set.

## Summary

This project is divided into three parts: __Data Exploration__, __Data Cleansing__, and __Data Analysis__ (including final recommendation). Data Cleansing step is divided into Google Play data cleansing and App Store data cleansing.

## PART I: Data Exploration

In [1]:
from csv import reader

# Apple data set
opened_AppleStore_file = open('AppleStore.csv')
read_Apple_file = reader(opened_AppleStore_file)
data_Apple= list(read_Apple_file)

# Google data set
opened_google_file = open('googleplaystore.csv')
read_google_file = reader(opened_google_file)
data_google = list(read_google_file)


In order to make things easier to analyze our data sets, we created an __explore function__ as showed bellow.

In [2]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    
    if rows_and_columns:
        print('Number of rows: ', len(dataset[:]))
        print('Number of columns: ', len(dataset[0]))


In [3]:
explore_data(data_Apple, 0, 5, rows_and_columns = True)
print('\n')
explore_data(data_google, 0, 5, rows_and_columns = True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows:  7198
Number of columns:  16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Phot

## PART II: Data Cleansing

To continue with our analysis, it is necessary to clean the data sets by removing apps that are not free and non-English apps.

###  # Google Play data set


- __STEP 1__: removing apps with _missing_ data.

According to a discussion in the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) of the Google Play data set, there is a missing data in row 10473 for "Life Made WI-Fi Touchscreen Photo Frame" app, which is the __Category__ data. So it is necessary to remove this data. It is possible to see the full row below.

In [4]:
data_google[10473]  # no Category

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

In [5]:
del data_google[10473]  # removing error row
len(data_google)

10841

- __STEP 2__: removing _duplicate_ data. 

As it is possible to see in the discussions section of Google Play data set documentation, there are some [duplicate entries](https://www.kaggle.com/lava18/google-play-store-apps/discussion?sortBy=top&group=all&page=1&pageSize=20&category=all) which we must remove in order to have more accurate data. We can see an example of a duplicate entry from Instagram in the code cell below.

In [6]:
for app in data_google:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [7]:
duplicate_data = []
unique_data = []
duplicate_apps = {}

for app in data_google:
    name = app[0]
    if name in unique_data:
        duplicate_data.append(name)
    else:
        unique_data.append(name)
        
print('Number of duplicate apps: ', len(duplicate_data))
print('\n')
print('Some duplicate apps: ', duplicate_data[:15])


Number of duplicate apps:  1181


Some duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Aiming not to remove duplicates randomly, we will remove them according to its number of reviews, that is, we will only keep data with the highest number of reviews (which suggests that the data should be more recent) and remove the other entries for any given app.

In the step below, we will create a dictionary to save only the entries of unique apps considering the highest number of reviews if it is duplicate. The length of the dictionary, which is equivalent to the number of unique apps, must totalize 9,659.

In [8]:
reviews_max ={}

for app in data_google[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews 

print(len(reviews_max))

9659


After creating the dictionary above, we must add each key of it to a new cleaned data set, which we named as __android_clean__.

In [9]:
android_clean = []
already_added = []

for app in data_google[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)    
    

In [10]:
# exploring the new cleaned data set

print('The cleaned data set has', len(android_clean), 'entries.')
print('\n')
explore_data(android_clean, 0, 3, True)



The cleaned data set has 9659 entries.


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9659
Number of columns:  13


- __STEP 3__: removing _non-English_ apps.

In this step, we must remove non-English apps from the cleaned data set that we created at the previous step. To do so, we will create a function that loops through each apps' name to identify if there are more than 3 non-English characters. This condition intends to minimize data loss from our data set.

In [11]:
def is_english(name):
    number_of_letters = 0
    
    for letter in name:
        if ord(letter) > 127:
            number_of_letters += 1
    
    if number_of_letters > 3:
        return False  # if there are more than 3 non-English character
    
    return True

Now we must use our function above to explore our data set to separate English of non-English apps.

In [12]:
android_english_apps = []
android_non_english_apps = []

for apps in android_clean:
    name = apps[0]
    
    if is_english(name):
        android_english_apps.append(apps)
    else:
        android_non_english_apps.append(apps)  
    

We can now check how many English apps there are in the Google Play data set.

In [13]:
print('Number of English apps in Google Play data set:', len(android_english_apps))

Number of English apps in Google Play data set: 9614


In [14]:
explore_data(android_english_apps, 0, 3, rows_and_columns = True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9614
Number of columns:  13


- __STEP 4__: removing _non-free_ apps.

As mentioned in the introduction, we are going to work only with free apps to download and install. However, the Google Play data set (and also App Store) contains both free and non-free apps. Therefore, we must remove the non-free apps from the data set.

In [15]:
android_final = []   # final and cleaned data set


for app in android_english_apps:
    price = app[7]
    if price == '0':
        android_final.append(app)
     

In [16]:
print('The total amount of free apps is '+ str(len(android_final)) + '.')

The total amount of free apps is 8864.


In [17]:
explore_data(android_final, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  8864
Number of columns:  13


As showed above, the final and cleaned data set for android apps (from Google Play Store) is compounded with __8,864__ apps, including English and free apps.

###  # App Store data set

After a review on the App Store data set, we noticed that there are no erros like missing data and duplicate entries. Therefore, we can skip to the step in which we must remove non-English and non-free apps.

- __STEP 1__: removal of _non-English_ entries.

Just as we did for Google Play data set, we must remove non-English data for the App Store data set, since we are working with apps directed toward an English-speaking audience. To do so, we are going to use the previous function created in STEP 3 from Google Play data set cleansing.

In [18]:
ios_english_apps = []

for app in data_Apple[1:]:
    name = app[1]
    
    if is_english(name):
        ios_english_apps.append(app)     


In [19]:
print('Number of English apps in App Store data set:', len(ios_english_apps))


Number of English apps in App Store data set: 6183


In [20]:
explore_data(ios_english_apps, 0, 3, rows_and_columns = True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  6183
Number of columns:  16


- __STEP 2__: removal of _non-free_ apps.

At this stage, as we did for Google Play apps, we must desconsider non-free apps. Therefore, we are going to remove the paid apps from the data set and finish data cleansing process. 

In [21]:
ios_final = []

for app in ios_english_apps:
    price = app[4]
    
    if price == '0.0':
        ios_final.append(app)

In [22]:
print('The total amount of free apps in App Store is '+ str(len(ios_final)) + '.')

The total amount of free apps in App Store is 3222.


In [23]:
explore_data(ios_final, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  3222
Number of columns:  16


The analysis above shows that the App Store data set is compounded with 3,222 free and English apps.

## PART III: Data Analysis

At this stage, we are going to analyze the previously cleaned data for both IOS and Android operating systems. For App Store, we are using as reference the __prime_genre__ column and for Google Play the __genre__ and __category__ column.



In [24]:
def freq_table(dataset, index):
    genry_dictionary = {}
    for row in dataset:
        element = row[index]
        if element in genry_dictionary:
            genry_dictionary[element] += 1
        else:
            genry_dictionary[element] = 1
            
    return genry_dictionary


In [25]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [26]:
print('For IOS:')
display_table(ios_final, 11)
print('\n')
print('For Android:')
display_table(android_final, 1)
print('\n')
display_table(android_final, 9)

For IOS:
Games : 1874
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4


For Android:
FAMILY : 1676
GAME : 862
TOOLS : 750
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 55
BEAUTY : 53


Tools : 749
Entertainment : 538
Education : 474



According to our prior analysis, the most common genre of apps designed for the IOS operating system is __Games__, followed by __entertainment__ and __Photo & Video__ apps. It is also possible to see that compared to entertainment apps, there are few apps designed for practical purposes such as shopping and education. 

However, these data may not indicate for sure that users prefer games to Health & Fitness apps, for example.

On the other hand, we can notice that the most frequent apps in Google Play are designed for practical use (such as __Tools__) and also for __entertainment__, which shows to us a more balanced landscape.

### Checking types of apps most users download

### IOS operating system 

Aiming to dig into the IOS data set, we are going to create a frequency table to find out what genres are most popular among users. To do so, we are going to use two basic pieces of information about the data set: the column __prime_genre__ and the column __rating_count_tot__. 

In [27]:
freq_table_ios = freq_table(ios_final, 11)
print(freq_table_ios)


{'Utilities': 81, 'Food & Drink': 26, 'Medical': 6, 'Games': 1874, 'Music': 66, 'Entertainment': 254, 'Photo & Video': 160, 'Navigation': 6, 'Productivity': 56, 'Social Networking': 106, 'Weather': 28, 'Sports': 69, 'Reference': 18, 'Lifestyle': 51, 'Travel': 40, 'Catalogs': 4, 'Education': 118, 'Shopping': 84, 'Finance': 36, 'Business': 17, 'News': 43, 'Health & Fitness': 65, 'Book': 14}


In [28]:
def display_table_final(dictionary):
    table = dictionary
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [29]:
genre_dictionary = {}

for genre in freq_table_ios:
    total = 0    # store the sum of user ratings
    len_genre = 0
    for app in ios_final:
        genre_app = app[11]
        if genre_app == genre:
            user_ratings = float(app[5])
            total += user_ratings
            len_genre += 1
    avg_rating = round((total / len_genre), 2)
    genre_dictionary[genre] = avg_rating
       

In [30]:
display_table_final(genre_dictionary)

Navigation : 86090.33
Reference : 74942.11
Social Networking : 71548.35
Music : 57326.53
Weather : 52279.89
Book : 39758.5
Food & Drink : 33333.92
Finance : 31467.94
Photo & Video : 28441.54
Travel : 28243.8
Shopping : 26919.69
Health & Fitness : 23298.02
Sports : 23008.9
Games : 22788.67
News : 21248.02
Productivity : 21028.41
Utilities : 18684.46
Lifestyle : 16485.76
Entertainment : 14029.83
Business : 7491.12
Education : 7003.98
Catalogs : 4004.0
Medical : 612.0


Despite "Games" is the most common genre in the App Store, it does not have the highest number of user ratings. As it is possible to see in the analysis above, most of the users ratings are concentrated on __Navigation apps__, followed by __Reference__ and __Social Networking__. 

### Android operating system

Just as we did for the App Store, we are going to create a frequency table using the __Category__ and __Installs__ columns from the Android final data set. This frequency table will help us to choose what are the most popular genres in Google Play.

In [31]:
freq_table_android = freq_table(android_final, 1)
print(freq_table_android)

{'ART_AND_DESIGN': 57, 'ENTERTAINMENT': 85, 'TOOLS': 750, 'HEALTH_AND_FITNESS': 273, 'SPORTS': 301, 'BEAUTY': 53, 'SHOPPING': 199, 'MEDICAL': 313, 'FAMILY': 1676, 'MAPS_AND_NAVIGATION': 124, 'EDUCATION': 103, 'LIFESTYLE': 346, 'EVENTS': 63, 'PARENTING': 58, 'DATING': 165, 'GAME': 862, 'HOUSE_AND_HOME': 73, 'PERSONALIZATION': 294, 'BUSINESS': 407, 'AUTO_AND_VEHICLES': 82, 'FOOD_AND_DRINK': 110, 'PHOTOGRAPHY': 261, 'SOCIAL': 236, 'LIBRARIES_AND_DEMO': 83, 'BOOKS_AND_REFERENCE': 190, 'PRODUCTIVITY': 345, 'VIDEO_PLAYERS': 159, 'FINANCE': 328, 'NEWS_AND_MAGAZINES': 248, 'COMICS': 55, 'WEATHER': 71, 'TRAVEL_AND_LOCAL': 207, 'COMMUNICATION': 287}


In [32]:
category_dictionary = {}
for category in freq_table_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            n_installs = installs.replace('+','').replace(',','')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
            
    avg_installs = round((total / len_category), 2)
    category_dictionary[category] = avg_installs


In [33]:
display_table_final(category_dictionary)

COMMUNICATION : 38456119.17
VIDEO_PLAYERS : 24727872.45
SOCIAL : 23253652.13
PHOTOGRAPHY : 17840110.4
PRODUCTIVITY : 16787331.34
GAME : 15588015.6
TRAVEL_AND_LOCAL : 13984077.71
ENTERTAINMENT : 11640705.88
TOOLS : 10801391.3
NEWS_AND_MAGAZINES : 9549178.47
BOOKS_AND_REFERENCE : 8767811.89
SHOPPING : 7036877.31
PERSONALIZATION : 5201482.61
WEATHER : 5074486.2
HEALTH_AND_FITNESS : 4188821.99
MAPS_AND_NAVIGATION : 4056941.77
FAMILY : 3695641.82
SPORTS : 3638640.14
ART_AND_DESIGN : 1986335.09
FOOD_AND_DRINK : 1924897.74
EDUCATION : 1833495.15
BUSINESS : 1712290.15
LIFESTYLE : 1437816.27
FINANCE : 1387692.48
HOUSE_AND_HOME : 1331540.56
DATING : 854028.83
COMICS : 817657.27
AUTO_AND_VEHICLES : 647317.82
LIBRARIES_AND_DEMO : 638503.73
PARENTING : 542603.62
BEAUTY : 513151.89
EVENTS : 253542.22
MEDICAL : 120550.62


## Conclusions

According to our analysis at the very beginning of PART III, Family, Game and Tools categories correspond to the most number of apps in Google Play. However, in the analysis above, it is possible to see that they are not among the most installed apps, but by Communication Apps, followed by Video Players and Social apps. 

Therefore, since our main goal is to recommend app genre that shows potential for being profitable for both the App Store and Google Play, we recommend to developers to focus on building more apps that fall into __Communication__ and __Social Networking__ categories, since they seem to be the most popular apps, despite the few numbers of them in comparison with some other genres.