# Profitable App Profiles for the App Store and Google Play Markets


Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users. 

As a data analyst, I  work for MBOA company that builds Android and iOS mobiles apps. We make our apps available on Google Play and the App Store.

We only build apps that are free to download and install, and we earn only from in-app ads.

## Opening and Exploring the Data

On September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead. To avoid spending resources on collection new data ourselves, we should first try to see if we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

* [ A data set ][1] containing data about approximately ten thousand Android apps from Google Paly; the data was collected in August 2018. You can dowload the data set directly from [ this link ][2]
* [ A data set ][3] containing data about approximately seven iOS apps from the App Store; the data was collected in july 2017. You can download the data set directly from [ this link ][4] 
)
[1]: https://www.kaggle.com/lava18/google-play-store-apps
[2]: https://app.dataquest.io/jupyter/edit/notebook/googleplaystore.csv
[3]: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps
[4]: https://app.dataquest.io/jupyter/edit/notebook/AppleStore.csv 
Let's start by opening the two data sets and continue with exploring the data.

In [74]:
#Opening AppleStore.csv and googleplaystore.csv and saved both as lists of lists
open_file_ios = open( "../my_datasets/AppleStore.csv", encoding = "UTF-8" )
open_file_google_play = open( "../my_datasets/googleplaystore.csv", encoding = "UTF-8" )
from csv import reader
#Store the values of AppleStore.csv in a variable named ios
readed_file_ios = reader(open_file_ios)
ios = list(readed_file_ios)
ios_header = ios[0]
ios = ios[1:]

#Store the values of googleplaystore.csv in a variable named google_play
readed_file_google_play = reader(open_file_google_play )
google_play = list(readed_file_google_play)
google_play_header = google_play[0]
google_play = google_play[1:]

To make it easier to explore the two data sets, we'll first write a function named **explore_data()** that we can use repeatedly to explore rows in a more readable way. We'll also add a parameter for our function to display the number of rows and columns for any data set.

In [75]:
#Explore a data set using explore() function
def explore_data( dataset, start, end, rows_and_columns = False ) :
    dataset_slice = dataset[start:end]
    for row in dataset_slice :
        print(row)
        #adds a new empty line after each row
        print("\n")
        
    if rows_and_columns :
        print( "Number of rows: ", len( dataset ) )
        print( "Number of columns: ", len(dataset[0]) )


In [76]:
#Print iOS header
print( ios_header )
print( "\n" )
#Explore Apple Store datasets using the explore_data() function
explore_data( ios, start = 0, end = 2, rows_and_columns = True )

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows:  7197
Number of columns:  16


As showcased above, the AppleStore data set has 7197 apps and 16 columns. At a glance, The columns that might essential for the purpose of our analysis are: 'trackname', 'Currency', 'price', 'rating_count_tot', 'rating_count_ver' and 'prime_genre'.

Let's perform the same action on Google Play data set.

In [77]:
#Print iOS header
print( google_play_header )
print( "\n" )
#Explore Google Play dataset using the explore_data() function
explore_data( google_play, start = 0, end = 2, rows_and_columns = True )

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows:  10841
Number of columns:  13


Google Play data set contained 10842 apps. The main columns that will be kept to analyze the data are: 'App', 'Category', 'Reviews', 'Installs', ' 'Rating', 'Price', 'Content Rating', and 'Genres'. A full description of each column name can be found in the data set [ documentation ][1].

[1]: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home 

## Deleting Wrong Data

Actually, we have opened the two data sets and performed a brief exploration of the data, we must make sure the data is accurate before begining our analyis, otherwise the results of our analysis will be wrong. By explaining in more simple, we need to:

* Detect inaccurate data, and correct or remove it.
* Detect duplicate data, and remove the duplicates.

As our company only build apps that are *free* to download and install, also directed toward an *English-speaking* audience, It is a necessity to :

* Remove non-English apps.
* Remove apps that aren't free.

The Google Play data set has a dedicated [ discussion section ][1], and we can see that [ one of the discussions ][2] describes an error for a certain row.

[1]: https://www.kaggle.com/lava18/google-play-store-apps/discussion
[2]: https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)

Let's print that row and compare it against the header and another rows that are correct.

In [78]:
#Print the incorrect row in the Google Play data set
print( "Google Play data set. \n" )
#Print a row with an error
print( "An row containing an error: \n" )
print( str( google_play[10472] ), "\n" )
#Print Google Play data set header
print( "The data set Header: \n" )
print( str( google_play_header), "\n" )
#Print the first correct row in the data set
print( "An row containing without an error: \n" )
print( str( google_play[0] ), "\n" )


Google Play data set. 

An row containing an error: 

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 

The data set Header: 

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

An row containing without an error: 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 



The row 10472 corresponds to the app 'Life Made Wi-Fi Touchscreen Photo Frame', and its rating is 19. Certainly, this is an error because the maximum rating for a Google Play is 5. Therefore we will delete that row. 

In [79]:
#Number of row in Google Play data set before deleting index row 10472
print( "Number of rows in Google Play data set before deleting index row 10472:", len( google_play ) )
#Deleting the index row 10472
del google_play[ 10472 ]

#Number of row in Google Play data set after deleting index row 10472
print( "Number of rows in Google Play data set after deleting index row 10472:", len( google_play ) )

Number of rows in Google Play data set before deleting index row 10472: 10841
Number of rows in Google Play data set after deleting index row 10472: 10840


## Removing Duplicate Entries.

### Part One

In the last step above, we started the data cleaning process and deleted the row number 10472 with incorrect data from the Google Play data set. if we scroll down the Google Play data set long enough, we will notice some apps have duplicate entries.

For instance, Instagram has four entries:

In [80]:
 for app in google_play :
        name = app[0]
        if name == "Instagram" :
            #Print all the values stored in app7
             print( app )

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Let's compute the total number of cases where an app occurs more than once in the Google Play data set:

In [81]:
duplicate_apps = []
unique_apps = []

#Go through the data set to find out each duplicate entries
for app in google_play :
    name = app[0]
    if name in unique_apps :
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
#Print the numbr of duplicate apps in Google Play dataset
print( "There are " + str( len(duplicate_apps ) ) + " duplicate apps in the Google Play dataset.\n" )


There are 1181 duplicate apps in the Google Play dataset.



After our computation, We have found that there are **1181** cases where an app occurs more than once in the Google Play data set.

Now, we need to delete all duplicate entries in the data set. If you examine the rows we printed for the instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. Clearly, we can understand that the data was collected at different times.

We are going to use this information to build our criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates entries randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.

We'll perform our criterion in the steps as follows:

* Create a dictionary where each dictionary key is an app name, and its corresponding dictionary value is the highest number of reviews of that app.
* Use the dictionary to create a new data set, which will have only one entrie per app( and we only select the apps with the highest number of reviews.

### Part Two

Let's start by building a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

In [82]:
#Building the Google Play dictionary
reviews_max = {}

#Looping through the Google Play data set
for app in google_play :
    #Assigning the app name to a variable named name
    name = app[ 0 ]
    #Convert the app reviews to float and assigning it to a variable named n_reviews
    n_reviews = float( app[ 3 ] )
    #Update the number of the reviews for that app in the reviews max as long as name 
    # already exists as a key in the reviews_max dictionary and reviews_max[name] < n_reviews
    if ( name in reviews_max ) and ( reviews_max[ name ] < n_reviews ) :
        reviews_max[ name ] = n_reviews
    #If name is not in the reviews_max dictionary as a key, create a new entry
    elif ( name not in reviews_max ) :
        reviews_max[ name ] = n_reviews


        

At the previous code cell, we found that The Google Play data set has 1181 duplicate entries. Basically, the length of our dictionary should be equal to the difference between the length of the Google Play data set and 1181.

In [83]:
#Print the length of the dictionary in the expectation that its value will be 9659

print( "Expected Length of the dictionary : " + str( len( google_play ) - 1181 ) )
print( "Actual Length of the dictionary : " + str( len( reviews_max ) ) )

Expected Length of the dictionary : 9659
Actual Length of the dictionary : 9659


Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicates cases,we only keep the entries with the highest number of reviews. In the code cell below:

* We start by initializing two empty lists, google_play_clean and already_added
* We loop through the Google Play data set, and for every iteration:
    * We isolate the name of the app and the number of reviews.
    * We add the current row (app) to the google_play_clean list, and the app name (name) to already_added list if :
        * The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary, and
        * The name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry ( for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[ name ] == n_reviews. We'll still end up with duplicate entries for some apps.  

In [88]:
#Initializing two empty list
google_play_clean = []
already_added = []

#Looping through the Google Play data set
for app in google_play :
    #isolating the app name and its number of reviews
    name = app[ 0 ]
    n_reviews = float( app[ 3 ] )
    #Add the current app to google_play_clean if its number of reviews is in the reviews_max dictionary and
    #the name of the app is not already in the already_added
    if ( reviews_max[ name ] == n_reviews ) and ( name not in already_added ) :
        google_play_clean.append( app )
        already_added.append( name )
        
#Exploring the google_play_clean to ensure enverything went as expected.
#The length of thr data set should have 9659 rows

print( "Expected Length of The Google play data set clean : " + str( len( reviews_max ) ) )
print( "Actual Length of The Google play data set clean : " + str( len( google_play_clean ) ) )
        

Expected Length of The Google play data set clean : 9659
Actual Length of The Google play data set clean : 9659


So far so good. We have perfectly remove all the duplicates entries in the Google Play data set.

## Removing Non-English Apps.

### Part One

If we explore the data long enough, we'll find that both data sets have apps with names that suggest they are not directed toward an English-speaking audience.
 

In [89]:
print(ios[813][1])
print("\n")
print(google_play_clean[4412][0])

爱奇艺PPS -《欢乐颂2》电视剧热播


中国語 AQリスニング


We are not interested in keeping these apps, therefore we will remove them by deleting each app with a name containing a symbol that is not commonly used in English text. These symbols includes letters from the English alphabet, number composed of digits from 0 to 9, punctuation marks(., !, ?,;) and other symbols(+, *,/).

All characters that are specific to English Text are encoded using ASCII( American Standard Code for Information Interchange) system.  Each ASCII symbol has a corresponding number in a range 0 to 127. Hence, we can build a function that detects whether a character belongs to the set of common English characters or not.

We build this function below and use the *ord()* build-in function to fetch the corresponding number of each character in any app name.

In [90]:
#Writing a function that takes  string as parameter and returns False
#if there's any character in the string that doesn't belong to the
#set of common English characters, otherwise it returns True

def is_english( string ) :
    for character in string :
        #checking if the character is in the range [0,127] using ord() build-in function
        encode_number = ord( character )
        #return False if encode_number > 127
        if ( encode_number > 127 ) :
            return False
        
    return True

Let's try our new function above to check whether some app names are detected as English or non-English.

In [91]:
app_names = ["Instagram", 
              "爱奇艺PPS -《欢乐颂2》电视剧热播", 
              "Docs To Go™ Free Office Suite",
              "Instachat 😜"
            ]
#Looping through the app_names and print whether or not these app names are detected  as English
for name in app_names :
    print( str( name ) + " is detected as English: " + str( is_english( name) ))
    print("\n")

Instagram is detected as English: True


爱奇艺PPS -《欢乐颂2》电视剧热播 is detected as English: False


Docs To Go™ Free Office Suite is detected as English: False


Instachat 😜 is detected as English: False




When we execute the function that detects non-English app names, we can found that the function couldn't correctly identify English app names that contain emojis and character like ™. This issue raised up because those characters fall outside the ASCII range [0, 127].

To solve this issue, we will update the function **is_english()**, and minimize the impact of data loss by only removing an app that has more than three characters with encoded numbers outside the ASCII range.  

In [92]:
#Updating  is_english()
def is_english( string ) :
    #Create a variable value to increment the number of character outside the range
    value = 0
    for character in string :
        #checking if the character is in the range [0,127] using ord() build-in function
        encode_number = ord( character )
        #return False if encode_number > 127
        if ( encode_number > 127 ) :
            value +=1
    
    if ( value > 3) :
        return False
        
    else :
        return True

Let's try our updated function above to check whether some app names are detected as English or non-English.

In [93]:
app_names = ["Instagram", 
              "爱奇艺PPS -《欢乐颂2》电视剧热播", 
              "Docs To Go™ Free Office Suite",
              "Instachat 😜"
            ]
#Looping through the app_names and print whether or not these app names are detected  as English
for name in app_names :
    print( str( name ) + " is detected as English: " + str( is_english( name) ))
    print("\n")

Instagram is detected as English: True


爱奇艺PPS -《欢乐颂2》电视剧热播 is detected as English: False


Docs To Go™ Free Office Suite is detected as English: True


Instachat 😜 is detected as English: True




Below, we use the is_english() function to filter out non_english apps for both data sets.

In [94]:
#We are going to seperate English apps with the others
google_play_english = []
ios_english = []

#Looping through the Google Play data sets
for app in google_play_clean :
    #Saved the app name in a variable named name
    name = app[0]
    if( is_english( name ) ) :
        google_play_english.append( app )

#Looping through the AppleStore data sets
for app in ios :
    #Saved the app name in a variable named name
    name = app[1]
    if( is_english( name ) ) :
        ios_english.append( app )   

explore_data(google_play_english, start = 0, end = 2, rows_and_columns = True)
print("\n")
explore_data( ios_english, start = 0, end = 2, rows_and_columns = True )

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows:  9614
Number of columns:  13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows:  6183
Number of columns:  16


Now we have seperated **9614** Google Play apps and **6983** ios apps

## Isolating the Free Apps.

As we mentioned in the introduction, we only build apps that are free to download and install, so we need to isolate only the free apps for our analysis. Below, we isolate the apps for both data sets.

In [95]:
#Isolating the free apps for both Google play and Apple Store
google_play_free = []
ios_free = []

#Looping through Google Play  data set to isolate free english apps
for app in google_play_english :
    price = app[ 7 ]
    if price == "0" :
        google_play_free.append( app )
        
#Looping through AppleStore data set to isolate free english apps
for app in ios_english:
    price = app[ 4 ] 
    if price == "0.0" :
        ios_free.append( app )
        
#Print the length of free google play english apps and free ios english apps
print( "Number of Free Google Play english apps :\n ")
explore_data( google_play_free, start = 0, end= 2, rows_and_columns = True )
print( "\nNumber of Free Applestore english apps :\n ")
explore_data( ios_free, start = 0, end= 2, rows_and_columns = True )

Number of Free Google Play english apps :
 
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows:  8864
Number of columns:  13

Number of Free Applestore english apps :
 
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows:  3222
Number of columns:  16


As we observed, Google Play data set contains **8864** free english apps while AppleStore holds **3222** free english apps.

## Most Common Apps by Genre.

### Part one

Actually, we have explored both data sets and  removed duplicate app data and isolated the free apps. Now is our time to begin our analysis.

Remember that we are seeking for the kinds of apps that are likely to attract more users because our revenue is highly, influenced by the number of people using our apps.

To minimize risk and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find apps profiles that are successful on both markets.

Let's start the analysis by getting a sense of what are the most common genres for each market. To do so, we'll  build frequency tables for the *prime_genre* column in The Apple Store data set, and the *genres* and *Categorie* columns in The Google Play data set.

Below, we build two functions to analyse the frequency tables :

* One function to generate frequency tables that show percentages
* Another function we can use to display the percentages in a descending order

In [96]:
#Creating a function named freq_table() that takes in two inputs
#dataset : a list of lists
#index : an integer

def freq_table( dataset, index ) :
    
    #Creating an empty dictionary
    frequency_table = {}
    total = 0
    for row in dataset :
        total += 1
        value = row[ index ]
        if( value in frequency_table ) :
            frequency_table[ value ] += 1
        else :
            frequency_table[ value ] = 1
    
    frequency_table_percentage = {}
    
    for key in frequency_table :
        percentage = ( frequency_table[key] / total ) *100
        frequency_table_percentage [ key ] = percentage
        
            
    #return the frequency table
    return frequency_table_percentage

#Creating a function to display the percentage in a descending order

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        


Below, we are analysing the frequency table of the prime_genre column of the App store data set.

In [98]:
#Display the frequency  table of the column prime_genre in iOS free app data set
display_table(ios_free, index = 11 )

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


At a glance, Games has the highest percentage( 58.16%) in the frequency table. Futhermore, Entertaiment and Photo & Video are also popular apps sorted by Genres with a percentage rate of 7.89% and 4.69% respectively. The apps with the lowest percentage rate is Catalogs( 0.12%), follows by Medical(0.18%) and Nagivation(0.18%).

Based on the percentage  of each Genres in the frequency table, we can assume that iOS apps are most designed for entertainnment and the few ones for practical purpose.

On the other hand, we cannot recommend an app profile for the App Store market based on the frequency table alone because an app could have a high percentage rate by Genres in the frequency table, while in the same time it has a low number of users.

Let's continue by examning the Genres and Gategory columns of the Google Play data set. 



In [99]:
#Display the frequency table of the column Category in The Google Play free app data set
display_table(google_play_free, index = 1 )

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

We can see that the scenario in the Google Play data set is the opposite of the App Store data set. Most of the apps are designed for practical use (Family, Tools, education ) . However, if we investigate further, we can observed that family category (which accounts for almost 19%) stands for mostly games for kids.  

The frequency table for the Genres confirms our hypothesis. The practical apps are more represented on The Google Play than on the App Store.  

## Most Popular Apps by Genre on the App Store.

In the last step, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and fun apps. Now, let's figure out the kind of apps with the most users.

To perform this action, we'll calculate the average number of user ratings per app genre on the App store. To do that, we need to:

* Isolate the apps of each genre.
* Sum up the user ratings for the user ratings for the apps of that genre. 
* Divide the Sum by the number of apps belonging to that genre( not by the total number of apps)/

Let's begin the first step by generating a frequency table for the prime_genre column to get the unique app genres.

In [100]:
# Store the frequency table for the prime_genre column in a variable 
# named unique_genres
unique_genres = freq_table( dataset = ios_free, index = 11)

# Looping through the unique genres if the App Store data set
for genre in unique_genres :
    #Initiating a variable named total to store the sum user ratings (the number of ratings, not actual ratings )
    total = 0
    #Initiating a varaible named jen_genre to save the number of apps secific to each gnere
    len_genre = 0
    
    #Looping over ios_free
    for app in ios_free :
        genre_app = app[11]
        # Check whether genre_app and genre are equal
        if ( genre_app == genre) :
            # save the number of user ratings of the app as a float
            n_ratings = float( app[ 5 ] )
            total += n_ratings
            len_genre +=1
            
    #Computing the average number of user ratings
    avg_user_ratings = total / len_genre
    
    #Print the app genre and the average number of user ratings
    print(genre, ":", avg_user_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


We can observe that practical apps have the higher average number of ratings compared to fun apps. Nagivation apps get the greatest number of user ratings with an average of  86091, follows by reference (74942) and Social Networking( 71598).

Among the 3 top popular Genres in the Apple Store (Social Networking, photo & Video, and Games), there are fantastic apps  just like Facebook, Youtube and Pac-Man that redirect the  user attraction to them. If we decide to fight against them, we will automaticatally fail.

However, we can merge great features built in those apps and create an app that will definetely impact customer experience. What about combine Google Map, Youtube and tinder features  to help companies in the informal sector to offer their service online? 

We can work with handworkers who repair shoes on the street everyday, and our app will yield features as follow :

* Video adverts of the handworkers who perform the jobs
* Google or Bing Maps to locate the place to buy the service to the technician
* Geolocation to send a promotion or a discount to a user who is passing near a handworker.

People interested in our app will be pay extra fees to get this/her name on the top the list.

Now, let's continue with the Google Play data set

## Most Popular Apps by Genre on Google Play

For the Google Play market, we already have data about the number of installs. Therefore we should be able to get a clearer picture about genre popularity. As install numbers is printed with special character, they seem to be not precise enough. We can see that most values are open-ended( 100+, 1000+, 5000+, etc...)

In [108]:
display_table(google_play_free, index = 5 )

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


To remove characters from strings, we will use str.replace(old, new) method. As a result, Install number will be the same, which means that we'll consider that an app with 100.000+ has 100.000 installs, etc...

Now let's calculate the average number of installs per app genre for the google Play data set.

In [109]:
#Generating a frequency table for the Category column of the Google Play data
unique_genres = freq_table( dataset = google_play_free, index = 1)

#Looping through the unique genres of the Google Play data set
for category in unique_genres :
    total = 0
    len_category = 0
    #Looping through the Google Play data set
    for app in google_play_free :
        #saved app genre to a variable named category_app
        category_app = app[ 1 ]
        if( category_app == category ) :
            #saved the number of installs
            n_installs = app[5]
            n_installs = n_installs.replace( "+", "")
            n_installs = n_installs.replace( ",", "")
            #add up the number of installs to total 
            total += float( n_installs )
            len_category += 1
            
    #Computing the average number of installs by dividing total by len_category
    avg_n_installs = total / len_category
    
    #print the app genre and the average number of installs
    print(category, ": ", avg_n_installs)

ART_AND_DESIGN :  1986335.0877192982
AUTO_AND_VEHICLES :  647317.8170731707
BEAUTY :  513151.88679245283
BOOKS_AND_REFERENCE :  8767811.894736841
BUSINESS :  1712290.1474201474
COMICS :  817657.2727272727
COMMUNICATION :  38456119.167247385
DATING :  854028.8303030303
EDUCATION :  1833495.145631068
ENTERTAINMENT :  11640705.88235294
EVENTS :  253542.22222222222
FINANCE :  1387692.475609756
FOOD_AND_DRINK :  1924897.7363636363
HEALTH_AND_FITNESS :  4188821.9853479853
HOUSE_AND_HOME :  1331540.5616438356
LIBRARIES_AND_DEMO :  638503.734939759
LIFESTYLE :  1437816.2687861272
GAME :  15588015.603248259
FAMILY :  3695641.8198090694
MEDICAL :  120550.61980830671
SOCIAL :  23253652.127118643
SHOPPING :  7036877.311557789
PHOTOGRAPHY :  17840110.40229885
SPORTS :  3638640.1428571427
TRAVEL_AND_LOCAL :  13984077.710144928
TOOLS :  10801391.298666667
PERSONALIZATION :  5201482.6122448975
PRODUCTIVITY :  16787331.344927534
PARENTING :  542603.6206896552
WEATHER :  5074486.197183099
VIDEO_PLAYERS 

We can oberse that in the Google Play, Communication apps are the most popular in the market ( 24727872 installs), follow by Video-Players categorie( 24727872 installs) and Social ( 23253652 installs ).

When we investigate more, Communication categories is drived by Whatsapp0 (1,000,000, 000+ installs), Skype( 1,000,000,000) Google Duo( 500,000,000+ installs). Usually, those apps are free of charge and belong to mega big companies in the Wolrd(Facebook, Google and Microsoft). Obviously we can't fight against them.

As a result, it seems that apps belong to Communication and Social Categeries  are tremendously popular both in the Google Play and Apple Store. Therefore, our apps could perfectly succeed in the both platform, and our revenue with grow faster.  
*


In [111]:
for app in google_play_free :
    if ( app[ 1 ] == "COMMUNICATION" ) :
        print(app[0], ": ", app[5])

WhatsApp Messenger :  1,000,000,000+
Messenger for SMS :  10,000,000+
My Tele2 :  5,000,000+
imo beta free calls and text :  100,000,000+
Contacts :  50,000,000+
Call Free – Free Call :  5,000,000+
Web Browser & Explorer :  5,000,000+
Browser 4G :  10,000,000+
MegaFon Dashboard :  10,000,000+
ZenUI Dialer & Contacts :  10,000,000+
Cricket Visual Voicemail :  10,000,000+
TracFone My Account :  1,000,000+
Xperia Link™ :  10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard :  10,000,000+
Skype Lite - Free Video Call & Chat :  5,000,000+
My magenta :  1,000,000+
Android Messages :  100,000,000+
Google Duo - High Quality Video Calls :  500,000,000+
Seznam.cz :  1,000,000+
Antillean Gold Telegram (original version) :  100,000+
AT&T Visual Voicemail :  10,000,000+
GMX Mail :  10,000,000+
Omlet Chat :  10,000,000+
My Vodacom SA :  5,000,000+
Microsoft Edge :  5,000,000+
Messenger – Text and Video Chat for Free :  1,000,000,000+
imo free video calls and chat :  500,000,000+
Calls & Tex

## Conclusion

Throughout this project, we handle two data sets in the aim to understand the type of apps that attract users both in The Google Play and Apple Store. We started by exploring the data sets to become familar with the data. then, we performed a data-cleaning process to delete error data and duplicate entries, and we also separated free and English apps from the others. After that, we analyzed iOS free english as well as Google Play   data sets.

At the end of our analysis, we recommended apps that combine the benefits of apps that belong to the most popular Categeries in both markets, including Social Networking, Video Player and Communication. Our app will offer an opportunity for informal workers to break into the internet, and increase their revenue. To acheive this goal, we will provide  non-free adverts, geolocation and mapping technology. At a consequence, our users will directly reach out to them.   
