# Profitable App Profiles for the App Store and Google Play Markets

Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

## Opening and Exploring the Data
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

![img](https://s3.amazonaws.com/dq-content/350/py1m8_statista.png) Source: [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/)
Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

* A [dataset](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
* A [dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

Let's start by opening the two data sets and then continue with exploring the data.

In [7]:
file1 = open('AppleStore.csv')
file2 = open('googleplaystore.csv')
from csv import reader
read_file1 = reader(file1)
read_file2 = reader(file2)
ios = list(read_file1)
ios_header=ios[0]
ios=ios[1:]
android = list(read_file2)
android_header=android[0]
android=android[1:]

To make it easier to explore the two data sets, we'll first write a function named `explore_data()` that we can use repeatedly to explore rows in a more readable way. We'll also add an option for our function to show the number of rows and columns for any data set.

In [10]:
def explore_data(dataset,start,end,rows_and_columns=False):
    dataset_slice=dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of Rows: ' , len(dataset))
        print('Number of Columns: ' ,len(dataset[0]))

In [48]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of Rows:  10841
Number of Columns:  13


We see that the Google Play data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are `'App'`, `'Category'`, `'Reviews'`, `'Installs'`, `'Type'`, `'Price'`, and `'Genres'`.

Now let's take a look at the App Store data set.

In [11]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of Rows:  7197
Number of Columns:  16


We have 7197 iOS apps in this data set, and the columns that seem interesting are: `'track_name'`, `'currency'`, `'price'`, `rating_count_tot'`, `'rating_count_ver'`, and `'prime_genre'`. Not all column names are self-explanatory in this case, but details about each column can be found in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

## Deleting Wrong Data
The Google Play data set has a dedicated [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section, and we can see that one of the [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) outlines an error for `row 10472`. Let's print this row and compare it against the header and another row that is correct.

In [8]:
print(android[10472]) 
print('\n')
print(android_header) 
print('\n')
print(android[0])    

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The row 10472 corresponds to the app Life Made WI-Fi Touchscreen Photo Frame, and we can see that the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5 (as mentioned in the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) section, this problem is caused by a missing value in the `'Category'` column). As a consequence, we'll delete this row.

In [13]:
print(len(android))
del android[10472]  
print(len(android))

10841
10840


## Removing Duplicate Entries: Part One
After reading the discussion we may find out that there are duplicate entries of many apps, For example `Intagram'` has 4 duplicate entries which we can see below

In [12]:
for app in android:
    if app[0]=='Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We can find out which apps have duplicates and even create a new list that has only unique apps in it.

In [14]:
duplicate_apps=[]
unique_apps=[]

for app in android:
    if app[0] in unique_apps:
        duplicate_apps.append(app[0])
    else:
        unique_apps.append(app[0])


With this code we now have `duplicate_apps` and ` unique_apps` list, which makes our data cleaning process a lot easier.

if we take a look at the duplicates of `Instagram`, We can see that the number of reviews are different for each 4, We can take it as a criteria of using the app which has the highest number of reviews, because it'll naturally be the one that is latest adn can provide us with more accurate data. First, Let's count the number of cases of duplication, which we can do easily by counting the lenth of `duplicate_apps` lsit. 


In [15]:
print('Number of cases where duplication occured:',len(duplicate_apps))

Number of cases where duplication occured: 1181


Now, Like we mentioned before, We'll use reviews as a basis while choosing an app out of the duplicates, To do that:
* we'll have to create a dictionary which has name of the duplicate app as the key and the numbe rof reviews as its value.
* Using that dictionary we'll create a new dataset that only uses the duplicate with maximum number of reviews.

## Removing Duplicate Entries: Part Two
Now We'll create the dictionary.

In [16]:
max_reviews={}

for app in android:
    n_reviews=float(app[3])
    if app[0] in max_reviews and max_reviews[app[0]]<=n_reviews:
        max_reviews[app[0]]=n_reviews
    elif app[0] not in max_reviews:
        max_reviews[app[0]]=n_reviews
print('Actual:',len(max_reviews))
print('Expected:',len(android)-len(duplicate_apps))

Actual: 9659
Expected: 9659


Testing out this dictionary by removing the number of cases of duplication from orignal dataset gives us equal value to the length of the dictionary `'max_reviews'`, that means we are on the right track!

Now We'll Use this dictionary to make a dataset that has been cleaned of duplicates. For this, We'll first make two empty lists: `'android_clean'` and `'already_added'`. 
* `'android_clean'` will be our new dataset that has no duplicates.
*  The reason we need this list `'already_added'` is because even when we check the condition of maximum reviews with the dictionary, There are still some apps that have same number of reviews for their duplicates. Our program will end up including them in our new dataset if we dont use the list `'already_added'` in our `if` conditions. 
* Now we'll iterate through the android dataset and add the apps with highest reviews out of the duplicates into the list `'android_clean'`, if it is not in the list `already_added`

In [17]:
android_clean=[]
already_added=[]
for app in android:
    n_reviews=float(app[3])
    if max_reviews[app[0]]==n_reviews and app[0] not in already_added:
        android_clean.append(app)
        already_added.append(app[0])
        
explore_data(android_clean,0,5,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of Rows:  9659
Number of Columns:  13


We can confirm the new dataset `android_cleaned` has been cleaned of duplicates by checking the number of entries in the dataset.

## Removing Non-English Apps: Part One
Our company is in a country where English is the first language and is most commonly used so We only want to focus on apps that are directed towards Englsih speaking audience. However, you might have noticed that there are some apps that have a non-English name and are not directed towards English speaking audience, In this part We'll remove them. 

Using the built in `'ord()'` function, We can find out the ASCII number of a character, We knw that characters commonly used in English Language are between 0-127 range in the ASCII system. We can use tis to our advantage and remove the apps that have a non-English name. To do this we can iterate through the name of the app(string) and check if it has any characer that is not in the ASCII range of 0-127.

In [18]:
def english_or_not(name):
    
    for letter in name:
        if ord(letter)>127:
            return False
    return True





Above, We have created a function named `'english_or_not'` that can take in a string and tell us if it has any character outside of ASCII 0-127 range. Let's try it out in some examples:

In [19]:
english_or_not('Instagram')

True

In [20]:
english_or_not('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [21]:
english_or_not('Docs To Go™ Free Office Suite')

False

In [22]:
english_or_not('Instachat 😜')

False

So, in the first two examples, We see that our function is working fine and telling us which name is english and which one is not. However, When we go to the third example and the fourth exmaple, We see that our function gave us a false statement even when the name was in english, That is because `emojis` and the `trademark` symbol does not come under ASCII 0-127 range. But this is not good for us because we are loosing alot of valuable data which can affect our analysis.

The solution to that problem is, We can set a limit to the number of `out-of-range` characters in a name. For example: We will set the limit to 3, if there are more than 3 characters that are not in the range of 0-127, we will label that app as non-english. Yes, We might still loose some data this way, but this is still a pretty effective approach.

## Removing Non-English Apps: Part Two

In [24]:
def english_or_not(name):
    times=0
    for letter in name:
        if ord(letter)>127:
            times+=1
    if times>3:
        return False
    else:
        return True

Now let's do the examples again:

In [25]:
english_or_not('Instagram')

True

In [26]:
english_or_not('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [27]:
english_or_not('Docs To Go™ Free Office Suite')

True

In [28]:
english_or_not('Instachat 😜')

True

This way, we can have a lot more data for our analysis, Now We'll iterate through android_cleaned and ios datasets and make seperate lists that does not have any non-English names. 

In [29]:
android_dataset=[]
ios_dataset=[]
for app in android_clean:
    if english_or_not(app[0]):
        android_dataset.append(app)
for app in ios:
    if english_or_not(app[1]):
        ios_dataset.append(app)
explore_data(android_dataset,0,3,True)
print('\n')
explore_data(ios_dataset,0,3,True)
        
        

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of Rows:  9614
Number of Columns:  13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+'

We can see that we're left with 9614 Android apps and 6183 iOS apps.

##  Isolating the Free Apps
As we have already mentioned our company only builds free apps, adn our main source of revenue is in-game ads, So we have to remove all the paid apps from our datasets because they are not useful for our analysis.

In [50]:
free_android_dataset=[]
free_ios_dataset=[]
for app in android_dataset:
    if app[7] =='0':
        free_android_dataset.append(app)
for app in ios_dataset:
    if float(app[4].strip('$')) ==0:
        free_ios_dataset.append(app)
explore_data(free_android_dataset,0,3,True)
explore_data(free_ios_dataset,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of Rows:  8864
Number of Columns:  13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

We can see that we're left with 8864 Android apps and 3222 iOS apps. This much is enough for our analysis.

## Most Common Apps by Genre: Part One
Now that we are done with the cleaning part, We'll get to the analysis part. Our company's main goal to generate revenue by attracting more customers, To achieve this, We are going to use the following validation Strategy:
* Make a minimal android version of the app, and add it to GooglePlay.
* If the app is attracting enough users, We develop it further.
* After 6 months, if the app is profitable, we build an IOS version of the app and add it to App Store.
---
With the information provided above, We can understand that our end goal is to make and publish an app on both app store and google play store, SO, We need to analyse data from both markets, and see which ones are doing goood in both.

We will start by analyzing the most common genres for each market, To do this we will make a frequency table for both datasets using the column `prime_genre` for app store dataset and ` Category` and `genre` for android datset.

## Most Common Apps by Genre: Part Two

In [78]:
def freq_table(dataset,index):
    dict_={}
    percent_dict={}
    for app in dataset:
        if app[index] in dict_:
            dict_[app[index]]+=1
        else:
            dict_[app[index]]=1
    for value in dict_:
        percentage_value= (dict_[value]/len(dataset))*100
        percent_dict[value]=percentage_value
        
    return percent_dict

freq_table(free_android_dataset,9)

# now we will write display_table function that takes in our dataset
# and sorts it in ascending or descending order
def display_table(dataset, index):
    dict_table=freq_table(dataset,index)
    table=[]
    a=0
    for key in dict_table:
        key_val_as_tuple=(dict_table[key],key)
        table.append(key_val_as_tuple)
    sorted_table=sorted(table,reverse=True)
    for entry in sorted_table:
        print(entry[1],':',entry[0])

So We made two functions above, `freq_table` and `display_table`, We have already talked about why we need the `freq_table`, Now, we will tlak about `display` table. We have to make sure that the frequency table we show based on our genre and category are sorted in a descending order, Otherwise it is really hard to understand which apps are msot common out of thousands in our dataset. So we will use the built in function `Sorted()` for that. However, in the case of dictionaries, when we use this function, It will use the keys of dictionary to sort, which is not right for us, So the `display_table`, uses the freq table we create and turns it into a tuple and then uses that to print a table in sorted order.

## Most Common Apps by Genre: Part Three

In [88]:
print('"prime_genres" Table for IOS dataset(percentage wise) \n')
display_table(free_ios_dataset,11)

"prime_genres" Table for IOS dataset(percentage wise) 

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Now Let's take a look at the data analysis of App Store, Here we can see that App Store is dominated by mostly `Games` that account to almost `58%` of the apps and the top 5 are also related to entertainment, photos-videos. We can tell by this that App-Store mostly has apps that are for `Fun`, as we do not see a lot of practical apps. However, That does not mean that apps having fun factor have the most users. Let's take a look at the google play store now 

In [89]:
print('"Category" Table for Android dataset(percentage wise) \n')
display_table(free_android_dataset,1)

"Category" Table for Android dataset(percentage wise) 

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.710740072

The data in google play store is alot different than App Stores, Although there is a good chunk of apps that are for Entertainment, They dont really dominate all over. The apps that are for practical uses such as education and communication also seem to have good collective percentage. The first result that is named `Family` is basically games for kids. We can check that if we search for `Family` category on Google Play. Let's Take a look at the `'Genres'` in google play data set now.

In [96]:
print('"Genres" Table for Android dataset(Percentage wise) \n')
##display_table(free_android_dataset,9)
print(ios_header)

"Genres" Table for Android dataset(Percentage wise) 

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


This dataset confirms what we talked about earlier, Although entertainment has a big chunk, We can also see that it has more of the other categories too. This shows the main difference between IOS and android datasets. Ios has more fun-type material while android is more of an equal field between productive and entertainment apps. Since, We are looking at the bigger pciture in our analysis, We will only consider the `category` colum in google paly and leave `genre` behind.

##  Most Popular Apps by Genre on the App Store
We explore genres, now we want to find out which genres have the most users in both the datasets. We can find it easily in GooglePlay dataset because it has `Installs` column so we will take the average `Installs` per genre, But in appstore dataset, we'll have to figure that out using the number of reviews associated with each app of a particular genre, and then get an average number of reviews per genre, That will give us an overall evauluation of number of people using apps of each genre. We'll start with the Appstore dataset.

In [102]:
freq_reviews=freq_table(free_ios_dataset,11)

for genre in freq_reviews:
    reviews=0
    no_of_apps=0
    for app in free_ios_dataset:
        if app[11]==genre:
            reviews+=float(app[5])
            freq_reviews[genre]=reviews
            no_of_apps+=1
    freq_reviews[genre]= freq_reviews[genre]/no_of_apps
    print(genre,':',freq_reviews[genre])          

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


In the frequency table above, We can see that most number of review on average are from `genres`: `Social Networking` , `reference` , and `Navigation`. However, We have to think about something here now. Since we are talking about number of reviews here in these 3 genres, They are heavily influenced by some really popular apps, for example the highest count is in navigation and most of them are for `Waze` and `Google maps`. It is not feasible for us to make such apps, so we have to take that into consideration. Similarly, In `Reference` we have the `bible` app and `dictionary.com` that are influencing the reviews heavily. Personally, i think AppStore is saturated with For-Fun apps that one more of those might not be the best option, What we can do is, We can make an app that can fit into both `Entertainment, Education, Music and Reference` Categories. It means we'll benefit from all these categories.

A lot of people love to listen to music while they read books, We can make an app that is an e-reader, but based on the context of the book and it's genre, provides a list of songs that go well with the book, We can also include an in-built dictionary in the app so people dont have to leave the app to look for the word.

## Most Popular Apps by Genre on Google Play
For google play dataset, We'll use `Installs` to figure out the average installs for each genre. in our dataset, we have installs in figures like `'100,000+'` , Since we are looking at the bigger picture, We dont need the exact numbers, So we'll take `'100,000+'` as `100000`, but we need to convert the strings to float to calcualte averages, so we will have to use some special string fucntion, `str.replace()` which removes,adds or excahnges our desidered characters from the string depending on how we use it. Let's sort out our google play data now.

In [111]:
freq_installs=freq_table(free_android_dataset,1)

for category in freq_installs:
    no_of_installs=0
    no_of_apps=0
    for app in free_android_dataset:
        installs=app[5].replace(',','')
        installs=installs.replace('+','')
        if app[1]==category:
            no_of_installs+=float(installs)
            freq_installs[category]=no_of_installs
            no_of_apps+=1
    freq_installs[category]=no_of_installs/no_of_apps
    print(category,':',freq_installs[category])
            

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

In the data above, We see some similarities between google play dataset and app store data set, In this dataset categories like `Maps_and_navigation` and `Communication` and `books_and reference` have been heavily skewed by some really popular apps that take over most of the downlaods, It is really hard to come up with an app that can compete with these apps and make people want to change over. So we want to look into a market that still has potential. Also, We want something that can work for both google play and App Store, because in the long run we want an app successful in both markets. How about we use our idea of App Store here and see if that can be successful here.

In [113]:

for app in free_android_dataset:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

Above, We can see that e-readers and dictionaries are popular in googleplay store, SO we can tell there is still market for these. However, if we make a simple app that is similar to thousands of apps that are already in the market, We might not achieve anything with that. So our idea of doing something different by making an app that let's us read books and recommends songs based on the book,we can also add an in built player with a saved playlist for some really popular books, and let users add their own choices onto that list. Since We want to keep our app free, buiyng the license for the songs can be expensive and out of budget, so we can give them an option to connect our app to their Spotify, like Discord. We can also include some options like discussion forums and quizzes on books to make it more interactive.

## Conclusions

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that making an app that is an advanced verion of e-reader, with music recommendations of each book, and options to connect it to spotify for better user experience can be a good idea. We can include Discussion Forums and quizes on the books to make the app more interactive.