# Edis Grudic / 2019-05-19

# Looking into profitability of apps on App Store and Google Play markets

The aim of this project is to analyze apps stored on App Store and Google play, to determine what kind of apps that brings in the most profit. We are working as data analysts for a company

At the company we work for, we only make apps that are free to download and install. Our main source of revenue comes from the in-app ads and therefore the more users we have, the bigger the profit.
The goal is to analyze the data and help our app developers understand what kind of apps are likely to attract more users.

# Opening and exploring data
There are over 4 million apps on App Store and Google play combined.
To save time and money, we will use existing data sets.

[Here](https://www.kaggle.com/lava18/google-play-store-apps/home) we see a data set containing approximately 10 000 android apps taken from Google play. 
[Here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) we see a data set containing approximately 7000 IOS apps from App Store.

We will now open up the data sets and then start exploring the data.

In [1]:
from csv import reader

### Google play data set ###
open_file=open('googleplaystore.csv')
read_file=reader(open_file)
android=list(read_file)
android_header=android[0]
android=android[1:]


### App Store data set ###
open_file=open('AppleStore.csv')
read_file=reader(open_file)
IOS=list(read_file)
IOS_header=IOS[0]
IOS=IOS[1:]

To easier analyze our data sets we will first defince a function: explore_data(). This function will be used to explore the data sets more easily.
The function can also print out he number of rows and columns of the data set we explore if we choose to do so. By default it does not provide us with this information

the number of rows and columns of the data set we analyze (This argument is set to False by default)
    

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Now we will take a closer look at the data set for the android apps.

In [3]:
print(android_header)
print('\n')
explore_data(android,0,3,rows_and_columns=True) #printing the first 5 rows of the android data set using the explore function

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We can see that there are 10841 rows and 13 columns.
Each row is one app, so there are 10841 android apps in this data set.
Each column describes some information regarding the apps.
The columns that could be of interest are: App, category, reviews, installs, type, price, Genres. Some other fields could be of interest but we will focus on these columns for now. 

Now we will take a look at the data set for the IOS apps.

In [4]:
print(IOS_header)
explore_data(IOS,0,3,rows_and_columns=True) # printing the first 5 rows of the IOS data set using the explore function

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


We see that there are 7197 IOS apps in the app store data set with 16 columns.
The columns that could be of interest are: track_name, currency, price, rating_count_tot, content_rating, prime_genre.
The columns for the IOS apps are a bit more difficult to understand and are better explained in this [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

# Deleting wrong data
Before we can analyze the data a bit more clearly we need to clean it up a bit.
At the company we only make apps that are in English and for free.
Therefore, we need to clean our dataset and remove apps that are non-English and apps that are not for free.

In the next step, we will need to remove or correct wrong data.
Remove duplicate data, if there is any.
Modify data to fit our analysis.

We will begin by detecting and deleting wrong data.

According to the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) for the Google play data set, one of the [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row, namely row 10472. Let us take a look at this particular row, compare it to the header and a row we know to be correct for comparison.

In [5]:
print(android_header)
print('\n')
print(android[1])
print('\n')
print(android[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


We can see that the rating for the app is at 19, although maximal rating is 5. There are only 12 columns when there should be 13. The columns have moved a bit.
This row will therefore be deleted. 

In [6]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


We can see  that the number of rows has decreased by 1 and we therefore know that the deletion took place. Next we will see if there are any duplicate apps, apps that appear more than once in the data set.

# Removing duplicate entries - Android apps
If we look at the dataset for android apps, or take a look into the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), we see that some apps appear more than once. We can try to look into some of the most popular apps, for instance Instagram and Facebook.

In [7]:
for app in android:
    name=app[0]
    if name == "Instagram" or name=="Facebook":
        print(app)

['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2

We see that Facebook appears 2 times, while Instagram appears 4 times. This should raise a suspicion that there might be more duplicate apps in our dataset and we should take a closer look at all of the apps and see if there are more duplicates that appear.

In [8]:
android_duplicates=[]
android_unique=[]
for app in android:
    name=app[0]
    if name in android_unique:
        android_duplicates.append(name)
    else:
        android_unique.append(name) 
        
a=len(android[1:])              #Number of android apps
b=len(android_duplicates)       #Number of duplicates android apps
print('Out of {} android apps, {} are duplicates'.format(a,b))    

Out of 10839 android apps, 1181 are duplicates


We find that there are 1181 duplicates in our dataset for the android apps.
Here are some of the apps that appear more than once:

In [9]:
print(android_duplicates[:10])

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


We have to delete duplicates that appear, the question is what app data we want to keep in our dataset. If we look into into the printed rows for the Instagram and facebook app, we see something interesting, namely that the only variable that seems to differ, is the amount of reviews that the apps recieved. The reason that this value differs in the dataset, is probably because the data was collected at different. Keeping the apps with the most revews seems like the best choice, since this data should be the most recent.

In [10]:
print('Expected number of apps after removing duplicates is: ',len(android)-len(android_duplicates))

Expected number of apps after removing duplicates is:  9659


In order to remove duplicates, we will create a dictionary where the keys will be the unique app names and the values will be the highest number of reviews for each app. With this information we will be able to create a new dataset without duplicates.

In [11]:
reviews_max={}                    # Creating empty dictionary
for app in android:
    name=app[0]
    n_reviews=float(app[3])
    if name in reviews_max and n_reviews > reviews_max[name]:
            reviews_max[name]=n_reviews
    elif name not in reviews_max:
        reviews_max[name]=n_reviews
print("The number of android apps that are unique is: ",len(reviews_max))  

The number of android apps that are unique is:  9659


The number of unique android apps is 9659 as expected.
We have successfully created a dictionary that holds all unique apps and their highest reviews count. Now we will remove the duplicated rows and keep the non-unique apps that have the highest number of reviews.

To get this clean data set, we : 
- Create the empty list **android_clean** 
- create the empty list **already_added** 

we iterate through the whole data set and for each iteration we:
- Assign the app name to the variable **name**
- Assign the number of reviews to the variable **n_reviews**
- We add the app row to the clean data set if:
    - the number of reviews in **n_revies** is the same as the maximim reviews for that app 
    - the app name has not already been added to the list **android_clean**

The reason that we also check to see if the app has already been added is to make sure that we do not end up adding one app twice, which could happen if the number of reviews are the same for apps that share the same name. This way we make sure that we do not end up with duplicates again.



In [12]:
android_clean=[]     # List that will store our clean data set
already_added=[]     # List that will store app names of the ones we added
for app in android:
    name=app[0]
    n_reviews=float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

The data set has been cleaned from duplicates, we will take a look to see if we got the expected number of rows.

In [13]:
explore_data(android_clean,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


We can see that the number of rows for the clean data set is 9659 as expected.
Now we will do the same for the IOS apps.

# Finding non-English apps
Since we are developers who target at making free apps for English users, we will have to remove the apps not intended for an English-speaking audience.
To find apps that are not directed toward English-speaking users, we will look at the app names containing symbols/characters that are not commonly used in Enlgish text.
To do this, we can look up the ASCII(American Standard Code for Information Interchange) for different characters. We know that the characters commonly used in English have an ASCII code in the range of 0 to 127. 
Luckily for us there is a function (ord) that takes a string as parameter, and gives us the ASCII code for that character.
All we have to do is compare the chracter's ASCII code to determine whether an app is targeted for an English-speaking audience or not.

To do this we create the function is_english(app_name), where:
- The parameter **app_name** is a string coresponding to the name of the app
- we iterate for each character in the string and check:
    - The ASCII for the iterated characters
    - If any of the characters ASCII is above 127 we return **False**, which indicates that the app is probably intended for a non-English speaking audience
    - If all the characters have an ASCII of below 127, we return **True**, which indicates the app is probably intended for an English-speaking audience

In [14]:
def is_english(app_name):
    for character in app_name:
        if ord(character) > 127:
            return False
    return True

To test our function we will create a list with  different app names and see whether the function works correctly or not:

In [15]:
some_android_apps=['Instagram',
        '爱奇艺PPS -《欢乐颂2》电视剧热播', 
        'Docs To Go™ Free Office Suite',
        'Instachat 😜']
for app_name in some_android_apps:
    print(is_english(app_name))

True
False
False
False


After looking at the results we can conclude that our function does not work as intented.
we can clearly see that 3 of the 4 app names were written in English, but only 1 was recognized.
This has to do with the fact that some special characters were also introduced in the app names, such as ( ™ ) and the emoji ( 😜 ), and these special characters have ASCII that are out of range. 

In [16]:
print(ord("™"),ord("😜"))

8482 128540


If we use the function we created to sort out apps we believe to be intented for a non_English speaking audience, we will lose out on a lot of data. To make the function a bit more efficient, we will change it to still count an app as intended for English-users, as long as there are less than 4 special characters in the app name.

In [17]:
def is_english(app_name):
    count=0
    for character in app_name:
        if ord(character) > 127:
            count+=1
            if count>3:
                return False
    return True

We will use the function for the list **some_android_apps** again and look at the results.

In [18]:
for app_name in some_android_apps:
    print(is_english(app_name))

True
False
True
True


This time the function returned 3 out of 4 apps, which was the expected result.
**NOTE: This function is still not optimal since there could potentially be apps that have more than 3 special characters, however, it is more accurate than the previous function in sorting our data set.**


Now we will use our function to remove non-English apps from our 2 data sets(Android apps and IOS apps.

In [19]:
android_english=[]      # Will contain a list of English Android apps
for app in android_clean:
    if is_english(app[0])==True:
        android_english.append(app)
        
IOS_english=[]      # Will contain a list of English IOS apps
for app in IOS:
    if is_english(app[1])==True:
        IOS_english.append(app)

In [20]:
print('ANDROID APPS')
explore_data(android_clean,0,0,True)     #Android apps before removing non-English apps
print('\n')
explore_data(android_english,0,0,True)   #Android apps after removing non-English apps
print('\n')
print('IOS APPS')
explore_data(IOS,0,0,True)              #Android apps before removing non-English apps
print('\n')
explore_data(IOS_english,0,0,True)      #Android apps after removing non-English apps

ANDROID APPS
Number of rows: 9659
Number of columns: 13


Number of rows: 9614
Number of columns: 13


IOS APPS
Number of rows: 7197
Number of columns: 16


Number of rows: 6183
Number of columns: 16


The difference in apps before and after removing non-English apps:
- Android 9659 => 9614 , reduction in 45 apps
- IOS     7197 => 6183 , reduction in 1014 apps

# Removing non-free apps
As previously mentioned this company only focuses on building apps that are free to download because the main source of revenue comes from the in-app ads. The data set right now consists of free and non-free apps, and we want to split the data sets. To do this we will iterate through all of the apps and for each iteration we will:
- check the price (Index 7) of each app:
    - If the price is 0 , add that app to a list **free_android**
    - If the price is not 0 , add that app to a list **nonfree_android**

In [21]:
free_android=[]
nonfree_android=[]
for app in android_english:
    name=app[0]
    price=app[7]
    if price=='0':
        free_android.append(app)
    else:
        nonfree_android.append(app)
print("Free android apps: {}, Nonfree android apps: {}, total: {}".format(len(free_android),len(nonfree_android),len(android_english)))

Free android apps: 8864, Nonfree android apps: 750, total: 9614


We see that the number of android apps are divded into :
- 8864 free apps 
- 750 nonfree apps
- total 9614 apps

Now we will do the same for the IOS apps except here the name is at index 1, the price is at index 4 and the price is written as a string with decimals.

In [22]:
free_IOS=[]
nonfree_IOS=[]
for app in IOS_english:
    name=app[1]
    price=app[4]
    if price=='0.0':
        free_IOS.append(app)
    else:
        nonfree_IOS.append(app)
print("Free android apps: {}, Nonfree android apps: {}, total: {}".format(len(free_IOS),len(nonfree_IOS),len(IOS_english)))

Free android apps: 3222, Nonfree android apps: 2961, total: 6183


We see that the number of android apps are divded into :
- 3222 free apps 
- 2961 nonfree apps
- total 6183 apps

# Creating frequency tables
We have reached the crucial part of this project, but before we begin analyzing the data it is important to know what to look for. In the introductio we epxlained that we want make free apps, in English, and that the main profit for the company comes from in-app ads. The number of ads that are seen is correlating with the number of people that use our apps, our aim is therefore to determine the kind of apps that are more likely to attract users in order to increase profits for the company.

The overall strategy for an app idea usually consists of three steps.

To build an app and to put it out on google play.
if the app gets a good response from the intended audience, the app gets developed further.
If the app turns out to be profitable after half a year, an IOS version of the app is built and released to the App store.
An app can be considered successfull if it attracts many users on Google play and on App Store. We will begin by analyzing the most common genres for each market. This will be done with a **Frequency table** for a few columns in each data set. First we will examine the headers for each data set.

In [23]:
print(android_header,'\n')
print(IOS_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


The interesting columns in the data set for android apps:
Category: index 1 , Genres: index 9

The interesting columns in the data set for IOS apps:
prime_genre: 11

To analyze the data sets further we will first create two functions:
- The first function will generate frequency table and shows the percentage, we name the function **freq_table()**
- The second function will sort the percentages in a descending order, we name it **display_table()** 

We will begin with the **freq_table()** function which will:
- Take 2 inputs, **dataset** which is a list of lists, and **index** which is an integer
- The function will then iterate over the data set and for each app it will:
    - Look into the column with index **index**, see if this column exists in the dictionary **frequencies**:
        - If it does not exist, it will create a key with the name of the column and give it a value = 1
        - If it does exist, it will increment the value of the name of the column by 1
        
After the table has been created, the function will divide each value with the number of apps to get a percentage value.

In [24]:
def freq_table(dataset,index):
    frequencies={}
    frequency_table={}
    length=len(dataset)
    for app in dataset:
        key=app[index]
        if key not in frequencies:
            frequencies[key] = 1
        else:
            frequencies[key] += 1
    
    for key in frequencies:
        percentage= (frequencies[key]/length)*100
        frequency_table[key] = round(percentage,3)
    return(frequency_table)

The function is created and will be tested for the android set, a frequency table will created for the "category" column indexed 1.

In [25]:
freq_table(free_android,1)

{'ART_AND_DESIGN': 0.643,
 'AUTO_AND_VEHICLES': 0.925,
 'BEAUTY': 0.598,
 'BOOKS_AND_REFERENCE': 2.144,
 'BUSINESS': 4.592,
 'COMICS': 0.62,
 'COMMUNICATION': 3.238,
 'DATING': 1.861,
 'EDUCATION': 1.162,
 'ENTERTAINMENT': 0.959,
 'EVENTS': 0.711,
 'FAMILY': 18.908,
 'FINANCE': 3.7,
 'FOOD_AND_DRINK': 1.241,
 'GAME': 9.725,
 'HEALTH_AND_FITNESS': 3.08,
 'HOUSE_AND_HOME': 0.824,
 'LIBRARIES_AND_DEMO': 0.936,
 'LIFESTYLE': 3.903,
 'MAPS_AND_NAVIGATION': 1.399,
 'MEDICAL': 3.531,
 'NEWS_AND_MAGAZINES': 2.798,
 'PARENTING': 0.654,
 'PERSONALIZATION': 3.317,
 'PHOTOGRAPHY': 2.944,
 'PRODUCTIVITY': 3.892,
 'SHOPPING': 2.245,
 'SOCIAL': 2.662,
 'SPORTS': 3.396,
 'TOOLS': 8.461,
 'TRAVEL_AND_LOCAL': 2.335,
 'VIDEO_PLAYERS': 1.794,
 'WEATHER': 0.801}

The function seems to be working, now we just need to display the frequency table in a more fashioned manner by sorting it. The sorting function named **display_table**, we:
- Call the **freq_table** function with the parameters **dataset** and **index **, which in return creates a frequency table for us as explained above, and stores the frequency table in the variable **table**.
- create an empty list named **table_display**
- Iterate over the frequency dictionary and for each iteration we:
    - Store the **key** and **Value** as a tuple in the variable **key_val_as_tuple**. The value is stored first in this case so that we can easily sort it later on.
    - add the created tuple to our list **table_display**
    
- Sort the list of tuples **table_display** with the help of the **sorted()** function and put the sorted list of tuples in the variable **table_sorted**

In [26]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Now that we have our function created we will begin using it to display frequency tables with the order **Category**,** Genres** and **prime_genre**

In [27]:
print('Frequency table for prime_genre \n')
display_table(free_IOS,11)

Frequency table for prime_genre 

Games : 58.163
Entertainment : 7.883
Photo & Video : 4.966
Education : 3.662
Social Networking : 3.29
Shopping : 2.607
Utilities : 2.514
Sports : 2.142
Music : 2.048
Health & Fitness : 2.017
Productivity : 1.738
Lifestyle : 1.583
News : 1.335
Travel : 1.241
Finance : 1.117
Weather : 0.869
Food & Drink : 0.807
Reference : 0.559
Business : 0.528
Book : 0.435
Navigation : 0.186
Medical : 0.186
Catalogs : 0.124


## Analyzing data: **prime_genre** column for IOS apps.
- The most common genre is **Games: 58.16 %** followed by **Entertainment: 7.88 % **
- Apps created for entertainment purposes (Games, Entertainment, Photo & Video, Social Networking, sports, Music): **78.5 %** and apps created for practical purposes ( Education, shopping, utilities, productivity, lifestyle): **12.1 %** . This tells us that most free, english apps are designed for entertainment purposes.
- Given that there are many more entertainment apps than the rest, one might think that these also are the most popular ones. This is something that could be looked into further.
--------------------------------------------------------------------------------

In [28]:
print('Frequency table for Category \n')
display_table(free_android,1)

Frequency table for Category 

FAMILY : 18.908
GAME : 9.725
TOOLS : 8.461
BUSINESS : 4.592
LIFESTYLE : 3.903
PRODUCTIVITY : 3.892
FINANCE : 3.7
MEDICAL : 3.531
SPORTS : 3.396
PERSONALIZATION : 3.317
COMMUNICATION : 3.238
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.944
NEWS_AND_MAGAZINES : 2.798
SOCIAL : 2.662
TRAVEL_AND_LOCAL : 2.335
SHOPPING : 2.245
BOOKS_AND_REFERENCE : 2.144
DATING : 1.861
VIDEO_PLAYERS : 1.794
MAPS_AND_NAVIGATION : 1.399
FOOD_AND_DRINK : 1.241
EDUCATION : 1.162
ENTERTAINMENT : 0.959
LIBRARIES_AND_DEMO : 0.936
AUTO_AND_VEHICLES : 0.925
HOUSE_AND_HOME : 0.824
WEATHER : 0.801
EVENTS : 0.711
PARENTING : 0.654
ART_AND_DESIGN : 0.643
COMICS : 0.62
BEAUTY : 0.598


In [29]:
print('Frequency table for Genres \n')
display_table(free_android,9)

Frequency table for Genres 

Tools : 8.45
Entertainment : 6.069
Education : 5.347
Business : 4.592
Productivity : 3.892
Lifestyle : 3.892
Finance : 3.7
Medical : 3.531
Sports : 3.463
Personalization : 3.317
Communication : 3.238
Action : 3.102
Health & Fitness : 3.08
Photography : 2.944
News & Magazines : 2.798
Social : 2.662
Travel & Local : 2.324
Shopping : 2.245
Books & Reference : 2.144
Simulation : 2.042
Dating : 1.861
Arcade : 1.85
Video Players & Editors : 1.771
Casual : 1.76
Maps & Navigation : 1.399
Food & Drink : 1.241
Puzzle : 1.128
Racing : 0.993
Role Playing : 0.936
Libraries & Demo : 0.936
Auto & Vehicles : 0.925
Strategy : 0.914
House & Home : 0.824
Weather : 0.801
Events : 0.711
Adventure : 0.677
Comics : 0.609
Beauty : 0.598
Art & Design : 0.598
Parenting : 0.496
Card : 0.451
Casino : 0.429
Trivia : 0.417
Educational;Education : 0.395
Board : 0.384
Educational : 0.372
Education;Education : 0.338
Word : 0.259
Casual;Pretend Play : 0.237
Music : 0.203
Racing;Action & Adv

## Analyzing data: Category column for android apps.
- The dataset from category and genre are almost identical.
- These data sets are not as stocked with games as it was with the IOS data set.
- The IOS data set was above 75 % enterteinment( where more than 50% was from games), while the android data set for enterteinment (Games, Entertainment, Photo & Video, Social Networking, sports, Music) accounts for around 21 % (10 % of these 21 % are games).  
- The apps for practical purposes ( Education, shopping, utilities, productivity, lifestyle) account for approximately 19.6 %
    - The practical apps are more common with android apps and less so with IOS apps
    - Games apps are more common for IOS apps, which could mean that games are the most successfull apps. 
    
The genres themselves do not reveal the most popular apps, since we still have not looked into the number of users for these genres. It could simply be that some apps are easier to vary/make and therefore we have more of these. 

Below we look into the family category a bit more and create a frequence table for the genres for family apps.

In [30]:
FAMILY_genre={}
for app in free_android:
    if app[1]=="FAMILY":
        genre=app[9]
        if genre not in FAMILY_genre:
            FAMILY_genre[genre]=1
        else:
            FAMILY_genre[genre]+=1
print(FAMILY_genre)

{'Puzzle;Brain Games': 15, 'Entertainment;Action & Adventure': 3, 'Role Playing;Pretend Play': 4, 'Action;Action & Adventure': 8, 'Educational;Pretend Play': 8, 'Entertainment;Education': 1, 'Trivia;Education': 1, 'Card;Action & Adventure': 1, 'Education;Pretend Play': 4, 'Strategy;Education': 1, 'Video Players & Editors;Music & Video': 1, 'Puzzle': 78, 'Role Playing;Action & Adventure': 3, 'Educational': 33, 'Art & Design;Action & Adventure': 1, 'Educational;Creativity': 3, 'Puzzle;Action & Adventure': 3, 'Education;Education': 24, 'Entertainment;Pretend Play': 2, 'Entertainment;Music & Video': 12, 'Simulation;Pretend Play': 2, 'Casual;Pretend Play': 21, 'Role Playing;Brain Games': 1, 'Strategy;Creativity': 1, 'Casual;Action & Adventure': 12, 'Casual;Music & Video': 1, 'Role Playing': 72, 'Board;Action & Adventure': 2, 'Arcade;Pretend Play': 1, 'Health & Fitness;Action & Adventure': 1, 'Music & Audio;Music & Video': 1, 'Education;Brain Games': 1, 'Education;Music & Video': 3, 'Educati

We see that many apps within the FAMILY category are actually games.

The difference between the category column and genre column is not easy to follow. There are however more genres than there are categories. Therefore, we will instead focus on the genres column as we continue.

In [31]:
print(IOS_header)
print('\n')
print(android_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


# Popular apps by genre
We will now look into the popularity of genres.

For the android apps, we can focus on the column **Installs ( index 5)** and get an average number of installs per genre.

For the IOS apps, there is no column which desiplays the number of installs, we will therefore focus on the column **rating_count_tot (index 5)** which displays the number of ratings made.

To do this, we will need to :
- Isolate apps for each genre
- Sum up the user ratings for each app genre
- Divide the sum by the number of apps belonging to that genre


In [32]:
genres_IOS = freq_table(free_IOS, 11)
for genre in genres_IOS:
    total=0     # Storing the sum of the user ratings for each genre
    len_genre=0 # Storing the number of apps specific to each genre
    for app in free_IOS:
        genre_app=app[11] # Saving the genre for an app
        if genre_app==genre:
            n_ratings=float(app[5])
            total += n_ratings
            len_genre += 1
            
            
    avg_n_ratings=total/len_genre
    print(genre,':',round(avg_n_ratings))

Music : 57327
Finance : 31468
Travel : 28244
Utilities : 18684
Business : 7491
Sports : 23009
Entertainment : 14030
Games : 22789
Productivity : 21028
Reference : 74942
Shopping : 26920
News : 21248
Catalogs : 4004
Lifestyle : 16486
Education : 7004
Navigation : 86090
Photo & Video : 28442
Weather : 52280
Social Networking : 71548
Health & Fitness : 23298
Book : 39758
Food & Drink : 33334
Medical : 612


Before making a clear analyz we might consider shortening our scope a bit. We know that Facebook and instagram are widely known apps used by many, but we do not exactly know how popular they are. It might just be that the ratings for these apps(Instagram and facebook), stand for more than half of the ratings. The same might of course be true for the other genres. 
If this is the case, then making apps in these genres might not be the best market for our developers since all the other hundreds, if not thousands apps in these markets, have not gotten much recognition either. 
If we think about it from a user perspective: if you use a social app that most of your friends use, then there might not be so much of a necessity to use other social apps. If you use a navigation app that has worked really good for you so far, then you probably would not spend a lot of time trying to find a better one. The same could not be said about games, since a game has usually has an end point, it can be finished, and at some point we get tired of playing the same games so we want to find new ones. But a practical app is mostly not used because it is fun to use it, but because we need to use it to accomplish some task. We should therefore look a bit closer at the data set. 

We will begint by taking a closer look at the genre that had the highest amount of genres, Navigation.

In [33]:
total=0
for app in free_IOS:
    if app[11]=='Navigation':
        total+=float(app[5])
        print(app[1],':',app[5])
print('\n')   
print('Total number of apps in this genre is:',total)

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Total number of apps in this genre is: 516542.0


We see that there only are 6 apps in total for the **Navigation** genre.
We also see that approximately half a million ratings have been made for the 2 biggest apps, **Waze** and **Google Maps**. 
The total number of ratings has been 516 542. 
This means that around 97% of the ratings belong to the biggest apps.

In [34]:
total=0
for app in free_IOS:
    if app[11]=='Social Networking':
        total+=float(app[5])
        print(app[1],':',app[5])
print('\n')   
print('Total number of apps in this genre is:',total)

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

Above, we examined the number of ratings for the apps in **Social Networking**.
As with the genre **Navigation**, there are some apps here that totally dominate the dataset, like **Facebook**, **Pinterest**, and **Skype for iPhone** which stand for __58%__ of the ratings.

In [35]:
total=0
for app in free_IOS:
    if app[11]=='Reference':
        total+=float(app[5])
        print(app[1],':',app[5])
print('\n')   
print('Total number of apps in this genre is:',total)

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Total number of apps in this genre is: 1348958.0


Above, we examined the number of ratings for the apps in **Reference**. the **Bible** app accounts for **73 %** of the total ratings. If we also include **Dictionary.com** this amount goes up to **91 %**.

As we have sen so far, there are apps in some genres that make a genre seem more popular than it actually is. We could remove these big apps and make another average on the genres, but we will leave this bit for now. 

One thing to keep in mind is that, even though the amounts of ratings might be misleading because of some apps pushing the ratings much more than the rest, these big apps might be a potional market. For instance, if there is one app that highly dominates the market then we could try to make a similar app, but which adds a nice feature that might we appreciated by the users. One idea might be to recreate a popular **Music** app, but with the potential of using parts of songs as your ringtone. 

A good idea to stand out with an app would also be to focus on a genre that is not dominated, because there is a higher chance that our app would stand out. Making a **game** might for instance not be the best choice because we would have to compete with the other 2 milion games on **App Store**.

We will now look into the Google play data set.

# Popular app genres on Google Play
In this data set we have information about the number of installs per app, so we do not have to focus on the ratings. However, the installs are not precise because they are open-ended, like **10 000+**, **100 000+** and **1 000 000+**.

The installs can be a bit confusing because **50 000+** can be 51 000 or 98 000. 
Luckily for us we only want to know which genres that attract most users, so we can still work with this kind of data. We will leave the numbers as they are and consider the numbers that are written, so **50 000+** will be considered as **50 000 installs**. 

Each install value is however written as a **string**, and we need it as a **float**, that means that we have to convert the string to a float, but also remove the **+** sign to not raise any errors during the conversion. 

To remove the **+** sign we will use the built-in function str.replace(old,new). This function takes in 2 paramets, **old** and **new**, and replaces all occurences of **old** within a string with **new**. This will be done directly in the loop where we collect the average number of installs per category, see below.


In [36]:
category_freq=freq_table(free_android,1)
for category in category_freq:
    total=0 # storing sum of total installs per genre
    len_category=0 #Stores the number of apps specifict to each category 
    for app in free_android:
        category_app=app[1]
        if category_app==category:
            n_installs=app[5]
            n_installs=n_installs.replace(',','')
            n_installs=n_installs.replace('+','')
            total += float(n_installs)
            len_category += 1
            
    avg_installs=total/len_category
    print(category,':',round(avg_installs))

BUSINESS : 1712290
LIFESTYLE : 1437816
BEAUTY : 513152
TRAVEL_AND_LOCAL : 13984078
COMICS : 817657
PHOTOGRAPHY : 17840110
HOUSE_AND_HOME : 1331541
BOOKS_AND_REFERENCE : 8767812
FINANCE : 1387692
VIDEO_PLAYERS : 24727872
ART_AND_DESIGN : 1986335
AUTO_AND_VEHICLES : 647318
SPORTS : 3638640
EDUCATION : 1833495
MAPS_AND_NAVIGATION : 4056942
SHOPPING : 7036877
DATING : 854029
SOCIAL : 23253652
GAME : 15588016
LIBRARIES_AND_DEMO : 638504
TOOLS : 10801391
FOOD_AND_DRINK : 1924898
PERSONALIZATION : 5201483
FAMILY : 3695642
HEALTH_AND_FITNESS : 4188822
MEDICAL : 120551
COMMUNICATION : 38456119
WEATHER : 5074486
EVENTS : 253542
NEWS_AND_MAGAZINES : 9549178
PARENTING : 542604
ENTERTAINMENT : 11640706
PRODUCTIVITY : 16787331


The biggest number of installs occured for the following apps:
1. COMMUNICATION    : 38 456 119
2. VIDEO_PLAYERS    : 24 727 872
3. SOCIAL           : 23 253 652
4. PHOTOGRAPHY      : 17 840 110
5. PRODUCTIVITY     : 16 787 331
6. GAME             : 15 588 016
7. TRAVEL_AND_LOCAL : 13 984 078
8. ENTERTAINMENT    : 11 640 706
9. TOOLS            : 10 801 391 

In [37]:
length = 0 
installs= 0
for app in free_android:
    if app[1]=="COMMUNICATION":
        length +=1
        n=app[5]
        n=n.replace(',','')
        n=n.replace('+','')
        installs +=float(n)
        print(app[0],':',app[5])
print("There are: ", length," apps in this category and",int(installs), "installs")

WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+
AT&T Visual Voicemail : 10,000,000+
GMX Mail : 10,000,000+
Omlet Chat : 10,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Calls & Text by Mo+ : 5,000,000+
free 

Above we looked into apps categorized with **Communication**, which is the category with the highest number of installs. As with the app store data set, the google play data is boosted a lot by a few apps. For instance, **WhatsApp Messenger, Messenger, Skype, Google Chrome, Gmail and hangouts** all have an install number of **1,000,000,000+**. 

There are around 11 bilion installments for this category. That means that these 5 apps cover almost half of the installments.

Now we will create a function below that will count the average installs for apps that have less than 100,000,000 installs. This might gives us better insight in the data set.

In [43]:
def removing_100mil(category): # index 5 for installs
    length = 0 
    installs= 0
    for app in free_android:
        if app[1]==category:
            n=app[5]
            n=n.replace(',','')
            n=n.replace('+','')
            n=float(n)
            if n<100000000:  
                length +=1          
                installs +=float(n)
    avg=round(installs/length)
    print('\n')
    print(category, "| average:",avg,"| apps:", length)

The function has been created, it takes in a **category** as string and returns the **average installs** for that category. Since we have a category dictionary we can use this to create average for all of the categories. 

In [44]:
for category in category_freq:
    removing_100mil(category)



BUSINESS | average: 1226919 | apps: 405


LIFESTYLE | average: 1152129 | apps: 345


BEAUTY | average: 513152 | apps: 53


TRAVEL_AND_LOCAL | average: 2944080 | apps: 202


COMICS | average: 817657 | apps: 55


PHOTOGRAPHY | average: 7670532 | apps: 242


HOUSE_AND_HOME | average: 1331541 | apps: 73


BOOKS_AND_REFERENCE | average: 1437212 | apps: 185


FINANCE | average: 1086126 | apps: 327


VIDEO_PLAYERS | average: 5544878 | apps: 150


ART_AND_DESIGN | average: 1986335 | apps: 57


AUTO_AND_VEHICLES | average: 647318 | apps: 82


SPORTS | average: 2994083 | apps: 299


EDUCATION | average: 1833495 | apps: 103


MAPS_AND_NAVIGATION | average: 2484105 | apps: 122


SHOPPING | average: 4640921 | apps: 194


DATING | average: 854029 | apps: 165


SOCIAL | average: 3084583 | apps: 223


GAME | average: 6272565 | apps: 803


LIBRARIES_AND_DEMO | average: 638504 | apps: 83


TOOLS | average: 3191461 | apps: 721


FOOD_AND_DRINK | average: 1924898 | apps: 110


PERSONALIZATION | average:

From the code above we get a different order of successfull apps. These are the 8 most successfull categories:

- PHOTOGRAPHY   | average: 7 670 532 | apps: 242    
- GAME          | average: 6 272 565 | apps: 803    X
- ENTERTAINMENT | average: 6 118 250 | apps: 80     
- VIDEO_PLAYERS | average: 5 544 878 | apps: 150    
- WEATHER       | average: 5 074 486 | apps: 71     X
- SHOPPING      | average: 4 640 921 | apps: 194    
- COMMUNICATION | average: 3 603 485 | apps: 260    
- TOOLS         | average: 3 191 461 | apps: 721    X

If we analyze this data, we could say that we probably would not focus on **weather** apps, because this requires outside expertise and we are not prepared to pay for that.

When it comes to **Game** and **Tools**, these categories are already pretty flooded compared to the rest, so it would be harder to stand out on the app store. Also, we already know that games are really common at the App Store and we want to stand out on both Google Play and App Store.

From the categories that are left, the top standing are **Photography** and **Entertainment** .

We earlier created a frequence table for genres at the App Store and will therefore highlight the genres which can be compared to the three categories in google play that are of interest. We find 13 genres that compare well with these three categories.

- Entertainment : 6.069
- PHOTOGRAPHY : 2.944
- Video Players & Editors : 1.771
- Entertainment;Music & Video : 0.169
- Parenting;Music & Video : 0.068
- Education;Music & Video : 0.034
- Video Players & Editors;Music & Video : 0.023
- Music;Music & Video : 0.023
- Entertainment;Pretend Play : 0.023
- Video Players & Editors;Creativity : 0.011
- Music & Audio;Music & Video : 0.011
- Entertainment;Education : 0.011
- Casual;Music & Video : 0.011

For the highlighted genres in App Store, the most common one is **Entertainment**. A good idea could be to aim for this category/genre. If we want to succeed we should make apps that stand out from the rest of the entertainment apps. We see in the genre table above that **Entertainment:Education** looks like a relatively new area since it is the lowest common among the genres presented here.

The runner-up category is **Photography** which also seems pretty popular in the App Store. Just as we specified with **entertainment**, the **Photography** apps will need to stand out, even more so than the entertainment ones because they are higher in number. A safe bet for now could be to therefore only focus on entertaining apps.

Mixing **entertainment** with **Photography** could be a way to create some interesting apps.