# Analysis of Successful Apps for Google Play and App Store

Since the inception of the smart phone, apps have become common place for users. Apps of all types exist from games to utilities such as alarm clocks and even many banking and retail apps are used daily. Understanding what makes an app attractive to users is key in order for companies to turn a profit from development of an app.

The goal of this project is to understand types of apps that are likely to be attractive to users. These findings will assist developers to create more profitable apps.

---

## Data Import

Attaining data on the millions of apps available on each marketplace is not feasible. Therefore we will use a sample set from each marketplace. Below is basic information about each dataset:

**Google Play dataset:**
Data for about 10,000 Android apps collected in August 2018.

**App Store dataset:**
Data for about 7,000 iOS apps collected July 2017

The data is available in CSV format. We can import the datasets using the functions from the csv library:

In [1]:
from csv import reader

# Google Play Data #
open_file = open('googleplaystore.csv', encoding = 'utf8')
read_file = reader(open_file)
android = list(read_file)

# App Store Data #
open_file = open('AppleStore.csv', encoding = 'utf8')
read_file = reader(open_file)
appStore = list(read_file)

To get a better understanding of the data, we'll create and use a function called `explore_data()`. This function will print a specified number of rows from a dataset. It will also show the total number of rows and columns if the `rows_and_columns` parameter is set to `True`.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Now let's explore the first few lines of data. We'll also print the column headers to help identify any fields that will be useful to our analysis.

In [3]:
print(android[0],'\n')
explore_data(android, 1, 5, rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


In [4]:
print(appStore[0],'\n')
explore_data(appStore, 1, 5, rows_and_columns=True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


Number of rows: 7198
Number of columns: 17


Some of the column names may not be helpful in fully understanding their content. For additional information on the datasets, refer to the source documentation available below:

[Google Play documentation](https://www.kaggle.com/lava18/google-play-store-apps/home)

[App Store documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

---

## Data Cleaning

Before beginning our analysis, we will need clean data to work with. There are a couple main steps we need to take to ensure our data is clean:

1. Detect and correct/remove inaccurate data
2. Detect and remove duplicate data

Additionally, our company only builds apps that are free to download and are targeted primarily at English speaking users. Therefore we'll need to add the following steps to our data cleaning process:

3. Remove non-English language apps
4. Remove paid apps (charge a fee to download / install)

### Detect and Correct / Remove Inaccurate Data

Starting with step 1 on the Google Play data, we immediately know that there is an error with one particular row. We owe our gratitude to previous users of this dataset who identified and discussed this issue. The original discussion on kaggle.com can be found [here](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015).

The discussion specifies that row 10472, an app called "Life Made Wi-Fi Touchscreen Photo Frame" is the culprit. Let's check the row. Since data structure may vary based on how the data was imported, let's print the rows around it as well.

In [5]:
android[10471:10474]

[['Jazz Wi-Fi',
  'COMMUNICATION',
  '3.4',
  '49',
  '4.0M',
  '10,000+',
  'Free',
  '0',
  'Everyone',
  'Communication',
  'February 10, 2017',
  '0.1',
  '2.3 and up'],
 ['Xposed Wi-Fi-Pwd',
  'PERSONALIZATION',
  '3.5',
  '1042',
  '404k',
  '100,000+',
  'Free',
  '0',
  'Everyone',
  'Personalization',
  'August 5, 2014',
  '3.0.0',
  '4.0.3 and up'],
 ['Life Made WI-Fi Touchscreen Photo Frame',
  '1.9',
  '19',
  '3.0M',
  '1,000+',
  'Free',
  '0',
  'Everyone',
  '',
  'February 11, 2018',
  '1.0.19',
  '4.0 and up']]

In our case, the app in question is row 10473. Comparing the "Life Made" entry with the others, it's clear that the app category is missing. Let's remove this row to be safe, using the `del` statement.

In [6]:
del android[10473]

# Check Removal #
android[10471:10474]

[['Jazz Wi-Fi',
  'COMMUNICATION',
  '3.4',
  '49',
  '4.0M',
  '10,000+',
  'Free',
  '0',
  'Everyone',
  'Communication',
  'February 10, 2017',
  '0.1',
  '2.3 and up'],
 ['Xposed Wi-Fi-Pwd',
  'PERSONALIZATION',
  '3.5',
  '1042',
  '404k',
  '100,000+',
  'Free',
  '0',
  'Everyone',
  'Personalization',
  'August 5, 2014',
  '3.0.0',
  '4.0.3 and up'],
 ['osmino Wi-Fi: free WiFi',
  'TOOLS',
  '4.2',
  '134203',
  '4.1M',
  '10,000,000+',
  'Free',
  '0',
  'Everyone',
  'Tools',
  'August 7, 2018',
  '6.06.14',
  '4.4 and up']]

The above output confirms the error row was deleted. But, what if there are other rows with the same  type of error? One way to check is to compare the number of data points for each app to the number of expected columns, using the header names.

This is accomplished with a `for` loop that checks the length of an app's data row against the headers. If an app has more or less data points than the headers, it is removed from the dataset. The 'enumerate' function gives us the row number so we know which app to remove.

*Note: This only removes entries based on how much data is present vs. expected. There may be other data content errors that are not addressed in this step.*

In [7]:
headers = android[0]
removedApps = 0
for row, app in enumerate(android[1:]):
    if len(app) != len(headers):
        del android[i]
        removedApps += 1

print(removedApps,'apps were removed from the Google Play dataset.')

headers = appStore[0]
removedApps = 0
for row, app in enumerate(appStore[1:]):
    if len(app) != len(headers):
        del appStore[i]
        removedApps += 1

print(removedApps,'apps were removed from the App Store dataset.')

0 apps were removed from the Google Play dataset.
0 apps were removed from the App Store dataset.


### Detect and Remove Duplicate Entries

#### Part 1: How many duplicates are there?

Now that the data format is what we are expecting for each app, let's check if there are any 
duplicate entries.

Based on the dataset headers, we can use app name as a unique value to check. The App Store data does have an `Id` field, but we do not know for sure if that is a truly unique field. For example, do newer versions of the same app get a new Id value? We do know though that once cleaned, there should be only one "Facebook" or one "Instagram" entry.

Before we begin, we also need to determine how we identify which duplicate entry to keep. The Google Play data does have a `Last Updated` field. But what if an app had more than one update in the same day? Additionally, the App Store dataset has no date field.

What we do have is a field for number of reviews called `Reviews` for Google Play and `rating_count_tot` for the App Store. It is safe to assume which is the most recent entry for an app based on the entry that has the most reviews out of all the duplicates. 

First, let's figure out just how many duplicate apps are in each dataset. We can do this by looping through the dataset and appending app names to a list. In the loop, we'll also check to see if an app is already on our list. If it is, we'll add it to a second list to identify it as a duplicate entry. Once the lists are populated, we can figure out how many duplicates there are by examining the length of the lists.

In [8]:
unique_android = []
duplicate_android = []

for app in android[1:]:
   name = app[0]
   if name in unique_android:
      duplicate_android.append(name)
   else:
      unique_android.append(name)

print('There are',len(duplicate_android),'duplicate Android app entries')
print('Examples:', duplicate_android[:6])
        
unique_appstore = []
duplicate_appstore = []

for app in appStore[1:]:
   name = app[0]
   if name in unique_appstore:
      duplicate_appstore.append(name)
   else:
      unique_appstore.append(name)

print('\nThere are',len(duplicate_appstore),'duplicate Apple app entries')
print('Examples:', duplicate_appstore[:6])

There are 1181 duplicate Android app entries
Examples: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box']

There are 0 duplicate Apple app entries
Examples: []


Luckily we only have duplicate apps in the Google Play dataset, of which there are 1,181 duplicate entries.

Looking at the original length of the Google Play dataset, we can easily determine how many entries our clean set should contain (including the header):

In [9]:
print(len(android) - 1181)

9660


#### Part 2: Finding highest number of reviews.

The next step of this process is to identify the highest number of reviews for each app. This number will be used in step 3 to identify the correct entries.

We can accomplish this step by creating a dictionary with the app name as the key, and number of reviews as the value. Looping through the dataset, we check if a name is already in the dictionary. If it is not, add the name and the number of reviews. If it is in the dictionary, then a check will be done on the number of reviews. If the current entry is higher then the value will be replaced. Otherwise it will remain the same. The final product is a dictionary with every unique app name and the highest number of reviews for that app.

We should expect to see 9,659 entries in this dictionary since these are the unique apps.

In [10]:
name_reviews = {}
for app in android[1:]:
    name = app[0]
    reviews = float(app[3])
    if name not in name_reviews:
        name_reviews[name] = reviews
    elif reviews > name_reviews[name]:
            name_reviews[name] = reviews

print(len(name_reviews))


9659


#### Part 3: Selecting the correct entries.

Now that we can reference each app's max number of reviews, we can use that number to identify the correct entries. To do so, we'll construct 2 lists. One of which, `android_clean` will contain the full, cleaned dataset. The other, `already_added` will be used to identify which apps are added to the new dataset.

This second list is necessary because there are some apps who have more than one entry sharing the same max number of reviews. This list will prevent duplicate entries even if they match the # of reviews criterion.

The loop will look through the dataset, then check if the number of reviews matches the number we have in the `name_reviews` dictionary. If it does, then add the app's entry to the `android_clean` list and the app's name to the `already_added` list. If the app's name exist on the `already_added` list, or the reviews don't match what we have in `name_reviews`, the entry is skipped.

Let's also print the length of the final dataset. Counting the header entry, we should see 9,660 entries.

In [11]:
android_clean = []
already_added = []

#Add headers to new dataset:
android_clean.append(android[0])

for app in android[1:]:
    name = app[0]
    reviews = float(app[3])
    if reviews == name_reviews[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

print(len(android_clean))

9660


### Removing Non-English Apps

The company we're working for only develops apps in English and therefore is not interested in the data regarding foreign language apps. This step in the cleaning process will remove those apps from our datasets.

String characters each have a corresponding value which can be found using the `ord()` function. Based on the American Standard Code for Information Interchange (ASCII), we know that common English charactes (letters, arabic numerals, special characters) are in the 0 to 127 range. It's likely that apps containing characters past 127 are not English language apps and can be discarded.

However, there are some characters such as "™" or emojis like "😜" that can appear in English apps. Therefore we can't throw out apps based on a single character criteria. It is safer to discard apps if they have a higher number of non-English characters. Let's use more than 3 as our criteria.

We can write a function that checks an app name for whether it is English or not. The function will first parse out the characters in the app and, using the `ord()` function, check each one to determine if it is English or not based on it's `ord()` value. It will then identify non-English apps by names that have more than 3 characters.

In [12]:
def is_english(app_name):
    non_eng = 0
    for letter in app_name:
        if ord(letter) > 127:
            non_eng += 1
    if non_eng > 3:
        return False
    else:
        return True

#Function Check:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Instachat 😜'))

True
False
True


Now with the `is_english()` function built, we can check each app's name to determine if it's English or not. Apps that are identified as English will be put in a separate list for a clean dataset.

In [13]:
android_eng = []

#Add header row:
android_eng.append(android_clean[0])

for app in android_clean[1:]:
    if is_english(app[0]):
        android_eng.append(app)

appstore_eng = []

#Add header row:
appstore_eng.append(appStore[0])

for app in appStore[1:]:
    if is_english(app[1]):
        appstore_eng.append(app)

print(len(android_eng))
print(len(appstore_eng))

9615
7198


### Removing Paid Apps

Our company only creates apps that are free to download and install. Therefore to maintain a clean comparison, we'll remove any apps that require payment to download and install.

Examining the headers, we see that the datasets have a field called 'price'. The condition of price being equal to 0 will isolate the apps that we are interested in. Let's set up a loop to check the price of each app and put it in a new dataset if it is a free app.

In [14]:
android_final = []
android_final.append(android_eng[0])
for app in android_eng[1:]:
    if app[7] == '0':
        android_final.append(app)

       
appstore_final = []
appstore_final.append(appstore_eng[0])
for app in appstore_eng[1:]:
    if app[5] == '0':
        appstore_final.append(app)
        
print(len(android_final))
print(len(appstore_final))

8865
4057


---

## Data Analysis

The end goal for our analysis is to determine properties that make an app attract more users. Many apps across both marketplaces are successful at accomplishing this. We can create a profile of a successful app by examining these existing apps. Using those properties in our own app will raise our chance of success in the market.

To minimize risk though, we'll use the following strategy to validate whether an app we created is successful or not:

1. Build a minimal Android version and publish on Google Play
2. If app has positive response from users, develop additional features
3. If app is profitable after 6 months, then build and publish iOS version

A good high-level data point to start with is the type of app, such as what category or genre is is classified as. Let's identify the appropriate data fields that contain this information.

In the `android_final` dataset, the fields `Genre` and `Category` look to contain this information. For the `appstore_final` dataset, there's only the one `prime_genre` field to reference.

### Part 1

Let's get a better understanding of successful app genres by taking a look at the frequency of each genre on the markets. We'll start by creating a function `freq_table` to output a frequency for any specified column using the dataset and index value of the desired column:

In [15]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = round((table[key] / total) * 100, 2)
        table_percentages[key] = percentage
    
    return table_percentages        

With the percentages calculated, we'll use a second function `display_table` to sort and display the results:

In [16]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Part 2

Since our strategy calls for publishing to Google Play first, let's examine the results from the `Genre` and `Category` columns of that dataset:

In [17]:
display_table(android_final[1:], 9) #Genres

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;

In [18]:
display_table(android_final[1:], 1) #Categories

FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


Looking at the Google Play data, we see that Family is clearly the most common app category at ~19%. Games follow a distance behind at 9.72%, then Tools at 8.46%.

The genres show a similar pattern with Tools being the most common at 8.45%, followed by Entertainment (6.07%), then Business, Productivity, and Lifestyle.

Some of the top categories and genres show possible overlap such as Tools, Business, Productivity, and Lifestyle. The Family category though is not equivalent to a single genre, and implies there's diversity within that category that we may need to examine.

Let's take a look at the App Store data to see how it compares:

In [19]:
display_table(appstore_final[1:], -5)

Games : 55.65
Entertainment : 8.23
Photo & Video : 4.12
Social Networking : 3.53
Education : 3.25
Shopping : 2.98
Utilities : 2.69
Lifestyle : 2.32
Finance : 2.07
Sports : 1.95
Health & Fitness : 1.87
Music : 1.65
Book : 1.63
Productivity : 1.53
News : 1.43
Travel : 1.38
Food & Drink : 1.06
Weather : 0.76
Reference : 0.49
Navigation : 0.49
Business : 0.49
Catalogs : 0.22
Medical : 0.2


The App Store has a clear popular category, with a majority of the apps being categorized as Games. Entertainment and Photo & Video follow.

Genre itself does not provide us enough information to determine what makes a successful app. But it does give narrow down the categories our apps should compete in. There is an interesting distinction of popular categories between the two markets. In the App Store, Games are clearly the top category. But in Google Play, we see that more practical genres like Tools, Business, Education, are comparable to Entertainment. Overall, apps on the Google Play store tend to be more balanced than solely focused on entertainment.

### Part 3

We can get a better understanding of app popularity by looking at number of installs within each genre or category. This will help give us a more direct estimate of profitability since we need to maximize the number of users on the app.

### App Store

Let's start by looking at the App Store data. The data does not have a metric on number of installs. However it does have total number of reviews. We can use this `rating_count_tot` field as a proxy.

Using the `freq_table` function created earlier, we'll loop through the table, then through the full data, to get an average number of ratings per genre.

In [20]:
genres_ios = freq_table(appstore_final[1:], -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in appstore_final:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = int(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = int(total / len_genre)
    print(genre, ':', avg_n_ratings)

Productivity : 0
Weather : 0
Shopping : 0
Reference : 0
Finance : 0
Music : 0
Utilities : 0
Travel : 0
Social Networking : 0
Sports : 0
Health & Fitness : 0
Games : 0
Food & Drink : 0
News : 0
Book : 0
Photo & Video : 0
Entertainment : 0
Business : 0
Lifestyle : 0
Education : 0
Navigation : 0
Medical : 0
Catalogs : 0


Navigation, Reference, and Social Networking apps lead the list with the highest number of ratings. These categories though have a few big names that are likely skewing the data such as Waze, Google Maps in Navigation, or Facebook, Instagram, Snapchat in Social Networking.

Let's take a look at these two categories more closely:

In [21]:
for app in appstore_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

323229106 : 0
329541503 : 0
344176018 : 0
377321278 : 0
413487517 : 0
447024088 : 0
452186370 : 0
461703208 : 0
463431091 : 0
504677517 : 0
528532387 : 0
553771681 : 0
562136065 : 0
585027354 : 0
820004378 : 0
982887800 : 0
1025396583 : 0
1074321709 : 0
1075817264 : 0
1130847808 : 0


In [22]:
for app in appstore_final:
    if app[-5] == 'Social Networking':
        print(app[1], ':', app[5])

284882215 : 0
288429040 : 0
304878510 : 0
305343404 : 0
305939712 : 0
310633997 : 0
314716233 : 0
319881193 : 0
336435697 : 0
349442137 : 0
350962117 : 0
351331194 : 0
357218860 : 0
364183992 : 0
369970819 : 0
372513032 : 0
372648912 : 0
375239755 : 0
382617920 : 0
384830320 : 0
386098453 : 0
389638243 : 0
392796698 : 0
398166286 : 0
405548206 : 0
414478124 : 0
427941017 : 0
428845974 : 0
429047995 : 0
432274380 : 0
433156786 : 0
442012681 : 0
443904275 : 0
444934666 : 0
445338486 : 0
448165862 : 0
453718989 : 0
454638411 : 0
458272450 : 0
471347413 : 0
477091899 : 0
477927812 : 0
505311207 : 0
506141837 : 0
531761928 : 0
539124565 : 0
552208596 : 0
554064861 : 0
558512661 : 0
562162550 : 0
566223681 : 0
569077959 : 0
570315854 : 0
575147772 : 0
640360962 : 0
643496868 : 0
648228242 : 0
651309421 : 0
686449807 : 0
698054232 : 0
710380093 : 0
719829352 : 0
733144873 : 0
789870026 : 0
836071680 : 0
838848566 : 0
852801905 : 0
861891048 : 0
862550306 : 0
867887231 : 0
899538562 : 0
907002

With some well established names in these categories, it would be very difficult to compete and establish a successful app in Navigation or Social Networking. Let's take a look also at the Reference category:

In [23]:
for app in appstore_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

282935706 : 0
308750436 : 0
364740856 : 0
388389451 : 0
399452287 : 0
414706506 : 0
475772902 : 0
640199958 : 0
671889349 : 0
980134624 : 0
1003837100 : 0
1096464625 : 0
1130829481 : 0
1132715891 : 0
1133678984 : 0
1133706938 : 0
1135575003 : 0
1137683736 : 0
1156856246 : 0
1171021623 : 0


In the reference category, the Bible and Dictionary.com lead the list. However there are not many other apps in this category to compete with. It's possible this group may be our potential market for a successful app. We could take another popular book or reference material and create an app for it, adding features beyond the raw text such as built-in dictionary, source material, or trivia to keep users engaged and spending more time in the app.

### Google Play

For the Android dataset, we do have a specific `Installs` column. This column uses categorical data though, not a raw number of installations. Let's take a look at the breakdown of this field using the `display_table` function from earlier:

In [24]:
display_table(android_final[1:], 5)

1,000,000+ : 15.73
100,000+ : 11.55
10,000,000+ : 10.55
10,000+ : 10.2
1,000+ : 8.39
100+ : 6.92
5,000,000+ : 6.83
500,000+ : 5.56
50,000+ : 4.77
5,000+ : 4.51
10+ : 3.54
500+ : 3.25
50,000,000+ : 2.3
100,000,000+ : 2.13
50+ : 1.92
5+ : 0.79
1+ : 0.51
500,000,000+ : 0.27
1,000,000,000+ : 0.23
0+ : 0.05
0 : 0.01


The categories within the field give us an idea of popularity, but are not precise enough. For example, an app in the 100,000+ group may have 100,001 or 400,000 installs. We'll have to make an assumption here that apps within each grouping have the exact number of installs. An app in the 1,000+ group will be assumed to have exactly 1,000 installs.

To get an average as we did with the App Store, the grouping names need to be cleaned and converted to numeric values. We can loop through the data as we did before, but will have to add a cleaning portion that removes the `,` and `+` from the name and converts the value to a numeric:

In [25]:
categories_android = freq_table(android_final[1:], 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = int(total / len_category)
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335
AUTO_AND_VEHICLES : 647317
BEAUTY : 513151
BOOKS_AND_REFERENCE : 8767811
BUSINESS : 1712290
COMICS : 817657
COMMUNICATION : 38456119
DATING : 854028
EDUCATION : 1833495
ENTERTAINMENT : 11640705
EVENTS : 253542
FINANCE : 1387692
FOOD_AND_DRINK : 1924897
HEALTH_AND_FITNESS : 4188821
HOUSE_AND_HOME : 1331540
LIBRARIES_AND_DEMO : 638503
LIFESTYLE : 1437816
GAME : 15588015
FAMILY : 3695641
MEDICAL : 120550
SOCIAL : 23253652
SHOPPING : 7036877
PHOTOGRAPHY : 17840110
SPORTS : 3638640
TRAVEL_AND_LOCAL : 13984077
TOOLS : 10801391
PERSONALIZATION : 5201482
PRODUCTIVITY : 16787331
PARENTING : 542603
WEATHER : 5074486
VIDEO_PLAYERS : 24727872
NEWS_AND_MAGAZINES : 9549178
MAPS_AND_NAVIGATION : 4056941


The results show that the Communication category has the highest average at over 38M. This category is dominated by a few giants though such as Facebook Messenger, WhatsApp, and Skype.

In [26]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

Looking at some other categories, we see Social and Video Players are also quite popular. Again though these are categories with established market dominators such as Facebook, Instagram, Youtube, and Google Play Movies. Games is another promising category, but that market is very saturated and it's unlikely our app will get much notice in the noise.

Since we looked at Reference in the App Store data, let's take a look at the Books and Reference category for Google Play. It has over 8M average installs, which is certainly not the highest, but is a notable amount. Let's see what the breakdown by app looks like:

In [27]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

There are a few major leaders in this category such as Google Play Books, Bible, and Kindle. If we take out these top apps, we can get a better picture of what's popular among this category:

In [28]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

There's quite a few moderately popular apps. Examining the names shows that most of them are some type of collection, processors, or readers rather than specific books. As with the App Store data, this category shows potential for a specific book app that contains features offered by some of the more popular apps in this category.

-------

## Conclusion

This project consisted of collecting and analyzing app data for Google Play and the Apple App Store to determine a potential app profile and market for our company.

Alot of steps were taken to clean the data and target our analysis to comparable apps. This included constraints to only free and English language apps. Once our data was clean, we determined popular genres by examining number of apps and average number of installs.

Our findings showed the most popular categories for each market which included Social Networking, Games, and Reference genres. Among these, we determined the potential categories we primarily dominated with well-established market leaders such as Facebook, Youtube, and Google's suite of apps.

The Reference category showed potential within both markets though. There are a few market leaders such as the Bible and Kindle. However most of the popular apps in this category are readers or processors, not actual reference materials.

The recommendation from these findings is that the company create an app for a specific book or popular piece of reference material. Beyond the text though, the app should include additional features such as audio files, trivia, built-in dictionary, or even forums. An example app could be one designed around research topics for college students.