# Analyzing Mobile App Data

Our aim is to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store.

With millions of apps currently available through these two app stores we will be using a sample to represent the population. We have a sample of around 10,000 Android apps from Google Play (collected August 2018), and 7,000 iOS apps from the App Store (collected July 2017).

## Opening and Exploring the Data

In [1]:
from csv import reader

# Get iOS Data 
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
opened_file.close()

# Get Android data
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
opened_file.close()

# Function to make it easier to explore the data
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
explore_data(ios, 1, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


In [3]:
explore_data(android, 1, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


Looking at the column names for each of the datasets:

In [4]:
for column in ios[0]:
    print(column)

id
track_name
size_bytes
currency
price
rating_count_tot
rating_count_ver
user_rating
user_rating_ver
ver
cont_rating
prime_genre
sup_devices.num
ipadSc_urls.num
lang.num
vpp_lic


In [5]:
for column in android[0]:
    print(column)

App
Category
Rating
Reviews
Size
Installs
Type
Price
Content Rating
Genres
Last Updated
Current Ver
Android Ver


## Deleting Wrong Data

Before we can continue with an analysis of the data we need to clean the data. Some of these steps are general, such as removing duplicates. Other steps are more specific to our goals, eg removing paid apps (since our company is only interested in developing free to install apps). Our steps for cleaning our data are:

- Detect inurate data, and correct or remove it.
- Detect duplicate data, and remove duplicates.
- Remove non-English apps
- Remove apps that aren't free

We know from the [kaggle discussion](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/164101) for the Google Play dataset that there is an error on row 10472 (header not included).

In [6]:
android[10473]

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

We can see that `Category`, the second column, is missing and as such all the columns have shifted to the left leaving us with 12 data columns instead of 13. We will delete this row.

In [7]:
if android[10473][0] == 'Life Made WI-Fi Touchscreen Photo Frame':
    del android[10473]

According to the discussions around the Apple dataset there do not seem to be any inaccurate rows.

## Removing Duplicate Entries

We will focus only on the Google Play store, since discussion around the ios dataset determine that there are no duplicates.

In [8]:
duplicate_apps = []
unique_apps = []

for app in android:
    if app[0] in unique_apps:
        duplicate_apps.append(app[0])
    else:
        unique_apps.append(app[0])
        
print('Number of dupicate apps: ', len(duplicate_apps))
print('\n')
print('Example of duplicate apps: ', duplicate_apps[:10])

Number of dupicate apps:  1181


Example of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


In [9]:
for app in android:
    if app[0] == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We need a criterion to determine which of the duplicates to keep in the dataset. We will use the `reviews` column to determine which to keep, using the theory that the most recent version will have the largest number of reviews. Therefore, we will only keep the duplicate with the largest number of reviews and all others will be deleted.

In [10]:
reviews_max = {}

for app in android[1:]:
    n_reviews = float(app[3])
    name = app[0]
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print("Length of dictionary: ",len(reviews_max), "\n")
print("Max reviews reviews_max['Instagram']")

Length of dictionary:  9659 

Max reviews reviews_max['Instagram']


We now have a dictionary where each unique app is the key and the value is the highest number of user reviews. We can use this dictionary to remove the duplicate entries.

In [11]:
android_clean=[]
already_added=[]

for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
print("Length of clean dataset: ", len(android_clean))
android_clean[:10]

Length of clean dataset:  9659


[['Photo Editor & Candy Camera & Grid & ScrapBook',
  'ART_AND_DESIGN',
  '4.1',
  '159',
  '19M',
  '10,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'January 7, 2018',
  '1.0.0',
  '4.0.3 and up'],
 ['U Launcher Lite – FREE Live Cool Themes, Hide Apps',
  'ART_AND_DESIGN',
  '4.7',
  '87510',
  '8.7M',
  '5,000,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'August 1, 2018',
  '1.2.4',
  '4.0.3 and up'],
 ['Sketch - Draw & Paint',
  'ART_AND_DESIGN',
  '4.5',
  '215644',
  '25M',
  '50,000,000+',
  'Free',
  '0',
  'Teen',
  'Art & Design',
  'June 8, 2018',
  'Varies with device',
  '4.2 and up'],
 ['Pixel Draw - Number Art Coloring Book',
  'ART_AND_DESIGN',
  '4.3',
  '967',
  '2.8M',
  '100,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design;Creativity',
  'June 20, 2018',
  '1.1',
  '4.4 and up'],
 ['Paper flowers instructions',
  'ART_AND_DESIGN',
  '4.4',
  '167',
  '5.6M',
  '50,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'March 26, 2017

## Removing Non-English Apps

Since our company mainly has an English speaking audience we want to remove any apps with non-English names from our dataset. We can do this by removing ann apps with names containing non-English characters. Each unicode character has an associated integer, which we can find using the `ord()` function. According to the ASCII (American Standard Code for Information Interchange) system, those characters associated with English text range between 0 and 127. We can use this as a criterion for determining if we should keep an app or not.

In [12]:
def is_english(name):
    for letter in name:
        unicode = ord(letter)
        if unicode > 127:
            return False
    return True

# Testing
print("Is 'Instagram' english: ", is_english('Instagram'))
print("Is'爱奇艺PPS -《欢乐颂2》电视剧热播' english: ", is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print("Is 'Docs To Go™ Free Office Suite' english: ", is_english('Docs To Go™ Free Office Suite'))
print("Is 'Instachat 😜' english: ", is_english('Instachat 😜'))

Is 'Instagram' english:  True
Is'爱奇艺PPS -《欢乐颂2》电视剧热播' english:  False
Is 'Docs To Go™ Free Office Suite' english:  False
Is 'Instachat 😜' english:  False


This function will remove apps that are english but with at least one non-standard character. We need to modify the function before we can use it. We will change it to allow up to 3 non-standard characters before it is classified as non-english.

In [13]:
def is_english(name):
    non_standard = 0
    for letter in name:
        unicode = ord(letter)
        if unicode > 127:
            non_standard += 1
            if non_standard > 3:
                return False
    return True

# Testing
print("Is 'Instagram' english: ", is_english('Instagram'))
print("Is'爱奇艺PPS -《欢乐颂2》电视剧热播' english: ", is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print("Is 'Docs To Go™ Free Office Suite' english: ", is_english('Docs To Go™ Free Office Suite'))
print("Is 'Instachat 😜' english: ", is_english('Instachat 😜'))

Is 'Instagram' english:  True
Is'爱奇艺PPS -《欢乐颂2》电视剧热播' english:  False
Is 'Docs To Go™ Free Office Suite' english:  True
Is 'Instachat 😜' english:  True


In [14]:
android_clean_en = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_clean_en.append(app)
print("Number of unique android apps with english text: ", len(android_clean_en))

Number of unique android apps with english text:  9614


In [15]:
ios_clean_en = []

for app in ios[1:]:
    name = app[1]
    if is_english(name):
        ios_clean_en.append(app)
print("Number of unique iOS apps with english text: ", len(ios_clean_en))

Number of unique iOS apps with english text:  6183


## Isolating the Free Apps

In [16]:
free_android_en = []

for app in android_clean_en:
    if app[6] == 'Free':
        free_android_en.append(app)
        
print("Number of unique free english text android apps: ", len(free_android_en))

Number of unique free english text android apps:  8863


In [17]:
free_ios_en = []
for app in ios_clean_en:
    price = float(app[4])
    if price == 0.0:
        free_ios_en.append(app)
        
print("Number of unique free english text iOS apps: ", len(free_ios_en))

Number of unique free english text iOS apps:  3222


## Most Common Apps by Genre

The end goal for the company is to have a new app that will be available on both the Google Play and App Store, so we need to find an app profile to fit both markets.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

We will start by looking at what the most common genres are, for which we will build frequency tables.

- in the ios table `prime_genre` is at column index 11
- in the android table `Genres` is at column index 9, and `Category` is at column index 1

We will build a function to generate a frequency table that shows percentages and another to display the percentages in descending order (making it easy to identify the top genres).

In [18]:
# Where index is the column index of interest
def freq_table(dataset, index):
    
    f_table={}
    n = len(dataset)
    
    for app in dataset:
        key = app[index]
        if key in f_table:
            f_table[key] += 1
        else:
            f_table[key] = 1
    for key in f_table:
        f_table[key] /= n
        f_table[key] *= 100
        f_table[key] = round(f_table[key], 3)
        
    return f_table
    
# Test
test = freq_table(ios_clean_en, 11)

test

{'Social Networking': 2.038,
 'Photo & Video': 5.515,
 'Games': 54.86,
 'Music': 2.216,
 'Reference': 0.857,
 'Health & Fitness': 2.669,
 'Weather': 1.116,
 'Utilities': 3.445,
 'Travel': 0.97,
 'Shopping': 1.375,
 'News': 0.922,
 'Navigation': 0.453,
 'Lifestyle': 1.601,
 'Entertainment': 7.262,
 'Food & Drink': 0.712,
 'Sports': 1.682,
 'Book': 0.89,
 'Finance': 0.792,
 'Education': 6.631,
 'Productivity': 2.717,
 'Business': 0.857,
 'Catalogs': 0.081,
 'Medical': 0.34}

This list is unordered and not easy to read at a glance. Which is why we need a second function to sort this in descending order.

In [19]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

# Display ordered freq table for ios genres
display_table(ios_clean_en, 11)

Games : 54.86
Entertainment : 7.262
Education : 6.631
Photo & Video : 5.515
Utilities : 3.445
Productivity : 2.717
Health & Fitness : 2.669
Music : 2.216
Social Networking : 2.038
Sports : 1.682
Lifestyle : 1.601
Shopping : 1.375
Weather : 1.116
Travel : 0.97
News : 0.922
Book : 0.89
Reference : 0.857
Business : 0.857
Finance : 0.792
Food & Drink : 0.712
Navigation : 0.453
Medical : 0.34
Catalogs : 0.081


For the ios `prime_genre` column :
 - the most common genre (from the free english apps) are `Games` followed by `Entertainment` and `Education`, although the `Games` genre is considerably greater in number than either of the other two.
 - the top genres are for entertainment
 - whilst there are many `games` options we don't yet know how many of these have been downloaded and how many users there are.
 - if we recommend `games` there will be a lot of competition

In [20]:
# Display ordered freq table for android genres
display_table(android_clean_en, 9)

Tools : 8.602
Entertainment : 5.794
Education : 5.232
Business : 4.358
Medical : 4.109
Personalization : 3.901
Productivity : 3.88
Lifestyle : 3.776
Finance : 3.589
Sports : 3.443
Communication : 3.266
Action : 3.11
Health & Fitness : 2.996
Photography : 2.912
News & Magazines : 2.6
Social : 2.486
Travel & Local : 2.268
Books & Reference : 2.268
Shopping : 2.091
Simulation : 1.976
Arcade : 1.914
Dating : 1.768
Casual : 1.716
Video Players & Editors : 1.675
Maps & Navigation : 1.342
Puzzle : 1.238
Food & Drink : 1.165
Role Playing : 1.082
Strategy : 0.978
Racing : 0.947
Libraries & Demo : 0.874
Auto & Vehicles : 0.874
Weather : 0.822
House & Home : 0.759
Adventure : 0.749
Events : 0.666
Art & Design : 0.582
Comics : 0.562
Beauty : 0.551
Card : 0.489
Parenting : 0.478
Board : 0.437
Casino : 0.406
Educational;Education : 0.395
Trivia : 0.385
Educational : 0.385
Education;Education : 0.364
Casual;Pretend Play : 0.26
Word : 0.239
Music : 0.198
Puzzle;Brain Games : 0.177
Education;Pretend Pl

In [21]:
# Display ordered frequency table for android categories
display_table(android_clean_en, 1)

FAMILY : 19.326
GAME : 9.819
TOOLS : 8.612
BUSINESS : 4.358
MEDICAL : 4.109
PERSONALIZATION : 3.901
PRODUCTIVITY : 3.88
LIFESTYLE : 3.786
FINANCE : 3.589
SPORTS : 3.38
COMMUNICATION : 3.266
HEALTH_AND_FITNESS : 2.996
PHOTOGRAPHY : 2.912
NEWS_AND_MAGAZINES : 2.6
SOCIAL : 2.486
TRAVEL_AND_LOCAL : 2.278
BOOKS_AND_REFERENCE : 2.268
SHOPPING : 2.091
DATING : 1.768
VIDEO_PLAYERS : 1.695
MAPS_AND_NAVIGATION : 1.342
FOOD_AND_DRINK : 1.165
EDUCATION : 1.103
ENTERTAINMENT : 0.905
LIBRARIES_AND_DEMO : 0.874
AUTO_AND_VEHICLES : 0.874
WEATHER : 0.822
HOUSE_AND_HOME : 0.759
EVENTS : 0.666
PARENTING : 0.624
ART_AND_DESIGN : 0.624
COMICS : 0.572
BEAUTY : 0.551


For the android `genre` column:
- Highly granular with no one genre dominating the list
- games have been divided into different genres such as `Action`, `Simulation`, `Arcade`
- The top genre is `Tools` at 8.6%
- There are sub-genres, depicted with a `;`, such as `Simulation;Pretend play`
The android `categories` column:
- Easier to read (and more comparable to the iOS `genres`)
- The top category is `FAMILY` with 19.4%, followed by `GAME` with 9.8%
- `FAMILY` comprises different types of apps including games and other types of entertainment that are age-appropriate for children.
- `TOOLS` have the same share as in the genres column at 8.6%

Again, these tables just tells us how many apps are available, not how many people have downloaded, installed and use them.

##  Most Popular Apps by Genre on the App Store

Our iOS dataset does not provide us with the number of installs for each app, so we will use the number of user ratings instead. We will extract this from the table using nested loops and look to see if we can make a recommendation based on this.

In [22]:
ios_genres = freq_table(ios_clean_en, 11)
print("Average number of ratings per iOS genre: \n")
for genre in ios_genres:
    # Initialise variables
    total=0
    len_genre=0
    for app in ios_clean_en:
        genre_app = app[11]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_rating = round(total / len_genre)
    print(genre, ": ", avg_rating)


Average number of ratings per iOS genre: 

Social Networking :  60254
Photo & Video :  14689
Games :  15587
Music :  29047
Reference :  27037
Health & Fitness :  10802
Weather :  23145
Utilities :  7928
Travel :  19030
Shopping :  26635
News :  16980
Navigation :  19371
Lifestyle :  8930
Entertainment :  8862
Food & Drink :  19934
Sports :  15351
Book :  10359
Finance :  23354
Education :  2472
Productivity :  8508
Business :  5149
Catalogs :  3465
Medical :  649


`Social Networking`, `Music`, and `Reference` are our top most downloaded apps from the App Store. Despite being the most populuous genre, `Games` has only quarter of the number of `Social Networking` user ratings on average.

There are many different criteria we could use to select a genre, but one way to select would be:
- Popular genre (high number of user ratings)
- Gap in the market (low number of currently available apps)

In [23]:
print("Selecting only the most rated apps (with over 20000 ratings on average):"
     )
for genre in ios_genres:
    # Initialise variables
    total=0
    len_genre=0
    for app in ios_clean_en:
        genre_app = app[11]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_rating = round(total / len_genre)
    if avg_rating > 20000:
        print(genre, ": ", avg_rating)

Selecting only the most rated apps (with over 20000 ratings on average):
Social Networking :  60254
Music :  29047
Reference :  27037
Weather :  23145
Shopping :  26635
Finance :  23354


In [53]:
print("Least available genres for app market (less than 2% share of the store):")

for genre in ios_genres:
    if ios_genres[genre] < 2.0:
        print(genre, ": ", ios_genres[genre])

Least available genres for app market (less than 2% share of the store):
Reference :  0.857
Weather :  1.116
Travel :  0.97
Shopping :  1.375
News :  0.922
Navigation :  0.453
Lifestyle :  1.601
Food & Drink :  0.712
Sports :  1.682
Book :  0.89
Finance :  0.792
Business :  0.857
Catalogs :  0.081
Medical :  0.34


From our two selections we cross-match and recommend the following:
- `Reference`
- `Weather`
- `Shopping`
- `Finance`
  
## Most Popular Apps by Genre on Google Play

We will use the `category` column rather than genres for this analysis.

In [52]:
android_cats = freq_table(android_clean_en, 1)

for category in android_cats:
    # Initialise variables
    total=0
    len_category=0
    for app in android_clean_en:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace("+", "")
            n_installs = n_installs.replace(",", "")
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
            
    avg_installs = round(total / len_category)
    if avg_installs > 4000000: # greater than 4M
        print(category, ": ", avg_installs)

BOOKS_AND_REFERENCE :  7641778
COMMUNICATION :  35153714
ENTERTAINMENT :  11375402
GAME :  14256218
SOCIAL :  22961790
SHOPPING :  6966909
PHOTOGRAPHY :  16636241
TRAVEL_AND_LOCAL :  13218663
TOOLS :  9785955
PERSONALIZATION :  4086652
PRODUCTIVITY :  15530942
WEATHER :  4570893
VIDEO_PLAYERS :  24121489
NEWS_AND_MAGAZINES :  9472807


The list above is only displaying the categories that have on average more than 4M installs. Using the same criteria as with the iOS dataset we can compare the following table:

In [48]:
print("Least available genres for app market (less than 2% share of the store):")

for cat in android_cats:
    if android_cats[cat] < 2.0:
        print(cat, ": ", android_cats[cat])

Least available genres for app market (less than 2% share of the store):
ART_AND_DESIGN :  0.624
AUTO_AND_VEHICLES :  0.874
BEAUTY :  0.551
COMICS :  0.572
DATING :  1.768
EDUCATION :  1.103
ENTERTAINMENT :  0.905
EVENTS :  0.666
FOOD_AND_DRINK :  1.165
HOUSE_AND_HOME :  0.759
LIBRARIES_AND_DEMO :  0.874
PARENTING :  0.624
WEATHER :  0.822
VIDEO_PLAYERS :  1.695
MAPS_AND_NAVIGATION :  1.342


From our two selections we cross-match and recommend the following:
- `ENTERTAINMENT`
- `VIDEO PLAYERS`
- `WEATHER`

## Reccomendation and conclusions

A weather app could be recommended based on the criteria used (high rate of install vs low availability). This appears in our recommended genres for both iOS and android. A next step would be to see if its possible to generate revenue from in-app advertising etc. This would require a deeper look into this genre and profitiability. We may also need to explore some further options from the genres available to see if there are some more profitble genres that fit our original criteria less well.
