# Apple and Android Mobile Apps Project

My objective for this project is to analyze availabe data from the App Store and Google Play store in an attempt to discern what types of apps will bring our organization the most revenue.

For this project, I am working as a data analyst for a company that builds mobile apps. These mobile apps end up in app stores such as Google Play and Apple's App Store. The company I am working for specializes in free-to-play apps, and the company makes money by showing ads inside of its apps. The goal for this project is to analyze data to help my company's developers understand what type of apps are likely to attract more users.

In [1]:
import pandas as pd
# Importing the Google Play Dataset
gp_csv = "https://dq-content.s3.amazonaws.com/350/googleplaystore.csv"
gp_df = pd.read_csv(gp_csv, header=0)


# Importing the App Store Dataset
as_csv = "https://dq-content.s3.amazonaws.com/350/AppleStore.csv"
as_df = pd.read_csv(as_csv, header=0)

In [2]:
gp_df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [3]:
as_df.head()

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1


In [4]:
# Dimensions of the respective datasets
print(gp_df.shape)
print(as_df.shape)

(10841, 13)
(7197, 16)


In [5]:
# Identifying an incorrect entry
gp_df.iloc[10472]

App               Life Made WI-Fi Touchscreen Photo Frame
Category                                              1.9
Rating                                               19.0
Reviews                                              3.0M
Size                                               1,000+
Installs                                             Free
Type                                                    0
Price                                            Everyone
Content Rating                                        NaN
Genres                                  February 11, 2018
Last Updated                                       1.0.19
Current Ver                                    4.0 and up
Android Ver                                           NaN
Name: 10472, dtype: object

In [6]:
# This entry has missing 'Rating' column entry which results in the following columns to have incorrect data
# Deleting the incorrect entry
print(gp_df.shape)
gp_df.drop(index=10472, inplace=True)
print(gp_df.shape)

(10841, 13)
(10840, 13)


## Duplicate Entries

Further examination of the dataset reveals that there are duplicate entries in the Google Play dataset

In [7]:
for index, row in gp_df.iterrows():
    name = row[0]
    if name == 'Instagram':
        print(row)
    

App                        Instagram
Category                      SOCIAL
Rating                           4.5
Reviews                     66577313
Size              Varies with device
Installs              1,000,000,000+
Type                            Free
Price                              0
Content Rating                  Teen
Genres                        Social
Last Updated           July 31, 2018
Current Ver       Varies with device
Android Ver       Varies with device
Name: 2545, dtype: object
App                        Instagram
Category                      SOCIAL
Rating                           4.5
Reviews                     66577446
Size              Varies with device
Installs              1,000,000,000+
Type                            Free
Price                              0
Content Rating                  Teen
Genres                        Social
Last Updated           July 31, 2018
Current Ver       Varies with device
Android Ver       Varies with device
Name: 2604, 

Here we see that the difference between the duplicate entries is the amount of reviews

In [8]:
# Create a list of the duplicate apps in the Google Play store
duplicate_apps = []
unique_apps = []

for index, row in gp_df.iterrows():
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print(len(duplicate_apps))

1181


To remove the duplicates, I will do the following:

Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

Use the information stored in the dictionary and create a new dataset, which will have only one entry per app (and for each app, I'll only select the entry with the highest number of reviews).

In [10]:
reviews_max = {} 
for index, row in gp_df.iterrows():
    name = row[0]
    n_reviews = float(row['Reviews'])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        

In [11]:
# Creating gp_clean list which is where I will store the Google Play dataset that is cleansed of duplicates
gp_clean = []
already_added = []

for index, row in gp_df.iterrows():
    name = row[0]
    n_reviews = float(row['Reviews'])
    if n_reviews == reviews_max[name] and name not in already_added:
        gp_clean.append(row)
        already_added.append(name)
        

In [11]:
# There are no duplicate apps in the App Store dataset

## Non-English Apps

I am going to remove all non-English apps from both of the databases. Each character we use in a string has a corresponding number associated with it. The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

In [12]:
# Creating a function that checks if every character in a string is English
def check_english(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True

In [13]:
# Testing the function
print(check_english('Instagram'))
print(check_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_english('Docs To Go™ Free Office Suite'))
print(check_english('Instachat 😜'))

True
False
False
False


The function couldn't correctly identify certain English app names like 'Docs To Go™ Free Office Suite' and 'Instachat 😜'. This is because emojis and characters like ™ fall outside the ASCII range and have corresponding numbers over 127.

To minimize the impact of data loss, I will only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.

In [14]:
# Updating the function
def check_english(string):
    ne_char = 0
    for character in string:
        if ord(character) > 127:
            ne_char += 1
        if ne_char > 3:
            return False
    return True

In [15]:
# Checking the updated function
print(check_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_english('Docs To Go™ Free Office Suite'))
print(check_english('Instachat 😜'))

False
True
True


In [16]:
# Filtering non-English apps out of the Google Play dataset
gp_eng = []
for row in gp_clean:
    name = row['App']
    if check_english(name) == True:
        gp_eng.append(row)
        
# Filtering non-English apps out of the App Store dataset
as_eng = []
for index, row in as_df.iterrows():
    name = row['track_name']
    if check_english(name) == True:
        as_eng.append(row)
        
print(len(gp_eng)) # Checking how many Google Play apps are left in the dataset
print(len(as_eng)) # Checking how many App Store apps are left in the dataset

9614
6183


I am only interested in the free apps from the Google Play and App Store datasets. Because of this, I will filter out all of the paid apps from both datasets.

In [17]:
# Creating a Google Play dataset with only free apps
gp_free = []
for row in gp_eng:
    price = row['Price']
    if price == '0':
        gp_free.append(row)
        
# Creating a App Store dataset with only free apps
as_free = []
for row in as_eng:
    price = row['price']
    if price == 0:
        as_free.append(row)
        
print(len(gp_free)) # Checking how many Google Play apps are left in the dataset
print(len(as_free)) # Checking how many App Store apps are left in the dataset

8864
3222


## Data Analysis

Because my end goal is to add the app on both Google Play and the App Store, I need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

I will begin the analysis by determining the most common genres for each market. For this, I'll need to build frequency tables for a few columns in my datasets.

In [18]:
# Creating a function that returns a frequency table expressed in percentages
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

# Creating a function that generates a frequency table and prints the entries of the table in descending order
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [22]:
display_table(gp_free, 'Genres')

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [23]:
display_table(gp_free, 'Category')

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [24]:
display_table(as_free, 'prime_genre')

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


## Results

In the App Store, games make up the overwhelming majority. Of all of the free, English apps in the app store, games account for more than 58% of the catalog. Entertainment comes in as the second most popular genre at about 8%. 

As for the Google Play store, practical apps seem to reign supreme in the genre column. Tools, Entertainment, Education, Business, and Productivity make up the five most popular genres in the Google Play store. The category column is dominated by family-oriented apps with games coming in second.

Game apps might be less popular in the Google Play store compared the App Store, but games are still one of the most popular types of apps in the Google Play store.

## More Analysis

Next, I will calculate the average number of installations for each app genre in the App Store. The App Store dataset does not have data about installations so I will instead use the user ratings column.

In [31]:
# Calculating the average number of installations for each app genre in the App Store
genres_ios = freq_table(as_free, 'prime_genre')

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in as_free:
        genre_app = app['prime_genre']
        if genre_app == genre:            
            n_ratings = float(app['rating_count_tot'])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Reference apps have the largest average of reviews despite accounting for less than 1% of the total amount of apps on the App Store. This means that a reference app has more of a chance to stand out as opposed to a game app.

In [36]:
# Calculating the average number of installations for each app genre in the Google Play store
categories_gp = freq_table(gp_free, 'Category')
for category in categories_gp:
    total = 0
    len_category = 0
    for app in gp_free:
        category_app = app['Category']
        if category_app == category:            
            n_installs = app['Installs']
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

## Conclusion

Once again, reference apps rank relatively high. While other categories may be more popular, such as games or communication, these categories are also oversaturated as we see apps of these categories make up a large percentage of the apps available in the Google Play and App Stores. I would recommend that my company attempt to make a reference app based off of some popular media.