# Profitable App Profiles for the App Store and Google Play Market

This project will help us analyze what type of apps are likely to attract more users.    
Why we need it? Because we want to help developers make their app popular and attractive for users on Google Play and the App Store :)

In [1]:
from csv import reader

app_store = list(reader(open('AppleStore.csv')))
google_store = list(reader(open('googleplaystore.csv')))

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(app_store, 1, 5, True)
explore_data(google_store, 1, 5, True)


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live 

There is a row in google store dataset with incorrect data, so we will delete it

In [4]:
print(google_store[10473])
del google_store[10473]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Google Play data set has dublicate entries

In [5]:
for row in google_store:
    name = row[0]
    if name == 'Instagram':
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [6]:
duplicate_apps = []
unique_apps = []

for row in google_store:
    name = row[0]
    
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

Now it's time to remove duplicates from this dataset. Let's keep only the row with the highest number of reviews

In [7]:
reviews_max = {}

for row in google_store[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if(name in reviews_max and reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))

9659


android_clean is a list where we will store our cleaned dataset

In [8]:
android_clean = []
already_added = []

for row in google_store[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
        
print(len(android_clean))

9659


Now it's time to remove all apps with non-English name, because our audience is English. IsEnglish is a function that will help us do it. 

We check each three characters coming in a row, and if all of these characters are non-English, than we decide that app should be removed (to avoid removing apps with emojies and special marks).

In [9]:
def IsEnglish(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

In [10]:
print(IsEnglish('Instagram'))
print(IsEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(IsEnglish('Docs To Go™ Free Office Suite'))
print(IsEnglish('Instachat 😜'))

True
False
True
True


google_english and app_english are sets where we removed all non-English applications.

In [11]:
google_english = []
app_english = []

for row in android_clean:
    if IsEnglish(row[0]):
        google_english.append(row)

for row in app_store[1:]:
    if IsEnglish(row[1]):
        print(row[1])
        app_english.append(row)

Facebook
Instagram
Clash of Clans
Temple Run
Pandora - Music & Radio
Pinterest
Bible
Candy Crush Saga
Spotify Music
Angry Birds
Subway Surfers
Fruit Ninja Classic
Solitaire
CSR Racing
Crossy Road - Endless Arcade Hopper
Injustice: Gods Among Us
Hay Day
Clear Vision (17+)
Minecraft: Pocket Edition
PAC-MAN
Calorie Counter & Diet Tracker by MyFitnessPal
DragonVale
The Weather Channel: Forecast, Radar & Alerts
Head Soccer
Google – Search made just for mobile
Despicable Me: Minion Rush
The Sims™ FreePlay
Google Earth
Plants vs. Zombies
Sonic Dash
Groupon - Deals, Coupons & Discount Shopping App
8 Ball Pool™
Tiny Tower - Free City Building
Jetpack Joyride
Bike Race - Top Motorcycle Racing Games
Shazam - Discover music, artists, videos & lyrics
Kim Kardashian: Hollywood
Doodle Jump
Trivia Crack
WordBrain
Sniper 3D Assassin: Shoot to Kill Gun Game
Flow Free
Lose It! – Weight Loss Program and Calorie Counter
Skype for iPhone
Geometry Dash Lite
Draw Something
▻Sudoku
Twitter
Messenger
Waze - GPS

Checking whether it worked or not :)

In [12]:
explore_data(app_store, 1, 10, True)
explore_data(google_store, 1, 10, True)


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624', '1814', '4.5', '4.0', '6.26', '12+', 'Social Networking', '37', '5', '27', '1']


['282935706', 'Bible', '92774400', 'USD', '0.0', '985920', '5320', '4.5', '5.0', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


['5538347

Now we will isolate free apps in separate lists

In [13]:
google_free = []
app_free = []

for row in google_english:
    if row[6] == 'Free':
        google_free.append(row)
        
for row in app_english:
    if row[4] == '0.0':
        app_free.append(row)
        
print(len(google_free))
print(len(app_free))

8863
3222


Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.

In [14]:
print(google_store[0])
print(app_store[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Now we will make and display frequency tables for genres and categories in both datasets using 2 functions described below.

In [15]:
def freq_table(dataset, index):
    freq_dict = {}
    total = 0
    
    for row in dataset:
        total += 1
        param = row[index]
        
        if param in freq_dict:
            freq_dict[param] += 1
        else:
            freq_dict[param] = 1
    table_percentages = {}
    for key in freq_dict:
        percentage = (freq_dict[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

In [16]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [17]:
display_table(google_free, 1)
print('-'*20)
display_table(google_free, 9)
print('-'*20)
display_table(app_free, 11)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

Let's see average ratings of the apps by the genre in AppStore

In [24]:
genres_ios = freq_table(app_free, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in app_free:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print("{} : {:,.1f}".format(genre, avg_n_ratings))

Book : 39,758.5
Food & Drink : 33,333.9
Utilities : 18,684.5
Lifestyle : 16,485.8
Sports : 23,008.9
Education : 7,004.0
News : 21,248.0
Shopping : 26,919.7
Catalogs : 4,004.0
Weather : 52,279.9
Medical : 612.0
Finance : 31,467.9
Entertainment : 14,029.8
Music : 57,326.5
Reference : 74,942.1
Social Networking : 71,548.3
Photo & Video : 28,441.5
Travel : 28,243.8
Business : 7,491.1
Games : 22,788.7
Navigation : 86,090.3
Health & Fitness : 23,298.0
Productivity : 21,028.4


Considering these results Navigation, Reference and Music have the highest ratings in App Store

In [25]:
for app in app_free:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Now let's see average number of installs by the category in Google Store

In [27]:
genres_google = freq_table(google_free, 1)

for category in genres_google:
    total = 0
    len_category = 0
    
    for app in google_free:
        category_app = app[1]
        
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = float(n_installs.replace('+', ''))
            total += n_installs
            len_category += 1
    avg_n_installs = total / len_category
    
    print("{} : {:,.1f}".format(category, avg_n_installs))

VIDEO_PLAYERS : 24,727,872.5
SOCIAL : 23,253,652.1
FINANCE : 1,387,692.5
EDUCATION : 1,833,495.1
COMICS : 817,657.3
PERSONALIZATION : 5,201,482.6
HEALTH_AND_FITNESS : 4,188,822.0
TOOLS : 10,801,391.3
DATING : 854,028.8
GAME : 15,588,015.6
LIBRARIES_AND_DEMO : 638,503.7
HOUSE_AND_HOME : 1,331,540.6
LIFESTYLE : 1,437,816.3
FOOD_AND_DRINK : 1,924,897.7
EVENTS : 253,542.2
MEDICAL : 120,550.6
BOOKS_AND_REFERENCE : 8,767,811.9
NEWS_AND_MAGAZINES : 9,549,178.5
BUSINESS : 1,712,290.1
BEAUTY : 513,151.9
MAPS_AND_NAVIGATION : 4,056,941.8
AUTO_AND_VEHICLES : 647,317.8
PHOTOGRAPHY : 17,840,110.4
PRODUCTIVITY : 16,787,331.3
COMMUNICATION : 38,456,119.2
ART_AND_DESIGN : 1,986,335.1
SPORTS : 3,638,640.1
ENTERTAINMENT : 11,640,705.9
WEATHER : 5,074,486.2
FAMILY : 3,697,848.2
PARENTING : 542,603.6
TRAVEL_AND_LOCAL : 13,984,077.7
SHOPPING : 7,036,877.3


Considering these results COMMUNICATION, VIDEO_PLAYERS and SOCIAL are categories with the biggest number of installs in average in Google Store