# Apps that Are More Likely to Attract Costumers in the App Store and Google Play Markets

In this project we will be going through app data from the App Store and Google Play markets to make a decision as to which kind of app to make based on analysis of which apps are likely to attract more users.

## Opening and Exploring the Data

In [1]:
from csv import reader


# Apple Data
open_Apple = open('AppleStore.csv', encoding = 'Latin-1')
read_Data = reader(open_Apple)
apple_Data = list(read_Data)
apple_Header = apple_Data[0]
apple_Data = apple_Data[1:]


# Google Data
open_Google = open('googleplaystore.csv', encoding = 'Latin-1')
read_Data = reader(open_Google)
google_Data = list(read_Data)
google_Header = google_Data[0]
google_Data = google_Data[1:]





In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


print(apple_Header)
print('\n')
explore_data(apple_Data, 0, 4, True)


['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


Number of rows: 7197
Number of columns: 17


We would be most interested in freemium games (so price of 0.0) with the rating_count_tot column (how many times users rated it) and user_rating column (average rating from users). Finally we wish to know which type of app it is so we will be needing the prime_genre column.

In [3]:
print(google_Header)
print('\n')
explore_data(google_Data,0,4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite â\x80\x93 FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


For Google apps, we will want the Category column, the rating column, the reviews, perhaps the installs and perhaps content rating.

## Deleting Wrong Data

**For Google**

In [4]:
print(google_Header)
print('\n')
print(google_Data[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [5]:
del google_Data[10472]

**For Apple**

There wasn't any wrong data found in the Apple data

## Duplicate Entries: Part One

In [6]:
for app in google_Data:
    name = app[0]
    if name == 'Box':
        print(app)

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [7]:
duplicate_apps = []
unique_apps = []

for app in google_Data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    unique_apps.append(name)

print("number of duplicate entries:",len(duplicate_apps))
print(duplicate_apps[:5])

number of duplicate entries: 1181
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


We see above that there are 1181 duplicate entries, and give an example of 'Box' being used 3 times.

To remove the duplicates we are going to keep the row data with the highest number of reviews and remove the other entries for any given app.

## Duplicate Entries: Part Two

In [8]:
reviews_max = {}

for row in google_Data:
    name = row[0]
    n_reviews = float(row[3])
    if (name in reviews_max and reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif (name not in reviews_max):
        reviews_max[name] = n_reviews

The code above fills up a dictionary with app name as a key and number of reviews as a value. Whenever a duplicate is encountered, it will take on the new value for number of reviews if that value is higher than the previous one.

In [9]:
print("Expected length:", len(google_Data) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


In [10]:
android_clean = []
already_added = []

for row in google_Data:
    name = row[0]
    n_reviews = float(row[3])
    if (n_reviews == reviews_max[name] and name not in already_added):
        android_clean.append(row)
        already_added.append(name)

In [11]:
explore_data(android_clean, 0, 4, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite â\x80\x93 FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9659
Number of columns: 13


## Removing Non-English Apps: Part One

In [12]:
def is_English(string):
    for letter in string:
        if ord(letter) > 127:
            return False
    return True

In [13]:
print('Instagram:', is_English('Instagram'))
print('爱奇艺PPS -《欢乐颂2》电视剧热播:', is_English('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print('Docs To Go™ Free Office Suite', is_English('Docs To Go™ Free Office Suite'))
print('Instachat 😜', is_English('Instachat 😜'))

Instagram: True
爱奇艺PPS -《欢乐颂2》电视剧热播: False
Docs To Go™ Free Office Suite False
Instachat 😜 False


## Removing Non-English Apps: Part Two

The above function wasn't good enough because it didn't work for emojis and other English ASCII characters. In Part Two, we recreate the function, allowing for three ASCII characters above 127. It's not perfect, but much better than what we had

In [14]:
def is_English(string):
    non_English = 0
    for letter in string:
        if ord(letter) > 127:
            non_English += 1
    if non_English > 3:
        return False
    return True

In [15]:
print('Instagram:', is_English('Instagram'))
print('爱奇艺PPS -《欢乐颂2》电视剧热播:', is_English('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print('Docs To Go™ Free Office Suite', is_English('Docs To Go™ Free Office Suite'))
print('Instachat 😜', is_English('Instachat 😜'))

Instagram: True
爱奇艺PPS -《欢乐颂2》电视剧热播: False
Docs To Go™ Free Office Suite True
Instachat 😜 True


In [16]:
engl_Google = []
engl_Ios = []

for row in android_clean:
    name = row[0]
    is_Engl = is_English(name)
    if (is_Engl == True):
        engl_Google.append(row)
        
for row in apple_Data:
    name = row[1]
    is_Engl = is_English(name)
    if (is_Engl == True):
        engl_Ios.append(row)

        

In [17]:
explore_data(engl_Google, 0, 4, True)
explore_data(engl_Ios, 0, 4, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite â\x80\x93 FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9500
Number of columns: 13
['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'U

## Isolating the Free Apps

In [18]:
# For Google

google_Final = []

for row in engl_Google:
    if row[7] == '0':
        google_Final.append(row)

ios_Final = []

for row in engl_Ios:
    if row[4] == '0.0':
        ios_Final.append(row)

    

In [19]:
print('length of Google set:', len(google_Final))
print('length of ios set:', len(ios_Final))

length of Google set: 8760
length of ios set: 0


## Most Common Apps by Genre

**Part One**

Revenue is highly influenced by the number of people using the app, especially 
with freemium games. We want to eventually add the app to Google Play and ios markets. 
Here is our validation strategy to minimize risk and overhead:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

We will begin by exploring what the most common genres for each market are. We will use the prime_genre column in apple data, and the genres and category columns from the google play set

**Part Two & Part Three**

In [20]:
def freq_table(dataset, index):
    freq_table = {}
    total = 0
    for row in dataset:
        total += 1
        value = row[index]
        if value in freq_table:
            freq_table[value] += 1
        else:
            freq_table[value] = 1
            
    percent_table = {}
    for key in freq_table:
        percentage = (freq_table[key] / total) * 100
        percent_table[key] = percentage
    
    return percent_table

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [21]:
display_table(ios_Final, -5) # Prime Genres

Games is the most popular by a landslide. However, many of them could have bad reviews or very few people playing them. Most of the apps seemed designed for fun in the App Store, with only education reaching top 5. 

In [22]:
display_table(google_Final, 1) # Category

FAMILY : 18.938356164383563
GAME : 9.657534246575343
TOOLS : 8.481735159817351
BUSINESS : 4.646118721461187
PRODUCTIVITY : 3.9383561643835616
LIFESTYLE : 3.9155251141552516
FINANCE : 3.721461187214612
MEDICAL : 3.550228310502283
SPORTS : 3.3333333333333335
PERSONALIZATION : 3.287671232876712
COMMUNICATION : 3.2534246575342465
HEALTH_AND_FITNESS : 3.093607305936073
PHOTOGRAPHY : 2.9794520547945202
NEWS_AND_MAGAZINES : 2.808219178082192
SOCIAL : 2.6484018264840183
TRAVEL_AND_LOCAL : 2.3401826484018264
SHOPPING : 2.2488584474885847
BOOKS_AND_REFERENCE : 2.146118721461187
DATING : 1.860730593607306
VIDEO_PLAYERS : 1.8036529680365299
MAPS_AND_NAVIGATION : 1.3812785388127853
FOOD_AND_DRINK : 1.2328767123287672
EDUCATION : 1.1757990867579908
ENTERTAINMENT : 0.9589041095890412
AUTO_AND_VEHICLES : 0.9246575342465754
LIBRARIES_AND_DEMO : 0.9018264840182649
WEATHER : 0.7876712328767124
HOUSE_AND_HOME : 0.7876712328767124
EVENTS : 0.7191780821917808
ART_AND_DESIGN : 0.6506849315068494
PARENTING : 

There are significant differences in the Google Play market. There are less games and more productivity and tools apps. This information is insufficient to decide on which app to make because it doesn't tell us about how many users use these apps, supply could be more than the demand.

In [23]:
display_table(google_Final, -4) # Genres

Tools : 8.470319634703197
Entertainment : 6.084474885844749
Education : 5.3881278538812785
Business : 4.646118721461187
Productivity : 3.9383561643835616
Lifestyle : 3.904109589041096
Finance : 3.721461187214612
Medical : 3.550228310502283
Sports : 3.4018264840182644
Personalization : 3.287671232876712
Communication : 3.2534246575342465
Action : 3.105022831050228
Health & Fitness : 3.093607305936073
Photography : 2.9794520547945202
News & Magazines : 2.808219178082192
Social : 2.6484018264840183
Travel & Local : 2.328767123287671
Shopping : 2.2488584474885847
Books & Reference : 2.146118721461187
Simulation : 2.054794520547945
Dating : 1.860730593607306
Arcade : 1.82648401826484
Video Players & Editors : 1.7808219178082192
Casual : 1.7351598173515983
Maps & Navigation : 1.3812785388127853
Food & Drink : 1.2328767123287672
Puzzle : 1.141552511415525
Racing : 1.004566210045662
Role Playing : 0.9474885844748858
Strategy : 0.9246575342465754
Auto & Vehicles : 0.9246575342465754
Libraries &

## Most Popular Apps by Genre on the App Store

In [24]:
freq_ios = freq_table(ios_Final, -5)

for genre in freq_ios:
    total = 0
    len_genre = 0
    for row in ios_Final:
        genre_app = row[-5]
        if (genre_app == genre):
            n_ratings = float(row[5])
            total += n_ratings
            len_genre += 1
    avg_num_ratings = total / len_genre
    print(genre, ':', avg_num_ratings)


Navigation apps have the most reviews, however this is mainly due to google maps. Social Networking is up there as well as Reference apps. Social Networking is also skewed by apps such as Facebook and Instagram. Let's take a further look at reference.

In [25]:
for row in ios_Final:
    if row[-5] == 'Reference':
        print(row[1], ':', row[5])

## Most Popular Apps by Genre on Google Play

In [26]:
freq_google = freq_table(google_Final, 1)

max_installs = 0
max_genre = ''
for category in freq_google:
    total = 0
    len_category = 0
    for row in google_Final:
        category_app = row[1]
        if category_app == category:
            n_installs = row[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    avg_installs = total / len_category
    if (avg_installs > max_installs):
        max_installs = avg_installs
        max_genre = category
    print(category, ':', avg_installs)
print('\n')
print(max_genre, ':', max_installs)



ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 654074.8271604938
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8329168.936170213
BUSINESS : 1712290.1474201474
COMICS : 859042.1568627451
COMMUNICATION : 38550548.03859649
DATING : 861409.5521472392
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11767380.952380951
EVENTS : 253542.22222222222
FINANCE : 1365500.4049079753
FOOD_AND_DRINK : 1951283.8055555555
HEALTH_AND_FITNESS : 4219697.055350553
HOUSE_AND_HOME : 1385541.463768116
LIBRARIES_AND_DEMO : 649314.0506329114
LIFESTYLE : 1447458.976676385
GAME : 15571586.690307328
FAMILY : 3716053.755274262
MEDICAL : 121161.87781350482
SOCIAL : 23628689.23275862
SHOPPING : 7103190.78680203
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3750580.6438356163
TRAVEL_AND_LOCAL : 14120454.07804878
TOOLS : 10902378.834454913
PERSONALIZATION : 5240358.986111111
PRODUCTIVITY : 16787331.344927534
PARENTING : 552875.1785714285
WEATHER : 5212877.101449275
VIDEO_PLAYERS : 24878048.860759493
NEWS_AND_MAGAZI

Communication is the most installed category, but it is heavily influenced by Facebook, Instagram etc. We see that Books and Reference is fairly well represented. Let's take a further look.

In [27]:
for row in google_Final:
    if row[1] == "BOOKS_AND_REFERENCE":
        print(row[0], ':', row[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra â free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+

There are a lot of E reader apps, and holy texts with high number of installs. There would be significant competition in these niche markets. Let's look at the apps in the mid range number of installs

In [28]:
for row in google_Final:
    if (row[1] == "BOOKS_AND_REFERENCE" and (row[5] == '1,000,000+' 
                                             or row[5] == '5,000,000+' 
                                             or row[5] == '10,000,000+' 
                                             or row[5] == '50,000,000+')):
        print(row[0], ':', row[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra â free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+

Based on these results, we can deduce that an app based on a popular book could be quite profitable. 

## Conclusions

In this project, we analyzed app data for the Google and Apple markets, to see which type of freemium app we could make which would be profitable in both markets.
Afer the analysis, we see that creating an app about a popular book would likely be most profitable in both markets.