# Profitable App Profiles for the App Store and Google Play Markets (just 4f)

 Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. Don't be so serious, chill! We will do it step by step.

---
# Explore the Data
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

There are 2 dataset we're going to work with:
- Appstore dataset containing data about approximately seven thousand iOS apps from the App Store.
- CH play dataset containing data about approximately ten thousand Android apps from Google Play.

Let's start by opening the two data sets and then continue with exploring the data.

In [34]:
from csv import reader
# AppStore DataSet
ios_op = open('AppleStore.csv', encoding = 'utf8') #Unicode
ios_read = reader(ios_op)
ios = list(ios_read)

#GooglePlay Store DataSet
android_op = open('googleplaystore.csv', encoding = 'utf8')
android_read = reader(android_op)
android = list(android_read)

To explore the data, we need to build the function named `explore_data()` to see what things we will work with:

In [35]:
def explore_data(dataset, start, end, row_col = False):
    for i in dataset[start:end]:
        print(i) #i is a row in dataset which was converted to list of list
        print("\n") # to separete each line
    if (row_col == True):
        print("Number of rows: ",len(dataset))
        print("Number of columns: ",len(dataset[0]))

In [36]:
print(android[0]) #print out header :)) 
print("\n")
print(explore_data(android,1,3,True))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows:  10842
Number of columns:  13
None


In [37]:
explore_data(ios,0,3)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']




It's great, right ? Now we can observe how our data look like. Then, let's move to the next step to clear our data! 

---
# Deleting wrong data

Everything is not perfect, and Data either. There are many inaccurate data in our dataset. 

But there are some trouble, `the wrong data`. Like row 10473 in CHplay dataset:

In [38]:
print(android[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


It's having wrong rating for entry. Therefore, all we need to do is delete this row by using `del` statement. But not to abuse it much!

In [39]:
del(android[10473]) #just run one time!

In [40]:
print(len(android)) #correct row

10841


---
# Remove Duplicate Entries
## Part 1
Still not the end of cleaning data. We will see that "Instagram" has 4 entries like this

In [41]:
for app in android:
    if app[0] == "Instagram": #app[0] is the name of app
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We will find out all of duplicate app and how many is it like this:

In [42]:
duplicate = []
unique_app = []
for app in android:
    if app[0] in unique_app:
        duplicate.append(app[0])
    else:
        unique_app.append(app[0])
print("Number of duplicated app: ",len(duplicate))
print("\n")
print("Some name of duplicated app: ",duplicate[:20])

Number of duplicated app:  1181


Some name of duplicated app:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


As we see, there are many duplicated apps we need to remove. But we are not supposed to delete them randomly. `Name` of each app is unique and we can follow that criterion to clean our dataset.

## Part 2

We found that there are 1181 duplicates which mean we only have 9660 unique rows left

In [43]:
print("Unique rows: ",len(android) - 1181)

Unique rows:  9660


To remove duplicates, we will:
- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
- Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

Now start building our dictionary:

```
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
```

# Remove non-English app

In previous step, we know how to remove wrong and duplicated app. This step will show you how to remove non English app.

In [44]:
print(ios[814][1])
print(ios[6732][1])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


So what we're going to do ?

We are going to remove each app with a name containing a symbol what isn't commonly used in English text.

Each character corresponding to number. Like `'a'` is 97, character `'A'` is 65, and character `'爱'` is 29233. (use `ord()` to know what number is that character)

The number of English text are all in range 0 to 127, according to the ASCII system. So we need to bulid the function to check if the name is English or not:

In [45]:
def IsEnglish(name):
    for character in name:
        if ord(character) > 127:
            return False
    return True
print(IsEnglish("Facebook"))
print(IsEnglish("爱奇艺PPS -《欢乐颂2》电视剧热播"))

True
False


The function seem work fine. But hold on! Let's try these cases: Emotion icon :))

In [46]:
print(IsEnglish('Instachat 😜'))
print(IsEnglish('Docs To Go™ Free Office Suite'))

False
False


If we keep that function form, we may remove such usefull app using those emojis :V So move on to next part to see how will we optimize our function.

## Part 2
To minimize losing usefull data, we need to change something in our function.

In [47]:
def IsEnglish(name):
    non_ascii = 0
    for character in name:
        if ord(character) > 127:
            non_ascii += 1
    if non_ascii > 3: #if this text have more than 3 non ascii char- > not english 
        return False
    else:
        return True
print(IsEnglish('Instachat 😜'))
print(IsEnglish('Docs To Go™ Free Office Suite'))
print(IsEnglish("爱奇艺PPS -《欢乐颂2》电视剧热播"))

True
True
False


Here we use `IsEnglsih()` to explore both dataset that we have:

In [48]:
android_eng = []
ios_eng = []
#English android apps
for app in android[1:]:
    if IsEnglish(app[1]):
        android_eng.append(app)
        
#English ios apps
for app in ios[1:]:
    if IsEnglish(app[1]):
        ios_eng.append(app)
explore_data(ios_eng,1,4)

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']




# Isolating the Free Apps
We only build apps that are free to dowload and install. Our datasets contain both free and non-free apps. We need to isolate only free apps (cause I like apps which are free :)) )

In [49]:
android_final = []
ios_final = []

for app in android_eng:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_eng:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(ios_final))


3222


There are `3222` free IOS apps omg! (i'm using android though :v)

This step is also the last step of `cleaning data`! Congratulation!

---
# Most common Apps by Genre
## Part 1
Our aim is to determine the kinds of apps that are likely to attract more user cause our revenue is highly influenced bu the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the `prime_genre` column of the App Store data set, and the `Genres` and `Category` columns of the Google Play data set.

## Part 2
We'll build two functions we can use to analyze the frequency tables:
- One function to generate frequency tables that show percentages
- Another function that we can use to display the percentages in a descending order

In [50]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

## Part 3
We start by examining the frequency table for the prime_genre column of the App Store data set.

In [51]:
display_table(ios_final,-5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We can see that among the free English apps, more than a half (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Let's continue exam `Categoty` and `Genres` comlumns of android dataset

In [52]:
display_table(android_final,1) #Category

FAMILY : 17.739043824701195
GAME : 10.56772908366534
TOOLS : 7.6195219123505975
BUSINESS : 4.442231075697211
PRODUCTIVITY : 3.944223107569721
LIFESTYLE : 3.6155378486055776
SPORTS : 3.5856573705179287
COMMUNICATION : 3.5856573705179287
MEDICAL : 3.5258964143426295
FINANCE : 3.4760956175298805
HEALTH_AND_FITNESS : 3.237051792828685
PHOTOGRAPHY : 3.117529880478088
PERSONALIZATION : 3.0776892430278884
SOCIAL : 2.908366533864542
NEWS_AND_MAGAZINES : 2.7988047808764938
SHOPPING : 2.5697211155378485
TRAVEL_AND_LOCAL : 2.450199203187251
DATING : 2.2609561752988045
BOOKS_AND_REFERENCE : 2.0219123505976095
VIDEO_PLAYERS : 1.7031872509960162
EDUCATION : 1.5139442231075697
ENTERTAINMENT : 1.4641434262948207
MAPS_AND_NAVIGATION : 1.3147410358565739
FOOD_AND_DRINK : 1.245019920318725
HOUSE_AND_HOME : 0.8764940239043826
LIBRARIES_AND_DEMO : 0.8366533864541833
AUTO_AND_VEHICLES : 0.8167330677290837
WEATHER : 0.7370517928286853
EVENTS : 0.6274900398406374
ART_AND_DESIGN : 0.6175298804780877
COMICS : 0

The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 17% of the apps) means mostly games for kids.

In [53]:
display_table(android_final,-4) #Genres

Tools : 7.609561752988048
Entertainment : 6.01593625498008
Education : 5.169322709163347
Business : 4.442231075697211
Productivity : 3.944223107569721
Sports : 3.7250996015936253
Lifestyle : 3.6055776892430282
Communication : 3.5856573705179287
Medical : 3.5258964143426295
Finance : 3.4760956175298805
Action : 3.396414342629482
Health & Fitness : 3.237051792828685
Photography : 3.117529880478088
Personalization : 3.0776892430278884
Social : 2.908366533864542
News & Magazines : 2.7988047808764938
Shopping : 2.5697211155378485
Travel & Local : 2.4402390438247012
Dating : 2.2609561752988045
Books & Reference : 2.0219123505976095
Arcade : 1.9920318725099602
Simulation : 1.902390438247012
Casual : 1.8326693227091633
Video Players & Editors : 1.6832669322709164
Maps & Navigation : 1.3147410358565739
Food & Drink : 1.245019920318725
Puzzle : 1.205179282868526
Racing : 0.9462151394422311
Strategy : 0.9362549800796812
House & Home : 0.8764940239043826
Role Playing : 0.8665338645418327
Libraries

The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

---
# Most Popular Apps by Genre on the App Store
One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the `Installs` column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Below, we calculate the average number of user ratings per app genre on the App Store:

In [54]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [55]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we'll leave this level of detail for later.

Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating:

In [56]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


However, this niche seems to show some potential. One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app.

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

- Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

- Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.

- Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.

I also want to analyze GooglePlay Store as well.... but i'm hungry so we may end here

---
# conclusion
Actually, there's no conclusion here cause i'm so freaking lazy :v If it's fun, read again :D 