# Apps Profiles in App Store and Google Play Markets
The company only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. 

Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.
We will analyze the data in two aspect
1. Supply of the type of app
2. Demand of the type of app

# Opening and Exploring the data
A data set containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from this [link]https://dq-content.s3.amazonaws.com/350/googleplaystore.csv.

A data set containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from this [link]https://dq-content.s3.amazonaws.com/350/AppleStore.csv.
First we open the data sets and make them readable

In [2]:
from csv import reader
opened_file1=open('AppleStore.csv')
read_file1=reader(opened_file1)
apple=list(read_file1)
header_app=apple[0]
data_app=apple[1:]
opened_file2=open("googleplaystore.csv")
read_file2=reader(opened_file2)
google=list(read_file2)
header_goo=google[0]
data_goo=google[1:]

We will create a function called explore_data() that you can use to print rows of the datasets; and you can also choose to see the number of rows and columns in the data set

In [3]:
def explore_data(dataset,start,end,rows_and_columns=False):
    new_dataset=dataset[start:end]
    for row in new_dataset:
        print(row)
        print('\n')
    if rows_and_columns:
        print("The number of rows is: ", len(dataset))
        print("The number of columns is: ", len(dataset[0]))
        
print(header_goo)
explore_data(data_goo,0,3,True)

print(header_app)
explore_data(data_app,0,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


The number of rows is:  10841
The number of columns is:  13
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

To make it easier to read, from exploration, we know that the columns of data from Google paly are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

The columns of data from Apple store are 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. 

# Cleaning Wrong Data 
Before we beigin our analysis, we need to remove any information that is irrelevant, such as apps that are not free(since our company only builds apps that are free) and any non-English apps. 

The Google Play data set has a dedicated discussion section<link>https://www.kaggle.com/lava18/google-play-store-apps/discussion, and we can see that one of the discussions<link>https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015 outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [4]:
print(data_goo[10472])
print(header_goo)
print(len(data_goo[10472]))
print(len(header_goo))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
12
13


We can immediately see that the 10472th row in data from Google is missing one column. BY comparing it to the header, we find that its missing the "Category" column. We want to delete this row

In [5]:
del data_goo[10472]


# Cleaning Duplicate Entries

## Part one
From the discussion section, we notice that some apps have duplicate entries. We now create a fuction too find the duplicate rows.

In [6]:
unique_name=[]
duplicate_name=[]
def duplicate(data):
    for row in data:
        name=row[0]
        if name in unique_name:
            duplicate_name.append(name)
        else:
            unique_name.append(name)
    return duplicate_name

print(len(duplicate(data_goo)))            
print(len(data_goo))               
expected_length_data_goo=len(data_goo)-1181
print(expected_length_data_goo)

1181
10840
9659


We can see that there are 1181 duplicate rows. The expected coorect length of data_goo is 9659 after removing duplicate.

Next, we create find_max to find out the correct entry we would like to keep(according to the highest reviews criteria)

In [7]:
reviews_max={}
def find_max(data):
    for row in data:
        name=row[0]
        review=float(row[3])
        if name in reviews_max and reviews_max[name]<review:
            reviews_max[name]=review
        elif name not in reviews_max:
            reviews_max[name]=review
find_max(data_goo)            
print(len(reviews_max))

9659


## Part Two
We now create function remove to remove the duplicate entries and create a new data set 'clean_goo' with the desired entries we would like to keep

In [8]:
clean_goo=[]
already_add=[]
def remove(data):
    for row in data:
        name=row[0]
        review=float(row[3])
        if review==reviews_max[name] and name not in already_add:
            clean_goo.append(row)
            already_add.append(name)
remove(data_goo)        
print(len(clean_goo))
print(expected_length_data_goo)


9659
9659


# Removing non-English Entries


Removing non-English entries from googleplay data

In [9]:
goo=[]
def english_goo(data):
    for row in data:
        name=row[0]
        counter=0
        for c in name:
            if ord(c)>127:
                counter+=1
        if counter<=3:
            goo.append(row)
            
english_goo(clean_goo)
print(len(goo))
                
                

9614


Removing non-English entries from applestore data

In [10]:
app=[]
def english_app(data):
    for row in data:
        name=row[1]
        counter=0
        for c in name:
            if ord(c)>127:
                counter+=1
        if counter<=3:
            app.append(row)
english_app(data_app)
print(len(app))    


6183


## Isolate for free apps

In [11]:
final_app=[]
def free_app(data):
    for row in data:
        price=row[4]
        if price=='0.0':
            final_app.append(row)

final_goo=[]
def free_goo(data):
    for row in data:
        price=row[7]
        if price=='0':
            final_goo.append(row)
free_app(app)
free_goo(goo)
print(len(final_app))
print(len(final_goo))

3222
8864


## Analyze data by Genre
After cleaning the data, we now use "final_app" for apple store; "final_goo" for google play. We now analyze data to find the most common genre in both markets.

In [12]:
def freq(data,index):
    genre_list={}
    total=len(data)
    for row in data:
        genre=row[index]
        if genre in genre_list:
            genre_list[genre]+=1
        else:
            genre_list[genre]=1
            
    freq_table=[]
    for key in genre_list:
        freq=(key, genre_list[key]/total*100)
        freq_table.append(freq)
        
    new_table=sorted(freq_table,key=lambda freq: freq[1])
    return new_table
    
def display(list):
    list.reverse()
    for entry in list:
        print(entry[0],":",entry[1])
            
    


def common(dic):
    largest=0
    most_common=""
    for entry in dic:
        if dic[entry]>largest:
            most_common=entry
            
 
print("Category of google data")
display(freq(final_goo,1))
print("\n")
#print("Genre of google data")
#display(freq(final_goo,-4))
print("\n")
print("Prime Genre of Apple data")
display(freq(final_app,-5))

Category of google data
FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299

From the data above, we can conclude that the most common Category on Google is Family related apps(games etc) but rather balanced with other more practical used apps. While Games dominate the Apple market to a large extend.

In my opinion, our company should consider developing apps that are less common in the market, as there are too many competitors. It is also easier to dominate that category if there are fewer apps existed in the market currently. 

# Most Popular Genre
we now want to find out which genre has the most users. We would like to determie which Genre has high volume of instal aka has higher demand. 

## Apple Store
Note that in the dataset, it lacks the information on number of installs, so we will use the number of user rating in replacement

In [13]:
def pop(data,index1,index2):
    genre_list={}
    for row in data:
        genre=row[index1]
        install=float(row[index2])
        if genre in genre_list:
            genre_list[genre]+=install
        else:
            genre_list[genre]=install
    popular=[]
    for entry in genre_list:
        total=0
        for row in data:
            genre=row[index1]
            if genre==entry:
                total+=1
        average=genre_list[entry]/total
        pop_tup=(entry,average)
        popular.append(pop_tup)
    new_popular=sorted(popular,key=lambda pop: pop[1])
    return new_popular

display(pop(final_app,-5,5))
      

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


Let's look into some detail of the most popular genre

In [14]:
for entry in final_app:
    genre=entry[-5]
    install=float(entry[5])
    name=entry[1]
    if genre=="Navigation":
        print(name,":",install)

Waze - GPS Navigation, Maps & Real-time Traffic : 345046.0
Google Maps - Navigation & Transit : 154911.0
Geocaching® : 12811.0
CoPilot GPS – Car Navigation & Offline Maps : 3582.0
ImmobilienScout24: Real Estate Search in Germany : 187.0
Railway Route Search : 5.0


Combining our analysis previously with "most common genre". We know that Reference only accounts for a small amount of apps in the Apple market; however, it has a great number of install. This is because companies like "Waze" and "Google Maps" dominate the category. Also, developing navigation app requires lots of technical and financial support. Therefore, we don't recommend our company to develop app in such category as we are more vunerable to small category that is already dominated by big companies. 


In [15]:
for entry in final_app:
    genre=entry[-5]
    install=float(entry[5])
    name=entry[1]
    if genre=="Reference":
        print(name,":",install)

Bible : 985920.0
Dictionary.com Dictionary & Thesaurus : 200047.0
Dictionary.com Dictionary & Thesaurus for iPad : 54175.0
Google Translate : 26786.0
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418.0
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588.0
Merriam-Webster Dictionary : 16849.0
Night Sky : 12122.0
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535.0
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693.0
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497.0
Guides for Pokémon GO - Pokemon GO News and Cheats : 826.0
WWDC : 762.0
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718.0
VPN Express : 14.0
Real Bike Traffic Rider Virtual Reality Glasses : 8.0
教えて!goo : 0.0
Jishokun-Japanese English Dictionary & Translator : 0.0


"Reference" is a rather small category as we analyzed previously; however, it is ranked the second most installed genre in the market. Although it is also dominated by apps like "Bible","Dictionary.com", the rest of apps still have decent number of installs. Developing reference app is not as costly and technical. Therefore, it seems like an good option to open up the market. 

## Google Play

In [16]:
def pop(data,index1,index2):
    genre_list={}
    for row in data:
        genre=row[index1]
        install=row[index2].replace(",","")
        install=install.replace("+",'')
        install=float(install)
        if genre in genre_list:
            genre_list[genre]+=install
        else:
            genre_list[genre]=install
    popular=[]
    for entry in genre_list:
        total=0
        for row in data:
            genre=row[index1]
            if genre==entry:
                total+=1
        average=genre_list[entry]/total
        pop_tup=(entry,average)
        popular.append(pop_tup)
    new_popular=sorted(popular,key=lambda pop: pop[1])
    return new_popular

display(pop(final_goo,1,5))

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

In [18]:
for entry in final_goo:
    genre=entry[1]
    install=entry[5]
    name=entry[0]
    if genre=="COMMUNICATION":
        print(name,":",install)

WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+
AT&T Visual Voicemail : 10,000,000+
GMX Mail : 10,000,000+
Omlet Chat : 10,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Calls & Text by Mo+ : 5,000,000+
free 

In [20]:
for entry in final_goo:
    genre=entry[1]
    install=entry[5]
    name=entry[0]
    if genre=="BOOKS_AND_REFERENCE":
        print(name,":",install)

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

We can see that the number of installs in the "Books and reference" genre in apple market is rather distributed. Even though it has a great number of competitor, it is still less competitive compared to other categories.  

In conclusion, taking into account the supply size and demand for apps in different categories, we found that "Books and reference" has relatively small competition; but has many users. Therefore, we believe that it has some potential to further develop apps relating to this genre. However, we need to be more innovated as "dictionaries" and "e-books" dominate the category. We can try builiding an app that combines different features. 
1. allows users to read books but at the same time, look up to words
2. users can highlight and put down notes
3. there is a platform for readers to discuss and comment on the books
4. link to background of the story or information about the author
5. allow users to rate the book