## Profitable App profiles for the app store and google play markets

### 1. Introduction
Analyzing data to provide feedback to developers to understand what kinds of apps are likely to attract more users. Goal to gain a understanding of the data and provide accurate information.

### 2. Opening and Exploring the data
In our goal we stated that our aim is help our developers understand the type of apps that are more likely to attract users in both stores. To do this we'll need to collect and anaylze data of mobile apps available.

In [1]:
from csv import reader
#open files
###The Apple Store data set ###
opened_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios=ios[1:]
###The Google Play Store data set ###
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android=android[1:]

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(android_header)        
print('\n')
explore_data(android,0,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


From the Google Play Store header we can see that of the 13 columns the most useful for anaylzing the data might be the App, Category, Genre, Installs, Type, and Price.

In [4]:
print(ios_header)        
print('\n')
explore_data(ios,0,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


From the App Store header we can there are 16 columns we can see that Track Name, Price, rating_count_tot, rating_count_ver, prime_genre, Currency may be the most useful. For more information on these please go to [column information](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

### 3. Deleting Wrong Data
Before beginning anaylsis we need to  ensure the accuracy of the data this means to detect any inaccurate/duplicate data removing and correcting as need be. Our audience is as well an English speaking one so we'll need to remove non english apps and apps that aren't free.

In [5]:
print(android[10472])  # incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android[0])      # correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


In [6]:
print(len(android))
del(android[10472])
print(len(android))

10841
10840


### 4.Removing Duplicate Entries

If  we examine the dataset long enough we will notice that some apps have duplicate entries. For instance, Instagram has four entries.

In [7]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In total, there are 1,181 cases where an app occurs more than once:

In [8]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


 We don't want to count apps more than once when we analyze data, so we will need to remove the duplicate entires and keep only one entry per app. There are a few ways we could do this one being randomly but that isn't necessarily the best way.
 
 If we examine the rows we printed two cells above for the instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of the reviews. We'll keep the entry with the highest number of reviews.

In [9]:
reviews_max ={}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

In [10]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


After finding there are 1,181 cases where an app occurs more than once we should find as we did that after filtering the data by highest reviews that our dataset expected length should reflect that. Now lets clean up the data.

In [11]:
android_clean=[] ##New Cleaned data set
already_added=[] ##Storing App Names

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if(n_reviews == reviews_max[name] and name not in already_added):
        android_clean.append(app)
        already_added.append(app[0])

In [12]:
print('Cleaned data set length:', len(android_clean))

Cleaned data set length: 9659


Our cleaned data set matchs our expected.

### 5. Removing Non-English Apps
As we are not interested in any Non-English apps we need to build a function to determine if app name contains non-ascii characters as English text usually only  includes letters from the English Alphabet, numbers, puncation marks and other symbols. Thus all characters from the English text is encoded using the Ascii standard being 0 to 127. Anything beyond this is a non-ascii character.

In [13]:
def detectNonEnglish(stringToTest):
    for char in stringToTest:
        if ord(char) > 127:
            return False
    return True

In [14]:
print(detectNonEnglish('Instagram'))
print(detectNonEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(detectNonEnglish('Docs To Go™ Free Office Suite'))
print(detectNonEnglish('Instachat 😜'))

True
False
False
False


While this would be quite sufficent some English apps may use emojis or other symbols that are beyond the range we have set. This may cause us to lose some useful apps if we use the function in it's current form.

In [15]:
print(detectNonEnglish('InstaSnap 😜'))
print(detectNonEnglish('Food To Door™ Ordering App'))
print(ord('™'))
print(ord('😜'))

False
False
8482
128540


In [16]:
def isEnglish3(stringToTest):
    count = 0
    for char in stringToTest:
        if ord(char) > 127:
            count+=1
    if(count > 3):
        return False
    else:
        return True

While not perfect this will allow some leeway and let us get on to our analysis.

In [17]:
print(isEnglish3('Instagram'))
print(isEnglish3('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(isEnglish3('Docs To Go™ Free Office Suite'))
print(isEnglish3('Instachat 😜'))

True
False
True
True


### 6. Isolating the Free Apps
Now that we have our tool to seperate out the Non-English apps we need to isolate the free apps from the paid apps. So first we will clean the data to filter out the non-English apps from both data sets.

In [18]:
android_english=[] ##New Cleaned data set
ios_english=[]
for app in android_clean:
    name = app[0]
    if(isEnglish3(name)):
        android_english.append(app)
for app in ios:
    name = app[1]
    if(isEnglish3(name)):
        ios_english.append(app)


In [19]:
print('Data Set that is English length:', len(android_english))
print('Data Set that is English length:', len(ios_english))
print('\n')
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

Data Set that is English length: 9614
Data Set that is English length: 6183


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans'

We can see after cleaning the data that we are left with <br>
Google Play Store: 9614 apps<br>
App Store:         6183 apps

Now that we have cleaned the data from the non-English we need to further filter out the apps we want by filter to apps that are free. As our main source of revenue consists of in-app ads.

In [20]:
android_final=[]
ios_final=[]

for app in android_english:
    if(app[7]=='0'):
        android_final.append(app)
for app in ios_english:
    if(app[4]=='0.0'):
        ios_final.append(app)


In [21]:
explore_data(android_final, 0, 3, True)
print('\n')
explore_data(ios_final, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

In [22]:
print('Android Dataset Final length:', len(android_final))
print('Ios Dataset Final length:', len(ios_final))

Android Dataset Final length: 8864
Ios Dataset Final length: 3222


After cleaning and now filtering the data that we are left with <br>
Google Play Store: 8864 apps<br>
App Store:         3322 apps

### 7. Most Common Apps by Genre

Our aim as we have mentioned is to determine what kind of apps are more likely to attract users because our revnue is highly influneced by the number of people using apps.
A validation strategy for an app would be comprised of three steps:
1. Build a minimal Android version of the app, and add it to google play.
2. If the app is successful and has a good user base expand it.
3. If the app is profitable and succcessful for six months then develop an ios app and add it to the app store

Since our end goal is to have the app on both Google Play and App Store, finding app profiles that are successful on both markets should be our goal. Looking into these Genre's may provide a good starting point.

For this we'll build two functions to anaylze the frequency tables.
    One to generate the frequency tables to show percentages.
    Another that we can use to display hte percentages in a descending order.

In [23]:
def freq_table(dataset, index):
    table={}
    total=0
    for row in dataset:
        total+=1
        value = row[index]
        if value in table:
            table[value]+=1
        else:
            table[value] = 1
            
    table_percentages={}
    for key in table:
        percentage = (table[key]/total) * 100
        table_percentages[key] = percentage
    return table_percentages

In [24]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [25]:
display_table(ios_final, -5)


Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Most Common Genre on IOS as we examine the data seem to show that 58% of the apps are games followed by Entertainment at nearly 8% meaning 70% roughly of all apps are for Entertainment and not productivity uses.

In [26]:
##print(android_header)
display_table(android_final,1)#Category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Interestingly the Android data seems to be more designed in productivty in mind. We can see the top category is also alot less then the ios. 

In [27]:
display_table(android_final,9)#Genres

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075