# Profitable App Profiles for the App Store and Google Play Markets

My aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. I am working as hypothetical data analyst for a company that builds Android and iOS mobile apps, and my job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At my company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

## Opening and Exploring the Data
As of September 2018, there were around 2 million IOS apps on the App Store, and 2.1 million Android apps on Google play. 

I am using two datasets to explore the data: 

* Google play: https://www.kaggle.com/lava18/google-play-store-apps
* App store: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps


In [1]:
from csv import reader

# Opening the App Store dataset
open_apple = open('AppleStore.csv', encoding='utf8')
read_apple = reader(open_apple)
apple_data = list(read_apple)
apple_header = apple_data[0]
apple = apple_data[1:]

#Opening the Play Store dataset
open_android = open('googleplaystore.csv', encoding='utf8')
read_android = reader(open_android)
android_data = list(read_android)
android_header = android_data[0]
android = android_data[1:]

Since I want to make it easier to read the data sets, going to create a function called explore_data() 

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice: 
        print(row)
        print('\n')  # add new line between the rows 
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

# look at app store example for some context      
print(apple_header)
print('\n')
explore_data(android,0,3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


## Cleaning the data
The company only builds free apps, and since it is for an English-speaking audience, gotta remove non-English apps. 

### Deleting the wrong data
In the Google Play dataset, the discussion section said that the rating for row 10472 was wrong.

In [3]:
print(android_header)
print('\n')
print(android[10472]) # This row is messed up
print('\n')
print(android[0]) # What a correct row should look like


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


We can see that row 10472 has a rating of 19, which cannot be the case as rating can only be 0-5, which is why I'm going to delete it. 

In [4]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


### Removing duplicate values

The discussion section in the Play Store dataset also said that there were a few duplicate entries. An example was instagram which had 4 duplicates: 

In [5]:
for app in android: 
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


There were found to be 1181 apps that occurred more than once.

In [6]:
duplicate_apps = []
unique_apps = []

for app in android: 
    name = app[0]
    if name in unique_apps: 
        duplicate_apps.append(name)
    else: 
        unique_apps.append(name)
        
print("Number of duplicate apps:", len(duplicate_apps))
print('\n')
print("Examples of duplicate apps:", duplicate_apps[0:9])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business']


I am now going to remove the duplicate entries, and just keep one entry per app. 

The way I'm going to do this is by keeping the apps with the highest number of reviews. When looking at the Instagram dupes, the fourth column corresponds to the number of reviews that the app currently has, and the higher the number - the more recent and reliable the data would be. 

To do this, I will
* Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app 
* Use the dictionary to create a new data set, which will only have one entry per app 

In [7]:
reviews_max = {} 

for row in android: 
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews: 
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max: 
        reviews_max[name] = n_reviews

I previously found that there were 1,181 cases of apps that occurred more than once, hence the dictionary should equal that if accurate. 

In [8]:
print("Expected length:", len(android) - 1181)
print("Actual length:", len(reviews_max))

Expected length: 9659
Actual length: 9659


Using the dictionary reviews_max that I created, I am going to now proceed and remove the duplicate rows. 

In [9]:
android_clean = []
already_added = []

for row in android:  
    name = row[0]
    n_reviews = float(row[3])
    
    if n_reviews == reviews_max[name] and name not in already_added: 
        android_clean.append(row) 
        already_added.append(name)

Now to make sure that the android_clean dataset is properly cleaned (should have 9659 rows) 

In [10]:
explore_data(android_clean, 0, 3, True) 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


Nice, we have 9659 rows as expected. :) 

### Removing non-English apps 

Unfortunately there are a few app names that aren't in English, here are some examples:

In [11]:
print(apple[813][1])
print(apple[6731][1])

print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
中国語 AQリスニング
لعبة تقدر تربح DZ


I will remove these apps. 

All the characers that are specific to English texts are encoded with the ASCII standard, therefore each character has a corresponding number between 0 and 127 associated with it, hence we can build a function that checks whether an app contains non-ascii characters. 

In [12]:
def is_english(string):
    for character in string: 
        if ord(character) > 127: 
            return False
    
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False


Nice, the function seems to owrk fine; however, some English apps use emojis or other symbols like ™ that fall outside the ASCII range. Due to this, we could potentially remove useful apps with this current function. 

In [13]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

print(ord('™'))
print(ord('😜'))

False
False
8482
128540


In order to minimise data loss, I am only going to remove an app if its name has more than three non-ASCII characters: 

In [14]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
True


Time to use the is_english() function on our datasets: 

In [15]:
android_english = []
apple_english = []

for app in android_clean: 
    name = app[0]
    if is_english(name) == True:
        android_english.append(app)

for app in apple: 
    name = app[0]
    if is_english(name) == True: 
        apple_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(apple_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

Therefore, it can be seen that I am left with 9614 Android apps, and 7197 Apple apps. 
(hooolldd on the apple value didn't change for some reason) 

In [16]:
print(apple_english[813][1])

爱奇艺PPS -《欢乐颂2》电视剧热播


need to fix this goddamn bug idk what's wrong 