# Profitable App Profiles for the App Store and Google Play Markets

# What the project is about?
This project is analyzing the Andriod and IOS applications marketshare and their progits

# What is the primary goal?
This project applys the skills taught in the fundamental Python course for data science. *This notebook is focused around writing Python functions.*

# What is the end goal?
The aim is to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store.

### Phase one: Data Cleaning

In [1]:
from csv import reader
def read_data(df):
    """
    This function takes a CSV file
    and read the data as a nested list
    """
    opened_file = open(df, encoding='utf8')
    read_file = reader(opened_file)
    return list(read_file)

In [2]:
ios_data = read_data('AppleStore.csv')
android_data =  read_data('googleplaystore.csv')

In [3]:
import numpy as np
def explore_data(dataset, start, end, shape=True):
    """
    This function read a few rows of the dataset 
    and print the whole dataframe shape.
    I takes 3 positional arguement and 1 default arguement:
    * dataset: (the main dataframe)
    * start: slicing the df with a start index (int) 
    * end: ending the slicing with an end index (int)
    * Shape: [True, False] prints the shape of the df
    """
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if shape:
        print("dataframe length".format(np.shape(dataset)))

In [4]:
apple_store = explore_data(ios_data, 0, 5)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


dataframe length


In [5]:
android_store = explore_data(android_data, 0, 5, False)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']




In [6]:
#the wrong row that reported in the discussion form
print(android_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [7]:
del android_data[10473]
len(android_data)

10841

In [8]:
for i in ios_data:
    if len(i) > 16 or len(i) < 16:
        del i

In [9]:
len(ios_data) 

7198

In [10]:
def check_duplicates(dataset):
    duplicate_list = []
    unique_list = []
    for i in dataset:
        app_name = i[0]
        if app_name in unique_list:
            duplicate_list.append(app_name)
        else:
            unique_list.append(app_name)
    return duplicate_list, unique_list

In [11]:
ios_check_duplicates = check_duplicates(ios_data) 

In [12]:
android_check_duplicates = check_duplicates(android_data) 

In [13]:
print("No. of duplicate IOS apps: {}".format(len(ios_check_duplicates[0])))
print("No. of unique IOS apps: {}".format(len(ios_check_duplicates[1])))
print("No. of duplicate Android apps: {}".format(len(android_check_duplicates[0])))
print("No. of unique Android apps: {}".format(len(android_check_duplicates[1])))

No. of duplicate IOS apps: 0
No. of unique IOS apps: 7198
No. of duplicate Android apps: 1181
No. of unique Android apps: 9660


We can clearly see that the Android dataset has **1181** duplicate apps and **9660** unique apps, while the IOS dataset is unique and has no duplicate apps. This will lead us to deal with the duplicate apps in the Android dataset.

## Dealing with duplicate apps in the andriod data
A detail look at the dataset lead to decided, first, to convert the "Reviews" column to integers instead strings then build a function that filters the new dataframe with the unique apps only.

In [14]:
def get_unique_android(dataset):
    """
    The function takes a nested list and does two things:
    1- sort the nested list by the maximum value of each list.
    2- drop the duplicate list by its index of 0 and append the
       unique result to a new nested list. 
    This way will guarnatee that a duplicate list with lower number
    of reviews will be dropped and the highest value will be kept.
    """
    dataset.sort(key=lambda x: (int(x[3]), x[0]), reverse=True)
    unique_android = []
    for i in dataset:
        if i[0] not in [k[0] for k in unique_android]:
            unique_android.append(i)
    return unique_android

In [15]:
unique_android = get_unique_android(android_data[1:])

In [16]:
print("The android dataset has {} of unique apps".format(len(unique_android)))

The android dataset has 9659 of unique apps


## Removing Non-English Apps
After filtering and as per discussion around this project, the company has a tendency to build an application for English audience only and so, the next part will lead us to a new dataset that has only English apps from the IOS and unique Android data.

In [17]:
def english_string(string):
    """
    This function takes a string to check whether an app name
    is recoreded in English or not.
    """
    non_english = 0
    for character in string:
        if ord(character) > 127:
            non_english += 1
    if non_english > 3:
        return False
    return True

In [18]:
print(english_string('Docs To Go™ Free Office Suite'))
print(english_string('Instagram'))
print(english_string('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_string('Gmail'))
print(english_string('Instachat 😜'))

True
True
False
True
True


In [19]:
def get_english_apps(dataset, n):
    """
    This function takes the main df and number of indexing.
    It returns two lists; one that holds the English apps
    in a dataset and the other hold the non-English apps.
    """
    english_apps = []
    non_english_apps = []
    for i in dataset:
        app_name = i[n]
        if english_string(app_name) == True:
            english_apps.append(i)
        else:
            non_english_apps.append(i)
    return english_apps, non_english_apps

In [20]:
android_nested_apps = get_english_apps(unique_android, 0)
android_english = android_nested_apps[0]

In [21]:
ios_nested_apps = get_english_apps(ios_data, 1)
ios_english = ios_nested_apps[0]

In [22]:
print("No. of English Apps in the IOS df is: {}".format(len(ios_english)))
print("No. of Non-English Apps in the IOS df is: {}".format(len(ios_nested_apps[1])))
print("No. of English Apps in the Android df is: {}".format(len(android_english)))
print("No. of Non-English Apps in the Android df is: {}".format(len(android_nested_apps[1])))

No. of English Apps in the IOS df is: 6184
No. of Non-English Apps in the IOS df is: 1014
No. of English Apps in the Android df is: 9614
No. of Non-English Apps in the Android df is: 45


## Isolating Free Apps
As per discussion around this project, the company has a tendency to build an application that is free to download and install and the main source of revenue consists of in-app ads and so, the next part will lead us to a new dataset that has only free apps from the IOS or Android data.

In [23]:
def check_price(dataset, n):
    """
    This function is to find the free and non-free application.
    It takes a dataset and a number to index at.
    """
    free_apps = []
    non_free_apps = []
    for i in dataset:
        price = float(i[n].strip("$"))
        if price == 0.0:
            free_apps.append(i)
        else:
            non_free_apps.append(i)
    return free_apps, non_free_apps

In [24]:
check_android_price = check_price(android_english, 7)
android_clean = check_android_price[0]

In [25]:
print("Number of Free Android Apps: {}".format(len(check_android_price[0])))
print("Number of Non-Free Android Apps: {}".format(len(check_android_price[1])))

Number of Free Android Apps: 8864
Number of Non-Free Android Apps: 750


In [26]:
# Appending a the header list to the new dataset
android_clean[0:0] = [['App','Category','Rating','Reviews','Size',
                       'Installs','Type','Price','Content Rating',
                       'Genres','Last Updated','Current Ver',
                       'Android Ver']]

In [27]:
check_ios_price = check_price(ios_english[1:], 4)
ios_clean = check_ios_price[0]

In [28]:
print("Number of Free IOS Apps: {}".format(len(check_ios_price[0])))
print("Number of Non-Free IOS Apps: {}".format(len(check_ios_price[1])))

Number of Free IOS Apps: 3222
Number of Non-Free IOS Apps: 2961


In [29]:
# Appending a the header list to the new dataset
ios_clean[0:0] = [['id','track_name','size_bytes','currency',
                  'price','rating_count_tot','rating_count_ver',
                  'user_rating','user_rating_ver','ver',
                  'cont_rating','prime_genre','sup_devices.num',
                  'ipadSc_urls.num','lang.num','vpp_lic']]

### Phase Two: Data Analysis in Python

Finding an app profile that fits the App Store and Google Play is a primary goal for the comapny. That will enable the development team to build an application that will raise the highest rate of profit as the company's revenue is based manily on ads and thus, the more users the app has, the better.

In [30]:
# import operator
def check_genre_frequency(dataset, n):
    """
    This function takes the following:
    1- dataframe.
    2- n: (int) that will index a specific column.
    It indexes the genre column in android and IOS dataframe.
    It counts the elements and return the highest recommendation
    to build a mobile app that fits both markets.
    """
    frequency_table = {}
    total = 0
    
    for i in dataset:
        total += 1
        column = i[n]
        if column in frequency_table:
            frequency_table[column] += 1
        else:
            frequency_table[column] = 1
    for key in frequency_table:
        percentage =  round((frequency_table[key] / total) * 100, 2)
        frequency_table[key] = percentage
#     sorted_genres = sorted(frequency_table_percentage.items(), key=operator.itemgetter(1), reverse=True)
    return frequency_table

In [31]:
def display_table(dataset, index):
    frequency_table = check_genre_frequency(dataset, index)
    table_display = []
    for key in frequency_table:
        frequency_tuple = (frequency_table[key], key)
        table_display.append(frequency_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ":", entry[0])

In [38]:
def get_genre_average(dataset, genres_list, n, k):
    for genre in genres_list:
        total = 0
        genre_length = 0
        for app in dataset:
            app_genre = app[n]
            if app_genre == genre:
                n_ratings = float(app[k].replace('+','').replace(',',''))
                total += n_ratings
                genre_length += 1
        rating_average = round(total / genre_length, 2)
        print(genre, ':', rating_average)

In [33]:
android_category = check_genre_frequency(android_clean[1:], 1)
android_category

{'ART_AND_DESIGN': 0.64,
 'AUTO_AND_VEHICLES': 0.93,
 'BEAUTY': 0.6,
 'BOOKS_AND_REFERENCE': 2.14,
 'BUSINESS': 4.59,
 'COMICS': 0.62,
 'COMMUNICATION': 3.24,
 'DATING': 1.86,
 'EDUCATION': 1.16,
 'ENTERTAINMENT': 0.96,
 'EVENTS': 0.71,
 'FAMILY': 18.91,
 'FINANCE': 3.7,
 'FOOD_AND_DRINK': 1.24,
 'GAME': 9.72,
 'HEALTH_AND_FITNESS': 3.08,
 'HOUSE_AND_HOME': 0.82,
 'LIBRARIES_AND_DEMO': 0.94,
 'LIFESTYLE': 3.9,
 'MAPS_AND_NAVIGATION': 1.4,
 'MEDICAL': 3.53,
 'NEWS_AND_MAGAZINES': 2.8,
 'PARENTING': 0.65,
 'PERSONALIZATION': 3.32,
 'PHOTOGRAPHY': 2.94,
 'PRODUCTIVITY': 3.89,
 'SHOPPING': 2.25,
 'SOCIAL': 2.66,
 'SPORTS': 3.4,
 'TOOLS': 8.46,
 'TRAVEL_AND_LOCAL': 2.34,
 'VIDEO_PLAYERS': 1.79,
 'WEATHER': 0.8}

In [35]:
android_frequency_category = display_table(android_clean[1:], 1)
android_frequency_category 

FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


In [40]:
android_average_rating = get_genre_average(android_clean[1:], android_category, 1, 5)

ENTERTAINMENT : 11640705.88
PRODUCTIVITY : 16787331.34
PERSONALIZATION : 5201482.61
WEATHER : 5074486.2
HOUSE_AND_HOME : 1331540.56
BEAUTY : 513151.89
COMICS : 817657.27
FAMILY : 3695641.82
TOOLS : 10801391.3
HEALTH_AND_FITNESS : 4188821.99
SOCIAL : 23253652.13
DATING : 854028.83
VIDEO_PLAYERS : 24727872.45
ART_AND_DESIGN : 1986335.09
LIBRARIES_AND_DEMO : 638503.73
MAPS_AND_NAVIGATION : 4056941.77
FINANCE : 1387692.48
AUTO_AND_VEHICLES : 647317.82
PARENTING : 542603.62
NEWS_AND_MAGAZINES : 9549178.47
LIFESTYLE : 1437816.27
BOOKS_AND_REFERENCE : 8767811.89
SHOPPING : 7036877.31
EVENTS : 253542.22
GAME : 15588015.6
TRAVEL_AND_LOCAL : 13984077.71
PHOTOGRAPHY : 17840110.4
COMMUNICATION : 38456119.17
EDUCATION : 1833495.15
FOOD_AND_DRINK : 1924897.74
BUSINESS : 1712290.15
MEDICAL : 120550.62
SPORTS : 3638640.14


In [36]:
ios_prime_genres = check_genre_frequency(ios_clean[1:], -5)
ios_prime_genres

{'Book': 0.43,
 'Business': 0.53,
 'Catalogs': 0.12,
 'Education': 3.66,
 'Entertainment': 7.88,
 'Finance': 1.12,
 'Food & Drink': 0.81,
 'Games': 58.16,
 'Health & Fitness': 2.02,
 'Lifestyle': 1.58,
 'Medical': 0.19,
 'Music': 2.05,
 'Navigation': 0.19,
 'News': 1.33,
 'Photo & Video': 4.97,
 'Productivity': 1.74,
 'Reference': 0.56,
 'Shopping': 2.61,
 'Social Networking': 3.29,
 'Sports': 2.14,
 'Travel': 1.24,
 'Utilities': 2.51,
 'Weather': 0.87}

In [37]:
ios_frequency_genre = display_table(ios_clean[1:], -5)
ios_frequency_genre 

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


In [39]:
ios_average_rating = get_genre_average(ios_clean[1:], ios_prime_genres, -5, 5)

Games : 22788.67
Business : 7491.12
Reference : 74942.11
Catalogs : 4004.0
Entertainment : 14029.83
Utilities : 18684.46
Travel : 28243.8
Book : 39758.5
Social Networking : 71548.35
Health & Fitness : 23298.02
Navigation : 86090.33
Finance : 31467.94
Productivity : 21028.41
Music : 57326.53
Photo & Video : 28441.54
News : 21248.02
Sports : 23008.9
Weather : 52279.89
Shopping : 26919.69
Medical : 612.0
Lifestyle : 16485.76
Food & Drink : 33333.92
Education : 7003.98


#### Analyzing the frequency table of prime genre in IOS data
We can conclude that Games and Entertainment are, respectively, coming at the first place of the most common app-categories that got interest to IOS users. Games takes around *58%* of our whole dataset and Entertainment takes around *8%*. Also, it's obvious that IOS users lose interest in apps that provide such as a service in weather, food & drink, reference. My direct impression is that most IOS users are more likely to install apps related to entertainment purposes since Games, Photo&Video, Social Networking and shopping are taking the most frequency in the table.