<img src="assets/apple.png" width="200" style="margin-left:auto; margin-right:auto" /><img src="assets/play.svg" width="200" style="margin-left:auto; margin-right:auto" />

# Project: Profitable App Profiles

The primary aim of this project is to conduct a comprehensive analysis of mobile applications, seeking to identify app profiles that exhibit profitability within the App Store and Google Play markets.

This objective is pursued through the examination of two distinct datasets. One dataset pertains to the Google Play Store Apps, encompassing information on roughly ten thousand Android applications available on Google Play. The other dataset, Mobile App Store, encompasses data on approximately seven thousand iOS applications featured in the App Store.

The analysis undertaken unveils specific app categories that enjoy considerable popularity across both platforms. Consequently, investing in app development within these identified categories is deemed particularly lucrative and promising for potential profitability.

Lets start.

*This project was completed as part of the Data Science Career Path offered by dataquest.io.*

#### The Data

The number of apps available in the Google Play Store was around 2.65 million as of July 1, 2023 [[source]](https://de.statista.com/statistik/daten/studie/74368/umfrage/anzahl-der-verfuegbaren-apps-im-google-play-store/#:~:text=Die%20Anzahl%20der%20im%20Google,bei%20rund%202%2C67%20Millionen.) and The number of apps available in the Apple Store was around 2 million [[source]](https://www.apple.com/de/app-store/#:~:text=Wir%20bieten%20fast%20zwei%20Millionen,einem%20guten%20Gefühl%20nutzen%20kannst).

Gathering data for a vast quantity of over four million applications demands substantial resources in terms of both time and financial investment. Therefore, our approach focuses on analyzing a representative subset of the data. Fortunately, two datasets have emerged that align well with our intended objectives:

- A dataset comprising information on roughly ten thousand Android applications sourced from [Google Play](https://www.kaggle.com/datasets/lava18/google-play-store-apps).

- A dataset encompassing data pertaining to approximately seven thousand iOS applications sourced from the [App Store](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps).

First, we'll begin by accessing the two data sets, and then we'll proceed to delve into an in-depth examination of the data.

In [1]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv', encoding='utf-8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0] #save only the header
android = android[1:] #save data without header

### The App Store data set ###
opened_file = open('AppleStore.csv', encoding='utf-8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0] 
ios = ios[1:]

To facilitate the exploration of the two datasets, a function named `explore_data()` will be developed. This function is designed to enable iterative inspection of rows in a format conducive to user comprehension. Moreover, the function will be equipped with a feature to present the count of rows and columns within any specified dataset.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


Observing the Google Play dataset, we find it contains 10841 apps and 13 columns. At a brief glance, the columns most likely pertinent for our analysis are `'App'`, `'Category'`, `'Reviews'`, `'Installs'`, `'Type'`, `'Price'`, and `'Genres'`.

Moving forward, let's shift our attention to the App Store dataset.

In [3]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


Within this dataset, there are 7197 iOS apps available, and the columns that seem to be interesting arer: `'track_name'`, `'currency'`, `'price'`, `'rating_count_tot'`, `'rating_count_ver'`, and `'prime_genre'`. 

Although not all column names are self-explanatory in their meanings, comprehensive information regarding each column can be found in the data set [documentation](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps).

#### Data Cleaning

##### Deleting Wrong Data

The Google Play data set has an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [4]:
print(android[10472])  # incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android[0])      # correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Row 10472 represents the app "Life Made WI-Fi Touchscreen Photo Frame," and it appears to have a rating of 19. However, this value is evidently inaccurate, considering that the maximum rating for a Google Play app is 5. This issue arises due to a missing value in the `'Category'` column, as discussed in the corresponding section. As a result, we will proceed to remove this particular row from the dataset.

In [5]:
print(len(android))

del android[10472] 
print(len(android))

10841
10840


#### Removing Duplicate Entries - Android

Upon thorough exploration of the Google Play dataset, it becomes apparent that certain applications have multiple entries. For instance, the application 'Instagram' is represented by four separate entries. To address this, a `for` loop incorporating a conditional `if` statement is constructed.

In [6]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Upon inspection of the printed **rows** corresponding to the Instagram application, a notable disparity becomes evident primarily within the **fourth position of each row**, denoting the count of reviews.

This discrepancy in numerical values indicates varying data collection instances. Leveraging this insight, we can establish a criterion for eliminating duplicates; specifically, a higher count of reviews implies a more recent dataset.

Rather than employing random elimination of duplicates, our methodology entails retaining solely the entry with the highest count of reviews for each app while discarding the remaining entries.

Proceeding to delve deeper into the presence of duplicate Android applications involves the utilization of a `for` loop. To facilitate this process, two empty `lists` are initialized, and values are selectively appended based on their existence within these lists.

Finally, a subset of examples depicting duplicated apps is presented through `print` statements.

In [7]:
unique_apps = []
duplicate_apps = []

for app in android:
    name = app[0]
    
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of unique apps: ', len(unique_apps))
print('Number of duplicate apps: ', len(duplicate_apps))
print('-------------------------------------------------')
print('Example of duplicate apps: ', duplicate_apps[0:5])

Number of unique apps:  9659
Number of duplicate apps:  1181
-------------------------------------------------
Example of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


As you can see above, there are 1181 cases where an app occurs more than once.

Next we create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

In [8]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
      
print('Number of non-duplicate rows:', len(reviews_max))

Number of non-duplicate rows: 9659


In [9]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


As evident from the information provided above, we stored 9659 unique apps with the newest rating count (highest number of reviews) into the `reviews_max` dictionary.

Next, we will create two empty `lists` and store the unique rows with the newest rating count into one `list` and the names of the unique apps into the other `list`.

We will use a `for` loop once again and `append` the values to the `lists`

In [10]:
android_clean = []
already_added  = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])

    if name not in already_added and n_reviews == reviews_max[name]:
        android_clean.append(app)
        already_added.append(name)

explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


Length of the `android_clean` list is *9659*, which confirms that we added all unique rows from the android dataset to our new `list`. The `already_added` `list` just includes the names of all the unique apps. It count *9659* names, which is the number of our unique rows.

#### Removing Duplicate Entries - Ios

Now, let's examine whether there are any duplicates in the iOS data as well. We can employ the same code but with a change in datasets. The iOS dataset incorporates an 'id' column, which we will utilize for this check. This serves as a prime illustration of code reuse for the same objective. Alternatively, we could create a function and employ it for this purpose.

In [11]:
unique_apps = []
duplicate_apps = []

for app in ios:
    name = app[0]
    
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of unique apps: ', len(unique_apps))
print('Number of duplicate apps: ', len(duplicate_apps))
print('-------------------------------------------------')
print('Example of duplicate apps: ', duplicate_apps[0:5])

Number of unique apps:  7197
Number of duplicate apps:  0
-------------------------------------------------
Example of duplicate apps:  []


As evident from the iOS dataset, there are no duplicates present, and therefore, there's no need to remove any duplicates at this stage. 

Moving forward, our next step involves analyzing the language of the apps, with a specific focus on English apps only.

#### Removing Non-English Apps

Upon thorough exploration of the datasets, we observed that certain app names imply they are not targeted towards an English-speaking audience. Below, we present a few examples from both datasets for illustration.

In [12]:
print(ios[813])
print(ios[813][1])
print(android_clean[4412])
print(android_clean[4412][0])

['445375097', '爱奇艺PPS -《欢乐颂2》电视剧热播', '224617472', 'USD', '0.0', '14844', '0', '4.0', '0.0', '6.3.3', '17+', 'Entertainment', '38', '5', '3', '1']
爱奇艺PPS -《欢乐颂2》电视剧热播
['中国語 AQリスニング', 'FAMILY', 'NaN', '21', '17M', '5,000+', 'Free', '0', 'Everyone', 'Education', 'June 22, 2016', '2.4.0', '4.0 and up']
中国語 AQリスニング


In [13]:
print(ord('a')) # with ord we can look for the corresponding number of the letter
print(ord('K'))

97
75


The numbers corresponding to the characters we commonly use in an English text are all in the range *0 to 127*, according to the ASCII (American Standard Code for Information Interchange) system. 

We will create a `function`, which is checking whether or not the string is english. We initialize a counter, if it is above 3, we return False. The counter prevents use from rejecting english apps with special characters.

In [14]:
def character_check(string):
    non_ascii = 0
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True

While the `function` is not without imperfections, its efficacy in filtering out non-English applications is notable. However, a residual subset of non-English apps might bypass the implemented filter.

Subsequently, we proceed to employ the `character_check()` function on both datasets, aiming to systematically exclude non-English applications.

In [15]:
print(character_check('Instachat 😜'))
print(character_check('Docs To Go™ Free Office Suite'))
print(character_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(character_check('Instagram'))
print(character_check('Inst爱奇艺艺PPSagram'))
print(character_check('Inst爱奇艺PPSagram'))

True
True
False
True
False
True


Now we will use the `function` `character_check()` and filter our datasets 

In [16]:
android_english_clean = []
ios_english_clean = []

for row in android_clean:
    name = row[0]
        
    if character_check(name):
        android_english_clean.append(row)
        
for row in ios:
    name = row[1]
    
    if character_check(name):
        ios_english_clean.append(row)

In [17]:
explore_data(android_english_clean, 0, 3, True)
print('\n')
explore_data(ios_english_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

Now our datasets contain just english apps. Our android dataset contains 9614 rows and our ios dataset 6183. Without clearing other languages our datasets contained 9659 (android) and 7197 (ios) rows.

We only want to analyze apps, that are free to download and install, therefore we need to exclude the apps, which are non-free. We will do this in the next step.

#### Isolating the Free Apps

Our exclusive attention is directed towards applications that are offered for free download and installation. The datasets encompass a mix of both free and non-free applications, necessitating the isolation of solely the free applications for the purpose of our analysis. Thus, our subsequent step involves extracting and segregating the free apps from both datasets.

In [18]:
android_final = []
ios_final = []

for row in android_english_clean:
    price_type = row[7]
    
    if price_type == '0':
        android_final.append(row)

for row in ios_english_clean:
    price = row[4]
    
    if price == '0.0':
        ios_final.append(row)
        
print('Number of rows free + non-free (android): ', len(android_english_clean))
print('Number of rows free apps (android): ', len(android_final))
print('\n')
print('Number of rows free + non-free (ios): ', len(ios_english_clean))
print('Number of rows free apps (ios): ', len(ios_final))

Number of rows free + non-free (android):  9614
Number of rows free apps (android):  8864


Number of rows free + non-free (ios):  6183
Number of rows free apps (ios):  3222


Upon completing the data cleaning process concerning pricing, the android dataset now comprises *8864* rows, while the iOS dataset encompasses *3222* rows.

Thus far, the following tasks have been accomplished:

- Elimination of inaccurate data
- Removal of duplicated app entries
- Exclusion of non-English applications
- Isolation of applications offered at no cost

Aligned with our introductory objectives, our aim is to discern app profiles that possess the potential to allure a larger user base, as user count significantly impacts revenue.

To ascertain the viability of an app idea, a three-step validation strategy is proposed:

1) Development and release of a minimal Android version on Google Play.
2) Further development contingent upon a favorable user response.
3) Consideration of an iOS version's development for the App Store if the app proves profitable within six months.

Given the ultimate objective of featuring the app on both Google Play and the App Store, the pursuit involves identifying app profiles thriving in both markets. For example, a profile exhibiting success in both markets could entail a productivity app integrating gamification elements.

Moving forward, a comprehensive analysis of the dataset remains imperative. Subsequent steps will involve scrutinizing the most prevalent genres. This will include the generation of a frequency table for the `prime_genre` column within the iOS dataset, alongside the `Genres` and `Category` columns within the android dataset.

#### Most Common Apps by Genre

In [19]:
print(android_header)
print('\n')
print('''The category column in the android header is at index 1.
The genre column in the android header is at index 9.''')
print('\n')
print(ios_header)
print('\n')
print('The genre column in the ios header is at index 11.')
print('\n')

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


The category column in the android header is at index 1.
The genre column in the android header is at index 9.


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


The genre column in the ios header is at index 11.




We will construct two distinct `functions` intended for analyzing frequency tables:

Firstly, a `function` designed to produce frequency tables illustrating percentages. <br> Secondly, an additional function dedicated to presenting these percentages in a descending order for enhanced analysis and readability.

##### `freq_table` function

In [20]:
def freq_table(dataset, index):
    freq_dict = {}
    total = 0
    
    for row in dataset:
        total += 1
        column = row[index] 
        if column in freq_dict:
            freq_dict[column] +=1
        else:
            freq_dict[column] = 1
            
    freq_dict_percentages = {}
    for key in freq_dict:
        percentage = (freq_dict[key] / total) * 100
        freq_dict_percentages[key] = percentage
        
    return freq_dict_percentages

In [21]:
freq_table(android_final, 1) #freq_table for category in android dataset

{'ART_AND_DESIGN': 0.6430505415162455,
 'AUTO_AND_VEHICLES': 0.9250902527075812,
 'BEAUTY': 0.5979241877256317,
 'BOOKS_AND_REFERENCE': 2.1435018050541514,
 'BUSINESS': 4.591606498194946,
 'COMICS': 0.6204873646209386,
 'COMMUNICATION': 3.2378158844765346,
 'DATING': 1.861462093862816,
 'EDUCATION': 1.1620036101083033,
 'ENTERTAINMENT': 0.9589350180505415,
 'EVENTS': 0.7107400722021661,
 'FINANCE': 3.7003610108303246,
 'FOOD_AND_DRINK': 1.2409747292418771,
 'HEALTH_AND_FITNESS': 3.0798736462093865,
 'HOUSE_AND_HOME': 0.8235559566787004,
 'LIBRARIES_AND_DEMO': 0.9363718411552346,
 'LIFESTYLE': 3.9034296028880866,
 'GAME': 9.724729241877256,
 'FAMILY': 18.907942238267147,
 'MEDICAL': 3.531137184115524,
 'SOCIAL': 2.6624548736462095,
 'SHOPPING': 2.2450361010830324,
 'PHOTOGRAPHY': 2.944494584837545,
 'SPORTS': 3.395758122743682,
 'TRAVEL_AND_LOCAL': 2.33528880866426,
 'TOOLS': 8.461191335740072,
 'PERSONALIZATION': 3.3167870036101084,
 'PRODUCTIVITY': 3.892148014440433,
 'PARENTING': 0.6

##### `display_table()` function

In [22]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0]) #print key first and then percentage

In [23]:
display_table(android_final, 1) #freq_table for category in android dataset, sorted desc

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [24]:
display_table(android_final, 9) #freq_table for genre in android dataset, sorted desc

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [25]:
display_table(ios_final, 11)  #freq_table for genre in ios dataset, sorted desc

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


#### Analyzing the Frequency Tables
We'll start with the `prime_genre` column of the ios dataset. After that we look at the android dataset columns

In [26]:
most_common_genre_ios = display_table(ios_final, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


As you can see above, the most common genre for the apple app store, for just english free apps is the `Games` genre with *~58%*. Far behind on second place is the `Entertainment` genre with *~8%*. `Games`makes up by far the largest part of the genre in the app store among english free apps. Most of the other genres are represent up to *~3%*.

If we look at the frequency table above we can conclude, that most of the free english apps are related to entertainment categories. Like `Games` = *~58%*, `Entertainment` = *~8%*, `Photo & Video` = *~5%* or `Social Networking` = *~3%*. But we cant say, that  `Games` is the category with the most users. There could be tousands of free english games in the app store without any user. The frequency table just shows us the amount of apps, not the user count.

With this information alone, we cant recommend an app profile for the app store market. Next lets look at the android dataset

In [27]:
most_common_category_android = display_table(android_final, 1) #category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [28]:
most_common_genre_android = display_table(android_final, 9) #genre

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

Before we analyze the android frequency table for `Category` and `Genre` we should look at the columns. An app can belong to multiple `Genres` (apart from its main `Category`). For eg, a musical family game will belong to Music, Game, Family genres. This means the genre column is a detailed representation of the category column. For our purpose, looking at the category column is sufficient, because we want to see the distribution within the app store.

The most common `Category` in the free english android app store is `Family` with *~19%*, on second place is `Games` with *~10%*.

Like the apple store, there are a lot of `Categories` in the android store which are represent up to *~3%*. As we mentioned before, representation of a `Genre` does not say anything about the user count. In contrast to the apple store, the most common free english apps in the android store are related to the `Category` `Family`, `Games` are on the second place. Like the apple store, Although the `Games` `Category` is not that strongly represented in like in the app store, it is still a very common `Category` in the android store. However, if we investigate this further, we can see that the `Family` `Category` means mostly games for kids.

While in the app store, most apps where in `Categories`which we can describe with the keyword `Entertainment`, the distribution of the free english apps in the android store is much more heterogeneous between `Entertainment` and apps we can describe as `practical purposes` eg, `Tools`, `Business` and `Productivity`. However, it is evident that `practical purpose` apps are more prominently represented on Google Play in comparison to the App Store. This observation is further corroborated by the frequency table we have for the 'Genres' column.

#### Most Popular Apps by Genre

Until now, our analysis indicates that the App Store primarily consists of apps designed for entertainment, whereas Google Play displays a more balanced variety of both practical and recreational apps. However, our focus now shifts towards understanding the types of apps that have the highest number of users.

To determine the most popular genres with the highest number of users, one approach is to calculate the average number of `installs` for each app genre. In the `Google Play dataset`, this information is directly available in the `'Installs'` column. However, for the `App Store dataset`, the corresponding data is missing. To address this, we'll use the total number of user ratings as a proxy, which we can find in the `'rating_count_tot'` column.

Below, we proceed to compute the `average number of user ratings per app` genre on the App Store:

#### Most Common Genres by User Ratings (Ios)

In [29]:
freq_genre_ios = freq_table(ios_final, 11)

for genre in freq_genre_ios:
    total = 0
    len_genre = 0
    
    for row in ios_final:
        app_genre = row[11]
        if app_genre == genre:
            ratings = float(row[5])
            total += ratings
            len_genre += 1
    
    avg_rating_count = round(total / len_genre)
    print(genre, '--> Average number of user ratings:', avg_rating_count)

Social Networking --> Average number of user ratings: 71548
Photo & Video --> Average number of user ratings: 28442
Games --> Average number of user ratings: 22789
Music --> Average number of user ratings: 57327
Reference --> Average number of user ratings: 74942
Health & Fitness --> Average number of user ratings: 23298
Weather --> Average number of user ratings: 52280
Utilities --> Average number of user ratings: 18684
Travel --> Average number of user ratings: 28244
Shopping --> Average number of user ratings: 26920
News --> Average number of user ratings: 21248
Navigation --> Average number of user ratings: 86090
Lifestyle --> Average number of user ratings: 16486
Entertainment --> Average number of user ratings: 14030
Food & Drink --> Average number of user ratings: 33334
Sports --> Average number of user ratings: 23009
Book --> Average number of user ratings: 39758
Finance --> Average number of user ratings: 31468
Education --> Average number of user ratings: 7004
Productivity --

Above we calculated the average number of user ratings per `Genre`. The result confirms the assumption, that the number of apps by genre does not necessarily reflect the number of users. The user ratings give a good idea of how many users use a particular app of a genre. The average number of user ratings for each genre have been summarized here. Based on the average number of user ratings count, the `Gaming` `Genre` is not the most popular `Genre`. Rather, `Social Networking`, `Reference` and `Navigation` apps, which are just represented by *~3.29%* , *~0.56%* and *~0.19%* in the free english dataset, seem to have the most users on average. 


#### Bias
Among the app `Genres`, `Navigation` apps boast the highest number of user reviews. However, this number is significantly impacted by two dominant apps, namely `Waze` and `Google Maps`, which together accumulate  half a million user reviews:

In [30]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


We observe a similar trend with `Social Networking` and `Reference`apps. The average number of user reviews is heavily influenced by a handful of giants such as `Facebook`, `Pinterest` and `Skype`.
As for `Reference` apps, they have an average of 74942 user ratings. However, this figure is significantly impacted by apps like the `Bible`, `Dictionary.com` and also `Google Translate` (see below).

Consequently, these apps may appear more popular than they truly are. The average number of ratings seems skewed due to a few apps with hundreds of thousands of user reviews, while other apps struggle to surpass the 10000 threshold. 

To gain a clearer perspective, we could improve our analysis by excluding these extremely popular apps from each genre and then reevaluate the averages.


In [31]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


In [32]:
for app in ios_final:
    if app[11] == 'Lifestyle':
        print(app[1], ':', app[5])

Zillow Real Estate - Homes for Sale & for Rent : 342969
Tinder : 143040
Text Free: Free Texting + Calling + MMS : 100477
Countdown‼ (Event Reminders and Timer) : 60490
PINK Nation : 49816
happn — Dating app — Find and meet your crush : 20546
Perfect365 - Custom makeup designs and beauty tips : 19540
ipsy - Makeup, subscription and beauty tips : 17489
cute icon&wallpaper dressup - CocoPPa : 12508
Bumble – Find a Date, Meet Friends & Network : 10109
IKEA Catalog : 8939
Monogram - Wallpaper & Backgrounds Maker HD DIY with Glitter Themes : 7427
Nerve — Truth or Dare Dirty Houseparty Party Games : 6658
Tile - Find & track your lost phone, wallet, keys : 5684
Yellow - Make new friends : 3809
Home - Design & Decor Shopping : 3354
T-Mobile Tuesdays : 3213
Player for Acapella Triller illuminati Edition : 2476
Funny Face - Filters Swap Effects Pic for Snapchat : 2471
SafeTrek - Personal Safety : 2227
Philips Hue : 1999
Alipay - Makes Life Easy : 1926
Yoshirt - Design Your Own Custom Tshirt, Tote

The `Lifestyle` `Genre` seems to be interesting. It looks like different apps are gathering in this `Genre`. On the one hand there are dating apps like `Tinder` and `Bumble` but at the same time there is also information about real estate like `Zillow Real Estate` but also apps for the `IKEA Catalog` or `ipsy - Makeup, subscription and beauty tips`. 

One thing we could do is creating an app which gathers useful `Lifestyle` information and helps the user and helps users to use a single app for all these topics. The user has the opportunity to select the topics that are of interest to him and then gets access to apps that might be of interest to him.

This idea seems to fit well with the fact that the App Store is dominated by `Entertainment` apps. This suggests the market might be a bit saturated with `Entertainment` apps, which means a practical `Lifestyle` app might have more of a chance to stand out among the huge number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. 
Now let's analyze the Google Play market a bit.

#### Most Popular Apps by Genre (android)

We possess data regarding the number of installations for the Google Play market, which allows us to gain a more comprehensive understanding of genre popularity. Nonetheless, the install figures appear to lack precision as they are predominantly open-ended (e.g., 100+, 1,000+, 5,000+, etc.)

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

In order to carry out the computations successfully, we must convert each install number to a floating-point format. This requires removing the commas and plus characters, as their presence would result in a conversion failure and raise an error. We will accomplish this task within the loop below, where we will also compute the average number of installs for each genre (category).

In [33]:
freq_category_android = freq_table(android_final, 1)

for genre in freq_category_android:
    total = 0
    len_genre = 0
    
    for row in android_final:
        app_genre = row[1]
        if app_genre == genre:
            installs = row[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            installs = float(installs)
            total += installs
            len_genre += 1
            
    avg_number_installs = round(total / len_genre)
    print(genre, '--> Average number of user ratings:',  avg_number_installs)

ART_AND_DESIGN --> Average number of user ratings: 1986335
AUTO_AND_VEHICLES --> Average number of user ratings: 647318
BEAUTY --> Average number of user ratings: 513152
BOOKS_AND_REFERENCE --> Average number of user ratings: 8767812
BUSINESS --> Average number of user ratings: 1712290
COMICS --> Average number of user ratings: 817657
COMMUNICATION --> Average number of user ratings: 38456119
DATING --> Average number of user ratings: 854029
EDUCATION --> Average number of user ratings: 1833495
ENTERTAINMENT --> Average number of user ratings: 11640706
EVENTS --> Average number of user ratings: 253542
FINANCE --> Average number of user ratings: 1387692
FOOD_AND_DRINK --> Average number of user ratings: 1924898
HEALTH_AND_FITNESS --> Average number of user ratings: 4188822
HOUSE_AND_HOME --> Average number of user ratings: 1331541
LIBRARIES_AND_DEMO --> Average number of user ratings: 638504
LIFESTYLE --> Average number of user ratings: 1437816
GAME --> Average number of user ratings: 1

We will look at the average installs per `Category`. Most of the installations were carried out for the apps in the `Communication` `Category` --> *38.456.119* avg. installs, followed by the `Category` `Video Players` --> *24.727.872* avg. installs. On the third place regarding installs of free english apps in the android store is the `Category` `Social` with *23.253.652* avg. installs. `Games` is on the sixth place, with *15.588.016* avg. installs.

In [34]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

The number for the `communication` apps avg. installs (38.456.119) is heavily skewed up by a few apps that have over one billion installs (`WhatsApp`, `Facebook Messenger`, `Skype`, `Google Chrome`, `Gmail`, and `Hangouts`), and a few others with over 100 and 500 million installs (see above).

Once more, our primary concern revolves around the potential misperception of app genres appearing more popular than they genuinely are. Additionally, these niches seem to be controlled by a few dominant giants, making it challenging to compete against them.

While the game genre appears to be quite popular, our previous findings revealed market saturation in this area. Consequently, we aim to provide an alternative app recommendation if feasible.

The `Lifestyle` `Category` looks fairly popular as well, with an average number of installs of 1.437.816. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play. 

Let's take a look at some of the apps from this genre and their number of installs:

In [35]:
for app in android_final:
    if app[1] == 'LIFESTYLE':
        print(app[0], ':', app[5])

Dollhouse Decorating Games : 5,000,000+
metroZONE : 10,000,000+
Easy Hair Style Design : 100,000+
Talking Babsy Baby: Baby Games : 10,000,000+
Black Wallpaper, AMOLED, Dark Background: Darkify : 5,000,000+
Girly Wallpapers Backgrounds : 1,000,000+
Chart - Myanmar Keyboard : 5,000,000+
Easy Makeup Tutorials : 1,000,000+
Horoscopes – Daily Zodiac Horoscope and Astrology : 10,000,000+
Entel : 1,000,000+
ZenUI Safeguard : 1,000,000+
Live 4D Results ! (MY & SG) : 5,000,000+
Diary with lock : 10,000,000+
FOSSIL Q: DESIGN YOUR DIAL : 500,000+
Telstra : 5,000,000+
Family Locator - GPS Tracker : 10,000,000+
Van Nien 2018 - Lich Van su & Lich Am : 1,000,000+
Safeway : 1,000,000+
HTC Speak : 10,000,000+
Kawaii Easy Drawing : How to draw Step by Step : 5,000,000+
Tattoodo - Find your next tattoo : 1,000,000+
H&M : 10,000,000+
Samsung+ : 50,000,000+
Anime Avatar Creator: Make Your Own Avatar : 1,000,000+
Beautiful Design Birthday Cake : 500,000+
Pronunciation and know the name of the caller from hi

In [36]:
for app in android_final:
    if app[1] == 'LIFESTYLE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Tinder : 100,000,000+


There is only one very popular app (Tinder). Let's try to get some ideas about the apps in this `Category` which can be categorized in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [37]:
for app in android_final:
    if app[1] == 'LIFESTYLE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Dollhouse Decorating Games : 5,000,000+
metroZONE : 10,000,000+
Talking Babsy Baby: Baby Games : 10,000,000+
Black Wallpaper, AMOLED, Dark Background: Darkify : 5,000,000+
Girly Wallpapers Backgrounds : 1,000,000+
Chart - Myanmar Keyboard : 5,000,000+
Easy Makeup Tutorials : 1,000,000+
Horoscopes – Daily Zodiac Horoscope and Astrology : 10,000,000+
Entel : 1,000,000+
ZenUI Safeguard : 1,000,000+
Live 4D Results ! (MY & SG) : 5,000,000+
Diary with lock : 10,000,000+
Telstra : 5,000,000+
Family Locator - GPS Tracker : 10,000,000+
Van Nien 2018 - Lich Van su & Lich Am : 1,000,000+
Safeway : 1,000,000+
HTC Speak : 10,000,000+
Kawaii Easy Drawing : How to draw Step by Step : 5,000,000+
Tattoodo - Find your next tattoo : 1,000,000+
H&M : 10,000,000+
Samsung+ : 50,000,000+
Anime Avatar Creator: Make Your Own Avatar : 1,000,000+
Super Slime Simulator - Satisfying Slime App : 1,000,000+
Caf - My Account : 5,000,000+
H Pack : 1,000,000+
Family convenience store FamilyMart : 1,000,000+
w UN map :

This niche seems to be dominated by a different kind of apps, like `Telecommunication Provider` apps, an app named `Family Locator - GPS Tracker` or `Timely Alarm Clock` some apps about the Quran or apps for restaurants. An app that can be used as a multifunctional tool to map relevant lifestyle topics in a personalized way could be a helpful tool for users who have used many different apps from the `Lifestyle` `Category` so far.

| Category | Popularity   |
|:----------:|:----------:|
|Lifestlye (android)      | 1.437.816 installs |
|Lifestyle (ios)| 16.486 avg user ratings|

#### Conclusion

Within this endeavor, our analysis centered on data derived from the `App Store` and `Google Play`, aimed at identifying an app profile conducive to profitability within both markets.

Our findings suggest that developing an application catering to diverse Lifestyle content holds promising potential for success in both the `Google Play` and `App Store` domains. Notably, the `Lifestyle` category exhibits lower popularity, presenting an opportunity to introduce a significant and unchallenged application within this segment.