# Project Types of Mobile Applications

## Table of Contents
### &ensp;&ensp; 1. [Opening and Exploring the Dataset](#1.)
### &ensp;&ensp; 2. [Cleaning the Dataset](#2.)
### &ensp;&ensp; 3. [Analysing the Dataset](#3.)
### &ensp;&ensp; 4. [Conclusion](#4.)

## Abstract

- This project aims to analyse two datasets of mobile app derived from [an iOS dataset](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) and [an Android dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps) with a goal to provide`the app development team`with a report on`what type of apps that are likely to attract more users`on Google Play and App Store for them to build`an app in English that are free to download and install`.

- Note that this project provides me with opportunities for practising the basic skills such as
    1. Write functions and use the functions with positional arguments and keyword arguments.
    2. Use for loop and nested for loop with lists.
    3. Use the sorted(iterable_object, key, reverse=False) function to sort the order in a list.    
    4. Use dictionary to make frequency tables.

##  1. Opening and Exploring the Dataset

In this section, both two datasets will be opened and read into a list of lists for preliminary exploration.
- [The iOS dataset](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) contains data about approximately 7,000 iOS apps from the App Store. The dataset can be downloaded from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv) .
- [The Android dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps) contains data about approximately 10,000 Android apps from Google Play. The dataset can be downloaded from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

In [1]:
# Create a function to read a .csv file into a list of lists
def read_dataset(csv_file, header=True):
    opened_file = open(csv_file, encoding='utf8')
    from csv import reader
    dataset = list(reader(opened_file))
    if header:
        dataset_header = dataset[0]
        dataset_rows = dataset[1:]
        return dataset_header, dataset_rows
    else:
        return dataset

In [2]:
# Create a reusable function that prints rows in a readable way
def explore_data(dataset, start_row, end_row, rows_and_columns=False):
    dataset_sliced = dataset[start_row:end_row]
    for row in dataset_sliced:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

### Part One: Open and Explore the iOS dataset

In [3]:
ios_header, ios_dataset = read_dataset('AppleStore.csv')
print('First 5 rows of iOS dataset:\n')
explore_data(ios_dataset, 0, 5, rows_and_columns=True)
print('Column names:\n', ios_header)

First 5 rows of iOS dataset:

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16
Column names:
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.n

Based on the above information, we have got 7,197 iOS apps and 16 columns in the dataset, and the below 6 columns will be selected for the purpose of analysis:

|No.|Column Name|Description|
|:-----|:-----|:-----|
|1|track_name|Application Name|
|2|currency|Currency Type|
|3|price|Price amount|
|4|ratingcounttot|User Rating counts (for all version)|
|5|ratingcountver|User Rating counts (for current version)|
|6|prime_genre|Primary Genre|


### Part Two: Open and Explore the Android dataset

In [4]:
android_header, android_dataset = read_dataset('googleplaystore.csv')
print('First 5 rows of Android dataset:\n')
explore_data(android_dataset, 0, 5, rows_and_columns=True)
print('Column names:\n', android_header)

First 5 rows of Android dataset:

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13
C

Based on the above information, we have got 10,841 Android apps and 13 columns in the dataset, and the below 7 columns will be chosen for the purpose of analysis:

|No.|Column Name|Description|
|:-----|:-----|:-----|
|1|App|Application Name|
|2|Category|Application Category|
|3|Review|The Number of Reviews|
|4|Installs|The Number of Times the App has been Installed|
|5|Type|Whether the App is Free or Paid|
|6|Price|Price amount|
|7|prime_genre|Primary Genre|


## 2. Cleaning the Dataset
In this section, the dataset will be manipulated in prior to data analysis to ensure it is clean and accurate.

### The Flow of Data Cleaning includes:
1. Remove Rows with Missing Column.
2. Remove Duplicate Rows.
3. Remove Non-English Apps.
4. Identify Free Apps.

### 2.1. Remove Rows with Missing Column
A function`missing_column()`will be created to check whether there are rows with missing columns in a dataset.  
- If the dataset contains rows with missing column, the function report 
    - The length of the row.
    - The index of the row.
    - How many rows contain missing columns.

In [5]:
# Create a function that checks whether there are rows with missing columns.
def missing_column(dataset, header):
    row_count = 0
    for row in dataset:
        if len(row) != len(header):
            row_count += 1
            print(row)
            print('\n')
            print('The length of the row is:', len(row))
            print('The index of the row is:', dataset.index(row))
    print(row_count, 'row(s) containing missing column has(have) been found.' )

### 2.1.1. Part One: Examine the iOS dataset

In [6]:
missing_column(ios_dataset, ios_header)

0 row(s) containing missing column has(have) been found.


The result shows that there is no missing column for the iOS dataset.  
  
### 2.1.2. Part Two: Examine the Android dataset

In [7]:
missing_column(android_dataset, android_header)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The length of the row is: 12
The index of the row is: 10472
1 row(s) containing missing column has(have) been found.


In [8]:
# Remove the row with index 10472
del android_dataset[10472]
print('The number of rows in the Android dataset is:', len(android_dataset))

The number of rows in the Android dataset is: 10840


In [9]:
# Double check whether the missing row has been removed.
missing_column(android_dataset, android_header)

0 row(s) containing missing column has(have) been found.


### 2.1.3. Summary
- For the iOS dataset, there is no row with missing column. The dataset consists of 7197 rows now.
- For the Android dataset, 1 row with a missing column has been found and removed from the dataset. The dataset contains 10840 rows now.

### 2.2. Remove Duplicate Rows
A function`duplicate_row`will be created to check duplicate rows in a dataset, which returns information as follows
- Number of unique apps.
- Number of duplicate apps.
- Number of duplicate rows.
- Ten Key-Value examples of duplicated apps with app's name and its duplicate count.

In [10]:
def duplicate_row(dataset, index_to_check):
    duplicate_apps_list = []
    unique_apps_list = []
    duplicate_apps_dict = {}
    for row in dataset:
        app_name = row[index_to_check]
        if app_name in duplicate_apps_dict:
            duplicate_apps_dict[app_name] += 1
            duplicate_apps_list.append(app_name)
        else:
            duplicate_apps_dict[app_name] = 0
            unique_apps_list.append(app_name)
    
    # For below the for loop, the list returned by dict.items() will be sorted in descending order based on the parameter key that takes in the value.
    new_dict = {}
    for item in sorted(duplicate_apps_dict.items(), key=lambda x: x[1], reverse=True):
        if item[1] > 0:
            new_dict[item[0]] = item[1]
    
    print('Number of unique apps:', len(unique_apps_list))
    print('Number of duplicate apps:', len(new_dict))
    print('Number of duplicate rows:', len(duplicate_apps_list))
    print('Examples of duplicated apps with Name and its Duplicate Count:\n', dict(list(new_dict.items())[:10]))

### 2.2.1. Part One: Investigate the iOS dataset

In [11]:
print('For the iOS dataset:')
duplicate_row(ios_dataset, 1)

For the iOS dataset:
Number of unique apps: 7195
Number of duplicate apps: 2
Number of duplicate rows: 2
Examples of duplicated apps with Name and its Duplicate Count:
 {'Mannequin Challenge': 1, 'VR Roller Coaster': 1}


In [12]:
print(ios_header)
print('\n')
for row in ios_dataset:
    if row[1] == 'Mannequin Challenge' or row[1] == 'VR Roller Coaster':
        print(row)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


### Based on the above information:
- The iOS dataset has got`2 duplicate rows with 2 duplicate apps`.
- By looking into each duplicate rows, it turned out that the duplicates are due to various versions with different rating counts.
- Here I decided to add an additional rule for dealing with duplicate apps:
    -`Keep the row with the highest count of rating`
- A function`duplicate_removal()`will be created to remove the duplicate rows from the dataset.
    - The function returns a clean dataset:
        - For duplicate apps, only one row with the highest count of rating is kept.

In [13]:
def duplicate_removal(dataset, index_name, index_rating):
    # Create a dict that holds only the highest numeric value for each app
    apps_highest_rating_dict = {}
    for row in dataset:
        app_name = row[index_name]
        app_rating = float(row[index_rating])
        if app_name in apps_highest_rating_dict:
            if app_rating > apps_highest_rating_dict[app_name]:
                apps_highest_rating_dict[app_name] = app_rating
        else:
            apps_highest_rating_dict[app_name] = app_rating
    # Create a list to hold an unique row for each app
    new_dataset = []
    app_already_added = [] # To avoid rows that have the same numeric values being added to the new_dataset.
    for row in dataset:
        app_name = row[index_name]
        app_rating = float(row[index_rating])
        # Check whether the value of app_rating is the highest as the one held in the dict
        if (app_rating == apps_highest_rating_dict[app_name]) and (app_name not in app_already_added):
            new_dataset.append(row)
            app_already_added.append(app_name)
    return new_dataset

In [14]:
ios_dataset_clean = duplicate_removal(ios_dataset, index_name=1, index_rating=5)
print('Number of rows in the iOS dataset without duplicate rows:', len(ios_dataset_clean))
print('\n')
# Double check whether the duplicate rows have been removed.
duplicate_row(ios_dataset_clean, 1)

Number of rows in the iOS dataset without duplicate rows: 7195


Number of unique apps: 7195
Number of duplicate apps: 0
Number of duplicate rows: 0
Examples of duplicated apps with Name and its Duplicate Count:
 {}


### 2.2.2. Part Two: Investigate the Android dataset

In [15]:
print('For the Android dataset:')
duplicate_row(android_dataset, 0)

For the Android dataset:
Number of unique apps: 9659
Number of duplicate apps: 798
Number of duplicate rows: 1181
Examples of duplicated apps with Name and its Duplicate Count:
 {'ROBLOX': 8, 'CBS Sports App - Scores, News, Stats & Watch Live': 7, 'Duolingo: Learn Languages Free': 6, 'Candy Crush Saga': 6, '8 Ball Pool': 6, 'ESPN': 6, 'Nick': 5, 'Subway Surfers': 5, 'Bubble Shooter': 5, 'slither.io': 5}


In [16]:
print(android_header)
print('\n')
for row in android_dataset:
    if row[0] == 'slither.io':
        print(row)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['slither.io', 'GAME', '4.4', '5234162', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Action', 'November 14, 2017', 'Varies with device', '2.3 and up']
['slither.io', 'GAME', '4.4', '5234825', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Action', 'November 14, 2017', 'Varies with device', '2.3 and up']
['slither.io', 'GAME', '4.4', '5234810', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Action', 'November 14, 2017', 'Varies with device', '2.3 and up']
['slither.io', 'GAME', '4.4', '5235294', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Action', 'November 14, 2017', 'Varies with device', '2.3 and up']
['slither.io', 'GAME', '4.4', '5235294', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Action', 'November 14, 2017', 'Varies with device', '2.3 

### Based on the above information:
- The Android dataset has got`1181 duplicate rows with 798 duplicate apps`.
- Similar to the iOS dataset, duplicates are due to various versions with different review counts.
- The same logic will be applied for dealing with duplicate apps:
    - Use the function`duplicate_removal()`to obtain a dataset without duplicate rows.
    - `Keep the row with the highest count of review`.

In [17]:
android_dataset_clean = duplicate_removal(android_dataset, index_name=0, index_rating=3)
print('Number of rows in the Android dataset without duplicate rows:', len(android_dataset_clean))
print('\n')
# Double check whether the duplicate rows have been removed.
duplicate_row(android_dataset_clean, 0)

Number of rows in the Android dataset without duplicate rows: 9659


Number of unique apps: 9659
Number of duplicate apps: 0
Number of duplicate rows: 0
Examples of duplicated apps with Name and its Duplicate Count:
 {}


### 2.2. Summary
- For the iOS dataset, 2 duplicate rows have been removed and only rows with the highest rating count are kept.
  - There are now 7195 rows in the iOS dataset.
- For the Android dataset, 1181 duplicate rows have been deleted and only rows with the highest review count are kept.
  - There are now 9659 rows in the Android dataset.

### 2.3. Remove Non-English Apps
Since the app the development team aims to design is for English-speaker, in this step, we are going to manipulate the dataset by removing rows where`the column of App Name contains more than 3 non-English characters`. Note that this method will eliminate most but not all of the non-English apps.

Each of character that is commonly used in an English text can be converted into a number in the`range from 0 to 127`according to [ASCII](https://www.techonthenet.com/ascii/chart.php).

Based on this number range:
- A function`non_english()`will be created to check whether the App Name is non-English.
- A function`is_english()`will be built to remove all of the non-English Apps from the dataset.

In [18]:
def non_english(dataset, index_name):
    non_english_app = []
    for row in dataset:
        apps_name = row[index_name]
        non_ASCII_count = 0 # Set this varaiable to screen English Apps
        for l in apps_name:
            number = ord(l)
            if number > 127:
                non_ASCII_count += 1
        # This allows English Apps to contain upto 3 non-ASCII characters such as emoji, '-'
        if non_ASCII_count > 3:
            non_english_app.append(apps_name)
    print(len(non_english_app),'non-English apps have been found.')
    print('Examples of non-English Apps:\n', non_english_app[:15])

In [19]:
def is_english(dataset, index_name):
    is_english_app = []
    for row in dataset:
        apps_name = row[index_name]
        non_ASCII_count = 0 # Set this varaiable to screen English Apps
        for l in apps_name:
            number = ord(l)
            if number > 127:
                non_ASCII_count += 1
        # This allows English Apps to contain upto 3 non-ASCII characters
        if non_ASCII_count <= 3:
            is_english_app.append(row)
    print('Number of English apps:', len(is_english_app))
    return is_english_app

### 2.3.1. Part One: Check the iOS dataset

In [20]:
non_english(ios_dataset_clean, 1)

1014 non-English apps have been found.
Examples of non-English Apps:
 ['爱奇艺PPS -《欢乐颂2》电视剧热播', '聚力视频HD-人民的名义,跨界歌王全网热播', '优酷视频', '网易新闻 - 精选好内容，算出你的兴趣', '淘宝 - 随时随地，想淘就淘', '搜狐视频HD-欢乐颂2 全网首播', '阴阳师-全区互通现世集结', '百度贴吧-全球最大兴趣交友社区', '百度网盘', '爱奇艺HD -《欢乐颂2》电视剧热播', '乐视视频HD-白鹿原,欢乐颂,奔跑吧全网热播', '万年历-值得信赖的日历黄历查询工具', '新浪新闻-阅读最新时事热门头条资讯视频', '喜马拉雅FM（听书社区）电台有声小说相声英语', '央视影音-海量央视内容高清直播']


In [21]:
ios_dataset_Eng = is_english(ios_dataset_clean, 1)

Number of English apps: 6181


### 2.3.2. Part Two: Check the Android dataset

In [22]:
non_english(android_dataset_clean, 0)

45 non-English apps have been found.
Examples of non-English Apps:
 ['Flame - درب عقلك يوميا', 'သိင်္ Astrology - Min Thein Kha BayDin', 'РИА Новости', 'صور حرف H', 'L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]', 'RMEduS - 음성인식을 활용한 R 프로그래밍 실습 시스템', 'AJ렌터카 법인 카셰어링', 'Al Quran Free - القرآن (Islam)', '中国語 AQリスニング', '日本AV历史', 'Ay Yıldız Duvar Kağıtları', 'বাংলা টিভি প্রো BD Bangla TV', 'Cъновник BG', 'CSCS BG (в български)', '뽕티비 - 개인방송, 인터넷방송, BJ방송']


In [23]:
android_dataset_Eng = is_english(android_dataset_clean, 0)

Number of English apps: 9614


### 2.3. Summary
- The iOS dataset now contains 6181 rows after removing 1014 non-English apps from the dataset.
- The Android dataset now contains 9614 rows after removing 45 non-English apps from the dataset.

### 2.4. Identify Free Apps
The last step of data cleaning is to keep only the free apps in the dataset
- The paid apps will be removed from each dataset.

A function`free_or_paid`will be create, which reports the number of free and paid apps, respectively, and returns a dataset with only free apps.
 

In [24]:
def free_or_paid(dataset, index_price, string_type=False):
    free_app = []
    paid_app = []
    if string_type:
        for row in dataset:
            price = row[index_price]
            # Convert all strings into uppercase
            if price.upper() == 'FREE':
                free_app.append(row)
            else:
                paid_app.append(row)
    else:
        for row in dataset:
            price = float(row[index_price])
            if price == 0:
                free_app.append(row)
            else:
                paid_app.append(row)
    print('Total number of apps:', len(dataset))
    print('Number of Free apps:', len(free_app))
    print('Number of Paid apps', len(paid_app))
    return free_app

In [25]:
ios_dataset_free = free_or_paid(ios_dataset_Eng, index_price=4)

Total number of apps: 6181
Number of Free apps: 3220
Number of Paid apps 2961


In [26]:
android_dataset_free = free_or_paid(android_dataset_Eng, index_price=6, string_type=True)

Total number of apps: 9614
Number of Free apps: 8863
Number of Paid apps 751


### 2.4. Summary
This is the end of data cleaning. Both the iOS and Android datasets are ready for the analysis.
- The iOS dataset now contains 3220 rows.
- The Android dataset now contains 8863 rows.

## 3. Analysing the Dataset

The aim of the current project is to determine the types of apps that are more likely to attract users because company's revenue is dependent of the in-app advertising in which the more users the higher the revenue.

Considering the budget and risk, the company made a`validation strategy`for an app idea, which includes the below 3 steps:
1. Build an Android version of the app, and publish it on Google Play.
2. Collect users' feedbacks to see whether the app has potential. If the app receives good responses, we then continue to develop it.
3. If the app is profitable after 6 months, we build an iOS version of the app and publish it on the App Store.

In this section, we aim to find out on both the App Store and Google Play
1. What app categories are the most common.
2. What app categories have the highest number of users.

### 3.1. Most Common App Categories
In this section, we attempt to find out what app categories are the most common on both the App Store and Google Play.

A function`genre_percentage` will be created to check the `frequency` of each app category, which returns a list that illustrates the name of each app category and its percentage relative to the whole dataset.

In [27]:
def genre_percentage(dataset, index_to_check):
    # Obtain the frequency of each category
    genre_freq_dict = {}
    for row in dataset:
        genre = row[index_to_check]
        if genre in genre_freq_dict:
            genre_freq_dict[genre] += 1
        else:
            genre_freq_dict[genre] = 1
    
    # Convert the frequency of each category into percentages
    genre_percentage_dict = {}
    total = len(dataset)
    for key in genre_freq_dict:
        genre_percentage_dict[key] = round((genre_freq_dict[key] / total) * 100, 2)
    
    # Append the data from the dict to a list and sort the data in descending order
    genre_list = []
    for key in genre_percentage_dict:
        # The dictionary value needs to come before the key, so the sorted() function can be used to sort the list in descending order
        value_key = (genre_percentage_dict[key], key)
        genre_list.append(value_key)
    
    sorted_list = sorted(genre_list, reverse=True)

    return sorted_list

### 3.1.1. Part One: Look into the iOS dataset
In the iOS dataset, there is only one column,`prime_genre,`that indicates the category each app belongs to.

In [28]:
ios_category = genre_percentage(ios_dataset_free, 11)
for item in ios_category:
    print(item[1], ':', item[0], '%')

Games : 58.14 %
Entertainment : 7.89 %
Photo & Video : 4.97 %
Education : 3.66 %
Social Networking : 3.29 %
Shopping : 2.61 %
Utilities : 2.52 %
Sports : 2.14 %
Music : 2.05 %
Health & Fitness : 2.02 %
Productivity : 1.74 %
Lifestyle : 1.58 %
News : 1.34 %
Travel : 1.24 %
Finance : 1.12 %
Weather : 0.87 %
Food & Drink : 0.81 %
Reference : 0.56 %
Business : 0.53 %
Book : 0.43 %
Navigation : 0.19 %
Medical : 0.19 %
Catalogs : 0.12 %


### Based on the above information:
- The top 5 common categories are  
`Games(58.14%)`,`Entertainment(7.89%)`,`Photo & Video(4.97%)`,`Education(3.66%)`and`Social Networking(3.29%)`.
- This suggests that among the free English apps in the App Store, around 70% of the apps are designed for fun (i.e., games, entertainment, photo & video), while others are designed for practical uses (e.g., education, shopping, utilities, productivity, lifestyle...etc.).

### 3.1.2. Part Two: Look into the Android dataset
In the Android dataset, there are two columns,`category`and`genres,`that indicate the category each app belongs to. Both columns will be examined to see whether there exists difference.

In [29]:
android_category = genre_percentage(android_dataset_free, 1)
print('Number of categories:', len(android_category))
print('\n')
for item in android_category:
    print(item[1], ':', item[0], '%')

Number of categories: 33


FAMILY : 18.9 %
GAME : 9.73 %
TOOLS : 8.46 %
BUSINESS : 4.59 %
LIFESTYLE : 3.9 %
PRODUCTIVITY : 3.89 %
FINANCE : 3.7 %
MEDICAL : 3.53 %
SPORTS : 3.4 %
PERSONALIZATION : 3.32 %
COMMUNICATION : 3.24 %
HEALTH_AND_FITNESS : 3.08 %
PHOTOGRAPHY : 2.94 %
NEWS_AND_MAGAZINES : 2.8 %
SOCIAL : 2.66 %
TRAVEL_AND_LOCAL : 2.34 %
SHOPPING : 2.25 %
BOOKS_AND_REFERENCE : 2.14 %
DATING : 1.86 %
VIDEO_PLAYERS : 1.79 %
MAPS_AND_NAVIGATION : 1.4 %
FOOD_AND_DRINK : 1.24 %
EDUCATION : 1.16 %
ENTERTAINMENT : 0.96 %
LIBRARIES_AND_DEMO : 0.94 %
AUTO_AND_VEHICLES : 0.93 %
HOUSE_AND_HOME : 0.82 %
WEATHER : 0.8 %
EVENTS : 0.71 %
PARENTING : 0.65 %
ART_AND_DESIGN : 0.64 %
COMICS : 0.62 %
BEAUTY : 0.6 %


In [30]:
android_genre = genre_percentage(android_dataset_free, 9)
print('Number of categories:', len(android_genre))
print('\n')
for item in android_genre:
    print(item[1], ':', item[0], '%')

Number of categories: 114


Tools : 8.45 %
Entertainment : 6.07 %
Education : 5.35 %
Business : 4.59 %
Productivity : 3.89 %
Lifestyle : 3.89 %
Finance : 3.7 %
Medical : 3.53 %
Sports : 3.46 %
Personalization : 3.32 %
Communication : 3.24 %
Action : 3.1 %
Health & Fitness : 3.08 %
Photography : 2.94 %
News & Magazines : 2.8 %
Social : 2.66 %
Travel & Local : 2.32 %
Shopping : 2.25 %
Books & Reference : 2.14 %
Simulation : 2.04 %
Dating : 1.86 %
Arcade : 1.85 %
Video Players & Editors : 1.77 %
Casual : 1.76 %
Maps & Navigation : 1.4 %
Food & Drink : 1.24 %
Puzzle : 1.13 %
Racing : 0.99 %
Role Playing : 0.94 %
Libraries & Demo : 0.94 %
Auto & Vehicles : 0.93 %
Strategy : 0.9 %
House & Home : 0.82 %
Weather : 0.8 %
Events : 0.71 %
Adventure : 0.68 %
Comics : 0.61 %
Beauty : 0.6 %
Art & Design : 0.6 %
Parenting : 0.5 %
Card : 0.45 %
Casino : 0.43 %
Trivia : 0.42 %
Educational;Education : 0.39 %
Board : 0.38 %
Educational : 0.37 %
Education;Education : 0.34 %
Word : 0.26 %
Casual;Pretend Pl

### Based on the above information:
- For the`category`column, the top 5 common categories are `Family(18.90%)`,`Game(9.73%)`,`Tools(8.46%)`,`Business(4.59%)`and`Lifestyle(3.90%)`.
    - By searching [Family](https://play.google.com/store/apps/category/FAMILY?hl=en) under the app category on Google Play, it turned out that `Family` means games for kids from 5 years old to 12 years old.
 
  
- For the`genres`column, the top 5 common categories are `Tools(8.45%)`,`Entertainment(6.07%)`,`Education(5.35%)`,`Business(4.59%)`and `Productivity(3.89%)`and`Lifestyle(3.89%)`.
  
  
- We can see that the number of categories in the`genres`column (114 categories) is 2.5 times more than that in the`category`column (33 categories). This seems to be a result of adding sub-categories under the main categories in the`genres`column, providing the app with more information. However, because we are looking for a bigger picture, we will work with the`category`column rather than the`genres`.


- This suggests that among the free English apps in the Google Play, around 1/3 of the apps are designed for fun (i.e., family and game), while others are for practical purposes (e.g., business, finance and medical...etc.).

### 3.1. Summary
The result of the analysis suggests that 
- For the iOS dataset, the majority of the apps are designed for fun.
- Whereas for the Android dataset, the combination of the app category is relatively balanced with apps designed for fun as well as those designed for practical purposes.

### 3.2. Most Popular App Categories with Most Users
In this section, we will look for what app categories have got `the highest average number of users` on both the App Store and Google Play.

A function`genre_user_avg_counts` will be created to check the popularity of each app category, which returns a list that demonstrates the name of each app category and its average number of user counts.

### 3.2.1. Part One: Examine the iOS dataset
In the iOS dataset, there are two columns that indicate the number of users for each app:
1. `rating_count_tot (user rating counts for all version)`
2. `ratingcountver (user rating counts for current version)`

Here we will only work with the`rating_count_tot`column considering that it contains more users' rating counts.

In [31]:
def genre_user_counts(genre_list, dataset, index_genre, index_count):
    # The idea here is to use a for loop to iterate through the category list from the genre_percentage() function
    avg_user_count_list = []
    for genre in genre_list:
        list_category = genre[1]
        user_count = 0
        total = 0
        for row in dataset:
            row_category = row[index_genre]
            rating_count = float(row[index_count])
            if row_category == list_category:
                user_count += rating_count
                total += 1
        avg_user_count_list.append((round(user_count / total, 2), list_category))
    sorted_list = sorted(avg_user_count_list, reverse=True)
    return sorted_list        

In [33]:
ios_user_counts = genre_user_counts(ios_category, ios_dataset_free, index_genre=11, index_count=5)
for row in ios_user_counts:
    print(row[1], ':', row[0])

Navigation : 86090.33
Reference : 74942.11
Social Networking : 71548.35
Music : 57326.53
Weather : 52279.89
Book : 39758.5
Food & Drink : 33333.92
Finance : 31467.94
Photo & Video : 28441.54
Travel : 28243.8
Shopping : 26919.69
Health & Fitness : 23298.02
Sports : 23008.9
Games : 22812.92
News : 21248.02
Productivity : 21028.41
Utilities : 18684.46
Lifestyle : 16485.76
Entertainment : 14029.83
Business : 7491.12
Education : 7003.98
Catalogs : 4004.0
Medical : 612.0


### Based on the above information:
- The top 5 popular categories are  
`Navigation(86,090)`, `Reference(74,942)`, `Social Networking(71,548)`, `Music(57,326)` and `Weather(52,279)`.

- This result differs from the analysis for the most common apps where the majority of free English apps in the iOS dataset are designed for fun (e.g., game, entertainment)

For the below cell, we will use a nested for loop to look into the `rating_count_tot` column for each app within each top 5 popular category to see whether the average number of users is biased.

In [48]:
# Iterate through each genre
for genre in ios_user_counts[:6]:
    genre_name = genre[1]
    print(genre[1], ':', genre[0])
    app_list = []
    total = 0
    # Iterate through each row of the dataset and see if the category matches the genre
    for row in ios_dataset_free:
        category = row[11]
        rating_counts = int(row[5])
        app_name = row[1]
        if genre_name == category:
            total += 1
            app_list.append((rating_counts, app_name))
    sorted_list = sorted(app_list, reverse=True)
    print('Number of apps:', total)
    for item in sorted_list[:5]:
        print(item[1], ':', item[0])
    print('\n')

Navigation : 86090.33
Number of apps: 6
Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187


Reference : 74942.11
Number of apps: 18
Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418


Social Networking : 71548.35
Number of apps: 106
Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293


Music : 57326.53
Number of apps: 66
Pandora - Music & Radio : 1126879
Spotify Music : 878563
Shazam - Discover music, artists, videos & lyrics : 402925
iHeartRadio – Free Music & Radio Stations : 293228
SoundCloud - Music & Audio : 135744


Weather : 52279.89
Number of apps: 28
The Weather Channel: Forecast, Radar & Alerts : 495626
The

### Based on the above information:
- For the category`Navigation(86,090)`, the result suggests that the average number was heavily affected by the below two apps
    1. Waze(345,046)
    2. Google Maps(154,911)
              
              
- For the category`Reference(74,942)`, the result shows that the average number was skewed up due to the below two apps
    1. Bible(985,920)
    2. Dictionary.com(200,047)


- The same pattern applies to the category`Social Networking`,`Music`,`Weather`as well as`Book`. 


For now, it could be considered that
1. The`Navigation`category might require higher budget for developing an app.  
2. The market in both the`Social Networking`,`Music`categories could be saturated,
3. The`Weather`category might not produce much profit from the in-app advertising considering that users would close the app

The category`Book` could be a potential choice. However, further information regarding to the company's preference is needed to leverage the result of the analysis and make a recommendation.

### 3.2.2. Part Two: Examine the Android dataset
In the Android dataset, there are two columns that indicate the number of users for each app:
1. `Review (the number of review)`
2. `Installs (the number of of times the app has been installed)`

### 3.2.2.1. Check the average number of user by the column`Review`
This is similar to the iOS dataset where `users' rating counts`was used for the analysis.

In [50]:
android_user_counts = genre_user_counts(android_category, android_dataset_free, index_genre=1, index_count=3)
for row in android_user_counts:
    print(row[1], ':', row[0])

COMMUNICATION : 995608.46
SOCIAL : 965830.99
GAME : 683523.84
VIDEO_PLAYERS : 425350.08
PHOTOGRAPHY : 404081.38
TOOLS : 305732.9
ENTERTAINMENT : 301752.25
SHOPPING : 223887.35
PERSONALIZATION : 181122.32
WEATHER : 171250.77
PRODUCTIVITY : 160634.54
MAPS_AND_NAVIGATION : 142860.05
TRAVEL_AND_LOCAL : 129484.43
SPORTS : 116938.61
FAMILY : 113210.55
NEWS_AND_MAGAZINES : 93088.03
BOOKS_AND_REFERENCE : 87995.07
HEALTH_AND_FITNESS : 78094.97
FOOD_AND_DRINK : 57478.79
EDUCATION : 56293.1
COMICS : 42585.62
FINANCE : 38535.9
LIFESTYLE : 33921.82
HOUSE_AND_HOME : 26435.47
ART_AND_DESIGN : 24699.42
BUSINESS : 24239.73
DATING : 21953.27
PARENTING : 16378.71
AUTO_AND_VEHICLES : 14140.28
LIBRARIES_AND_DEMO : 10925.81
BEAUTY : 7476.23
MEDICAL : 3730.15
EVENTS : 2555.84


In [52]:
for genre in android_user_counts[:6]:
    genre_name = genre[1]
    print(genre[1], ':', genre[0])
    app_list = []
    total = 0
    # Iterate through each row of the dataset and see if the category matches the genre
    for row in android_dataset_free:
        category = row[1]
        rating_counts = int(row[3])
        app_name = row[0]
        if genre_name == category:
            total += 1
            app_list.append((rating_counts, app_name))
    sorted_list = sorted(app_list, reverse=True)
    print('Number of apps:', total)
    for item in sorted_list[:5]:
        print(item[1], ':', item[0])
    print('\n')

COMMUNICATION : 995608.46
Number of apps: 287
WhatsApp Messenger : 69119316
Messenger – Text and Video Chat for Free : 56646578
UC Browser - Fast Download Private & Secure : 17714850
BBM - Free Calls & Messages : 12843436
Viber Messenger : 11335481


SOCIAL : 965830.99
Number of apps: 236
Facebook : 78158306
Instagram : 66577446
Snapchat : 17015352
Facebook Lite : 8606259
VK : 5793284


GAME : 683523.84
Number of apps: 862
Clash of Clans : 44893888
Subway Surfers : 27725352
Clash Royale : 23136735
Candy Crush Saga : 22430188
My Talking Tom : 14892469


VIDEO_PLAYERS : 425350.08
Number of apps: 159
YouTube : 25655305
VivaVideo - Video Editor & Photo Movie : 9879473
MX Player : 6474672
VideoShow-Video Editor, Video Maker, Beauty Camera : 4016834
DU Recorder – Screen Recorder, Video Editor, Live : 2588730


PHOTOGRAPHY : 404081.38
Number of apps: 261
Google Photos : 10859051
PicsArt Photo Studio: Collage Maker & Pic Editor : 7594559
PhotoGrid: Video & Pic Collage Maker, Photo Editor : 752

### Based on the above information:
- We can tell that, the average number of users is mainly biased by the few apps, which is similar to the iOS dataset.

### 3.2.2.2. Check the average number of user by the column`Installs`
Note that 
- Since the data in the column`Installs` is of string type that contains symbols such as ',' and '+'. These symbols need to be removed before the analysis.
- The data such as '10,000+', '50,000,000+' are rough indicators suggesting the number of users. Here we will firstly treat '10,000+' as '10,000' and '50,000,000+' as '50,000,000' and so on to get a picture of how the dataset look like.

In [59]:
# The symbols to remove from the data in the install column
replace_list = [',', '+']
android_dataset_install = []
for row in android_dataset_free:
    for symbol in replace_list:
        row[5] = row[5].replace(symbol, '')
    android_dataset_install.append(row)
print(android_dataset_install[0])    

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10000', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


In [61]:
android_install_counts = genre_user_counts(android_category, android_dataset_install, index_genre=1, index_count=5)
for row in android_install_counts:
    print(row[1], ':', row[0])

COMMUNICATION : 38456119.17
VIDEO_PLAYERS : 24727872.45
SOCIAL : 23253652.13
PHOTOGRAPHY : 17840110.4
PRODUCTIVITY : 16787331.34
GAME : 15588015.6
TRAVEL_AND_LOCAL : 13984077.71
ENTERTAINMENT : 11640705.88
TOOLS : 10801391.3
NEWS_AND_MAGAZINES : 9549178.47
BOOKS_AND_REFERENCE : 8767811.89
SHOPPING : 7036877.31
PERSONALIZATION : 5201482.61
WEATHER : 5074486.2
HEALTH_AND_FITNESS : 4188821.99
MAPS_AND_NAVIGATION : 4056941.77
FAMILY : 3697848.17
SPORTS : 3638640.14
ART_AND_DESIGN : 1986335.09
FOOD_AND_DRINK : 1924897.74
EDUCATION : 1833495.15
BUSINESS : 1712290.15
LIFESTYLE : 1437816.27
FINANCE : 1387692.48
HOUSE_AND_HOME : 1331540.56
DATING : 854028.83
COMICS : 817657.27
AUTO_AND_VEHICLES : 647317.82
LIBRARIES_AND_DEMO : 638503.73
PARENTING : 542603.62
BEAUTY : 513151.89
EVENTS : 253542.22
MEDICAL : 120550.62


In [73]:
for genre in android_install_counts[:6]:
    genre_name = genre[1]
    print(genre[1], ':', round(genre[0] / 10000))
    app_list = []
    total = 0
    # Iterate through each row of the dataset and see if the category matches the genre
    for row in android_dataset_install:
        category = row[1]
        install_counts = int(row[5])
        app_name = row[0]
        if genre_name == category:
            total += 1
            app_list.append((install_counts, app_name))
    sorted_list = sorted(app_list, reverse=True)
    print('Number of apps:', total)
    for item in sorted_list[:15]:
        print(item[1], ':', round(item[0] / 10000))
    print('\n')

COMMUNICATION : 3846
Number of apps: 287
WhatsApp Messenger : 100000
Skype - free IM & video calls : 100000
Messenger – Text and Video Chat for Free : 100000
Hangouts : 100000
Google Chrome: Fast & Secure : 100000
Gmail : 100000
imo free video calls and chat : 50000
Viber Messenger : 50000
UC Browser - Fast Download Private & Secure : 50000
LINE: Free Calls & Messages : 50000
Google Duo - High Quality Video Calls : 50000
imo beta free calls and text : 10000
Yahoo Mail – Stay Organized : 10000
Who : 10000
WeChat : 10000


VIDEO_PLAYERS : 2473
Number of apps: 159
YouTube : 100000
Google Play Movies & TV : 100000
MX Player : 50000
VivaVideo - Video Editor & Photo Movie : 10000
VideoShow-Video Editor, Video Maker, Beauty Camera : 10000
VLC for Android : 10000
Motorola Gallery : 10000
Motorola FM Radio : 10000
Dubsmash : 10000
Vote for : 5000
Vigo Video : 5000
VMate : 5000
Samsung Video Library : 5000
Ringdroid : 5000
MiniMovie - Free Video and Slideshow Editor : 5000


SOCIAL : 2325
Number

### Based on the above information:
- The data in the`Installs`column only provides a rough picture of the popular apps under each common category.

Again, additional information regarding to the company's preference is required to leverage the result of the analysis and make a recommendation.

## 4. Conclusion

In this project, we cleaned and analysed the iOS dataset and the Android dataset with the goal of making a suggestion on what type of app the app development team can choose to build that can be profitable on both the App Store and Google Play.

We found that in terms of the most common category, apps for fun such as`game, entertainment`would be the first choice on both markets. Whereas in terms of the most users, considering the undecided budget, preference and direction of the company,`book`could be a choice with potential.
