# What type of apps are likely to attract more users

### Which has the most profitable apps: Android or IOS?

### What are the most popular genres?

Requirements of the project:
- Working as a Data Analyst for a company that builds Android and IOS mobile apps.
- Objective: to define what type of apps are likely to attract more users.
- Analysed data: from Statist (German Statistics office). 
[August2018](https://www.kaggle.com/datasets/lava18/google-play-store-apps)
[July2018](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps)

## The datasets

### Mobile App Statistics (Apple iOS app store)
This data set contains more than 7000 Apple iOS mobile application details. The data was extracted from the iTunes Search API at the Apple Inc website. R and linux web scraping tools were used for this study. The variables are poorly described in the data source.

Data collection date (from API);
July 2017

[Aplestore](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps)

### Google Play Store Apps
This information is scraped from the Google Play Store. The date of scraping was not described by the author, and this is a limitation of this dataset. This dataset has most of the qualitative variables .

[Googleplaystore](https://www.kaggle.com/datasets/lava18/google-play-store-apps)

## Open datasets

In [156]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [157]:
# open dataset AppleStore.csv and transform in a list of lists
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios_data = ios[1:]

In [158]:
print(explore_data(ios, 0, 3, True))

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16
None


In [159]:
# count rows
r=0
for row in ios:
    r += 1
print(r)    

7198


In [160]:
# open dataset googleplaystore.csv and transform in a list of lists
opened_file = open('googleplaystore.csv')
from csv import reader
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android_data = android[1:]

In [161]:
# count rows
r=0
for row in android:
    r += 1
print(r) 

10842


In [162]:
print(explore_data(android, 0, 3, True))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13
None


## Exploring datasets and Cleaning data

The datasets are cleaned to improve accuracy, check user engagement, and allow us to evaluate the kinds of apps that are likely to attract more users and, consequently, increase our revenue.

1-  Detect duplicate data, and remove the duplicates: to reduce errors and risks

2- Detect inaccurate data, and correct or remove it:
- Remove non-English apps
    > Most of the users use apps in English and keeping only with this language, we reach most users and reduce machine costs in evaluating. <br><br>
    > To classify the apps as English or not, ASCII is used.
    > English texts are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it, so we can build a function that checks an app name and tells us whether it contains non-ASCII characters.

- Remove apps that aren't free
    > I keep with the free apps. Most of the apps make money inside the app, having free installations. This business type makes sense for new apps like ours new apps.
    
3- Examining genres of apps.
- Most reviewed genres
    > Reviews are used to find engagement index of user with the app. Reviews are a benchmark for measuring user acceptance of an app.<br>
    
- Most installed genres
    > By the number of apps installed in certain genres, we can identify demand by genre, considering different platforms.
    

### Explore applestore_data

#### Check the number of columns in each row

In [163]:
# create a function to check and delete the rows with missing data/columns in the datasets
def check_missing_values(data):
    is_all_valid = True
    for row in data:
        n_columns = len(data[0])  # n_of_headers:
        if len(row) != n_columns:
            print(data.index(row))
            print('The row index ', data.index(row), ' is missing ', n_columns - len(row), ' column(s). This row is being deleted.')
            index_to_delete = data.index(row)
            del data[index_to_delete]
            is_all_valid = False

    if is_all_valid:
        print('There is no missing columns in the dataset.')

In [164]:
check_missing_values(ios)

There is no missing columns in the dataset.


In [165]:
# count rows
r=0
for row in ios:
    r += 1
print(r) 

7198


In [166]:
check_missing_values(android)

10473
The row index  10473  is missing  1  column(s). This row is being deleted.


In [167]:
# count rows
r=0
for row in android:
    r += 1
print(r) 

10841


### Check duplicate entries

In [168]:
# ios
for app in ios:
    name = app[0]
    if name == 'Instagram':
        print(app)

In [169]:
print(android[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [170]:
# googleplaystore_data
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [171]:
# Finding the number of duplicate app:  # IOS
unique_apps = []
duplicate_apps = []
for app in ios:
    app_name =  app[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:     
        unique_apps.append(app_name)
        
print('Number of duplicate IOS apps: ', len(duplicate_apps))
if duplicate_apps:
    print('Examples of duplicate IOS apps: ', duplicate_apps[:3])
else:
    print('There is no duplicate IOS apps.')

Number of duplicate IOS apps:  0
There is no duplicate IOS apps.


In [172]:
# Finding the number of repeated app:  # Android
unique_apps = []
duplicate_apps = []
for app in android:
    app_name =  app[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:     
        unique_apps.append(app_name)
        
print('Number of duplicate Android apps: ', len(duplicate_apps))
if duplicate_apps:
    print('Examples of duplicate Android apps: ', duplicate_apps[:15])
else:
    print('There is no duplicate Android apps.')

Number of duplicate Android apps:  1181
Examples of duplicate Android apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


The **IOS** dataset, from googleplaystore, has **1181** multiple entries to the app name. **'Quick PDF Scanner + OCR FREE'**, **'Box'**, **'Google My Business'** , **'Slack'**, **'ZOOM'** are some examples of repeated apps.

### Remove duplicates in apps, keeping the ones with more reviews.

In [173]:
print(explore_data(android, 0, 3, True))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13
None


In [174]:
# Android

reviews_max = {}

for app in android[1:]:   # excluding the header
    name = app[0]   
    n_reviews = len(app[3])  
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        

In [175]:
print('Expected length:', len(android[1:]) - 1181)
print('Actual length:', len(reviews_max))   # considering unique values

Expected length: 9659
Actual length: 9659


In [176]:
android_clean = []
already_added = []

for app in android[1:]:
    name = app[0]
    n_reviews = len(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)   # rows  
        already_added.append(name)   # values in apps

In [177]:
print(explore_data(android_clean, 0, 3, True))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9659
Number of columns: 13
None


### Remove Non-English Apps

In [178]:
def is_english(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(ord('™'))
print(ord('😜'))

True
False
False
False
8482
128540


In [179]:
def is_english_0(string):
    non_ascii = 0
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
            
        if non_ascii > 3:
            return False
        else:
            return True
# print(is_english('Instagram'))
# print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
# print(is_english('Docs To Go™ Free Office Suite'))
# print(is_english(is_english(txt)))
print(is_english_0('欢乐颂2'))

True


Trying different because the code above didn't work well.

### Using Python String isascii() Method

In [180]:
def is_english(txt):
    n_non_englisg_letter = 0

    for letter in txt:
        if letter.isascii() == False:
            n_non_englisg_letter += 1

    if n_non_englisg_letter > 3:
        return False

    else:    
        return True

# txt = "Company123欢乐欢乐欢乐欢乐"
txt = '最長１週間の献立が簡単に作れるme:new（ミーニュー）'
# txt = 'Instachat 😜'
# txt = 'Instagram'
# txt = '爱奇艺PPS -《欢乐颂2》电视剧热播'
# txt = 'Docs To Go™ Free Office Suite'
is_english(txt)

False

This second option worked better than using the 1st method.
I considered words with more than 3 characters, not ascii as non-English.

In [181]:
ios_english = []
android_english = []

for app in ios[1:]:
    name = app[0]
    if is_english(name) == True:
        ios_english.append(app)
        
explore_data(ios_english, 0, 3, True)
print('\n')

for app in android_clean:
    name = app[0]
    if is_english(name) == True:
        android_english.append(app)
        
explore_data(android_english, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1,

We're left with 7197 IOS apps and 9614 Android apps.

### Keeping only with free apps

In [182]:
ios_final = [ios_english[0]]
android_final = [android_english[0]]

for app in ios_english:
    type = app[4]
    if type == '0.0':
        ios_final.append(app)
    
explore_data(ios_final, 0, 3, True)
print('\n')

for app in android_english:
    type = app[7]
    if type == '0':
        android_final.append(app)
    
explore_data(android_final, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 4057
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018'

We found 4057 free IOS apps and 8865 free android apps, with titles in English. 

### Transforming the numerical variables into integer and float
In order to check how spread the observations are.

In [189]:
# transform objects (strings) in float and integer  # IOS apps

# ios_cols_category = ios_free[['track_name','currency', 'user_rating', 'ver','cont_rating', 'prime_genre']]
# ios_cols_numeric = ios_free['size_bytes', 'price', 'rating_count_tot', 'rating_count_ver', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
import pandas as pd
cols = ios[0]
df_ios = pd.DataFrame(ios_final[1:], columns = cols)
df_ios

to_transform_to_int = ['size_bytes','rating_count_tot', 'rating_count_ver', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
for var in to_transform_to_int:
    df_ios[var] = df_ios[var].astype('int')

df_ios['price'] = df_ios['price'].astype('float')

# df_ios.info()
df_ios.describe()


Unnamed: 0,size_bytes,price,rating_count_tot,rating_count_ver,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
count,4056.0,4056.0,4056.0,4056.0,4056.0,4056.0,4056.0,4056.0
mean,147935700.0,0.0,19749.8,569.400888,37.428254,3.585552,5.732495,0.994822
std,208901400.0,0.0,97744.28,4134.301293,2.954281,2.041633,8.505148,0.071777
min,767126.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0
25%,54041340.0,0.0,22.0,1.0,37.0,2.0,1.0,1.0
50%,99600380.0,0.0,466.0,22.0,37.0,5.0,1.0,1.0
75%,161198600.0,0.0,5450.75,162.25,38.0,5.0,9.0,1.0
max,3148421000.0,0.0,2974676.0,117470.0,47.0,5.0,75.0,1.0


Among the numeric variables for IOS apps, `size_bytes`, `sup_devices.num`, `ipadSc_urls.num`,	and	`vpp_lic` are the most reliable for having lower std. However, we don't have description to  `sup_devices.num`, `ipadSc_urls.num`, and	`vpp_lic`. So I will not analyze these in detail.

In [191]:
# transform objects (strings) in float and integer  # Android apps

import pandas as pd
cols = android[0]
df_android = pd.DataFrame(android_free[1:], columns = cols)
df_android

to_transform_to_int_android = ['Reviews', 'Price']
for var in to_transform_to_int_android:
    df_android[var] = df_android[var].astype('int')

# df_ios.info()
df_android.describe()

Unnamed: 0,Reviews,Price
count,8864.0,8864.0
mean,235433.7,0.0
std,1910437.0,0.0
min,0.0,0.0
25%,30.0,0.0
50%,1403.0,0.0
75%,35528.5,0.0
max,78158310.0,0.0


Reviews have an OK to bad score for standard deviation which makes it a reliable variable for deep analysis.

### Defining percentages to evaluate genres
- One function to generate frequency tables that show percentages
- Another function that we can use to display the percentages in a descending order

In [None]:
def freq_table(dataset, index):
    table = {}   # create an empty dictionary
    total = 0    # counter
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1  # since it will happen many times, this came first. 
                    # Thus, we avoid to make the machine iterate by code for nothing.
        else:
            table[value] = 1   # this will hapend only in the first iteration.
            
    table_percentages = {}
    
    for key in table:
        percentage = (table[key]/total) * 100
        table_percentages[key] = percentage
        
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []    # create an empty list
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse=True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Examining the apps genre

In [None]:
# for ios app, removing the header
display_table(ios_free[1:], -5)   # prime_genre

The percentages show that the most popular IOS apps are focused on entertainment. [Publications](https://www.macrumors.com/2021/12/02/apple-most-downloaded-apps-2021/) ratify the information.

In [None]:
# for android app, removing the header
display_table(android_free[1:],1)   # category

In [None]:
# for android app, removing the header
display_table(android_free[1:],-4)   # genres

According to the [description of the variables of the dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps) and observation from the results for frequencies, `Genres` are more granulated than `Categories`.

At first glance, I can say that `Games` apps is the most popular genre  in IOS with 55,4%, presenting only 10,59% in Android. `Education` has 3.25% of the apps in IOS and 1,51% in Android. As well as, `Entertainment` is more significant in IOS (8,23%) than in Android (1,47%). 

Android stands out for `Business`, rated in Android with 4,5% and in IOS with 0,49%, for `Productivity` with 3,95% in Android while IOS presents 1,52%, for `Medical` apps (3,54%)  what does not happen with IOS (0,19%), and for `Travel` 2,46% while IOS presents 1,38%. 

`Food and drink` was not a significant genre in apps from IOS (1,06%) or Android(1,25%). 

Apps of Games and entertainment are expected to have more frequency and engagement than functional apps such as food, finance and travel.

Note: Since the categories are different between IOS and Android apps, a manual normalization or a data visualization could give better insights. 

### Examining reviews in Games

In [None]:
for app in android_free:
    if app[1] == 'GAME':
        print(app[0], ':', app[3], ':', app[2])  # print name and number of ratings

As examples of Android games, I cite:
`Roblox` had 4.447.388 reviews and was rated 4.5,
`Candy Crush Saga` had 22.426.677 reviews and was rated 4.4,
`Clash of Clans` had  44.891.723 reviews and was rated 4.6, and
`Temple Run 2` had 8.118.609 reviews and was rated 4.3.

In [None]:
reviews_android_games = []
for app in android_free:
    if app[1] == 'GAME':
        reviews_android_games.append(int(app[3]))
sorted_reviews_android_games = sorted(reviews_android_games)
# print(sorted_reviews_android_games[-1:])
for app in android_free:
    if app[3] == '44893888':
        print('The most reviewed Android game was ', app[0],' with 44,893,888 reviews.')
    

In [None]:
for app in ios_free:
    if app[-5] == 'Games':
        print(app[1], ':', app[5], ':', app[7])  # print name, number of ratings and user rating

As examples of IOS games, I cite:
`Roblox` had 183,621 reviews,
`Clash of Clans` had 2.130.805 reviews,
`Temple Run` had 1.724.546 reviews, and `Candy Crush Saga` had 961.794 reviews. 
Both were rated 4.5 by users.

In [None]:
reviews_ios_games = []
for app in ios_free:
    if app[-5] == 'Games':
        reviews_ios_games.append(int(app[5]))
sorted_reviews_ios_games = sorted(reviews_ios_games)
print(sorted_reviews_ios_games[-1:])
for app in ios_free:
    if app[5] == '2130805':
        print('The most reviewed IOS game was ', app[1],' with 2,130,805 reviews.')

In [None]:
reviews_android_games = []
for app in android_free:
    if app[1] == 'GAME':
        reviews_android_games.append(int(app[3]))
sorted_reviews_android_games = sorted(reviews_android_games)
# print(sorted_reviews_android_games[-1:])
for app in android_free:
    if app[3] == '44893888':
        print('The most reviewed Android game was ', app[0],' with 44,893,888 reviews.')

`Clash of Clans` was the most reviewed game on both platforms. However, the number of reviews on Android is 22 times higher than on IOS. 

User engagement is higher on Android. But this could only be proven with the information on the number of installed apps, or the number of users, which are information that we don't have for the two datasets.

### Number of users and popularity
Data of installs of Android apps are used to give us an overview about the genre popularity. 



In [None]:
# check the variables
android[0]

In [None]:
unique_installs = []  # list unique values for number of installs 

for value in android_free[1:]:
    install = value[5] 
    if install not in unique_installs:
        unique_installs.append(install)   
unique_installs        


In [None]:
# To find the higher install number, remove characters and sort in descending order
unique_installs = []

for value in android_free[1:]:
    install = value[5] 
    install = install.replace(',', '') # remove , character
    
    substring = '+'
    if install.find(substring): 
        install = install.replace('+', '') # remove + character
        install = int(install)    # transform in integer
        
    if install not in unique_installs:
        unique_installs.append(install)
        
sorted(unique_installs, reverse = True)  

In [None]:
print(explore_data(android_free, 0, 2, True))

#### The categories that have the most installed apps
List unique categories that have apps with 1,000,000,000+ installs

In [None]:
categories_installs = []
for row in android_free:
    category = row[1]
    install = row[5] 
    if install == '1,000,000,000+' and category not in categories_installs:
        categories_installs.append(category)
print(categories_installs)
# android_free

In [None]:
# remove characters and sort in descending order in the dataset
for value in android_free[1:]:
    install = value[5] 
    install = install.replace(',', '')
    
    substring = '+'
    if install.find(substring): 
        install = install.replace('+', '')
    install = int(install)
    value[5] = install

print(explore_data(android_free, 0, 2, True))       

In [None]:
# find the average of pps installed that are in COMMUNICATION category and  has less than 100000000 installs
under_100_m = []

for app in android_free:
    installs = app[5]
    if (app[1] == 'COMMUNICATION') and installs < 100000000:
        under_100_m.append(installs)
        
print(sum(under_100_m) / len(under_100_m))


In [None]:
# find the average of pps installed that are in BOOKS_AND_REFERENCE category and  has less than 100000000 installs
under_100_m = []

for app in android_free:
    installs = app[5]
    if (app[1] == 'BOOKS_AND_REFERENCE') and installs < 100000000:
        under_100_m.append(installs)
        
print(sum(under_100_m) / len(under_100_m))

In [None]:
# find the average of pps installed that are in ENTERTAINMENT category and  has less than 100000000 installs
under_100_m = []

for app in android_free:
    installs = app[5]
    if (app[1] == 'ENTERTAINMENT') and installs < 100000000:
        under_100_m.append(installs)
        
sum(under_100_m) / len(under_100_m)

In [None]:
# find the average of pps installed that are in GAME category and  has less than 100000000 installs
under_100_m = []

for app in android_free:
    installs = app[5]
    if (app[1] == 'GAME') and installs < 100000000:
        under_100_m.append(installs)
        
sum(under_100_m) / len(under_100_m)

In [None]:
list_avg = [4386993.665492957,1673876.3541666667, 6389411.764705882, 7943314.246119734]
sorted(list_avg)
# book, communication, entertainment, game

In [None]:
counter = 0
for app in android_free:
    if app[1] == 'COMMUNICATION' and app[5] == 100000000:
        counter += 1
        print(app[0], ':', app[5])
        
print(counter)

In [None]:
counter = 0
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and app[5] == 100000000:
        counter += 1
        print(app[0], ':', app[5])

print(counter)

In [None]:
counter = 0
for app in android_free:
    if app[1] == 'ENTERTAINMENT' and app[5] == 100000000:
        counter += 1
        print(app[0], ':', app[5])
print(counter)

In [None]:
counter = 0
for app in android_free:
    if app[1] == 'GAME' and app[5] == 100000000:
        counter += 1
        print(app[0], ':', app[5])
        
print(counter)

### Conclusion

**Which has the most profitable apps: Android or IOS?** We know that the percentage of mobile over desktop is only increasing. Android holds about 53.2% of the smartphone market, while iOS is 43%. Based on these datasets, I can't answer this question. But I can highlight points and make some observations.

Free apps from IOS (2017) and Android(unknown data) were analyzed. A Cleaning was made to detect and remove missing and duplicate data, and detect and fix or remove inaccurate data. Duplicate entries were removed, keeping with entries with the higher number of reviews, that can represent the more recent records. Non-English apps and non-free apps were excluded from this work. Percentages for genres in both platforms were found. And the game with more reviews was shown.


- **Games are the strong point of IOS apps.**
Games and Entertainment are strong points of IOS apps. These types of apps are quite profitable, even though they are initially free and have different business models. Unlike functional apps, users spend a lot of time using gaming and entertainment apps, which makes room for selling space for marketing and also selling features. <br><br>

- **Business and Productivity are the most important genres in Android apps.**
Only requires time and focus as input. We can say that business apps are profitable, even if they don't have many users, and that they run profitably with a small mumber of users. Because these only require time and focus as input, they do not require upfront capital, nor do they require an extended business network [(source)](https://www.appypie.com/app-business-model). <br><br>

- **Clash of Clans** is the game with more reviews on both platforms, but the engagement on Android is 22 times higher for this game.

**What are the most popular genres?**

- `book and reference`, `communication`, `entertainment`, and `game` are the most installed categories with more.<br><br>

- `Communication` category was shown to be a higher proportion of apps with 100,000,000+ more disproportionate than the average of those with less installs. This makes me believe that this category is dominated by big and famous apps, while the other three categories are more competitive.

With these general insights, I can prioritize the platform I choose to launch my company's app on, according with the company's niche.

