<img src="http://www.internetsegura.pt/sites/default/files/1065.jpg" alt="Drawing" style="width:550px;"/>

# Mobile Free Apps Profitable Analyzes

 This study is meant to help companies who develop free apps in GooglePlayStore and AppleAppStore.
 
 Since the revenue of those companies is highly influenced by the number of people using their apps, the aim is to determine the kinds of    **English - Free - Apps**    that are likely to attract more users.

#### DataSets:  
 * [GooglePlayStore](https://www.kaggle.com/lava18/google-play-store-apps/home)
 * [AppStore](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

## <img src="http://pngimages.net/sites/default/files/file-png-image-11230.png" alt="Drawing" style="width:40px; float:left"/>   &ensp; Opening Files

In [1]:
from csv import reader
import matplotlib.pyplot as plt
import numpy as np

In [2]:
#____Basic CSV Open Function____
def csv_to_list(file_name, encode=None):
    if encode =="utf8":
        csv_file = open(file_name, encoding="utf8")
        reade_file = reader(csv_file)
        new_list = list(reade_file)
        list_header = new_list[0]
        list_body = new_list[1:]
        return list_header, list_body
    elif encode =="Latin-1":
        csv_file = open(file_name, encoding="latin-1")
        reade_file = reader(csv_file)
        new_list = list(reade_file)
        list_header = new_list[0]
        list_body = new_list[1:]
        return list_header, list_body
    else:
        csv_file = open(file_name)
        reade_file = reader(csv_file)
        new_list = list(reade_file)
        list_header = new_list[0]
        list_body = new_list[1:]
        return list_header, list_body        
    

In [3]:
#____Opening AppleStore Dataset____
appstore = 'AppleStore.csv'
encode = "utf8"
apple_header, apple_dataset = csv_to_list(appstore, encode)
print("AppStore dataset size: ", len(apple_dataset))

#____Opening GoogleAppStore Dataset____
iosStore = 'googleplaystore.csv'
encode = "utf8"
ios_header, ios_dataset = csv_to_list(iosStore, encode)
print("GooglePlay dataset size: ", len(ios_dataset))

AppStore dataset size:  7197
GooglePlay dataset size:  10841


---

## <img src="https://www.invensis.net/blog/wp-content/uploads/2016/04/5-Best-Practices-in-Accounts-Payable-invensis1.png" alt="Drawing" style="width:50px; float:left"/> Data Cleaning 

### One-time error


This error was pointed in a [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) inside the dataset origin website.

In [4]:
print(ios_dataset[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [5]:
# this entry has missing 'Rating' 
del ios_dataset[10472]
print('Test:\n', ios_dataset[10472])

Test:
 ['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


### Duplicated data 
Only on GooglePlayStore DataSet

**Locating by duplicated name**

In [6]:
name_list = []
cnt = 0
repeated_index_list = []

for name in ios_dataset:
    if name[0] in name_list:
        repeated_index_list.append(cnt)
    else:
        name_list.append(name[0])
    cnt+=1

print('There is ', len(repeated_index_list), 'duplicated apps.')

There is  1181 duplicated apps.


**Deleting**

In [7]:
#___Changing the order of the repeated index, so we can delete without changing the others
repeated_index_list.sort(key=int, reverse=True)

for n in repeated_index_list:
    del ios_dataset[n]

### Non English Apps

Since the analyzes must be made only on english apps, it's necessary to remove non english apps.
   
Criteria: Apps with names that contain non english characters (ASCII > 127) will be removed.
An exception for '®', '™', '—' and '–', that are not in the range, but are frequently used in english apps.

* **GooglePlay DataSet**

In [8]:
#___Locating___
cnt = 0
non_english_index = []

for name in ios_dataset:
    for character in name[0]:
        if ord(character) > 127 and ord(character) != 8211 and ord(character) != 8482 and ord(character) != 174 and ord(character) != 8212:
            non_english_index.append(cnt)
            break
    cnt+=1

cnt = 0
print('Examples of apps tracedown:')
for n in non_english_index:
    print(ios_dataset[n][0])
    cnt+=1
    if cnt>5:
        break

#___Deleting___
non_english_index.sort(key=int,reverse=True)
for index in non_english_index:
    del ios_dataset[index]
    
#___Confirming___
cnt = 0
test_non_english_index = []

for name in ios_dataset:
    for character in name[0]:
        if ord(character) > 127 and ord(character) != 8211 and ord(character) != 8482 and ord(character) != 174 and ord(character) != 8212:
            test_non_english_index.append(cnt)
            break
    cnt+=1

print('\nTest: Applications not deleted:', test_non_english_index)


Examples of apps tracedown:
Zona Azul Digital Fácil SP CET - OFFICIAL São Paulo
Wattpad 📖 Free Books
Röhrich Werner Soundboard
Truyện Vui Tý Quậy
Comic Es - Shojo manga / love comics free of charge ♪ ♪
漫咖 Comics - Manga,Novel and Stories

Test: Applications not deleted: []


* **AppStore DataSet**

In [9]:
#___Locating___
cnt = 0
non_english_index = []

for name in apple_dataset:
    for character in name[2]:
        if ord(character) > 127 and ord(character) != 8211 and ord(character) != 8482 and ord(character) != 174 and ord(character) != 8212:
            non_english_index.append(cnt)
            break
    cnt+=1

print('Examples of apps tracedown:')
cnt = 0
for n in non_english_index:
    print(apple_dataset[n][2])
    cnt+=1
    if cnt>5:
        break

#___Deleting___
non_english_index.sort(key=int,reverse=True)
for index in non_english_index:
    del apple_dataset[index]
    
#___Confirming___
cnt = 0
test_non_english_index = []

for name in ios_dataset:
    for character in name[2]:
        if ord(character) > 127 and ord(character) != 8211 and ord(character) != 8482 and ord(character) != 174 and ord(character) != 8212:
            test_non_english_index.append(cnt)
            break
    cnt+=1

print('\nTest: Applications not deleted:', test_non_english_index)




Examples of apps tracedown:
Chase Mobile℠
大辞林
新浪新闻-阅读最新时事热门头条资讯视频
同花顺-炒股、股票
20 Minutes.fr - l'actualité en continu
Guess My Age  Math Magic

Test: Applications not deleted: []


### Isolating free apps

In [10]:
apple_dataset_free = []
for row in apple_dataset:
    if row[5] == '0':
        apple_dataset_free.append(row)

ios_dataset_free = []
for row in ios_dataset:
    if row[6] == 'Free':
        ios_dataset_free.append(row)
        

**DataSets sizes after cleaning**

In [11]:
print("AppleAppStore dataset size: ", len(apple_dataset_free))
print("GooglePlay dataset size: ", len(ios_dataset_free))

AppleAppStore dataset size:  3121
GooglePlay dataset size:  8647


---

## <img src="https://cdn2.iconfinder.com/data/icons/business-colored/48/23-512.png" alt="Drawing" style="width:30px; float:left"/>   &ensp;    Data Analyzes

### Frequency by genre in percentage

**AppStore**

In [12]:
apple_genre_dictionarie = {}

for genre in apple_dataset_free:
    if genre[12] in apple_genre_dictionarie:
        apple_genre_dictionarie[genre[12]] +=1
    else:
        apple_genre_dictionarie[genre[12]] = 1

apple_dataset_free_size = len(apple_dataset_free)
for key in apple_genre_dictionarie:
    apple_genre_dictionarie[key] = round((apple_genre_dictionarie[key]/apple_dataset_free_size)*100, 2)

print("Percentage of apps in the store per genre in AppStore:\n")
sorted_by_value = sorted(apple_genre_dictionarie.items(), key=lambda kv: kv[1], reverse = True)

for item in sorted_by_value:
    print(item)


Percentage of apps in the store per genre in AppStore:

('Games', 58.6)
('Entertainment', 7.91)
('Photo & Video', 5.06)
('Education', 3.72)
('Social Networking', 3.33)
('Shopping', 2.5)
('Utilities', 2.31)
('Sports', 2.11)
('Music', 2.05)
('Health & Fitness', 2.02)
('Productivity', 1.7)
('Lifestyle', 1.54)
('News', 1.28)
('Travel', 1.19)
('Finance', 1.09)
('Weather', 0.87)
('Food & Drink', 0.83)
('Reference', 0.51)
('Business', 0.48)
('Book', 0.38)
('Navigation', 0.19)
('Medical', 0.19)
('Catalogs', 0.13)


**GooglePlayStore**

In [13]:
ios_genre_dictionarie = {}

for genre in ios_dataset_free:
    if genre[1] in ios_genre_dictionarie:
        ios_genre_dictionarie[genre[1]] += 1
    else:
        ios_genre_dictionarie[genre[1]] = 1

ios_dataset_free_size = len(ios_dataset_free)
for key in ios_genre_dictionarie:
    ios_genre_dictionarie[key] = round((ios_genre_dictionarie[key]/ios_dataset_free_size)*100, 2) 

print("Percentage of apps in the store per genre in PlayStore:\n")
sorted_by_value = sorted(ios_genre_dictionarie.items(), key=lambda kv: kv[1], reverse = True)

for item in sorted_by_value:
    print(item)

Percentage of apps in the store per genre in PlayStore:

('FAMILY', 18.42)
('GAME', 9.88)
('TOOLS', 8.5)
('BUSINESS', 4.67)
('PRODUCTIVITY', 3.94)
('LIFESTYLE', 3.9)
('FINANCE', 3.7)
('MEDICAL', 3.6)
('PERSONALIZATION', 3.33)
('COMMUNICATION', 3.25)
('SPORTS', 3.21)
('HEALTH_AND_FITNESS', 3.1)
('PHOTOGRAPHY', 3.0)
('NEWS_AND_MAGAZINES', 2.75)
('SOCIAL', 2.64)
('TRAVEL_AND_LOCAL', 2.34)
('SHOPPING', 2.26)
('BOOKS_AND_REFERENCE', 2.15)
('DATING', 1.89)
('VIDEO_PLAYERS', 1.8)
('MAPS_AND_NAVIGATION', 1.35)
('EDUCATION', 1.3)
('FOOD_AND_DRINK', 1.21)
('ENTERTAINMENT', 1.13)
('AUTO_AND_VEHICLES', 0.94)
('LIBRARIES_AND_DEMO', 0.88)
('HOUSE_AND_HOME', 0.81)
('WEATHER', 0.79)
('EVENTS', 0.73)
('ART_AND_DESIGN', 0.69)
('PARENTING', 0.66)
('BEAUTY', 0.61)
('COMICS', 0.58)


It's clear that on Apple AppStore the apps designed for fun are the majority, with 'Games' representing more 58% of the apps. While Education comes in third position and it is the better positioning category that doesn't aim recreation, and representing only 3,72% of the apps on the store. 

However in GooglePlayStore the situation is different, this store shows a more balanced distribution of the apps per genre, and it seems that a good number of apps are designed for practical purposes(business, productivity, lifestyle and etc). 
Also, it is noticeable that content for kids has very relevant numbers, but it should be considered that within this content for children a good amount are games.

 ### Most Popular Apps by Genre

Criteria: The way to calculate the popularity of our interest is to count how many apps have been downloaded/Installed.

**AppStore**

The AppStore dataset doesn't have a number of downloads of a specific app, so it will be used the number of ratings that those apps have.

In [14]:
for row in apple_dataset_free:
    if row[12] == 'Navigation':
        print(row[2], ':', row[6]) 

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Geocaching® : 12811
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5
CoPilot GPS – Car Navigation & Offline Maps : 3582
Google Maps - Navigation & Transit : 154911


In [15]:
cnt = 0
for row in apple_dataset_free:
    cnt +=1
    if row[12] == 'Social Networking':
        print(cnt, " ", row[2], ':', row[6])
    if cnt>300 :
        break

10   Facebook : 2974676
17   LinkedIn : 71856
40   Skype for iPhone : 373519
41   Tumblr : 334293
42   Match™ - #1 Dating App. : 60659
56   WhatsApp Messenger : 287589
64   TextNow - Unlimited Text + Calls : 164963
71   Grindr - Gay and same sex guys chat, meet and date : 23201
106   imo video calls and chat : 18841
129   Ameba : 269
130   Weibo : 7265
131   Badoo - Meet New People, Chat, Socialize. : 34428
137   Kik : 260965
143   Qzone : 1649
160   Fake-A-Location Free ™ : 354
163   Tango - Free Video Call, Voice and Chat : 75412
164   MeetMe - Chat and Meet New People : 97072
167   SimSimi : 23530
184   Viber Messenger – Text & Call : 164249
186   Find My Family, Friends & iPhone - Life360 Locator : 43877
188   Weibo HD : 16772
198   POF - Best Dating App for Conversations : 52642
206   GroupMe : 28260
229   Lobi : 36
246   WeChat : 34584
271   ooVoo – Free Video Call, Text and Voice : 177501
273   Pinterest : 1061624
278   Qzone HD : 458
298   Skype for iPad : 60163


As we can see above, in genres like Navigation and Social media, as in others, a few giant apps like GoogleMaps, Facebook, Pinterest, can make a big difference in the popularity analyzes.
Most companies that develop english free apps don't have an interest in competing with this giant apps, the idea is to research which genre might be a better field to invest independent of this massive apps. Therefore will be excluded from these analyses the four apps of each genre with the biggest ratings. Although there are some genres that have too few apps, if 4 apps of those are deleted will substantially affect the average, creating inaccurate results. Therefore genres that represent less than 1% of the store will only have the app with most rating deleted.

In [16]:
#LOCATING MOST RATED APPS

#___Creating a list of Genre___
apple_genre_list = []
for genre in apple_dataset_free:
    if genre[12] not in apple_genre_list:
        apple_genre_list.append(genre[12])
        
        
#___Initializing Variables___
apple_onepercent_list = ['Weather', 'Food & Drink', 'Reference','Business','Book', 'Navigation','Medical','Catalogs']
cnt_index = 0
cnt_first_apps = 0
list_position = 0
list_of_index = []
list_of_ratings = []
list_position = 0



#Looping throught genres
for genre in apple_genre_list:
    cnt_first_apps = 0
    cnt_index = 0
    
    #Looping throught dataset for each genre
    for rating in apple_dataset_free:
               
        if rating[12] == genre:
            cnt_first_apps += 1
            
            #Treating items with one percent
            if rating[12] in apple_onepercent_list:
                if cnt_first_apps == 1:
                    list_of_index.append(cnt_index)
                    list_of_ratings.append(int(rating[6]))
                else:
                    if int(rating[6]) > list_of_ratings[-1]:
                        list_of_ratings[-1] = int(rating[6])
                        list_of_index[-1] = cnt_index
            #Treating items with more than one percent   
            else:
                if cnt_first_apps < 5:
                    list_of_index.append(cnt_index)
                    list_of_ratings.append(int(rating[6]))
                    
                else:
                    for i in range(4):
                        if list_of_ratings[list_position + i] < int(rating[6]):
                            list_of_ratings[list_position + i] = int(rating[6])
                            list_of_index[list_position + i] = cnt_index
                            break
        cnt_index += 1
    if genre in apple_onepercent_list:
        list_position += 1
        
    else:
        list_position += 4
            

In [17]:
#DELETING THEN

apple_dataset_free_nobigapps = apple_dataset_free.copy()

list_of_index.sort(key=int, reverse=True)

for n in list_of_index:
    del apple_dataset_free_nobigapps[n]


In [18]:
#MAKING THE ANALYZES:
apple_genre_rating_dictionarie = {}
sum_per_genre = 0
cnt = 0

for genre in apple_genre_list:
    sum_per_genre = 0 
    cnt = 0
    
    for rating in apple_dataset_free_nobigapps:
        
        if genre == rating[12]:
            sum_per_genre += int(rating[6])
            cnt +=1 
    
    apple_genre_rating_dictionarie[genre] = round(sum_per_genre /cnt, 2)
        

print("Download average per app in diferent genres in AppStore:\n")
sorted_by_value = sorted(apple_genre_rating_dictionarie.items(), key=lambda kv: kv[1], reverse = True)

for item in sorted_by_value:
    print(item)
    

Download average per app in diferent genres in AppStore:

('Weather', 37237.96)
('Navigation', 34299.2)
('Social Networking', 28220.36)
('Book', 27685.73)
('Reference', 24147.47)
('Food & Drink', 22513.04)
('Music', 22479.38)
('Games', 20422.55)
('Shopping', 19315.86)
('Productivity', 14083.61)
('Sports', 12596.6)
('Finance', 10856.57)
('Photo & Video', 10802.45)
('Entertainment', 10745.11)
('News', 9025.61)
('Travel', 8009.52)
('Utilities', 7918.38)
('Health & Fitness', 7420.25)
('Lifestyle', 5084.52)
('Business', 4565.21)
('Education', 3261.76)
('Catalogs', 890.33)
('Medical', 466.2)


**GooglePlay**

Differently from the AppleAppStore dataset, the GooglePlay dataset has a column called 'Install', that shows how many times that has been installed, so there is no need to work with the ratings of the apps. But the same idea that the big apps like Facebook makes a big difference in the analyses, apply for the PlayStore. Therefore, if the genre has more than 1%, the most 4 installed apps will be deleted, otherwise just the most installed.

In [19]:
#The Install column must be treated:
a ='10,000+'
print(a[:-1].replace(",",""))

10000


In [20]:
#LOCATING MOST RATED APPS

#___Creating a list of Genre___
ios_genre_list = []
for genre in ios_dataset_free:
    if genre[1] not in ios_genre_list:
        ios_genre_list.append(genre[1])
        
        
#___Initializing Variables___
ios_onepercent_list = ['AUTO_AND_VEHICLES', 'LIBRARIES_AND_DEMO','HOUSE_AND_HOME','WEATHER','EVENTS','ART_AND_DESIGN','PARENTING','BEAUTY','COMICS']
cnt_index = 0
cnt_first_apps = 0
list_position = 0
list_of_index = []
list_of_ratings = []
list_position = 0



#Looping throught genres
for genre in ios_genre_list:
    cnt_first_apps = 0
    cnt_index = 0
    
    #Looping throught dataset for each genre
    for rating in ios_dataset_free:
               
        if rating[1] == genre:
            cnt_first_apps += 1
            
            #Treating items with one percent
            if rating[1] in ios_onepercent_list:
                if cnt_first_apps == 1:
                    list_of_index.append(cnt_index)
                    list_of_ratings.append(int(rating[5][:-1].replace(",","")))
                else:
                    if int(rating[5][:-1].replace(",","")) > list_of_ratings[-1]:
                        list_of_ratings[-1] = int(rating[5][:-1].replace(",",""))
                        list_of_index[-1] = cnt_index
            #Treating items with more than one percent   
            else:
                if cnt_first_apps < 5:
                    list_of_index.append(cnt_index)
                    list_of_ratings.append(int(rating[5][:-1].replace(",","")))
                    
                else:
                    for i in range(4):
                        if list_of_ratings[list_position + i] < int(rating[5][:-1].replace(",","")):
                            list_of_ratings[list_position + i] = int(rating[5][:-1].replace(",",""))
                            list_of_index[list_position + i] = cnt_index
                            break
        cnt_index += 1
    if genre in ios_onepercent_list:
        list_position += 1
        
    else:
        list_position += 4
            
            

In [21]:
#DELETING THEN

ios_dataset_free_nobigapps = ios_dataset_free.copy()

list_of_index.sort(key=int, reverse=True)

for n in list_of_index:
    del ios_dataset_free_nobigapps[n]


In [22]:
#MAKING THE ANALYZES:
ios_genre_rating_dictionarie = {}
sum_per_genre = 0
cnt = 0

for genre in ios_genre_list:
    sum_per_genre = 0 
    cnt = 0
    
    for rating in ios_dataset_free_nobigapps:
        
        if genre == rating[1]:
            sum_per_genre += int(rating[5][:-1].replace(",",""))
            cnt +=1 
    
    ios_genre_rating_dictionarie[genre] = round(sum_per_genre /cnt, 2)
        

print("Download average per app in diferent genres in PlayStore:\n")
sorted_by_value = sorted(ios_genre_rating_dictionarie.items(), key=lambda kv: kv[1], reverse = True)

for item in sorted_by_value:
    print(item)
    

Download average per app in diferent genres in PlayStore:

('COMMUNICATION', 25219875.42)
('PHOTOGRAPHY', 13086936.53)
('GAME', 12932885.11)
('PRODUCTIVITY', 9760917.85)
('SOCIAL', 8847342.64)
('VIDEO_PLAYERS', 8656129.74)
('ENTERTAINMENT', 8632553.19)
('TOOLS', 7508217.93)
('SHOPPING', 5177113.01)
('WEATHER', 4622201.79)
('PERSONALIZATION', 3905716.15)
('TRAVEL_AND_LOCAL', 3710015.59)
('SPORTS', 2808260.65)
('FAMILY', 2490505.61)
('HEALTH_AND_FITNESS', 2319840.5)
('EDUCATION', 2031018.52)
('MAPS_AND_NAVIGATION', 1741639.65)
('FOOD_AND_DRINK', 1591273.77)
('NEWS_AND_MAGAZINES', 1506380.81)
('BOOKS_AND_REFERENCE', 1460869.56)
('HOUSE_AND_HOME', 1240758.86)
('ART_AND_DESIGN', 1090188.14)
('LIFESTYLE', 1055472.43)
('FINANCE', 993489.34)
('BUSINESS', 967241.47)
('COMICS', 679819.39)
('DATING', 631507.91)
('LIBRARIES_AND_DEMO', 550582.8)
('AUTO_AND_VEHICLES', 537250.76)
('PARENTING', 374482.32)
('BEAUTY', 330712.5)
('EVENTS', 176986.45)
('MEDICAL', 83489.72)


### <img src="https://image.flaticon.com/icons/svg/2/2291.svg" alt="Drawing" style="width:30px; float:left"/>             &ensp; Crossing Data

This study is looking for an app genre that has low market competition and high numbers of downloads. With that said the study must point to a gender that attends both stores within those parameters. 

**Adopted Criteria**:
* Every genre with less than 15.000 reviews on average per app in apple app store data_set will be considered not popular.
* Every genre with less than 2.000.000 downloads on average per app in play store data_set will be considered not popular.
* Every genre with more than 3% of apps in both stores will be considered with high competition.

**Isolating genres with stipulated criteria:**

In [23]:
#PlayStore
ios_genre_popular_and_low_competition = []

for genre in ios_genre_list:
    if ios_genre_rating_dictionarie[genre] > 2000000 and ios_genre_dictionarie[genre] < 3:
        ios_genre_popular_and_low_competition.append(genre)

#AppStore
apple_genre_popular_and_low_competition = []

for genre in apple_genre_list:
    if apple_genre_rating_dictionarie[genre] > 15000 and apple_genre_dictionarie[genre] < 3:
        apple_genre_popular_and_low_competition.append(genre)
        
print(ios_genre_popular_and_low_competition)
print(apple_genre_popular_and_low_competition)

['EDUCATION', 'ENTERTAINMENT', 'SOCIAL', 'SHOPPING', 'TRAVEL_AND_LOCAL', 'WEATHER', 'VIDEO_PLAYERS']
['Weather', 'Shopping', 'Reference', 'Music', 'Food & Drink', 'Book', 'Navigation']


The genres that attend all the criteria in both stores are: **Weather** and **Shopping**.

-----

## <img src="https://png.pngtree.com/svg/20160526/conclusion_717735.png" alt="Drawing" style="width:30px; float:left"/>  &ensp;     Conclusion

A company that wants to develop a free english app for AppleAppStore and GooglePlayStore and choose not to compete with large enterprise applications will probably have more success developing or a Weather or a Shopping application, since they are the two genres that have low competition and high interest of download in both stores.