## An√°lise - Dados de apps para celulares

** Este √© um projeto guiado do Data Quest. **

Ol√°, esse √© o meu primeiro projeto de an√°lise de dados utilizando Python e Colab Notebooks. 

O objetivo √© fazer uma an√°lise de dois data sets. O primeiro cont√©m aproximadamente 10 mil apps para Android do Google Play, os dados foram coletados em Agosto de 2018. O segundo data set contem aproximadamente 10 mil apps de iOS da App Store, os datos foram coletados em Julho 2017.

A proposta √© determinar que tipo de aplicativos podem atrais mais usu√°rios j√° que a nossa receita seria altamente influenciada pelo n√∫mero de pessoas utilizando os apps. 

Para minimizar riscos, a estrat√©gia √© composta por tr√™s passos: 
1. Construir uma vers√£o Android e adicionar no Google Play. 
2. Se o app tiver uma boa recep√ß√£o, melhorar a primeira vers√£o. 
3. Se o app for lucrativo em seis meeses, construir uma vers√£o iOS. 

### Parte um: abrindo e explorando dados

In [1]:
from csv import reader

### iOS / Apple ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apple_data = list(read_file)
apple_data_header = apple_data[0]
apple_data = apple_data[1:]

### Android / Google ###

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
google_data = list(read_file)
google_data_header = google_data[0]
google_data = google_data[1:]



In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
explore_data(apple_data, 0, 3, True)
print('\n')
explore_data(google_data, 0, 3, True)
        

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 

<blockquote>Para facilitar, a fun√ß√£o explore_data() √© utilizada para vizualizar as linhas do data set de forma mais clara.

In [4]:
print(apple_data_header)
print('\n')
print(google_data_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


<blockquote> O cabe√ßalho est√° separado aqui para identificar quais colunas poderiam ajudar na an√°lise.

### Parte dois: Cleaning

In [5]:
for row in google_data: #identying the row with wrong data
    header_length = len(google_data_header)
    row_length = len(row)
    if row_length != header_length:
        print(row)
        print(google_data.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472


In [6]:
del google_data[10472] #using the 'del' argument to delete wrong data
#run this just once

<blockquote> No f√≥rum de discuss√£o sobre os dados do Android, √© apontado que a linha 10472 cont√©m dados errados. Essa linha √© identificada e deletada.

In [7]:
### There are duplicate apps in the Android/Google data ###

duplicate_apps = []
unique_apps = []

for app in google_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:16])



Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation']


<blockquote> H√° aplicativos duplicados na base de dados da Google/Android. Primeiro, identificamos qual o n√∫mero de casos em que isso acontece (1181) e imprimimos exemplos de aplicativos.

In [8]:
### Example of an app that is duplicated ###
for app in google_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

### The apps will be removed based on the highest number of reviews. ###

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [9]:
reviews_max = {}
for app in google_data:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))

9659


In [10]:
android_clean = [] #store our new cleaned data set
already_added = [] #store app names

for app in google_data:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

print(len(android_clean))
    
    

9659


<blockquote> Para a an√°lise, √© importante n√£o contar os apps duplicados. Esses casos s√£o removidos da base de dados utilizando o n√∫mero de avalia√ß√µes como par√¢metro. Quanto mais avalia√ß√µes, mais recente √© a entrada do aplicativo. Logo, o aplicativo com maior n√∫mero de avalia√ß√µes √© mantido.

In [25]:
### Checking if the apps are in English or not ###
### Here using ASCII ### 

def is_english(string):
    for char in string: 
        if ord(char) > 127:
            return False
    return True

print(is_english('Instagram'))
print(is_english('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢ÇÁîµËßÜÂâßÁÉ≠Êí≠'))
print(is_english('Docs To Go‚Ñ¢ Free Office Suite'))
print(is_english('Instachat üòú'))


True
False
False
False


<blockquote> Outro a√ß√£o importante √© checar se os aplicativos s√£o em ingl√™s ou n√£o.

In [11]:
### to minimize data loss, changing the function ### 

def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Instagram'))
print(is_english('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢ÇÁîµËßÜÂâßÁÉ≠Êí≠'))
print(is_english('Docs To Go‚Ñ¢ Free Office Suite'))
print(is_english('Instachat üòú'))

True
False
True
True


<blockquote> Para o exerc√≠cio - considerando que se trata de uma empresa avaliando quais tipos de aplicativos ter√£o melhor recep√ß√£o - apenas os aplicativos em ingl√™s ser√£o mantidos.
A classifica√ß√£o 'ASCII' √© utilizada para determinar quais aplicativos s√£o em ingl√™s, mas para diminuir uma perda de dados, a fun√ß√£o √© modificada. 

In [12]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in apple_data:
    name = app[0]
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+'

In [13]:
### Isolating free apps ###

free_google = []
free_apple = []

for app in android_english:
    price = app[7]
    if price == '0':
        free_google.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        free_apple.append(app)
        
print(len(free_google))
print(len(free_apple))


        

8864
4056


<blockquote> Para a an√°lise, apenas os aplicativos n√£o pagos ser√£o contados. Por isso, isolamos esses dados para utilizar na pr√≥xima etapa. 

### Parte tr√™s: An√°lise

Como a proposta √© construir um aplicativo para iOS e Android, √© preciso identificar exemplos de sucesso nos dois mercados.

Vamos come√ßar buscando quais s√£o os g√™neros mais comuns no Google Play e App Store. Para isso, uma tabela de frequ√™ncia ser√° feita utilizando como base a coluna 'prime_genre' da App Store e 'Genres' e 'Category' do Google Play. 



In [14]:
### Most common genres ###

### we want to build an app that works both on Google Play and App Store ###

genres_google = {}
genres_apple = {}

for app in free_apple:
    genre = app[11]
    if genre in genres_apple:
        genres_apple[genre] += 1
    else:
        genres_apple[genre] = 1

print(genres_apple)

for app in free_google:
    genre= app[9]
    if genre in genres_google:
        genres_google[genre] += 1
    else:
        genres_google[genre] = 1
        
#print(genres_google)




{'Social Networking': 143, 'Photo & Video': 167, 'Games': 2257, 'Music': 67, 'Reference': 20, 'Health & Fitness': 76, 'Weather': 31, 'Utilities': 109, 'Travel': 56, 'Shopping': 121, 'News': 58, 'Navigation': 20, 'Lifestyle': 94, 'Entertainment': 334, 'Food & Drink': 43, 'Sports': 79, 'Book': 66, 'Finance': 84, 'Education': 132, 'Productivity': 62, 'Business': 20, 'Catalogs': 9, 'Medical': 8}


In [15]:
def freq_table(dataset, index):
    table = {}
    total = 0
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key]/total)*100
        table_percentages[key] = percentage
    
    return table_percentages 

#display_table function: tranaforms the frequency table into a list of 
#tuples, then sorts the list in a descending order
        
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        

apple_genre = display_table(free_apple, 11)
print('\n')
google_genre = display_table(free_google, 9)
print('\n')
google_category = display_table(free_google, 1)


Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.316787


Entre os aplicativos da App Store (aqueles em ingl√™s e sem custo), mais da metade s√£o jogos (55%) e s√£o seguidos por outros aplicativos dedicados a entretenimento (8%), foto e v√≠deo (4%) e redes sociais (3%). Aplicativos pr√°ticos e educativos t√™m menos sucesso na App Store. 

Quando olhamos para as Categorias do Google Play, n√£o existem muitos aplicativos feitos para divers√£o, mas um grande n√∫mero voltado √† objetivos pr√°ticos (fam√≠lia, ferramentas, neg√≥gios, estilo de vida, produtividade). √â preciso observar, no entanto, que dentro da categoria 'fam√≠lia', h√° um grande n√∫mero de jogos para crian√ßas. 

Mesmo assim, ao comparar Categorias com G√™neros, se nota que ferramentas, educa√ß√£o e outros aplicativos mais 'pr√°ticos', ainda tem mais sucesso no Google Play do que na App Store.


Outra forma de descobrir quais g√™neros s√£o os mais populares √© calcular a m√©dia de instala√ß√£o para cada g√™nero. Para o Google Play, essa informa√ß√£o est√° na coluna 'Installs'. No entanto, esse dado n√£o existe para iOS. Como alternativa, 'rating_count_total' ser√° utilizado. 

In [16]:
genres_apple = freq_table(free_apple, 11)

for genre in genres_apple:
    total = 0 #sum of user ratings
    len_genre = 0 #the number of apps specific to each genre
    for app in free_apple:
        genre_app = app[11]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre +=1
    avg_n_ratings = total/len_genre
    print(genre, ":", avg_n_ratings)



Social Networking : 53078.195804195806
Photo & Video : 27249.892215568863
Games : 18924.68896765618
Music : 56482.02985074627
Reference : 67447.9
Health & Fitness : 19952.315789473683
Weather : 47220.93548387097
Utilities : 14010.100917431193
Travel : 20216.01785714286
Shopping : 18746.677685950413
News : 15892.724137931034
Navigation : 25972.05
Lifestyle : 8978.308510638299
Entertainment : 10822.961077844311
Food & Drink : 20179.093023255813
Sports : 20128.974683544304
Book : 8498.333333333334
Finance : 13522.261904761905
Education : 6266.333333333333
Productivity : 19053.887096774193
Business : 6367.8
Catalogs : 1779.5555555555557
Medical : 459.75


In [20]:
categories_google = freq_table(free_google, 1)

for category in categories_google:
    total = 0
    len_category = 0
    for app in free_google:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)
    

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

### Conclus√£o

√â poss√≠vel notar que na App Store os g√™neros com mais avalia√ß√µes est√£o diretamente relacionados √† apps extremamente populares (Navega√ß√£o: Waze, Google Maps; M√≠dias Sociais: Facebook, Instagram; M√∫sica: Spotify). A categoria 'Refer√™ncia' tem potencial para nossa meta, assim como aplicativos de clima, comida e drinks, e financeiros.

O mesmo padr√£o se repete com Google/Android, categorias com apps muito populares, por exemplo, Comunica√ß√£o (WhatsApp, Telegram), tem muitas instala√ß√µes. Por isso, mudando um pouco de foco, o que se mostra com potencial para a proposta √© Livros e Refer√™ncias, Ferramentas e Produtividade. 