# App store 

- Finding out about on demand applications to develope for Android and later on for iOS
- Dataset sources for [App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) and [Google Play](https://www.kaggle.com/lava18/google-play-store-apps)
- The goal to clean and seprate the data that we are interested in (Free and English) and then analyze it to gain insigts  

## Cleaning 

1.Cleaning and seperating the data for free English applications in Google Play and App Store

In [1]:
def explore_data(dataset, start, end, row_column_count = False):

    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if row_column_count:
        print('Rows:    ',len(dataset))
        print('Columns: ',len(dataset[0]))

In [2]:
file = open('../my_datasets/AppleStore.csv', encoding='utf8')
file2 = open('../my_datasets/googleplaystore.csv', encoding='utf8')
from csv import reader

read_file = reader(file)
read_file2 = reader(file2)

ios_data = list(read_file)
google_data = list(read_file2)

In [3]:
explore_data(google_data, 10472, 10475, row_column_count=True)

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


Rows:     10842
Columns:  13


- In the cell below, the data point that we found to be faluty (from discussion section of the dataset) is removed from Google Play dataset. 

In [4]:
del google_data[10473]    #Just once otherwise you deleter more than one data

- There are also plenty of duplicate entries in the Google Play dataset that need to removed (example below).
- I am going to check which entry is most recent and remove the rest (based on the number of reviews). 

In [5]:
unique_apps = []
duplicate_apps = []

for app in google_data[1:]:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
            
print('Duplicate Apps: ', len(duplicate_apps))
index = 0
for app in google_data[1:]:
    name = app[0]
    index += 1
    if name == 'Instagram':
        print(index,'=', app)
        print('\n')

Duplicate Apps:  1181
2546 = ['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


2605 = ['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


2612 = ['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


3910 = ['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




Below, I stored the most recent record of every app from the data.
- By comparing the number of reviews and picking the one that has the most (in duplicates). Rest of data in stored as is.

In [6]:
reviews_max = {}

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name not in reviews_max:
        reviews_max[name] = n_reviews
    elif name in reviews_max and reviews_max[name]<n_reviews:
        reviews_max[name] = n_reviews

print("Length of reviews_max: ", len(reviews_max))


Length of reviews_max:  9659


- To clean the data from duplicates, below the code checks whether the number of reviews of each app is equal to one recorded in `reviews_max` dictionary from previous cell, and appends to a new list. The new list is duplicate-free.   

In [7]:
google_clean = []
already_added = []

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name not in already_added and n_reviews == reviews_max[name]:
        google_clean.append(app)
        already_added.append(name)
        

print("Google_clean: ", len(google_clean))
print("already_added: ", len(already_added))

Google_clean:  9659
already_added:  9659


- Now that the data is clean, we want only the apps that are English and Free. That's the data we are interested in.
- To check if they are English, we iterate the characters of the name for each app, and if it has maximum of three characters that their ASCII number is below 127. Of course we are going to lose the apps with 4 and more ASCII characters above 127 but we still get most of English app records.

In [8]:
def english_check(name):
    counter = 0
    for char in name:
        if ord(char)>127:
            counter +=1
        if counter>3:
            return False
        
    return True

In [9]:
ios_eng = []
google_eng = []

for app in google_clean:
    name = app[0]
    if english_check(name):
        google_eng.append(app)
        
for app in ios_data[1:]:
    name = app[0]
    if english_check(name):
        ios_eng.append(app)
        
print('Google_eng: ', len(google_eng))
print('ios_eng: ', len(ios_eng))

Google_eng:  9614
ios_eng:  7197


In [10]:
# from csv import writer

# with open('ios_clean.csv', 'w',encoding='utf8') as f: 
#     write = writer(f) 
#     write.writerow(ios_data[0])
#     for app in ios_eng:
#         write.writerow(app)

# with open('google_clean_eng.csv', 'w',encoding='utf8') as f: 
#     write = writer(f) 
#     write.writerow(google_data[0])
#     for app in google_clean_eng:
#         write.writerow(app)

- In the cell below, we append only the free apps from our data that contained only English records.

In [17]:
ios_eng_free = []
google_eng_free = []

for app in google_eng:
    if app[6] == 'Free':
        google_eng_free.append(app)
        
for app in ios_eng:
    if float(app[5]) == 0:
        ios_eng_free.append(app)
        
print('ios_eng: ', len(ios_eng),'Free: ', len(ios_eng_free))
print('google_eng: ', len(google_eng), 'Free: ', len(google_eng_free))

ios_eng:  7197 Free:  4056
google_eng:  9614 Free:  8863


# Genre Frequency Tables 

## 1. App Store

In [24]:
def freq_table(dataset, index):
    """Any column(index) of a Dataset to frequency table (dict)"""
    
    dataset_ft = {}
    for data in dataset:
        column = data[index]
        if column in dataset_ft:
            dataset_ft[column] += 1
        else:
            dataset_ft[column] = 1
    
    sum_column = 0
    for key in dataset_ft:
        sum_column += dataset_ft[key]
        
    for key in dataset_ft:
        dataset_ft[key] = (dataset_ft[key]/sum_column)*100
        dataset_ft[key] = round(dataset_ft[key],2)
        
    return dataset_ft

In [33]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [47]:
print('IOS_Eng_Free - Prime Genre: \n')
display_table(ios_eng_free, 12)

# print('Google_Eng_Free - Genres: \n')
# display_table(google_eng_free, 9)

IOS_Eng_Free - Prime Genre: 

Games : 55.65
Entertainment : 8.23
Photo & Video : 4.12
Social Networking : 3.53
Education : 3.25
Shopping : 2.98
Utilities : 2.69
Lifestyle : 2.32
Finance : 2.07
Sports : 1.95
Health & Fitness : 1.87
Music : 1.65
Book : 1.63
Productivity : 1.53
News : 1.43
Travel : 1.38
Food & Drink : 1.06
Weather : 0.76
Reference : 0.49
Navigation : 0.49
Business : 0.49
Catalogs : 0.22
Medical : 0.2


In [42]:
Freq_Table = freq_table(ios_eng_free, 12)
genre_avg_num_rate = []

for genre in Freq_Table:
    total = 0
    len_genre = 0
    
    for app in ios_eng_free:
        genre_app = app[12]
        if genre_app == genre:
            rate_num = float(app[6])
            total += rate_num
            len_genre += 1
    avg_num_rate = round(total/len_genre,2)
    genre_avg_num_rate.append((avg_num_rate, genre))

sorted(genre_avg_num_rate,reverse=True)

[(67447.9, 'Reference'),
 (56482.03, 'Music'),
 (53078.2, 'Social Networking'),
 (47220.94, 'Weather'),
 (27249.89, 'Photo & Video'),
 (25972.05, 'Navigation'),
 (20216.02, 'Travel'),
 (20179.09, 'Food & Drink'),
 (20128.97, 'Sports'),
 (19952.32, 'Health & Fitness'),
 (19053.89, 'Productivity'),
 (18924.69, 'Games'),
 (18746.68, 'Shopping'),
 (15892.72, 'News'),
 (14010.1, 'Utilities'),
 (13522.26, 'Finance'),
 (10822.96, 'Entertainment'),
 (8978.31, 'Lifestyle'),
 (8498.33, 'Book'),
 (6367.8, 'Business'),
 (6266.33, 'Education'),
 (1779.56, 'Catalogs'),
 (459.75, 'Medical')]

### Recommendation for App Store 
- according to number of apps in each genre (English, Free) and number of ratings on average (proxy for number of installs):

1. **Weather app** (There are few of them, yet there are lots of people who use them, so a cute and accurate weather app could be something to consider.)

2. **Hybrid reference/photo app** like something that could translate the name of objects in a photo. Like you want find the name of an object without knowing it in your language or typing it in the translator. 

## 2. Google Play

In [48]:
print('Google_Eng_Free - Category: \n')
display_table(google_eng_free, 1)

Google_Eng_Free - Category: 

FAMILY : 18.9
GAME : 9.73
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


In [46]:
Freq_category = freq_table(google_eng_free, 1)
category_avg_n_install = []

for category in Freq_category:
    total = 0
    len_category = 0
    
    for app in google_eng_free:
        category_app = app[1]
        if category_app == category:
            install_n = app[5]
            install_n = install_n.replace('+','')
            install_n = float(install_n.replace(',',''))
            total += install_n
            len_category += 1
    avg_num_install = round(total/len_category,2)
    category_avg_n_install.append((avg_num_install, category))

sorted(category_avg_n_install,reverse=True)

[(38456119.17, 'COMMUNICATION'),
 (24727872.45, 'VIDEO_PLAYERS'),
 (23253652.13, 'SOCIAL'),
 (17840110.4, 'PHOTOGRAPHY'),
 (16787331.34, 'PRODUCTIVITY'),
 (15588015.6, 'GAME'),
 (13984077.71, 'TRAVEL_AND_LOCAL'),
 (11640705.88, 'ENTERTAINMENT'),
 (10801391.3, 'TOOLS'),
 (9549178.47, 'NEWS_AND_MAGAZINES'),
 (8767811.89, 'BOOKS_AND_REFERENCE'),
 (7036877.31, 'SHOPPING'),
 (5201482.61, 'PERSONALIZATION'),
 (5074486.2, 'WEATHER'),
 (4188821.99, 'HEALTH_AND_FITNESS'),
 (4056941.77, 'MAPS_AND_NAVIGATION'),
 (3697848.17, 'FAMILY'),
 (3638640.14, 'SPORTS'),
 (1986335.09, 'ART_AND_DESIGN'),
 (1924897.74, 'FOOD_AND_DRINK'),
 (1833495.15, 'EDUCATION'),
 (1712290.15, 'BUSINESS'),
 (1437816.27, 'LIFESTYLE'),
 (1387692.48, 'FINANCE'),
 (1331540.56, 'HOUSE_AND_HOME'),
 (854028.83, 'DATING'),
 (817657.27, 'COMICS'),
 (647317.82, 'AUTO_AND_VEHICLES'),
 (638503.73, 'LIBRARIES_AND_DEMO'),
 (542603.62, 'PARENTING'),
 (513151.89, 'BEAUTY'),
 (253542.22, 'EVENTS'),
 (120550.62, 'MEDICAL')]

### Recommendation for potential genres for Google Play 
- To be further developed for iOS
- according to number of apps in each category (English, Free) and number of installs:

1. Productivity
2. Photography