# Free apps analysis from App Store and Google Store

In this project, we will analyse 2 data sets from App Store and Google Store giving informations about free apps.

By doing so, our goal is to help our apps developers understand what are the type of apps consumers like to download and use from those tech stores.

In [1]:
file_1 = open('AppleStore.csv')
file_2 = open('googleplaystore.csv')
from csv import reader
data_1 = reader(file_1)
data_2 = reader(file_2)
apple_data = list(data_1)
google_data = list(data_2)

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print("Apple Store first 4 rows\n")
explore_data(apple_data[1:], 0, 4, True)
print("\n----------------------------------------------------")
print("\nGoogle Store first 4 rows\n")
explore_data(google_data[1:], 0, 3, True)

Apple Store first 4 rows

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16

----------------------------------------------------

Google Store first 4 rows

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone'

In [4]:
print("\nColumn names\n")
print("Apple Store\n")
print(apple_data[0])
print("\n----------------------------------------------------")
print("\nGoogle Store\n")
print(google_data[0])


Column names

Apple Store

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

----------------------------------------------------

Google Store

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


**Data cleaning**

Google Store

In [5]:
if len(google_data) == 10842: # if header not removed
    print(google_data[10473])
elif len(google_data) == 10841:
    print(google_data[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The row printed above does not have a category column and  that will cause an error later on. So we delete that row

In [6]:
if len(google_data) == 10842: # if header not removed
    del google_data[10473]
elif len(google_data) == 10841:
    del google_data[10472]

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


Apple Store

In [7]:
for app in apple_data[1:]:
    if len(app) != len(apple_data[0]):
        print(app)
        print(apple_data.index(app))

The code above helps us detect if there is a faulty row in the apple store data set as for the google store and print the corresponding row and index. 

Nothing was printed. So, we conclude that there was no faulty row in the apple store data.

Duplicated entries

Our datasets might have some duplicated entries. The code below helps us detect and count them. 

In [10]:
google_duplicated_apps = []
google_unique_apps = []

apple_duplicated_apps = []
apple_unique_apps = []

for app in google_data[1:]:
    name = app[0]
    if name in google_unique_apps:
        google_duplicated_apps.append(name)
    else:
        google_unique_apps.append(name)
        
for app in apple_data[1:]:
    name = app[0]
    if name in apple_unique_apps:
        apple_duplicated_apps.append(name)
    else:
        apple_unique_apps.append(name)
        
print("Number of duplicate apps in the google store apps:", len(google_duplicated_apps), '\n')
print("Number of duplicate apps in the apple store apps:", len(apple_duplicated_apps), '\n')


Number of duplicate apps in the google store apps: 1181 

Number of duplicate apps in the apple store apps: 0 



As we can see above, after the print is excuted, the google store data set is only one with duplicated entries. We need to remove those rows before starting our analysis. But, we can not randomly removed them. 

After a quick look, we can see that those rows have differents values in number reviews column. So, we need to keep the row with the higher number of reviews as it is the one up to date.



In [13]:
reviews_max = {}
for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))

9659


In [15]:
google_clean = []
google_clean.append(google_data[0])# add the header row
google_already_added = []

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in google_already_added:
        google_clean.append(app)
        google_already_added.append(name)

print(len(google_clean))
        

9660


In this section, we need to remove apps with non-americnas names as we are aiming for english speaking customers.

In [20]:
def is_english_app(app_name):
    foreign_letter_count = 0
    for letter in app_name:
        if ord(letter) > 127:
            foreign_letter_count += 1
            if foreign_letter_count > 3:
                return False
    return True

In the cell below, we test the function we wrote to see if the app was an english app

In [21]:
print(is_english_app('Instagram'))
print(is_english_app('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english_app('Docs To Go™ Free Office Suite'))
print(is_english_app('Instachat 😜'))

True
False
True
True


In the following cell, we use the fucntion to filter both data sets.

In [24]:
google_english_data = []
google_english_data.append(google_clean[0]) # add the header row
apple_english_data = []
apple_english_data.append(apple_data[0]) # add the header row
for app in google_clean[1:]:
    if is_english_app(app[0]):
        google_english_data.append(app)

for app in apple_data[1:]:
    if is_english_app(app[0]):
        apple_english_data.append(app)
        
print("Number of remaining rows for apple data:", len(apple_english_data[1:]))
print("Number of remaining rows for google data:", len(google_english_data[1:]))
    

Number of remaining rows for apple data: 7197
Number of remaining rows for google data: 9614


In this final step of data cleaning, we remove apps that are not free.

In [27]:
apple_free_eng_apps = []
apple_free_eng_apps.append(apple_english_data[0]) # add header row
google_free_eng_apps = []
google_free_eng_apps.append(google_english_data[0]) # add header row

for app in apple_english_data[1:]:
    if float(app[4]) == 0.0:
        apple_free_eng_apps.append(app)
    
for app in google_english_data[1:]:
    if app[6] == "Free":
        google_free_eng_apps.append(app)
        
print("Number of remaining rows for apple data:", len(apple_free_eng_apps[1:]))
print("Number of remaining rows for google data:", len(google_free_eng_apps[1:]))
    

Number of remaining rows for apple data: 4056
Number of remaining rows for google data: 8863


**Data Analysis**

Now into the analysis, we need to find an profil that fits both the apple store and the google store because as stated at the beginning, our goal is to add our future apps to both stores.

After inspections, we can conclude that the prime_genre column in the apple data set and the Genres and Category columns of the google data are of interest. 

We build frequency tables for each of them.

In [32]:
def freq_table(dataset, index):
    freq_dict = {}
    freq_dict_percentage = {}
    lenght = len(dataset[1:])
    for app in dataset[1:]:
        if app[index] in freq_dict:
            freq_dict[app[index]] += 1
        else:
            freq_dict[app[index]] = 1
    
    for key in freq_dict:
        freq_dict_percentage[key] = (freq_dict[key] / lenght) * 100
        
    return freq_dict_percentage

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
print("prime_genre column of apple data\n")
display_table(apple_free_eng_apps, 11)
print("\nGenres column of google data\n")
display_table(google_free_eng_apps, 9)
print("\nCategory column of google data\n")
display_table(google_free_eng_apps, 1)
    

prime_genre column of apple data

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032

Genres column of google data

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315