# Popular Apps to Investigate - Analyze Data from Google Play and App Store
（This is a learning project based on dataquest [Link](https://www.dataquest.io/)）

The aim of the project is to find out what kind of apps are likely to attract more users on Google Play and the App Store. The result can be used to help decide what kind of apps to investigate.

Two data sets are used here:
1. Data about ~10,000 Android apps from Google Play that was collected in Aug. 2018. [Link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
2. Data about ~7,000 iOS apps from the App Store that was collected in July 2017. [Link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

### Defining function to explore data
This function takes 4 parameters:
- The dataset (a list of lists) to explore
- The start and end row view
- A variable indicating if print the size (number of rows and number of columns) of the data set

In [1]:
from csv import reader

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end] # slice the dataset
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
        
        if rows_and_columns:
            print('Number of rows:', len(dataset))
            print('Number of columns:', len(dataset[0]))

### Read in the data

In [4]:
# Data from the App Store
opened_file_apple = open('AppleStore.csv', encoding='utf8')
apps_data_apple = list(reader(opened_file_apple))
apple_header = apps_data_apple[0] # save the header
apps_data_apple = apps_data_apple[1:] # save as list without header

# Data from Google Play
opened_file_google = open('googleplaystore.csv', encoding = 'utf8')
apps_data_google = list(reader(opened_file_google))
google_header = apps_data_google[0]
apps_data_google = apps_data_google[1:]

### Explore data

In [5]:
# Explore data of both dataset
# Show number of rows and number of columns
# View first 2 lines of each dataset
print('Columns of data from apple store:\n' + str(apple_header) + '\n')
explore_data(apps_data_apple, 0, 2, True)
print('---------------')
print('Columns of data from Goolge Play:\n' + str(google_header) + '\n')
explore_data(apps_data_google, 0, 2, True)

Columns of data from apple store:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7197
Number of columns: 16
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16
---------------
Columns of data from Goolge Play:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018',

### Data cleaning - Remove inaccurate data

In [15]:
def missing_data(dataset):
    n_columns = len(dataset[1])
    for row in dataset:
        if len(row) != n_columns:
            print('This row has missing data: ' + str(dataset.index(row)) + ':')
            print(row)


In [16]:
# Check if the data set from the App Store has missing data.
print('Check data set from the App Store.')
missing_data(apps_data_apple)
# Check if the data set from Google Play has missing data.
print('Check data set from Google Play')
missing_data(apps_data_google)

Check data set from the App Store.
Check data set from Google Play
This row has missing data: 10472:
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [17]:
print(apps_data_google[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [18]:
# The 'Category' of this app is missing ,
# delete this line, but run the line below multiple times, 
# otherwise useful data will be deleted.
# del apps_data_google[10472]

### Data cleaning - Remove duplicate data
1. Check for duplicates in the datasets

In [24]:
def check_duplicates(dataset):
    duplicate_apps = [] # list for duplicated app names
    unique_apps = [] # list for unique app names
    # find duplicate app name
    for app in dataset:
        name = app[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
    # print an example of duplicate app entry in the data
    if len(duplicate_apps) == 0: # check if list of duplicates is empty
        print("There is no duplicated apps.")
    else: # if there are some duplicates, print the first one as example
        print("There are "+str(len(duplicate_apps)) + " duplicated apps.")
        print("Here is one example:")
        for app in dataset:
            name = app[0]
            if name == duplicate_apps[0]:
                print(app)

In [27]:
# check the android apps
print('Google Play data:')
check_duplicates(apps_data_google)
# check the apple apps
print('\nApp Store data:')
check_duplicates(apps_data_apple)

Google Play data:
There are 1181 duplicated apps.
Here is one example:
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']

App Store data:
There is no duplicated apps.


In [28]:
print(len(apps_data_google)-1181) # print the number of entries after removing duplicates

9659


### Data cleaning - Remove duplicates from the Google Play data set
For each duplicated apps, keep the latest one (the one with max reviews) and remove others.

First, create a dictionary of app name and it's max reivews

In [29]:
reviews_max = {} # dictory name and max reviews
for app in apps_data_google:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))

9659


Use the dictionary created above and two lists to remove duplicated data

In [34]:
android_clean = []
already_added = []
# loop through the android data 
for app in apps_data_google:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
print(len(android_clean))

9659


### Remove non-English apps
1. Define a function to detect non-English character in the app name, and if there are more than 3 non-English character, return False, which means this app is a non-English app.

In [30]:
def englishCharacter(string):
    non_english = 0
    for character in string:
        if ord(character) > 127:
            non_english +=1
            if non_english > 3:
                return False
    return True

Check the englishCharacter function.

In [33]:
print(englishCharacter('Instagram'))
print(englishCharacter('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(englishCharacter('Docs To Go™ Free Office Suite'))
print(englishCharacter('Instachat 😜'))

True
False
True
True


2. Use the function to loop through the lists and keep only the English apps.

In [31]:
apple_English = []
for app in apps_data_apple:
    name = app[0]
    isEnglishApp = englishCharacter(name)
    if isEnglishApp:
        apple_English.append(app)
print(len(apple_English))

7197


In [35]:
android_English = []
for app in android_clean:
    name = app[0]
    isEnglishApp = englishCharacter(name)
    if isEnglishApp:
        android_English.append(app)
print(len(android_English))

9614


### Isolate and keep free apps only

In [39]:
apple_Free_apps = []
for app in apple_English:
    price = float(app[4])
    if price == 0.0: # price of 0 indicating free apps
        apple_Free_apps.append(app)
        
print(len(apple_Free_apps))

4056


In [40]:
android_free_apps = []
for app in android_English:
    type = app[6]
    if type == 'Free': # apps labeled as free
        android_free_apps.append(app)
        
print(len(android_free_apps))

8863


### Analyze data - which genre has most apps
We need to build a frequency table for the prime_gene column of the App Store data set, and for the Genres and Category columns of the Google Play dataset. 
We'll build two functions we can use to analyze the frequency tables:
* One function to generate frequency talbles that show percentages
* Another function we can sue to display the percentages in a descending order

In [41]:
def freq_table(dataset, index):
    frequency_table = {}
    for app in dataset:
        value = app[index]
        if value in frequency_table:
            frequency_table[value] += 1
        else:
            frequency_table[value] =1
    return frequency_table

In [42]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    table_sorted = sorted(table_display, reverse=True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [43]:
display_table(apple_Free_apps, 11)
# entainment apps are most popular in App Store

Games : 2257
Entertainment : 334
Photo & Video : 167
Social Networking : 143
Education : 132
Shopping : 121
Utilities : 109
Lifestyle : 94
Finance : 84
Sports : 79
Health & Fitness : 76
Music : 67
Book : 66
Productivity : 62
News : 58
Travel : 56
Food & Drink : 43
Weather : 31
Reference : 20
Navigation : 20
Business : 20
Catalogs : 9
Medical : 8


In [44]:
display_table(android_free_apps, 9)
# data below show that practical ools is the most common genre, and Entertinment is the second
# These data suggests 

Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 80
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

In [45]:
display_table(android_free_apps, 1)

FAMILY : 1675
GAME : 862
TOOLS : 750
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 55
BEAUTY : 53


In [46]:
apple_genre_ft = freq_table(apple_Free_apps,11)


### Which type of app attract most users?

In [47]:
for genre in apple_genre_ft:
    total = 0
    genre_len = 0
    for app in apple_Free_apps:
        if app[11] == genre:
            rating_count_tot = int(app[5])
            total = total + rating_count_tot
            genre_len += 1
    print(genre,":", round(total/genre_len))

Social Networking : 53078
Photo & Video : 27250
Games : 18925
Music : 56482
Reference : 67448
Health & Fitness : 19952
Weather : 47221
Utilities : 14010
Travel : 20216
Shopping : 18747
News : 15893
Navigation : 25972
Lifestyle : 8978
Entertainment : 10823
Food & Drink : 20179
Sports : 20129
Book : 8498
Finance : 13522
Education : 6266
Productivity : 19054
Business : 6368
Catalogs : 1780
Medical : 460


In [48]:
android_category_ft = freq_table(android_English,1)
for genre in android_category_ft:
    total = 0
    genre_len = 0
    for app in android_free_apps:
        if app[1] == genre:
            installation = app[5]
            installation = float(installation.replace('+','').replace(',',''))
            total = total + installation
            genre_len += 1
    print(genre,":",round(total/genre_len))

ART_AND_DESIGN : 1986335
AUTO_AND_VEHICLES : 647318
BEAUTY : 513152
BOOKS_AND_REFERENCE : 8767812
BUSINESS : 1712290
COMICS : 817657
COMMUNICATION : 38456119
DATING : 854029
EDUCATION : 1833495
ENTERTAINMENT : 11640706
EVENTS : 253542
FINANCE : 1387692
FOOD_AND_DRINK : 1924898
HEALTH_AND_FITNESS : 4188822
HOUSE_AND_HOME : 1331541
LIBRARIES_AND_DEMO : 638504
LIFESTYLE : 1437816
GAME : 15588016
FAMILY : 3697848
MEDICAL : 120551
SOCIAL : 23253652
SHOPPING : 7036877
PHOTOGRAPHY : 17840110
SPORTS : 3638640
TRAVEL_AND_LOCAL : 13984078
TOOLS : 10801391
PERSONALIZATION : 5201483
PRODUCTIVITY : 16787331
PARENTING : 542604
WEATHER : 5074486
VIDEO_PLAYERS : 24727872
NEWS_AND_MAGAZINES : 9549178
MAPS_AND_NAVIGATION : 4056942
