# Analyzing Mobile App Data
The mobile app market is one of the most profitable sources of revenue for both Google and Apple. More and more consumers are using their phones as their main source to navigating the internet, so as a result, more and more business are discovering that their best chances at retaining consumer interaction will be through the app stores. Our company only builds free apps, and our largest source of revenue comes from in-app ads.

The goal of this analysis is to study data from the Google Play and App Store that will help determine what types of apps will attract more users to our company's product. And hopefully, in doing so, will increase the company's overall yearly profits.

## Explore Apple and Google Datasets

In [1]:
import csv

Open and save the two datasets

In [2]:
apple_data = []
google_data = []

with open('AppleStore.csv', newline='') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        apple_data.append(row)
        
with open('googleplaystore.csv', newline='') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        google_data.append(row)

Make function that takes a dataset and prints its rows in a readable way

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Explore Apple Data

In [4]:
print('Apple Data:\n')
explore_data(apple_data, 0, 5, True)
print('\nColumns: \n')
for row in apple_data[:1]:
    for column in row:
        print(f'    {column}')

Apple Data:

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16

Columns: 

    id
    track_name
    size_bytes
    currency
    price
    rating_count_tot
    rating_count_ver
    user_rating
    user_rating_ver
 

Explore Google Data

In [5]:
print('Google Data:\n')
explore_data(google_data, 0, 5, True)
print('\nColumns: \n')
for row in google_data[:1]:
    for column in row:
        print(f'    {column}')

Google Data:

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13

Columns: 

    App
    Category
    Rating
    Revie

## Clean Datasets

Clean Google Data

* User Found [Error](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/164101) at row 10473. Category is missing.

In [6]:
print(google_data[10473])
del google_data[10473]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


* Get rid of duplicate entries

Create a dictionary that holds the app name as the key and the duplicate app entries stored in a list as the value. Print the total number of duplicates and some examples.

In [7]:
duplicate_apps = {}

for app in google_data[1:]:
    app_name = app[0]
    if app_name in duplicate_apps:
        value = duplicate_apps[app_name]
        value.append(app)
        duplicate_apps[app_name] = value
    else:
        duplicate_apps[app_name] = []
        
total_duplicates = 0

for app_name, apps in duplicate_apps.items():
    total_duplicates += len(apps)

print(f'total duplicates: {total_duplicates}\n')

examples = 0
for key, value in duplicate_apps.items():
    if len(value) == 2 and examples < 3:
        print(f'{key}:\n')
        for x in value:
            print(f'{x}\n')
        examples += 1
    elif examples == 3:
        break;

total duplicates: 1181

Google My Business:

['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']

['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']

Box:

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']

Quick PDF Scanner + OCR FREE:

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']

['Quick PDF Scanner + OCR 

Rather than removing duplicates randomly, they will instead be removed based upon their most recent entry. The app entry with the highest number in the column of "Reviews" will help verify this.

Create a dictionary that holds the app name as the key and the highest number of reviews in one entry as the value. Make a new list that holds no duplicates and print the length.

In [8]:
reviews_max = {}

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(3)
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

android_clean = []
already_added = []

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(3)
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
print(f'Number of rows: {len(android_clean)}')

Number of rows: 9659


Luckily, there aren't any suspected errors in the apple dataset so further data cleaning will not be needed.

## Filter Datasets

Filter non-English apps android apps

We only develop English apps at our company, but some apps have non-English names suggesting that those apps haven't been developed for English speaking audiences.

Create a function that tests if characters in a string are commonly used in English text. If more than three non-English characters are found, then return false. Otherwise, return true. Then, make a new filtered list with only English apps and print the length.

In [9]:
def is_english(string):
    not_english = 0
    for char in string:
        if ord(char) > 127 and not_english > 3:
            return False
        elif ord(char) > 127:
            not_english += 1
    
    return True

android_filtered = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_filtered.append(app)
        
print(f'Number of rows: {len(android_filtered)}')

Number of rows: 9619


Because our company only develops free apps, we'll have to isolate only the free apps from both the android and apple dataset.

In [10]:
android_free = []
apple_free = []

for app in android_filtered:
    price = app[6]
    if price == 'Free':
        android_free.append(app)
        
for app in apple_data[1:]:
    price = app[4]
    if price == '0.0':
        apple_free.append(app)
        
print(f'Free android apps: {len(android_free)}\n')
print(f'Free apple apps: {len(apple_free)}')

Free android apps: 8866

Free apple apps: 4056


## Analyze Datasets

Now that the data has been successfully cleaned and filtered, it's time to start analyzing. As mentioned before, our primary goal is to retain users in our apps. Because we build apps for both the Google and Apple app stores, we need to find profiles that are successful on both markets.

Let's see what the most common genres are for each market by first building a frequency table.

In [11]:
def freq_table(dataset, index):
    table = {}
    for app in dataset:
        column_name = app[index]
        if column_name in table:
            table[column_name] += 1
        else:
            table[column_name] = 1
    
    return table

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Display frequency for Apple genre.

In [12]:
display_table(apple_free, 11)

Games : 2257
Entertainment : 334
Photo & Video : 167
Social Networking : 143
Education : 132
Shopping : 121
Utilities : 109
Lifestyle : 94
Finance : 84
Sports : 79
Health & Fitness : 76
Music : 67
Book : 66
Productivity : 62
News : 58
Travel : 56
Food & Drink : 43
Weather : 31
Reference : 20
Navigation : 20
Business : 20
Catalogs : 9
Medical : 8


Display frequency for Google genre.

In [13]:
display_table(android_free, 9)

Tools : 747
Entertainment : 538
Education : 476
Business : 407
Lifestyle : 346
Productivity : 345
Finance : 328
Medical : 312
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 191
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 125
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 80
House & Home : 74
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Educational : 33
Board : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

Display frequency for Google category.

In [14]:
display_table(android_free, 1)

FAMILY : 1636
GAME : 875
TOOLS : 748
BUSINESS : 407
LIFESTYLE : 347
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 312
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 191
DATING : 165
VIDEO_PLAYERS : 158
MAPS_AND_NAVIGATION : 125
EDUCATION : 114
FOOD_AND_DRINK : 110
ENTERTAINMENT : 100
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 74
WEATHER : 71
EVENTS : 63
ART_AND_DESIGN : 60
PARENTING : 58
COMICS : 55
BEAUTY : 53


Calculate the average number of user ratings per genre in the Apple app store.

In [15]:
apple_genre = freq_table(apple_free, 11)

for genre in apple_genre:
    total = 0
    len_genre = 0
    for app in apple_free:
        genre_app = app[11]
        if genre_app == genre:
            total_user_rating = float(app[5])
            total += total_user_rating
            len_genre += 1
    avg_user_rating = total / len_genre
    print(f'{genre}: {avg_user_rating}\n')

Social Networking: 53078.195804195806

Photo & Video: 27249.892215568863

Games: 18924.68896765618

Music: 56482.02985074627

Reference: 67447.9

Health & Fitness: 19952.315789473683

Weather: 47220.93548387097

Utilities: 14010.100917431193

Travel: 20216.01785714286

Shopping: 18746.677685950413

News: 15892.724137931034

Navigation: 25972.05

Lifestyle: 8978.308510638299

Entertainment: 10822.961077844311

Food & Drink: 20179.093023255813

Sports: 20128.974683544304

Book: 8498.333333333334

Finance: 13522.261904761905

Education: 6266.333333333333

Productivity: 19053.887096774193

Business: 6367.8

Catalogs: 1779.5555555555557

Medical: 459.75



While there aren't as many social networking apps as there are gaming and entertainment apps on the Apple app store, social networking apps still have one of the higher average number of user ratings. This might signify that although there are less of these apps, they are still higher in user attainment. Our possible social networking app has a higher chance of attracting a large audience.

In [22]:
android_category = freq_table(android_free, 1)

for category in android_category:
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            total += float(installs)
            len_category += 1
    avg_installs = total / len_category
    print(f'{category}: {avg_installs}')

ART_AND_DESIGN: 1905351.6666666667
AUTO_AND_VEHICLES: 647317.8170731707
BEAUTY: 513151.88679245283
BOOKS_AND_REFERENCE: 8721959.47643979
BUSINESS: 1712290.1474201474
COMICS: 817657.2727272727
COMMUNICATION: 38456119.167247385
DATING: 854028.8303030303
EDUCATION: 3082017.543859649
ENTERTAINMENT: 21134600.0
EVENTS: 253542.22222222222
FINANCE: 1387692.475609756
FOOD_AND_DRINK: 1924897.7363636363
HEALTH_AND_FITNESS: 4188821.9853479853
HOUSE_AND_HOME: 1313681.9054054054
LIBRARIES_AND_DEMO: 638503.734939759
LIFESTYLE: 1433701.5244956773
GAME: 15837565.085714286
FAMILY: 2690584.773838631
MEDICAL: 120616.48717948717
SOCIAL: 23253652.127118643
SHOPPING: 7036877.311557789
PHOTOGRAPHY: 17805627.643678162
SPORTS: 3638640.1428571427
TRAVEL_AND_LOCAL: 13984077.710144928
TOOLS: 10695245.286096256
PERSONALIZATION: 5201482.6122448975
PRODUCTIVITY: 16787331.344927534
PARENTING: 542603.6206896552
WEATHER: 5074486.197183099
VIDEO_PLAYERS: 24852732.40506329
NEWS_AND_MAGAZINES: 9549178.467741935
MAPS_AND_NA

The social apps have an average of over 23 million installs on the Google Play store. They are the most popular on both platforms. And their retention rate is much higher than most other apps. The company should try building at least one social networking app as this genre of app offers the greatest reward potential.