## Analyzing Mobile App Data Data Analysis

The purpose of this project is to utilize basic Python functionality, covered up to this point, to perform a practical data analysis. We'll imagine we're working as data analysts for a company that builds Android and iOS mobile apps made available on Google Play and in the App Store.

We build apps that are free to download and install, and our main source of revenue comes from in-app ads. This means that the # of app users, especially those that engage with our ads, has a direct impact on our revenue. Our goal is to analyze data and guide our team of developers to develop the most attractive, engaging app.

In [1]:
from csv import reader

# Define function to open CSV file and return header and data
def header_and_data(csv):
    opened_file = open(csv, encoding='utf8') # open CSV
    read_file = reader(opened_file) # return list of strings
    list_of_lists = list(read_file) # generate list of lists
    
    return list_of_lists[0], list_of_lists[1:] #return header and data

In [2]:
# Define function to explore data
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
# read in and explore Apple data
apple_header, apple_data = header_and_data('AppleStore.csv')
explore_data(apple_data,1,4,True)

# explore header
print(apple_header)

['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


Number of rows: 7197
Number of columns: 17
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


The [Apple store data](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) has 7197 apps (rows) and 17 app descriptors (columns) including but not limited to the app's name, price, number of ratings, average user ratings, and genre.

In [4]:
#read in and explore Android data
android_header, android_data = header_and_data('googleplaystore.csv')
explore_data(android_data,1,4,True)

#explore header
print(android_header)

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


The [Google playstore data](https://www.kaggle.com/datasets/lava18/google-play-store-apps) has 10841 apps (rows) and 13 app descriptors (columns) including but not limited to the app's name, price, number of ratings, average user ratings, and genre.

The two files, while having different column names, hold nearly identical data. For our purposes the name, rating, number of uses, and price will be of particular interest.

With a light familiarity with our data, the next step is ensuring we analyze the correct data. We're concerned with free apps for an English-speaking audience. As such, we'll filter out apps that do not meet these criteria.

To start we filter for unique entries and remove erroneous entries:

In [5]:
# remove duplicates from Apple data
apple_unique_app_names = []
apple_duplicate_app_names = []
apple_clean = []

for app in apple_data: 
    app_name = app[2] 

    if app_name not in apple_unique_app_names:
        apple_unique_app_names.append(app_name)
        apple_clean.append(app)
    else:
        apple_duplicate_app_names.append(app_name)

print('The Apple data has ' + str(len(apple_duplicate_app_names)) + ' duplicates.') # ID duplicates

The Apple data has 2 duplicates.


We remove an erroneous entry and explore the presence of duplicates in our Android dataset:

In [6]:
# Clean the data, remove duplicates, improper entries, apps that cost $, non-English apps, etc.

# remove erroneous index 10472 from Google play data
del android_data[10472] # don't run more than 1x

# remove duplicates from Android data
android_unique_app_names = []
android_duplicate_app_names = []
android_unique_apps = []

for app in android_data: 
    app_name = app[0] 

    if app_name not in android_unique_app_names:
        android_unique_app_names.append(app_name)
        android_unique_apps.append(app)
    else:
        android_duplicate_app_names.append(app_name)

print('The android data has ' + str(len(android_duplicate_app_names)) + ' duplicates.') # ID duplicates

The android data has 1181 duplicates.


To handle duplicates in the Android data we'll filter out duplicates by only keeping that with the most user ratings.

In [7]:
# Generate a dictionary to store unique apps based on those with the most ratings (most recent)
reviews_max = {}

for app in android_data:
    name = app[0] 
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews

    if name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max)) 

9659


In [8]:
# Utilize the dictionary to remove duplicate rows

android_clean = []
already_added = []

for app in android_data:
    name = app[0] 
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

print(len(android_clean))

9659


With duplicates and erroneous entries removed from both datasets, next we filter for English apps:

In [9]:
# Define a function to 
def english_string(input_string):
    
    non_eng_chars = 0
    
    for char in input_string:
        if ord(char) > 127: # ASCII characters have value of 0-126
            non_eng_chars += 1
            
    return non_eng_chars < 3 # We'll only consider apps with 3 or fewer non-English characters

# Test functionality
#english_string('Instagram')
#english_string('爱奇艺PPS -《欢乐颂2》电视剧热播')
#english_string('Instachat 😜')

In [10]:
# Filter for English Android apps
android_english = []

for app in android_clean:
    name = app[0] 
    
    if english_string(name):
        android_english.append(app)

print('# of English Android apps: ' + str(len(android_english)))


# of English Android apps: 9597


In [11]:
# Filter for English Apple apps

apple_english = []

for app in apple_clean:
    name = app[2]
    
    if english_string(name):
        apple_english.append(app)

print('# of English Apple apps: ' + str(len(apple_english)))

# of English Apple apps: 6153


With non-English apps* filtered out, the last step is to filter out paid apps:

*Note: we consider non-English apps those with 3 or more non-English characters*

In [12]:
# Filter for free Android apps

android_free = []

for app in android_english:
    price = app[6]
    
    if price == 'Free':
        android_free.append(app)
        
print('# of free Android apps: ' + str(len(android_free)))        

# of free Android apps: 8847


In [13]:
# Filter for free Apple apps

apple_free = []

for app in apple_english:
    price = float(app[5])
    
    if price == 0:
        apple_free.append(app)
        
print('# of free Apple apps: ' + str(len(apple_free)))        

# of free Apple apps: 3201


**At this point our data has been cleaned.** 

We've removed inaccurate data, removed duplicate app entries, removed non-English apps, and isolated out the free apps.

Our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue. To minimize risks and overhead, our validation strategy has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. 

**We now begin our analysis.** We build a frequency table to determine the most commone genres for each market:

In [14]:
# Generate a frequency table that show % based on the specified column of interest

def freq_table(dataset, index):
    
    frequency_table = {}
    
    for row in dataset:
        a_data_point = row[index]
        
        if a_data_point in frequency_table:
            frequency_table[a_data_point] += 1
        else:
            frequency_table[a_data_point] = 1
    
    return frequency_table

In [15]:
# Generate a display table

def display_table(dataset, index):
    
    table = freq_table(dataset, index)
    table_display = []
    
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [16]:
# Utilize display table --> frequency table functions on the following columns
## prime_genre: Apple index 12

display_table(apple_free, 12)

Games : 1864
Entertainment : 251
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 83
Utilities : 79
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 50
News : 43
Travel : 40
Finance : 35
Weather : 28
Food & Drink : 26
Reference : 17
Business : 17
Book : 12
Navigation : 6
Medical : 6
Catalogs : 4


For Apple's App Store genres we observe that:
1. Games are clearly the most popular.
2. Entertainment, Photo & Video, Education, and Social Networking form the next tier of popular apps.
3. The most popular apps are Entertainment-based. Whereas practical apps (ie. Productivity) are far less popular.
4. Based on the above display table, I would recommend Entertainment>Gaming as the app focus.

In [17]:
## Genres: Android index 9
display_table(android_free, 9)

Tools : 747
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 343
Finance : 328
Medical : 313
Sports : 306
Personalization : 294
Communication : 286
Action : 274
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 189
Simulation : 181
Dating : 165
Arcade : 163
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 123
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 80
House & Home : 71
Weather : 70
Events : 63
Adventure : 59
Comics : 53
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Trivia : 37
Casino : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

In [18]:
## Category: Android index 1
display_table(android_free, 1)

FAMILY : 1675
GAME : 858
TOOLS : 748
BUSINESS : 407
PRODUCTIVITY : 345
LIFESTYLE : 344
FINANCE : 328
MEDICAL : 313
SPORTS : 300
PERSONALIZATION : 294
COMMUNICATION : 286
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 189
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 123
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 71
WEATHER : 70
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 54
BEAUTY : 53


For Android's Play Store genres and categories we observe that:
1. Productivity and Entertainment form the top tier of most popular apps.
2. The most popular apps are Productivity-based for Android which is a contrast to the concentration of Entertainment-centric apps that were most popular for apps. With this said, Entertaiinment is the 2nd most popular app it's just surrounded by numerous Productivity-centric app types.
3. Based on the above display table, I would recommend a Productivity-focused, family-centric tool first followed by Entertainment.

With these findings in hand we proceed to exploring user-ship ...

In [28]:
# Generate a frequency table for the prime_genre column to get the unique app genres 
apple_genre_dict = freq_table(apple_free, 12)
apple_genre_list = list(apple_genre_dict.keys())

for genre in apple_genre_list:
    total = 0
    len_genre = 0

    for app in apple_free:
        app_genre = str(app[12])
        
        if app_genre == genre:
            user_rating_count = float(app[6])
            total += user_rating_count
            len_genre += 1
        
    user_rating_avg_count = total / len_genre
    print(str(genre) + ' :' + str(user_rating_avg_count))


Productivity :21028.410714285714
Weather :52279.892857142855
Shopping :27230.734939759037
Reference :79350.4705882353
Finance :32367.02857142857
Music :57326.530303030304
Utilities :19156.493670886077
Travel :28243.8
Social Networking :71548.34905660378
Sports :23008.898550724636
Health & Fitness :23298.015384615384
Games :22910.83100858369
Food & Drink :33333.92307692308
News :21248.023255813954
Book :46384.916666666664
Photo & Video :28441.54375
Entertainment :14195.358565737051
Business :7491.117647058823
Lifestyle :16815.48
Education :7003.983050847458
Navigation :86090.33333333333
Medical :612.0
Catalogs :4004.0


For Apple's App Store average user rating count we observe that:
1. Navigation apps have the highest user rating count.
2. Followed by Reference, Social Networking, Music, and Weather.
3. From this we'd extend that while the number of apps of a certain type may indicate one thing, the number of user ratings would provide a different read. We should take each with a grain of salt.

**For Apple's App Store I'd still recommend we generate an Entertainment app, I would just adapt this recommendation by focusing on Social Networking or Music.**

In [33]:
# Generate a frequency table for the Category column to get the unique app genres 
android_category_dict = freq_table(android_free, 1)
android_category_list = list(android_category_dict.keys())


for category in android_category_list:
    total = 0
    len_category = 0

    for app in android_free:
        app_category = str(app[1])
        
        if app_category == category:
            install_count = str(app[5])
            install_count = install_count.replace('+','')
            install_count = install_count.replace(',','')
            total += float(install_count)
            len_category += 1
        
    user_install_count = total / len_category
    print(str(category) + ' :' + str(user_install_count))


ART_AND_DESIGN :1986335.0877192982
AUTO_AND_VEHICLES :647317.8170731707
BEAUTY :513151.88679245283
BOOKS_AND_REFERENCE :8814199.78835979
BUSINESS :1712290.1474201474
COMICS :832613.8888888889
COMMUNICATION :38590581.08741259
DATING :854028.8303030303
EDUCATION :1833495.145631068
ENTERTAINMENT :11640705.88235294
EVENTS :253542.22222222222
FINANCE :1387692.475609756
FOOD_AND_DRINK :1924897.7363636363
HEALTH_AND_FITNESS :4188821.9853479853
HOUSE_AND_HOME :1360598.042253521
LIBRARIES_AND_DEMO :638503.734939759
LIFESTYLE :1446158.2238372094
GAME :15544014.51048951
FAMILY :3697848.1731343283
MEDICAL :120550.61980830671
SOCIAL :23253652.127118643
SHOPPING :7036877.311557789
PHOTOGRAPHY :17840110.40229885
SPORTS :3650602.276666667
TRAVEL_AND_LOCAL :13984077.710144928
TOOLS :10830251.970588235
PERSONALIZATION :5201482.6122448975
PRODUCTIVITY :16787331.344927534
PARENTING :542603.6206896552
WEATHER :5145550.285714285
VIDEO_PLAYERS :24727872.452830188
NEWS_AND_MAGAZINES :9549178.467741935
MAPS_AN

For Google's Play Store average install count we observe that:
1. Communication apps have the highest user rating count.
2. Followed by Video Player, Social, Productivity, 
3. From this we'd extend that while the number of apps of a certain type may indicate one thing, the number of user ratings would provide a different read. We should take each with a grain of salt.

**For Google's Play Store I'd adapt the earlier Productivity>Family-oriented app recommendation, and instead recommend a Social, video-based app (ie. TikTok).**

**Conclusion**

For Apple we found that an Entertainment app focused on Social and Music could be most popular whereas for Android we updated our recommendation to focus on a Social and Video-based app.

To conclude, I would recommend an adaptation on an existing app (like Tiktok). Tiktok took for exactly the reasons highlighted in this analysis: it's social and it incorporates music and video.

*How might we adapt / improve upon TikTok?*

Generate an app where the user can take a picture / video of their face and then select from a subset of animations to embed it upon (think cartoons with cut-out heads where the user could set their face) and short, bite-sized chunks of the most popular songs of the time.

This is just one possible route to take. There are an infinite # of possibilities here.