# App Profiles for the App Store and Google Play Markets
---
This project is for the purposes of displaying my ability to code in Python, use the Jupyter Notebook web app, and deepen my knowledge on the work of a data analyst.

This project will detail information regarding the making of mobile apps built for iOS and Android.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
from csv import reader
#Open Android app data#
open_android_file=open('googleplaystore.csv')
read_android_file=reader(open_android_file)
android_data=list(read_android_file)
android_header=android_data[0]

#Open iOS app data#
open_ios_file=open('AppleStore.csv')
read_ios_file=reader(open_ios_file)
ios_data=list(read_ios_file)
ios_header=ios_data[0]

In [3]:
print("Displaying iOS app data: ")
print("\n")
explore_data(ios_data, 0, 1)

print("Displaying Android app data: ")
print("\n")
explore_data(android_data, 0, 1)

Displaying iOS app data: 


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Displaying Android app data: 


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']




## Data Cleaning: Deleting Wrong Data

In the next section, in order. to make sure the data that is analyzed is accurate, data that is inaccurate or duplicate must be either corrected or removed from the data pool. 

For the purposes of this project, non-English apps (such as 爱奇艺PPS -《欢乐颂2》电视剧热播) and non-free apps will be removed, too.

In [4]:
#Here, the for loop finds any data that has a Type of 'NaN' or incorrect
#data input, which must be removed through the use of the del statement
i=0
for alist in android_data[1:]:
    if alist[3] == '3.0M':
        print(alist, i)
        del android_data[i]
    if alist[6] == 'NaN':
        print(alist, i)
        del android_data[i]
    i+=1

['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device'] 9148
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 10472


In [5]:
#In this section, duplicated app information will be separate into
#two lists. One with unique apps, and one with duplicate entries
android_dupes = []
android_uniq = []
for alist in android_data[1:]:
    name=alist[0]
    if name in android_uniq:
        android_dupes.append(name)
    else:
        android_uniq.append(name)
#For example, here is a list of 15 apps that were found to have 
#duplicated entries:
print(android_dupes[:15])

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Now we have found the exact amount of duplicated data in the Google Play data set. The length of this duplicate data subset can now be deducted from the total set in order to find the total number of unique apps.

In [6]:
print('No. of duplicate entries: ', len(android_dupes))
print('No. of unique apps: ', len(android_data)-len(android_dupes)-1)

No. of duplicate entries:  1181
No. of unique apps:  9658


For data analysis, the most recent data of duplicate entries will be preferred. With this in mind, the data with the highest number of reviews for duplicates will be selected and the other entries will be discarded.

In [7]:
android_reviews_max = {}
for alist in android_data[1:]:
    name = alist[0]
    n_reviews = float(alist[3])
    if name in android_reviews_max and n_reviews > android_reviews_max[name]:
        android_reviews_max[name] = n_reviews
    elif name not in android_reviews_max:
        android_reviews_max[name] = n_reviews

android_clean = []
android_already_added = []
for alist in android_data[1:]:
    name = alist[0]
    n_reviews = float(alist[3])
    if n_reviews == android_reviews_max[name] and name not in android_already_added:
        android_clean.append(alist)
        android_already_added.append(name)

print(len(android_clean))

9658


## Removing Non-English Apps

English text is in the ASCII range of 0 to 127, so if a character's ASCII is out of this range, then the app data will be removed from the dataset. 

In [8]:
def isEnglish(app_name):
    n_ascii =0
    for char in app_name:
        if ord(char) > 127:
            n_ascii+=1
        if n_ascii > 3:
            return False
    return True

In [9]:
android_english = []
for alist in android_clean:
    name = alist[0]
    if isEnglish(name):
        android_english.append(alist)
        
ios_english = []
for alist in ios_data[1:]:
    name = alist[1]
    if isEnglish(name):
        ios_english.append(alist)

print(len(ios_data))
print(len(ios_english))
print(len(android_data))
print(len(android_english))

7198
6183
10840
9613


## Isolating Free Apps

So far, duplicate data, non-English apps, and a couple of erroneous entries have been removed. Now, the purposes of this project is to analyze free applications, so those will be isolated now.

In [10]:
android_free = []
for alist in android_english[1:]:
    price = alist[6]
    if price == 'Free':
        android_free.append(alist)

ios_free = []
for alist in ios_english[1:]:
    price = float(alist[4])
    if price == 0.0:
        ios_free.append(alist)

print("Before android isolation: ", len(android_english))
print("After android isolation: ", len(android_free))
print("Before iOS isolation: ", len(ios_english))
print("After iOS isolation: ", len(ios_free))

Before android isolation:  9613
After android isolation:  8861
Before iOS isolation:  6183
After iOS isolation:  3221


The point of this data-cleaning and eventual analysis is to understand what kind of apps are recieved well and often in the AppStore or GooglePlay Store. 

A sensible plan of action is to develop an app that is likely to attract more users based on the data analysis.
* Build an Android version of the app
* If app is successful/recieves good responses, develop further
* Build an iOS version

## Most Common Apps by Genre

In this step, to validate our strategy to create a potential app, we will inspect the data to create frequency tables based on the genre of the dataset. Based on this, the percentages of genres can be calculated and sorted in descending order, to easily identify the most developed genre of app.

In [11]:
def freq_table(dataset, index):
    fr_table = {}
    for alist in dataset:
        content = alist[index]
        if content in fr_table:
            fr_table[content] += 1
        else:
            fr_table[content] = 1
    for key in fr_table:
        fr_table[key] /= len(fr_table)
    return fr_table

In [12]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

**Creating Frequency Tables**

The two functions *display_table* and *freq_table* are used to create frequency tables based off of two parameters: the dataset used(which should be a list of lists), and an index(an int for the desired column to be inspected). 


In [13]:
print("Google Play Category data: ")
display_table(android_free, 1)
print("\n")
print("Google Play Genres data: ")
display_table(android_free, 9)
print("\n")
print("iOS prime_genre data: ")
display_table(ios_free, 11)
print("\n")

Google Play Category data: 
FAMILY : 50.72727272727273
GAME : 26.12121212121212
TOOLS : 22.727272727272727
BUSINESS : 12.333333333333334
LIFESTYLE : 10.484848484848484
PRODUCTIVITY : 10.454545454545455
FINANCE : 9.93939393939394
MEDICAL : 9.484848484848484
SPORTS : 9.121212121212121
PERSONALIZATION : 8.909090909090908
COMMUNICATION : 8.696969696969697
HEALTH_AND_FITNESS : 8.272727272727273
PHOTOGRAPHY : 7.909090909090909
NEWS_AND_MAGAZINES : 7.515151515151516
SOCIAL : 7.151515151515151
TRAVEL_AND_LOCAL : 6.2727272727272725
SHOPPING : 6.03030303030303
BOOKS_AND_REFERENCE : 5.757575757575758
DATING : 5.0
VIDEO_PLAYERS : 4.818181818181818
MAPS_AND_NAVIGATION : 3.757575757575758
FOOD_AND_DRINK : 3.3333333333333335
EDUCATION : 3.121212121212121
ENTERTAINMENT : 2.5757575757575757
LIBRARIES_AND_DEMO : 2.515151515151515
AUTO_AND_VEHICLES : 2.484848484848485
HOUSE_AND_HOME : 2.212121212121212
WEATHER : 2.1515151515151514
EVENTS : 1.9090909090909092
PARENTING : 1.7575757575757576
ART_AND_DESIGN 

From the Cagetory data, the development of apps across all fields is fairly distributed. The most developed app Category was the Tools category at 6.57%, with the second-most being Entertainment at 4.72%.
However, under the Genres data, the Google Play store displays a more drastic distribution of app genres, with 50.72% of apps being under the Family genre. The second-highest developed genre is Game with 26.12%. This data shows that despite the fair distribution of app categories, the genres show that the market is more partial towards recreational and entertainment apps, such as ones that are tailored towards children.

From inspection of the prime_genre data, it can be seen that the Games genre is by far the most popular genre to develop apps, with a percentage of 81.49%. The 2nd highest genre developed is Entertainment, with a contrastingly low percent of 11.04%. From this data, it can be concluded that the vast majority of apps are developed with recreational purposes in mind, rather than utility and practicality.

In [14]:
category_data = freq_table(android_free, 1)
genres_data = freq_table(android_free, 9)
prime_genre_data= freq_table(ios_free, 11)

In [15]:
def calc_avg_reviews(dataset):
    ios_avg_reviews = {}
    for genre in prime_genre_data:
        total = 0
        len_genre = 0
        for alist in dataset:
            genre_app = alist[11]
            if genre == genre_app:
                n_ratings = float(alist[5])
                total += n_ratings
                len_genre += 1
        avg_reviews = total / len_genre
        ios_avg_reviews[genre] = avg_reviews
        
    return ios_avg_reviews

In [16]:
def display_avg_reviews(dataset):
    table = calc_avg_reviews(dataset)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [23]:
def calc_avg_install(dataset, genre_set, genre_index, install_index):
    avg_install = {}
    for genre in genre_set:
        total = 0
        len_genre = 0
        for alist in dataset:
            genre_app = alist[genre_index]
            if genre == genre_app:
                n_ratings = alist[install_index].replace('+', '')
                n_ratings = float(n_ratings.replace(',', ''))
                total += n_ratings
                len_genre += 1
        avg_reviews = total / len_genre
        print(total, len_genre, avg_reviews)
        avg_install[genre] = avg_reviews
    return avg_install

In [24]:
def display_avg_install(dataset, genre_set, genre_index, install_index):
    table = calc_avg_install(dataset, genre_set, genre_index, install_index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [25]:
display_avg_install(ios_free, prime_genre_data, 11, 5)
display_avg_install(android_free, category_data, 1, 5)
display_avg_install(android_free, genres_data, 9, 5)

516542.0 6 86090.33333333333
1463837.0 28 52279.892857142855
556619.0 14 39758.5
42705967.0 1874 22788.6696905016
3672.0 6 612.0
4550647.0 160 28441.54375
3563577.0 254 14029.830708661417
127349.0 17 7491.117647058823
2261254.0 84 26919.690476190477
1132846.0 36 31467.944444444445
826470.0 118 7003.983050847458
913665.0 43 21248.023255813954
3783551.0 66 57326.530303030304
1513441.0 81 18684.456790123455
4609449.0 105 43899.514285714286
866682.0 26 33333.92307692308
1129752.0 40 28243.8
1587614.0 69 23008.898550724636
16016.0 4 4004.0
840774.0 51 16485.764705882353
1348958.0 18 74942.11111111111
1177591.0 56 21028.410714285714
1514371.0 65 23298.015384615384
Navigation : 86090.33333333333
Reference : 74942.11111111111
Music : 57326.530303030304
Weather : 52279.892857142855
Social Networking : 43899.514285714286
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.