 # Profitable App Profiles
 
 This is a guided project. The aim here is to:
 
 1. To review basic, raw Python skills
 2. Get an idea of what kinds of apps are likely to attract more users. The idea is we work at a company which offers apps which are free to donwload and install, and are aimed at English-speaking audiences.
 
 For this, we'll be using two datasets:
 
1. Google Play Data: This was collected August 2018 and contains data on approx. 10,000 Android Apps.
2. Apple Data: This was collected July 2017 and contains data on approx. 7,000 IOS Apps.

Our end goal is to add an app on both Google Play and the App store. We therefore need to find app profiles that are successful on both markets.

## Data Exploration

In [1]:
# Imports
from csv import reader

In [2]:
# Open Google
file = open("googleplaystore.csv")
google = list(reader(file))
file.close()

# Open Apple
file = open("AppleStore.csv")
apple = list(reader(file))
file.close()

In [3]:
# Explore data
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
# Google Data Exploration
explore_data(google, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


In [5]:
# Apple Data Exploration
explore_data(apple, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


---

We can see that for our **Google data**:

* There are 10842 rows (including header row)
* There are 13 columns

We can see that for our **Apple data**:

* There are 7198 rows (including header row)
* There are 16 columns

---

For our analysis, we want to indentify the apps which are most profitable. We therefore may want to focus on columns that relate to app reviews and price.

---

Next up, we note in a Kaggle discussion that one of the rows of data is missing some data. Let's loop through all rows and find rows which don't have the same number of entries as the header row.

## Data Cleaning

### Duplicates

In [6]:
for row in google:
    if len(row) != len(google[0]):
        print(row)
        print(google.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10473


We've found the row. We can see that it is missing some data. Let's delete it.

In [7]:
del google[10473]

From reading some more Kaggle discussions, it looks like the Google data may have a duplicate applications. Let's see if we can find these.

In [23]:
total = 0
count = 1
for row1 in google[1:]:
    count += 1
    for row2 in google[count:]:
        if row2[0] == row1[0]:
            total += 1
            break
print(total)

1181


In the above, we can see that there are 1181 duplicates. Note we can also do something like the below to find duplicates.

In [24]:
duplicate_list = list()
unique_list = list()

for row in google[1:]:
    app = row[0]
    if app in unique_list:
        duplicate_list.append(app)
    else:
        unique_list.append(app)
        
print(len(duplicate_list))   

1181


Let's explore the duplicates and see if we can remove them.

In [26]:
duplicate_list[:10]

['Quick PDF Scanner + OCR FREE',
 'Box',
 'Google My Business',
 'ZOOM Cloud Meetings',
 'join.me - Simple Meetings',
 'Box',
 'Zenefits',
 'Google Ads',
 'Google My Business',
 'Slack']

In [43]:
for row in google:
    if row[0] == "Instagram":
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We can see a slight different at index 3. This corrosponds to number of reviews. A criterion for removing the duplicates could be to keep the record with the highest number of reviews (the most recent entry) and remove the rest. 

In [46]:
dictionary = dict()

for row in google[1:]:
    app = row[0]
    review = row[3]
    if app not in dictionary:
        dictionary[app] = review
    else:
        if review > dictionary[app]:
            dictionary[app] = review

Now that we have a dictionary with each app name along with their highest review, we'll use this to amend our current google dataset.

In [58]:
google_clean = list()
already_added = list()

for row in google[1:]:
    app = row[0]
    review = row[3]
    if (review == dictionary[app]) and (app not in already_added):
        google_clean.append(row)
        already_added.append(app)

len(google_clean)

9659

### Non-English Apps

We have now successfuly removed the duplicate rows.

Now let's move. We want to remove any non-English speaking apps. The first step here will be to scan all the app names, and remove any of these records if they contain any uncommon characters (i.e. outside the range of 0 - 127 ASCII). We'll first create a function to do detect whether a string's characters are within this range. As some of our English apps names may contain non-ascii characters (e.g. emojis), we'll only class an app as non-English if it contains more than 3 non-ascii characters.

In [79]:
def ascii_true(string):
    count = 0
    for char in string:
        if ord(char) > 127:
            count += 1
    if count > 3:
        return False
    return True

Let's now apply this function to our datasets

In [88]:
google_english = list()
apple_english = list()

for row in google_clean:
    app = row[0]
    if ascii_true(app):
        google_english.append(row)

for row in apple:
    app = row[1]
    if ascii_true(app):
        apple_english.append(row)

In [89]:
len(google_english)

9614

In [90]:
len(apple_english)

6184

### Free Apps

Now let's only keep the apps which are free (i.e. price is equal to 0)

In [102]:
google_final = list()
apple_final = list()

for row in google_english:
    price = row[7]
    if price == '0':
        google_final.append(row)
    
for row in apple_english:
    price = row[4]
    if price == '0.0':
        apple_final.append(row)
        
print(f"This leaves us with\n\n- {len(google_final)} google apps\n- {len(apple_final)} apple apps")

This leaves us with

- 8862 google apps
- 3222 apple apps


## Common Apps by Genre

Now we want to look at some specific columns `Genres` and `Category` in the the Google data, and `prime_genre` in the Apple data. Essentially we want to build an app that does well for both Apple and Google. One place to start is in identifying the most common genres.

In [169]:
# Will produce a frequency table for any column in list-of-lists
def freq_table(dataset, index):
    freq = dict()
    total = 0
    for row in dataset:
        total += 1
        col = row[index]
        if col in freq:
            freq[col] += 1
        else:
            freq[col] = 1
            
    table_percentages = {}
    for key in freq:
        percentage = (freq[key] / total) * 100
        table_percentages[key] = round(percentage, 2)
        
    return table_percentages

In [170]:
# Will create the frequency table and also format it
def display_table(dataset, index):
    output = list()
    table = freq_table(dataset, index)
    for key, value in table.items():
        tup = (value, key)
        output.append(tup)
    sorted_output = sorted(output, reverse = True)
    for tup in sorted_output:
        print(f"{tup[1]:40s} : {tup[0]}")

In [171]:
print("Genre frequences for Google\n")
display_table(google_final, 9)

Genre frequences for Google

Tools                                    : 8.44
Entertainment                            : 6.07
Education                                : 5.35
Business                                 : 4.59
Productivity                             : 3.89
Lifestyle                                : 3.89
Finance                                  : 3.7
Medical                                  : 3.52
Sports                                   : 3.46
Personalization                          : 3.32
Communication                            : 3.24
Action                                   : 3.1
Health & Fitness                         : 3.08
Photography                              : 2.95
News & Magazines                         : 2.8
Social                                   : 2.66
Travel & Local                           : 2.32
Shopping                                 : 2.25
Books & Reference                        : 2.14
Simulation                               : 2.04
Dating        

Let's make some observations regarding Genre for Google

* The most common genres appears to be Tools followed by Entertainment, along with apps genres like Productivity and Education. 
* This may be a bit misleading, as we can see that are many many categories. For example, we can see that a genre like Games has been split into many subcategories. If we were to aggregate this genre, we might find this to be the most common.

In [172]:
print("Category frequencies for Google\n")
display_table(google_final, 1)

Category frequencies for Google

FAMILY                                   : 18.93
GAME                                     : 9.69
TOOLS                                    : 8.45
BUSINESS                                 : 4.59
LIFESTYLE                                : 3.9
PRODUCTIVITY                             : 3.89
FINANCE                                  : 3.7
MEDICAL                                  : 3.52
SPORTS                                   : 3.4
PERSONALIZATION                          : 3.32
COMMUNICATION                            : 3.24
HEALTH_AND_FITNESS                       : 3.08
PHOTOGRAPHY                              : 2.95
NEWS_AND_MAGAZINES                       : 2.8
SOCIAL                                   : 2.66
TRAVEL_AND_LOCAL                         : 2.34
SHOPPING                                 : 2.25
BOOKS_AND_REFERENCE                      : 2.14
DATING                                   : 1.86
VIDEO_PLAYERS                            : 1.79
MAPS_AND_N

Let's make some observations about categories for Google

* We can see that Family ranks very high here. Additional research will show that Family really means "Games for Kids"
* We see that Games is the second highest. Thus we may have been right is speculating before that Games is the most common genre/category. 
* We can also see some productivity apps ranking quite high

In [173]:
print("Genre frequencies for Apple\n")
display_table(apple_final, 11)

Genre frequencies for Apple

Games                                    : 58.16
Entertainment                            : 7.88
Photo & Video                            : 4.97
Education                                : 3.66
Social Networking                        : 3.29
Shopping                                 : 2.61
Utilities                                : 2.51
Sports                                   : 2.14
Music                                    : 2.05
Health & Fitness                         : 2.02
Productivity                             : 1.74
Lifestyle                                : 1.58
News                                     : 1.33
Travel                                   : 1.24
Finance                                  : 1.12
Weather                                  : 0.87
Food & Drink                             : 0.81
Reference                                : 0.56
Business                                 : 0.53
Book                                     : 0.43
Navigation

Let's make some observations regarding the genre columns for Apple:

* Games is the most common genre, followed by Entertainment
* Most apps appear to be for general entertainment more than anything
* This doesn't tell us how popular the apps are

## Popular Apps by Genre

Now let's look at popular apps. The columns we are interested in here are `rating_count_tot` and `Installs` columns of the Apple and Google datasets respectively. Let's calculate the average number of ratings per app genre for Apple.

In [174]:
genres_apple_table = freq_table(apple_final, 11)
genres_apple_table

{'Social Networking': 3.29,
 'Photo & Video': 4.97,
 'Games': 58.16,
 'Music': 2.05,
 'Reference': 0.56,
 'Health & Fitness': 2.02,
 'Weather': 0.87,
 'Utilities': 2.51,
 'Travel': 1.24,
 'Shopping': 2.61,
 'News': 1.33,
 'Navigation': 0.19,
 'Lifestyle': 1.58,
 'Entertainment': 7.88,
 'Food & Drink': 0.81,
 'Sports': 2.14,
 'Book': 0.43,
 'Finance': 1.12,
 'Education': 3.66,
 'Productivity': 1.74,
 'Business': 0.53,
 'Catalogs': 0.12,
 'Medical': 0.19}

In [175]:
print("Average number of ratings per app genre for Apple\n")
for genre in genres_apple_table:
    total = 0
    len_genre = 0
    for row in apple_final:
        genre_app = row[11]
        if genre == genre_app:
            user_ratings = float(row[5])
            total += user_ratings
            len_genre += 1
    print(f"{genre:20s} : {total / len_genre:.2f}")

Average number of ratings per app genre for Apple

Social Networking    : 71548.35
Photo & Video        : 28441.54
Games                : 22788.67
Music                : 57326.53
Reference            : 74942.11
Health & Fitness     : 23298.02
Weather              : 52279.89
Utilities            : 18684.46
Travel               : 28243.80
Shopping             : 26919.69
News                 : 21248.02
Navigation           : 86090.33
Lifestyle            : 16485.76
Entertainment        : 14029.83
Food & Drink         : 33333.92
Sports               : 23008.90
Book                 : 39758.50
Finance              : 31467.94
Education            : 7003.98
Productivity         : 21028.41
Business             : 7491.12
Catalogs             : 4004.00
Medical              : 612.00


We can see that the highest here is for Navigation

In [176]:
for row in apple_final:
    if row[11] == "Navigation":
        print(row[1], row[5])

Waze - GPS Navigation, Maps & Real-time Traffic 345046
Google Maps - Navigation & Transit 154911
Geocaching® 12811
CoPilot GPS – Car Navigation & Offline Maps 3582
ImmobilienScout24: Real Estate Search in Germany 187
Railway Route Search 5


We can see that there are only a handful of apps for Genre, wth Waze and Google heavily skewing the data. This is likely to be a problem. Certain companies will be skewing the data, making a genre look popular, which in reality, a lot of smaller apps of these genres may not be doing well at all.

Let's now look at Google.

In [183]:
google_ratings = freq_table(google_final, 5)
google_ratings

{'10,000+': 10.2,
 '5,000,000+': 6.84,
 '50,000,000+': 2.29,
 '100,000+': 11.55,
 '50,000+': 4.77,
 '1,000,000+': 15.74,
 '10,000,000+': 10.52,
 '5,000+': 4.51,
 '500,000+': 5.57,
 '1,000,000,000+': 0.23,
 '100,000,000+': 2.12,
 '1,000+': 8.4,
 '500,000,000+': 0.27,
 '500+': 3.25,
 '100+': 6.92,
 '50+': 1.92,
 '10+': 3.54,
 '1+': 0.51,
 '5+': 0.79,
 '0+': 0.05,
 '0': 0.01}

In the above, we can see some "+" characters. For the pursposes of this project, we'll just assuming that "100,000+" is equivelent to "100000"

In [198]:
freq_cat = freq_table(google_final, 1)

for cat in freq_cat:
    len_cat = 0
    total = 0
    for row in google_final:
        if cat == row[1]:
            installs = row[5]
            installs = installs.replace(",", "").replace("+", "")
            installs = float(installs)
            total += installs
            len_cat += 1
            
    avg_installs = total / len_cat
    print(f"{cat:20s} : {avg_installs:.0f}")

ART_AND_DESIGN       : 1986335
AUTO_AND_VEHICLES    : 647318
BEAUTY               : 513152
BOOKS_AND_REFERENCE  : 8767812
BUSINESS             : 1712290
COMICS               : 817657
COMMUNICATION        : 38456119
DATING               : 854029
EDUCATION            : 1820673
ENTERTAINMENT        : 11640706
EVENTS               : 253542
FINANCE              : 1387692
FOOD_AND_DRINK       : 1924898
HEALTH_AND_FITNESS   : 4188822
HOUSE_AND_HOME       : 1331541
LIBRARIES_AND_DEMO   : 638504
LIFESTYLE            : 1437816
GAME                 : 15560966
FAMILY               : 3694276
MEDICAL              : 120616
SOCIAL               : 23253652
SHOPPING             : 7036877
PHOTOGRAPHY          : 17805628
SPORTS               : 3638640
TRAVEL_AND_LOCAL     : 13984078
TOOLS                : 10682301
PERSONALIZATION      : 5201483
PRODUCTIVITY         : 16787331
PARENTING            : 542604
WEATHER              : 5074486
VIDEO_PLAYERS        : 24727872
NEWS_AND_MAGAZINES   : 9549178
MAPS_AN

Now we can get an idea of the most popular apps. Although we can note some of these, such as Communications, additional reasearch should be conducted to assess whether a category like Communication is being heavily skewed by a few big apps. Ultimately, we don't want to develop a Communication app thinking it will be popular, when it reality it's popularity is explained only by big names like WhatsApp.

To be continued...