# Free App Profiles Likely to Generate Profit in the App Store and Google Play 

   This project will document an attempt to use python and data analysis to determine what makes an app "attractive" to a consumer. Working within a specific client-base of English-speakers who do not pay for phone apps, first all non-English and non-free apps will be removed. Once purged of defects and undesirable app types, the number of downloads and ratings, and average user ratings of the remaining apps will be extracted. Taking this data and examining the differences across genres will help provide a guideline for determining which categories of apps are deemed to have higher potential for ad-revenue. We will be using data from ~10,000 Google Play apps and ~7,000 iOS apps collected in 2018 and 2017 respectively.
    

### Data Types and Exploration
As of August 2020, the Google Play store had roughly 2.7 million apps on it, and the iOS app store had roughly 1.8 million. Given such large amounts of data would take significantly longer to work through and might cost more to acquire all of it than would be worth, we will take a smaller subset of data, this being the aforementioned datasets collected in 2018 and 2017. 

- [Google Play Store data](https://www.kaggle.com/lava18/google-play-store-apps): Roughly 10,000 **Android** apps compiled in 2018. [Download here](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
- [App Store data](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps): Roughly 7,000 **iOS** apps compiled in 2017. [Download here](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

The first step will be opening and exploring the dataset. The function below will be used to check datasets to ensure they're correct. The `explore_data()` function takes four parameters and prints out relevant data. The parameters are such:

- `dataset`: This takes in the dataset that we want to explore.
- `start`: This designates the beginning of the slice.
- `end`: This designates the index **after** the last visible index in the slice. 
- `rows_and_columns=False`: This parameter defaults to **False** as shown, but if assigned **True** it will output the number of rows and columns in the dataset.


In [65]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Next we'll take each data set, open them, assign them to a local list, and check that they're both intact datasets. 

In [105]:
from csv import reader

# Opening the Google Play data and storing it to a local list
open_file_2 = open('googleplaystore.csv', encoding='utf8')
read_file_google = reader(open_file_2)
google_data = list(read_file_google)

# Opening the Apple Store data and storing it to a local list
open_file_1 = open('AppleStore.csv', encoding='utf8')
read_file_apple = reader(open_file_1)
apple_data = list(read_file_apple)

# Check that each dataset came out by checking their headers against a line of specific data.
explore_data(google_data, 0,2, True)
explore_data(apple_data, 0,2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16


## Cleaning the Data

Before using the datasets for our purposes, it must be 'cleaned' or purged of incorrect or extraneous data, and what can be corrected must be corrected. In addition to that, given the additional constraints of what we're looking for--free apps for English-speakers--we must purge any data that doesn't match those conditions. 

### I. Format Checking

First we'll take the length of the header of each file by using the `len()` command on each dataset and assigning its integer value to a variable `name_head_len` where `name` is `goog` or `ios`where it ought to be. Then using that number we'll remove any data lines that do not have the correct length (and therefore do not have all of the relevant information and need not be considered). 

In [67]:
goog_head_len = len(google_data[0])
ios_head_len = len(apple_data[0])

First we will define a function to check data length against a desired length and print the row and index of any row that does not have the correct amount of data. The `len_check` code works by taking a `dataset` and desired `length` as its parameters, and runs each `row` in the `dataset` through an `if` statement to confirm its length. If it doesn't match, the function will print the row and index of the incorrect data.  

In [68]:
def len_check(dataset,length):
    for row in dataset:
        row_length = len(row)
        if row_length != length:
            print(row)
            print(dataset.index(row))

In [69]:
len_check(google_data,goog_head_len)
len_check(apple_data,ios_head_len)

Here we find one line from the **Google Play Store** (we can tell where it's from by checking it against the headers we printed above) that doesn't match. In order to prevent this incorrect line from corrupting our analysis, we will delete this row with the line `del google_data[10473]` which is run once and then removed so as to avoid deleting the wrong data.


In [70]:
#del google_data[10473]

### II. Checking for and Eliminating Duplicates Chronologically

Using the number of total reviews for each app, found in column `3` and `5` for the **App Store** and the **Google Play Store** repectively, we will check for duplicate names and retain only the `row` with the highest number of ratings (and is therefore the most recent data).

#### Step 1: Identifying Unique App Names
The first step will be creating a dictionary with a `name:index` format, of unique apps with the highest ratings. The function below, `name_maxreview_dictionary()` takes the following four parameters:

- `dataset` : The dataset to be trimmed.
- `namecol` : The column for names
- `revnumcol` : The column for max reviews
- `header = True` : Whether the dataset has a header (defaults to **True**)

The function takes four arguments and generates a dictionary with each row, and if it has the name (`key`) already exists in the dictionary, it only updates the `definition` if the `revnumcol` value is higher than the original `definition`. It then outputs a dictionary with the unique names and their highest review number.

In [71]:
def name_maxreview_dictionary(dataset, namecol, revnumcol, header = True):
    nr_Dict = {}
    if header == True:
        for row in dataset[1:]:
            name = row[namecol]
            n_reviews = float(row[revnumcol])
            if name in nr_Dict and n_reviews > nr_Dict[name]:
                nr_Dict[name] = n_reviews
            if name not in nr_Dict:
                nr_Dict[name] = n_reviews
        return nr_Dict
    if header == False:
        for row in dataset:
            name = row[namecol]
            n_reviews = float(row[revnumcol])
            if name in nr_Dict and n_reviews > nr_Dict[name]:
                nr_Dict[name] = n_reviews
            if name not in nr_Dict:
                nr_Dict[name] = n_reviews
        return nr_Dict

We then create two dictionaries, one for each dataset, and run them through the function above.

In [72]:
# Google Play data
goog_reviews_max = {}
goog_reviews_max = name_maxreview_dictionary(google_data,0,3)

# Apple Store data
apple_reviews_max ={}
apple_reviews_max = name_maxreview_dictionary(apple_data,1,6)

#### Step 2: Trimming the Data of Duplicates
The next step will be to create a function that will trim the data using a dictionary key and output it to a new list. The function `dictionary_trim()` takes the following four parameters:
- `dataset` : Dataset to be trimmed
- `dictionary` : Dictionary to use as a 'stencil' for trimming.
- `namecol` : Column to match against dictionary key
- `totrevcol` : Column to match against dictionary key's relative definition.

It takes each `dataset[namecol]` and checks the dictionary for the `key` with the same name. When it finds the `key` it checks the `definition` to see if it matches `dataset[totrevcol]` if it does and hasn't already been added, we add it to the `trimmed list`. 

In [73]:
def dictionary_trim(dataset, dictionary, namecol, totrevcol):
    trimmed_data = []
    added_list = []
    for row in dataset:
        name = row[namecol]
        n_reviews = float(row[totrevcol])
        if name in dictionary and dictionary[name] == n_reviews and name not in added_list:
            trimmed_data.append(row)
            added_list.append(name)
    return trimmed_data
        

We then run this function to print to two new lists: `android_clean` and `apple_clean`.

In [74]:
android_clean =[]
apple_clean = []


android_clean = dictionary_trim(google_data[1:],goog_reviews_max,0,3)
apple_clean= dictionary_trim(apple_data[1:],apple_reviews_max,1,6)

### III. Purging non-English Apps 

Our project is only focused on free, English-language apps, so we can remove any apps whose titles are not in English.

#### Step 1: Creating a True/False Dictionary
We start by defining a function `char_check()` to determine how many non-standard characters are present in the title. Standard English characters in ASCII are defined numerically between 0 and 127, which can be found through the built-in `ord()` function.

The function `char_check` takes up to three parameters:

- `string` : The string to be checked, for example the app title name.
- `minrange`: The minimum character number.
- `maxrange`: The maximum character number.

As we can see, this function allows the input of any character range, but it defaults to the English range (0-127). The `char_check()` function takes the `string` and checks each `char` (character) in the string for its `ord()` number, and then if it's outside of the specified range, it adds a tally to the `outofbounds_char` variable. If that variable goes over `3`, we consider the app to have too many foreign characters, and it returns `False`, otherwise it returns `True`.

In [75]:
def char_check(string, minrange=0, maxrange=127):
    outofbounds_char_tally = 0
    for char in string:
        if ord(char) > maxrange or ord(char)<minrange:
            outofbounds_char_tally += 1
            if outofbounds_char_tally >3:
                return False
    return True

The function `eng_clean` will take two parameters:
- `dataset` : The dataset we're cleaning 
- `colnum` : The number of the column we're cleaning

It then runs the column of the `dataset` through the `char_check` function. 

In [85]:
def eng_clean(dataset,colnum):
    tf_Dict = {}
    for row in dataset:
        name = row[colnum]
        tf_Dict[name] = char_check(name)
    return tf_Dict

We define our 'True/False Dictionaries' with these functions, resulting in two dictionaries whose keys represent the app names and whose definitions are all `True` or `False` based on whether they have too many foreign characters. 

In [87]:
apple_tfDict = eng_clean(apple_clean,1)
goog_tfDict=eng_clean(android_clean,0)


#### Step 2: Using our Dictionary to Trim our Dataset
The next function is very similar to the `dictionary_trim()` function above, except its fourth paramter is used as a direct comparison rather than as a data point. The function `dictionary_slice()` takes the following parameters:

-`dataset` : Dataset to be trimmed
-`dictionary` : Dictionary to use as 'stencil' 
-`condition1` : Column to check against keys
-`conditon2` : Column to check against definitions

The major difference between `dictionary_slice()` and `dictionary_trim()` is that `condition2` isn't converted into a `float`, but compared as its current value. It then outputs the trimmed data to a new list. 

In [88]:
def dictionary_slice(dataset, dictionary, condition1, condition2):
    trimmed_data = []
    added_list = []
    for row in dataset:
        name = row[condition1]
        if name in dictionary and dictionary[name] == condition2 and name not in added_list:
            trimmed_data.append(row)
            added_list.append(name)
    return trimmed_data

We then use this function on the `android_clean` and `apple_clean` lists, outputting lists that have only English-speaking apps that are the correct length.

In [90]:
android_cleaner = []
android_cleaner = dictionary_slice(android_clean, goog_tfDict, 0, True)

apple_cleaner = []
apple_cleaner= dictionary_slice(apple_clean, apple_tfDict, 1, True)

### IV. Purging Non-Free Apps
Since we are only interested in the apps that are free, we can extract only those from the previous list that are free. To check to see what constitues 'Free' in both apps, we'll check a few lines of both datasets to see what how each set designates things as 'Free'.

In [100]:
print(android_cleaner[3:6])
print(apple_cleaner[3:6])

[['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'], ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up']]
[['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'], ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1'], ['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624', '1814', '4.5', '4.0', '6.26', '12+', 'Social Networking', '37', '5', '27', '1']]


From that we can see that **Google Play Store** indicates free with the literal word '`Free`' which is represented by a `string`. The **App Store** is a little more traditional in its representation using a `float` value of `0.0`.

Because they use two different variable types, we will make two different functions, one that checks for a `string` price and one that checks for a `float` price.

The following two functions work as such: 
The `float_check()` function takes three parameters:

- `datset` : The dataset we're checking
- `pricecol` : The column the price is stored as a `float` value
- `tarprice` : The target price as a `float` value

It returns a list of rows with the `tarprice` value in the `pricecol` column.

In [101]:
def float_check(dataset, pricecol, tarprice):
    tarprice_apps = []
    price = 9898989898
    for row in dataset:
        price = float(row[pricecol])
        if price == tarprice:
            tarprice_apps.append(row)
    return tarprice_apps

The `string_check()` function takes two parameters:

- `dataset` : The dataset we're checking
- `pricecol` : The column the price is held in
- `tarstring` : The string we're checking against

It returns a list of data whose price column matches the target string.

In [102]:
def string_check(dataset,pricecol,tarstring):
    tarprice_apps = []
    for row in dataset:
        if row[pricecol] == tarstring:
            tarprice_apps.append(row)
    return tarprice_apps

In [106]:
apple_cleanest = float_check(apple_cleaner,4,0.0)
android_cleanest = string_check(android_cleaner,6,'Free')

print(len(apple_cleanest))
print(len(android_cleanest))

3220
8863


## V. Analyzing the Clean Data

Now that the data is stored in its cleanest form, it's time to take a look at the different available data points and determine what makes an app successful, and what kind of app could be theoretically successful.

### Step 1: Determine the Relevant Data

The first and most important thing we need to do is determine which columns of information are going to help us make a determination and which columns are extraneous. By checking the headers we printed out earlier, we can determine there are the following columns for each store: 

| Google Play Store | Apple Store      |
|-------------------|------------------|
| App               | id               |
| Category          | track_name       |
| Rating            | size_bytes       |
| Reviews           | Currency         |
| Size              | price            |
| Installs          | rating_count_tot |
| Type              | rating_count_ver |
| Price             | user_rating      |
| Content Rating    | user_rating_ver  |
| Genres            | ver              |
| Last Updated      | cont_rating      |
| Current Ver       | prime_genre      |
| Android Ver       | sup_devices.num  |
|                   | ipadSc_urls.num  |
|                   | lang.num         |
|                   | vpp_lic          |


The **Google Play Store** columns we're concerned about are going to be:
- `Category`
- `Genres`
- `Rating`
- `Reviews`
- `Installs`

The **App Store** columns we're concerned about are going to be:
- `rating_count_tot`
- `user_rating`
- `prime_genre`



### Step 2: Generating and Exploring Frequency Tables
The following function `freq_table()` creates a frequency table, which is a table defining how often a particular type of data appears in a datasetin terms of a percentage of the dataset. 

The function `freq_table()` takes the following parameters: 

- `dataset` : The dataset from which we're generating a table
- `index` : The column containing the types of data we'll generate frequencies of (e.g., genres) 

The function creates a dictionary `table` and pairs each unique `column` per `row` in the `dataset` paired with an integer value, which increases by one each time an additional instance of the `row[column]` value appears in the `dataset`.

In [108]:
def freq_table(dataset, index):
    table = {}
    for row in dataset:
        column = row[index]
        if column in table:
            table[column]+=1
        else:
            table[column]=1
    for row in table:
        table[row] /= len(dataset)
        table[row]*=100
    return table
    

The following `display_table()` function  takes the frequency table generated and displays it in descending order by transforming it into a list of tuples and then sorting them using the built-in function `sort()`.

In [109]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Step 3: Analyzing Categories

Now that we have viable functions, we can start analyzing various categories in which we're interested. To start, let's take a look at the `Genres` and `Category` columns of the **Google Play Store** dataset. 

In [112]:
display_table(android_cleanest,9)


Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

In [113]:
display_table(android_cleanest,1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

It's clear that the types in the `Category` column have broader scopes than those in the `Genres` column, so we're going to stick with `Category` for the rest of our analases. 

From the `Category` frequency table we can see that, among the top ten most popular categories, only three are *Entertainment*-related, the others being more utilitarian in nature (e.g., `TOOLS`, `BUSINESS`, `PRODUCTIVITY`). We can therefore see that the **diversity** of apps skews more heavily towards practicality than entertainment on the **Google Play Store** but, remembering that we haven't looked at number of installs or ratings, they could be populated by low-download low-rated apps. Just because there are a lot of apps doesn't necessarily indicate that they're popular. We have to look further for that, but this gives us a nice baseline. Next let's take a look at the relative category int the **App Store**, `prime_genre`:

In [114]:
display_table(apple_cleanest,11)

Games : 58.13664596273293
Entertainment : 7.888198757763975
Photo & Video : 4.968944099378882
Education : 3.6645962732919255
Social Networking : 3.291925465838509
Shopping : 2.608695652173913
Utilities : 2.515527950310559
Sports : 2.142857142857143
Music : 2.049689440993789
Health & Fitness : 2.018633540372671
Productivity : 1.7391304347826086
Lifestyle : 1.5838509316770186
News : 1.3354037267080745
Travel : 1.2422360248447204
Finance : 1.1180124223602486
Weather : 0.8695652173913043
Food & Drink : 0.8074534161490683
Reference : 0.5590062111801243
Business : 0.5279503105590062
Book : 0.43478260869565216
Navigation : 0.18633540372670807
Medical : 0.18633540372670807
Catalogs : 0.12422360248447205


In stark contrast to the **Google Play Store** we see here that ~58% of the available free apps in English in the **App Store** are *Games*, followed by another ~7.8% *Entertainment* and ~4.9% *Photo & Video*, which means over 70% of all available apps in the **App Store** that are free and English titled are dedicated to some form of entertainment and leisure.  Again, without knowing how many people are using these apps, we can't know for sure how popular they are, but this certainly gives us something useful to work with. 

These large percentages of categories also gives us an indication of which markets might be *oversaturated* 

### Step 4: Analyzing Popularity on the App Store

The next thing to find out is which genre holds the most popular apps, which in the case of the App Store is most readily determined by using the `rating_count_tot` values, which represent the total number of user ratings, giving a rough indication of how many people use the app (and if they've rated it, they've probably generated ad revenue at that point, which is the goal).  

The following function `ratings_table()` will be used to generate a dictionary where each `key` is the `genre` and each `definition` is the average number of ratings, giving us a general idea of how popular each genre is. It takes the following parameters:

- `genreset` : A list of categories we're working with
- `dataset` : The cleaned dataset
- `typecol` : The column of the dataset whence the genre was extracted
- `totratcol` : The column of the dataset where the total ratings value is


In [116]:
def ratings_table(genreset,dataset,typecol,totratcol):
    rating_dict = {}
    for genre in genreset:
        total =0
        len_genre = 0
        for row in dataset:
            genre_app = row[typecol]
            if row[typecol] == genre:
                total +=float(row[totratcol])
                len_genre+=1
        avg_ratnum = total/len_genre
        rating_dict[genre]=avg_ratnum
    return rating_dict

The function `extract_col()` below is how we're going to extract a quick genre table. It takes the following parameters: 

- `dataset` : The dataset we're extracting from
- `colnum` : The column we're extracting
- `header` : Whether or not to ignore the header (**False** indicates **NOT** ignoring)

In [119]:
def extract_col(dataset, colnum, header=False):
    extracted_col = []
    if header == False:
        for row in dataset:
            extracted_col.append(row[colnum])
        return extracted_col
    else:
        for row in dataset[1:]:
            extracted_col.append(row[colnum])
        return  extracted_col

Below we use the built-in `sorted()` function in conjunction with a `lambda` function that lets us sort the dictionary by value rather than key, and a small `for` loop to print out a legible list of genres and their average number of ratings. 

In [132]:
genre_table_apple = extract_col(apple_cleanest,11)
apple_ratings = ratings_table(genre_table_apple,apple_cleanest, 11,5)
apple_sorted_rats = sorted(apple_ratings.items(), key=lambda x: x[1], reverse=True)

for key in apple_sorted_rats:
    print(key[0], key[1])

Navigation 86090.33333333333
Reference 74942.11111111111
Social Networking 71548.34905660378
Music 57326.530303030304
Weather 52279.892857142855
Book 39758.5
Food & Drink 33333.92307692308
Finance 31467.944444444445
Photo & Video 28441.54375
Travel 28243.8
Shopping 26919.690476190477
Health & Fitness 23298.015384615384
Sports 23008.898550724636
Games 22812.92467948718
News 21248.023255813954
Productivity 21028.410714285714
Utilities 18684.456790123455
Lifestyle 16485.764705882353
Entertainment 14029.830708661417
Business 7491.117647058823
Education 7003.983050847458
Catalogs 4004.0
Medical 612.0


# STOPPED HERE #

In [189]:
def stringbased_table(genreset, dataset, typecol, installcol):
    for category in genreset:
        total = 0
        len_category = 0
        for row in dataset:
            category_app = row[typecol]
            if category_app == category:
                installs = row[installcol]
                installs = installs.replace(',','')
                installs = installs.replace('+','')
                installs = float(installs)
                total += installs
                len_category +=1
        avg_installs = total/len_category
        print(category + ': ')
        print(avg_installs)
            

In [190]:
ratings_table(genre_table_apple, apple_cleanest, 11, 5)

Photo & Video:
28441.54375
Games:
22812.92467948718
Music:
57326.530303030304
Social Networking:
43899.514285714286
Reference:
74942.11111111111
Health & Fitness:
23298.015384615384
Weather:
52279.892857142855
Utilities:
18684.456790123455
Travel:
28243.8
Shopping:
26919.690476190477
News:
21248.023255813954
Navigation:
86090.33333333333
Lifestyle:
16485.764705882353
Entertainment:
14029.830708661417
Food & Drink:
33333.92307692308
Sports:
23008.898550724636
Book:
39758.5
Finance:
31467.944444444445
Education:
7003.983050847458
Productivity:
21028.410714285714
Business:
7491.117647058823
Catalogs:
4004.0
Medical:
612.0


In [191]:
stringbased_table(genre_table_google, android_cleanest,1,5)

ART_AND_DESIGN: 
2021626.7857142857
AUTO_AND_VEHICLES: 
647317.8170731707
BEAUTY: 
513151.88679245283
BOOKS_AND_REFERENCE: 
8767811.894736841
BUSINESS: 
1712290.1474201474
COMICS: 
817657.2727272727
COMMUNICATION: 
38456119.167247385
DATING: 
854028.8303030303
EDUCATION: 
1833495.145631068
ENTERTAINMENT: 
11640705.88235294
EVENTS: 
253542.22222222222
FINANCE: 
1387692.475609756
FOOD_AND_DRINK: 
1924897.7363636363
HEALTH_AND_FITNESS: 
4188821.9853479853
HOUSE_AND_HOME: 
1331540.5616438356
LIBRARIES_AND_DEMO: 
638503.734939759
LIFESTYLE: 
1437816.2687861272
GAME: 
15588015.603248259
FAMILY: 
3697848.1731343283
MEDICAL: 
120550.61980830671
SOCIAL: 
23253652.127118643
SHOPPING: 
7036877.311557789
PHOTOGRAPHY: 
17840110.40229885
SPORTS: 
3638640.1428571427
TRAVEL_AND_LOCAL: 
13984077.710144928
TOOLS: 
10801391.298666667
PERSONALIZATION: 
5201482.6122448975
PRODUCTIVITY: 
16787331.344927534
PARENTING: 
542603.6206896552
WEATHER: 
5074486.197183099
VIDEO_PLAYERS: 
24727872.452830188
NEWS_AND_