# Profitable App Profiles for the App Store and Google Play Markets

our aim is to help our developers understand what type of apps are likely 
to attract more users on Google Play and the App Store. To do this, 
we'll need to collect and analyze data about mobile apps available on 
Google Play and the App Store.


As of September 2018 there were approximately 2 million iOS apps and 2.1 million Andriod apps available for download.

![Image](https://s3.amazonaws.com/dq-content/350/py1m8_statista.png)

***

Firstly, we'll start with opeinging the data below.

In [2]:

from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store Data Set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:]



To make them easier to explore, we created a function named `explore_data()` that you can repeatedly use to print rows in a readable way.

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

### Andriod Data
When using the `explore_data()` function we can see all of the data within the sheet

In [4]:
print(android_header)
print()
print(explore_data(android, 0, 3, True))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13
None


We can see from this data that there is 10,841 different rows of apps with 13 different columns containing information about each of the apps.
***

### Apple Data
Let's now take a look at the data for the apple store

In [5]:
print(apple_header)
print()
print(explore_data(apple, 0, 3, True))

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16
None


We can see that the apple store has 7197 rows and 16 columns. The Apple store has almost 3,000 less apps than the google play store but is able to provide us with 3 more columns of information per app

***
## Data Cleaning
### Checking for incorrect data
The purpose of data cleaning is to remove the incorrect data from this sheet so that we don't have any incorrect or missleading data. Below shows checking the invalid entry and the removal of said entry

In [6]:
print(android[10472])
# The above entry is missing a row, another way that we can check is checking the length
print(len(android[10472]))
print()
#This is the header row, it shows how many rows we NEED to have
print(android[0])
print(len(android[0]))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
12

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
13


In [7]:
#DO NOT RUN THIS CODE MORE THAN ONCE
print(len(android))
del android[10472]
print(len(android))

10841
10840


### Finding duplicate data

After reading through the discussions section of the database which can be found [here](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion), we can see that there are some duplicate enteries, specifically for 'Instagram'

In [8]:
for app in android:
    name = app[0] # Determining which row to use can be found from printing out the header
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In total, there are 1,181 cases where an app occurs more than once. This is where data cleaning gets messy!

In [9]:
duplicate_apps = []
unique_apps = []
for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else: 
        unique_apps.append(name)
print('Number of duplicate apps:', len(duplicate_apps), '\nExamples of duplicate apps: ', duplicate_apps[:10])

Number of duplicate apps: 1181 
Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


we could just randomly delete the data that has duplicates until theres only one left, but how accurate is that? We don't know if we'd be getting data from yesterday or 3 years ago. Hence why we will leaving the entry with the highest number of reviews and deleting the ones that have a lower number of reviews.

### Removing the Duplicates
To remove the duplicates, we will do the following:

Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
Use the information stored in the dictionary and create a new dataset, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [10]:
reviews_max = {}
for app in android:
    name = app[0] # Get the name
    n_reviews = float(app[3]) # Get the total # of ratings
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max.update({name : n_reviews})
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        

The purpose of the code above is to do two things, first we are checking if the reviews_max dict contains the element we are currently on in the android list. If it is in there we are checking to see if the reviews are higher or lower than the current duplicate and replacing it if necessary. Second we are checking if it even exists at all within the list.

In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [11]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


Now we have to use our dictionary, `reviews_max` to begin weeding out the duplicate data. 

In [12]:
android_clean = []
already_added = []
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


In the code block above we are creating two lists, `android_clean` and `already_added`. These lists are where we will be appending are results to as shown in the if statement. The if statement has a lot going on inside of it, lets break it down

How things are passed through the if statement
* `n_reviews` must equal the review information we have stored in `reviews_max`.
* `name` must be found within the `already_added` list
  * Inside the if statement, we are adding the whole app information to our new list and adding it to the `already_added` list

### Removing Non-English Apps

Remember we use English for the apps we develop at our company, and we'd like to analyze only the apps that are designed for an English-speaking audience. However, if we explore the data long enough, we'll find that both datasets have apps with names that suggest they are not designed for an English-speaking audience.
    

In [13]:
print(apple[813][1])
print(apple[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠
„ÄêËÑ±Âá∫„Ç≤„Éº„É†„ÄëÁµ∂ÂØæ„Å´ÊúÄÂæå„Åæ„Åß„Éó„É¨„Ç§„Åó„Å™„ÅÑ„Åß „ÄúË¨éËß£„ÅçÔºÜ„Éñ„É≠„ÉÉ„ÇØ„Éë„Ç∫„É´„Äú


‰∏≠ÂõΩË™û AQ„É™„Çπ„Éã„É≥„Ç∞
ŸÑÿπÿ®ÿ© ÿ™ŸÇÿØÿ± ÿ™ÿ±ÿ®ÿ≠ DZ


We don't want to keep apps that aren't in English. to do this we can use the `ord()` function, which converts different letters to a number, and in the english language theres 0-127 letters, numbers, and symbols that we will accept, anything else we will dispose of

In [14]:
print(ord('a'))
print(ord('A'))
print(ord('Áà±'))
print(ord('5'))
print(ord('+'))

97
65
29233
53
43


In Python, strings are indexable and iterable, which means we can use indexing to select an individual character, and we can also iterate on the string using a for loop. this will come in handy when we start sifting through our data

In [15]:
string = 'abc'
print(string[0])
print(string[1])
print(string[2])

for character in string:
    print(character)

a
b
c
a
b
c


>def validate (string):
 >   for x in range(len(string)):
  >      if ord(string[x]) > 128:
   >         return False

Above we have created a method called `validate()`, which takes in a string. When the string is taken in we can loop through each character of the string and check its ord value, if it's above 127 we know that the character is not an english character. We can remove a LARGE amount of data with this function and a for loop. Below is an example of how we can use this function on a smaller scale.

>print(validate('Instagram'))
>print(validate("PPS -„ÄäÊ¨¢‰πêÈ¢Ç2"))
>print(validate('Docs To Go‚Ñ¢ Free Office Suite'))
>print(validate('Instachat üòú'))


The function doesn't return true in the first case because we are only checking if it DOES NOT contain english characters, so the result, `none` is the appropriate result. Now we will rewrite the `validate()` function above to include english letters, that aren't in the typical letting format as shows with the ‚Ñ¢ in the app name `'Docs To Go‚Ñ¢ Free Office Suite'`

In [16]:
def validate (string): 
    threeFalse = 0
    for x in range(len(string)): 
        if ord(string[x]) > 127: 
            threeFalse += 1
        if threeFalse > 3:
            return False
    if threeFalse <= 3:
        return True

In [17]:
print(validate('Docs To Go‚Ñ¢ Free Office Suite'))
print(validate('Instachat üòú'))
print(validate('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠'))

True
True
False


We've confirmed that our new `validate()` function works, and only returns `False` in the case where theres more than 3 occurances, otherwise it returns `True`. Now it's time to test it on the datasets

In [18]:
English_Clean_Android = []
English_Clean_Apple = []
for app in android_clean:
    name = app[0]
    if validate(name) == True:
        English_Clean_Android.append(app)
for app in apple:
    name = app[1]
    if validate(name) == True:
        English_Clean_Apple.append(app)
explore_data(English_Clean_Android, 0, 3, True)
explore_data(English_Clean_Apple, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

So far in the data cleaning process, we've done the following:

* Removed inaccurate data
* Removed duplicate app entries
* Removed non-English apps

Now we will seperate apps based on whether they are free or not
### Isolate Free Apps

#### Part One

In [19]:
free_apple = []
free_android = []

for app in English_Clean_Android:
    price = app[7]
    if price == '0':
        free_android.append(app)
        
for app in English_Clean_Apple:
    price = app[4]
    if price == '0.0':
        free_apple.append(app)

print(len(free_android))
print(len(free_apple))


8864
3222


Something important to keep in mind that I realized when my results didn't look correct the first time is that the two tables are not formatted the same. The ones that we are looking for are in two different columns for the two different tables. Besides that we now have cleaned our data enough to perform a proper analysis.


As we mentioned in the introduction, our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

* Build a minimal Android version of the app, and add it to Google Play.
* If the app has a good response from users, we develop it further.
* If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

In [20]:
print(apple_header)
print()
print(android_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


#### Part Two
The columns that would be beneficial would be `android[Category]`, `android[Genres]`, `android[Reviews]`, and `android[Content Rating]` for google play. For ios I feel we would benefit from looking into the `apple[user_rating]`, `apple[prime_genre]`, and maybe `apple[rating_count_tot]`

In [21]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

        
def freq_table (dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

#### Part Three
On the previous screen, we generated frequency tables for the columns `prime_genre`, `Genres`, and `Category`. We'll now focus on analyzing these frequency tables.

Remember our dataset only contains free English apps, so you should be careful not to extend your conclusions beyond that scope. If you find that gaming apps are the most numerous among the free English apps on Google Play, it doesn't mean we'll see the same pattern on Google Play as a whole.

`prime_genre` in the app store

In [22]:
display_table(free_apple, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


For free apps, the more common genre by a land slide is **Games**. Games make up more than half of the free english apps with `58.16%`. The next common genre for free english apps is **Entertainment**. the progession through quantity of apps based on title is subtle but understandable, until the top of the list with games shooting up 50% higher than Entertainment.

Based on number of apps alone I don't think that this is enough to pictch a game idea to the company, if anything it seems that you may want to stay away unless you can provide a new and unique experience to make the app stand out, with out this the app will just get lost in the thousands of other apps, but lets keep looking.


Now we'll look at the google play stores categories

In [23]:
display_table(free_android, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Looking at the google playstore, we see that its a whole different environment, with **Games** at 2nd place with `9.7%` and **Family** with `18.9%`. Since there is no family in the App store but there is in the Google Play store, I think it would be safe to assume that majority of **Family** apps are also considered **Games**, but are more family friendly.


Now we'll look at the google play store Genres

In [24]:
display_table(free_android, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

Looking at the genres of apps, we can pretty easily tell that there is a lot more genres targeted around **Games**, however **Tools** holds the largest percentage with 8.45%. I don't think there is enough data to comfortably pitch an app profile, however I think we are beginning to narrow down our searches

Now we begin sifting through our cleaned data. We are looking to find genres with the highest rating, to do this we first generate frequency tables for the App Store. Then we loop through all the different genres from the frequency table, and finally we loop through each app based on its genre

In [31]:
ios_genre = freq_table(free_apple, 11)

for genre in ios_genre:
    total = 0
    len_genre = 0
    for app in free_apple:
        genre_app = app[11]
        if genre_app == genre:
            rating = float(app[5])
            total += rating
            len_genre += 1
    avg_rating = total / len_genre
    print(f"The genre {genre}, has {total} total ratings")

The genre Social Networking, has 7584125.0 total ratings
The genre Photo & Video, has 4550647.0 total ratings
The genre Games, has 42705967.0 total ratings
The genre Music, has 3783551.0 total ratings
The genre Reference, has 1348958.0 total ratings
The genre Health & Fitness, has 1514371.0 total ratings
The genre Weather, has 1463837.0 total ratings
The genre Utilities, has 1513441.0 total ratings
The genre Travel, has 1129752.0 total ratings
The genre Shopping, has 2261254.0 total ratings
The genre News, has 913665.0 total ratings
The genre Navigation, has 516542.0 total ratings
The genre Lifestyle, has 840774.0 total ratings
The genre Entertainment, has 3563577.0 total ratings
The genre Food & Drink, has 866682.0 total ratings
The genre Sports, has 1587614.0 total ratings
The genre Book, has 556619.0 total ratings
The genre Finance, has 1132846.0 total ratings
The genre Education, has 826470.0 total ratings
The genre Productivity, has 1177591.0 total ratings
The genre Business, has 

Social networking has the most user reviews with almost 7.6 million reviews. Lets take a look and see which apps these numbers are coming from

In [36]:
for app in free_apple:
    if app[11] == 'Entertainment':
        print(app[1], ":", app[5])

Netflix : 308844
Fandango Movies - Times + Tickets : 291787
Colorfy: Coloring Book for Adults : 247809
IMDb Movies & TV - Trailers and Showtimes : 183425
TRUTH or DARE!!! - FREE : 171055
Mad Libs : 117889
Twitch : 109549
Action Movie FX : 101222
Voice Changer Plus : 98777
iFunny :) : 98344
The CW : 97368
The Moron Test : 88613
DIRECTV : 81006
ABC ‚Äì Watch Live TV & Stream Full Episodes : 78890
Xbox : 72187
Redbox : 60236
Talking Tom Cat 2 for iPad : 56399
Hulu: Watch TV Shows & Stream the Latest Movies : 56170
NBC ‚Äì Watch Now and Stream Full TV Episodes : 55950
Emoji> : 55338
DIRECTV App for iPad : 47506
Amazon Prime Video : 43667
CBS Full Episodes and Live TV : 39436
FOX NOW - Watch Full Episodes and Stream Live TV : 39391
Talking Angela for iPad : 32763
Recolor - Coloring Book : 31180
Talking Ben the Dog for iPad : 31116
Talking Tom Cat for iPad : 29492
YouTube Kids : 28560
Tom's Love Letters : 27711
HBO GO : 26278
NFL Sunday Ticket : 24258
Pigment - Coloring Book for Adults : 239

We can see that theres a lot of apps based around entertainment, but the top 3 apps make up a large percentage of the total number of ratings.

Now we'll do it for the play store. But we have to remove some of the formating within this column so we can get the most accurate data.

In [44]:
android_category = freq_table(free_android, 1)

for category in android_category:
    total = 0
    len_category = 0
    for app in free_android:
        category_app = app[1]
        if category_app == category:
            downloads = app[5]
            downloads = downloads.replace(',','')
            downloads = downloads.replace('+','')
            total += float(downloads)
            len_category += 1
    avg_installs = total / len_category
    print(f"The genre {category}, has {avg_installs} total downlaods")
    


The genre ART_AND_DESIGN, has 1986335.0877192982 total downlaods
The genre AUTO_AND_VEHICLES, has 647317.8170731707 total downlaods
The genre BEAUTY, has 513151.88679245283 total downlaods
The genre BOOKS_AND_REFERENCE, has 8767811.894736841 total downlaods
The genre BUSINESS, has 1712290.1474201474 total downlaods
The genre COMICS, has 817657.2727272727 total downlaods
The genre COMMUNICATION, has 38456119.167247385 total downlaods
The genre DATING, has 854028.8303030303 total downlaods
The genre EDUCATION, has 1833495.145631068 total downlaods
The genre ENTERTAINMENT, has 11640705.88235294 total downlaods
The genre EVENTS, has 253542.22222222222 total downlaods
The genre FINANCE, has 1387692.475609756 total downlaods
The genre FOOD_AND_DRINK, has 1924897.7363636363 total downlaods
The genre HEALTH_AND_FITNESS, has 4188821.9853479853 total downlaods
The genre HOUSE_AND_HOME, has 1331540.5616438356 total downlaods
The genre LIBRARIES_AND_DEMO, has 638503.734939759 total downlaods
The g

We can see that communication apps have the most total downloads with the average being 38 million downloads. But where do all these downloads come from?

In [45]:
for app in free_android:
    if app[1] == 'COMMUNICATION':
        print(app[0], ' : ', app[5])

WhatsApp Messenger  :  1,000,000,000+
Messenger for SMS  :  10,000,000+
My Tele2  :  5,000,000+
imo beta free calls and text  :  100,000,000+
Contacts  :  50,000,000+
Call Free ‚Äì Free Call  :  5,000,000+
Web Browser & Explorer  :  5,000,000+
Browser 4G  :  10,000,000+
MegaFon Dashboard  :  10,000,000+
ZenUI Dialer & Contacts  :  10,000,000+
Cricket Visual Voicemail  :  10,000,000+
TracFone My Account  :  1,000,000+
Xperia Link‚Ñ¢  :  10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard  :  10,000,000+
Skype Lite - Free Video Call & Chat  :  5,000,000+
My magenta  :  1,000,000+
Android Messages  :  100,000,000+
Google Duo - High Quality Video Calls  :  500,000,000+
Seznam.cz  :  1,000,000+
Antillean Gold Telegram (original version)  :  100,000+
AT&T Visual Voicemail  :  10,000,000+
GMX Mail  :  10,000,000+
Omlet Chat  :  10,000,000+
My Vodacom SA  :  5,000,000+
Microsoft Edge  :  5,000,000+
Messenger ‚Äì Text and Video Chat for Free  :  1,000,000,000+
imo free video calls and 

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

### Conclusion

In this project we learned many different things about data science, we learned how to clean, analyze, and remove data as needed and how to identify good versus bad data. We analyzed two different data sets in order to get one singular result, we looked over the app store and the google play store. After looking over the data for the App store we see `Social Networking` as the most reviewed app, and in the google play store we see `Communication` as the most downloaded apps. In both cases categorys / genres of entertainment are close seconds.

#### Suggested app profile
Based off of this data I would suggest some sort of content service similar to youtube but the ability to connect and communicate with one another such as twit