### Profitable Apps And The Drivers For Growing Our User Base

This project is aimed at delivering an analysis of the key performance indicators for the apps sold via [Apple](https://www.apple.com/uk/ios/app-store/) and [Google](https://play.google.com/store?hl=en) store. Having an understanding of the main drivers of our user base in correlation with apps profitablity in each online store should give us a valuable insight to help with all future development work.

Our company builds apps that are free to download and install, and our main source of revenue consists of in-app ads.This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.


### The Summary Overview

According to [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/) Google Play was a market leader in September of 2018 with 2.1 million of apps available for download in its store with Apple taking a comfortable second position with 2 million of iOS apps availabe for download in the iOS App store.

![image_apple_vs_google](https://github.com/Rafal161812/Python-Solutions/blob/master/images/statista_google_apps_vs_apple_apps_2018.PNG?raw=true "Apps available for download in September 2018")
<p style="text-align: center;"> Image Source: [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/)</p>

To start off our analysis we will use two small data sets of which analysis should provide a representative view of the market.

* [A data set](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data about approximately 10,000 Adroid apps from Google Play; the data was collected in August 2018
* [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately 7,000 apps from the App Store; the data was collected in July 2017

As a first port of all we should open both data sets and explore them. To do this, we will create a reusable function called `explore_data()`.





In [1]:
from csv import reader
file_open = open(r'AppleStore.csv', encoding='utf8')
file_read = reader(file_open)
apple_dataset = list(file_read)
apple_dataset_header = apple_dataset[0]

file_open = open(r'googleplaystore.csv', encoding='utf8')
file_read = reader(file_open)
google_dataset = list(file_read)
google_dataset_header = google_dataset[0]


The `explore_data()` function created below can be applied using `start` and `end` parametrs that will help us to examine slice of each dataset and the last parameter called `rows_and_cloumns` can be used to display total number of rows in the dataset in question.

In [2]:
def explore_data(dataset,start,end,rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row,'\n') # \n adds a new (empty) line after each row
        
    if rows_and_columns is True:
        print('Number of rows ', len(dataset))
        print('Number of columns:', len(dataset[0]))

We are going to print the header and a few sample rows from each file.

In [3]:
# Apple dataset
print('\n',apple_dataset_header,'\n')
explore_data(apple_dataset,1,4,True)



 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] 

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] 

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] 

Number of rows  7198
Number of columns: 16


In [4]:
# Google dataset
print('\n',google_dataset_header,'\n')
explore_data(google_dataset,1,4,True)


 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] 

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

Number of rows  10842
Number of columns: 13


Based on the sample data we can see that both datasets share column providing the same information, even if the column names are not called exactly the same.
The most useful columns for our analysis will be the following ones:

* App name (Apple: `track_name`, Google: `App`)
* Price (Apple: `price`, Google: `Price`)
* User Rating (Apple: `user_rating`, Google: `Rating`)
* Count of Ratings (Apple: `rating_count_tot`, Google: `Reviews`)
* Genre (Apple: `prime_genre`, Google: `Genres`)


## Data Cleaning

Our analysis would not be worth a lot if it was based on incorrect data therefore the next step in the process will be data cleaning. We have to make sure that all inaccurate data is detected and either removed or corrected. This includes finding and removal of duplicates.

In [5]:
#Checking Apple data set
number_of_columns = len(apple_dataset[0])
list_incorrect_indexes = list()
for row in apple_dataset[:]:
    if not len(row) == number_of_columns:
        list_incorrect_indexes.append(apple_dataset.index(row))
        print(row)
print('\n')
print('Counf of incorrect entries:', len(list_incorrect_indexes))
print('Incorrect indexes:',list_incorrect_indexes)



Counf of incorrect entries: 0
Incorrect indexes: []


In [6]:
#Checking Google data set
number_of_columns = len(google_dataset[0])
list_incorrect_indexes = list()
for row in google_dataset[:]:
    if not len(row) == number_of_columns:
        list_incorrect_indexes.append(google_dataset.index(row))
        print(row)
print('\n')        
print('Counf of incorrect entries:', len(list_incorrect_indexes))
print('Incorrect indexes:',list_incorrect_indexes)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Counf of incorrect entries: 1
Incorrect indexes: [10473]


Analysing both data sets revealed that Google Play data set has one entry with a missing column. Because we have to be absolutely sure of the correctness of the entire data set it will best to simply remove this record. On the other hand, the fact that only one record in the entire data set has failed this specific check, it gives a confidence as to the integrity of the whole data set.

In [7]:
#removal of the faulty record
del google_dataset[10473]

Having removed a record from the Google Play data set which failed the data quality check based on the missing column check criteria, we find that other users of this data set report existence of duplicate entries in the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section.

In order to measure the size of this problem, we are going to run a loop using the `App` column of the data set and measure the following:
* number of unique apps in the data set
* number of duplicated records
* number of unique apps with more than 1 record

In [8]:
duplicate_apps = dict()
unique_apps = list()

for app in google_dataset[1:]:
    name = app[0]
    if name in unique_apps:
        duplicate_apps[name] = duplicate_apps.get(name,0) + 1
    else:
        unique_apps.append(name)

        
print('Number of unique apps',len(unique_apps))
print('Number of duplicate records:',sum(duplicate_apps.values()))
print('Number of unique apps with duplicate records:',len(duplicate_apps))

Number of unique apps 9659
Number of duplicate records: 1181
Number of unique apps with duplicate records: 798


Based on the results seen above we can conclude that the Google Play data set contains number of apps with more than 1 duplicate record.

In [9]:
i = 0
while i == 0:
    for key in duplicate_apps:
        duplicate_app_example = key
        i += 1
print('Example of an app with duplicate records: ',duplicate_app_example,'\n')
for app in google_dataset[1:]:
    name = app[0]
    if name == duplicate_app_example:
        print(app)

Example of an app with duplicate records:  Free Blood Pressure 

['Free Blood Pressure', 'MEDICAL', 'NaN', '7', '5.7M', '5,000+', 'Free', '0', 'Everyone', 'Medical', 'October 13, 2016', '3.0.0', '4.0.3 and up']
['Free Blood Pressure', 'MEDICAL', 'NaN', '7', '5.7M', '5,000+', 'Free', '0', 'Everyone', 'Medical', 'October 13, 2016', '3.0.0', '4.0.3 and up']
['Free Blood Pressure', 'MEDICAL', 'NaN', '7', '5.7M', '5,000+', 'Free', '0', 'Everyone', 'Medical', 'October 13, 2016', '3.0.0', '4.0.3 and up']


In the code above we have accomplished two things:
* we picked a name of duplicate app from a dictionary containing names and counts of duplicate apps
* we sampled our Google Play data set using that name to allow for examination of the problem

We can immediately see that all columns but one contain the same information for the app called *FastMeet*. The only column differentiating both records is the column corresponding to the number of reviews. Knowing that the number of reviews can only grow over time, we can make a safe assumption here that the record with the highest number of reviews is the most up to date record for this app and we can use this criteria to clean the entire data set affected by this specific data issue.

In [10]:
d_reviews_max = dict()
for row_list in google_dataset[1:]:
    name = row_list[0]
    reviews = float(row_list[3])
    if not name in d_reviews_max:
        d_reviews_max[name] = reviews
    else:
        current_reviews_value = d_reviews_max[name]
        if reviews > current_reviews_value:
            d_reviews_max[name] = reviews
print('Number of apps in the dictionary:',len(d_reviews_max))

Number of apps in the dictionary: 9659


The piece of code above is a first of the two steps aimed at cleaning the data set from the duplicates. What this code achieved is the following:
1. We looped through all records of the Google Play data set and derived the following:
    * name of the app
    * count of reviews converted to float<br>
    <br>
2. If the app name is not in the new dictionary that we created, we added that app name as a key to the dictionary and number of reviews as a value of that key. If the opposite is true, we are first assiging the number of reviews for the current row to the variable and then checking if that number is greater than the value of reviews already assigned to the app already existing as the key in the dictionary

In [11]:
temp_list = list()
for key,val in d_reviews_max.items():
    dictionary_tup = (key,val)
    temp_list.append(dictionary_tup)
print(temp_list[:5])

[('Compass', 286454.0), ('Cy-Fair Christian Church', 2.0), ('Antillean Gold Telegram (original version)', 2939.0), ('EO SA Benefits', 0.0), ('Masha and the Bear: Good Night!', 29155.0)]


As seen in the output above, newly created dictionary stores key-value pairs for each app from our data set where the key represents name of each app and the value represents the highest number of reviews we found for each of those apps.

In [12]:
# Part 2
android_clean = list() # this will store our new cleaned data set
android_added = list() # this will just store app names
for row_list in google_dataset[1:]:
    name = row_list[0]
    n_reviews = float(row_list[3])
    if n_reviews == d_reviews_max[name] and not name in android_added:
        android_clean.append(row_list)
        android_added.append(name)
print('Total number of records in the cleansed data set:',len(android_clean))

Total number of records in the cleansed data set: 9659


Part two of the data cleanup process involved creation of a new cleansed data set by looping through the original data set and matching app names and associated numbers of reviews with the dictionary populated in the prior step. As a measure of precaution, additional list, called `android_added` was used in order to ensure that we do not copy duplicate records for all those possible instances where a record with the highest number of reviews was also a duplicate one.

## Removal of non-English apps

Removal of duplicate entries does not complete our data cleaning task. Upon close examination of the app names we can conclude that both data sets contain apps directed at non-English speaking audience. Since English language is the only one used in our organization, it only makes sense to detect and remove all entries associated with apps not directed towards an English-speaking audience.

In [13]:
print(apple_dataset[814][1])
print('\n')
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播


لعبة تقدر تربح DZ


Best approach in this case is to create a function that will identify all app names containg letters or symbols not used in the English language. Python function called **`ord()`** returns a unique number associated with the widely used [ASCII](https://en.wikipedia.org/wiki/ASCII) standard and according to this standard, all characters commonly used in the English language fall within the range from 0 to 127. With this knowledge we can build a function that will discover all characters with a number greater than 127 and make an assumption that all app names containing those characters have been built with non-English audience in mind.

In [14]:
def string_test(astring):
    for char in astring:
        if ord(char) > 127: return False
    return True

print(string_test('Instagram'))
print(string_test('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(string_test('Docs To Go™ Free Office Suite'))
print(string_test('Instachat 😜'))
    

True
False
False
False


All characters making up a string can be looped through and application of the **`ord()`** function enables us to pick up on any of those characters not being  used in the English language.

The result of the test above reveals that this logic is not flawless and could potentialy exclude valid entries. The emoji symbol and the *TM* trademark symbols are valid characters and yet, because in the ASCII standard their unique numbers are set beyond 127, our new function would exclude all records that contain those symbols.

In order to improve the accuracy of this logic a number of special characters could be allowed to be contained within the string. Once could make an observation that since most app names are made of more than 2 or 3 characters, if they have been written in a language other than English, the number of special characters used in the app name will also be greater than 2 or 3. Allowing 3 special characters in the string should reduce the possibility of rejecting valid entries.

In [15]:
def uf_is_english(string):
    nonenglish_characters_count = 0
    for char in string:
        if ord(char) > 127: nonenglish_characters_count +=1
    if nonenglish_characters_count > 3: return False
    return True

# Apple Data Set
apple_english = list()    
for apple_app in apple_dataset[1:]:
    name = apple_app[1]
    if not uf_is_english(name) is True: continue
    apple_english.append(apple_app)
    
explore_data(apple_english,0,3,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] 

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] 

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] 

Number of rows  6183
Number of columns: 16


In [16]:
# Google Data Set
android_english = list()
for android_app in android_clean[:]:
    name = android_app[0]
    if not uf_is_english(name) is True: continue
    android_english.append(android_app)
    
explore_data(android_english,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] 

Number of rows  9614
Number of columns: 13


### Data Cleaning Finale - Isolation of Free Apps

It has been noted at the start of the project that our organization's main source of income are in-app ads therefore the last step in the puzzle of data cleaning will be exclusion of all apps that are not free to buy.

In [17]:
# Apple Data Set
apple_free = list()    
for apple_app in apple_english[:]:
    price = float(apple_app[4])
    if not price == 0: continue
    apple_free.append(apple_app)

print('Number of free apps:',len(apple_free))

Number of free apps: 3222


In [18]:
# Google Data Set
android_free = list()    
for android_app in android_english[:]:
    price = android_app[7]
    if not price == '0': continue
    android_free.append(android_app)

print('Number of free apps:',len(android_free))

Number of free apps: 8864


### Start of Analysis

At the start of the project it was mentioned that the number of people using our apps is the main factor behind revenue generation. Since development and deployment of apps into both online stores could prove to be inefficient, a startegy has been employed to ensure that resources are spent in the most efficient way:
* Apps are built in its basic form and then deployed to Google Store
* If the user response is good, apps are developed further
* If after 6 months apps are profitable, iOS version of those apps are built and deployed to the App Store

This model ensures that we spent minimal amout of resources in the most cost efficient way. Therefore our first port of call will be to find all those apps that are profitable in both online stores. One of the attributes made available to us in both data sets is app genre and this attribute will be used in the first step of the analysis.  

Frequency tables will be the right tool at this stage of the analysis.

In [19]:
#Step1 - create a frequency table
def genre_freq_table(dataset,index):
    count_of_records = 0
    freq_table = dict()
    for list_row in dataset:
        genre = list_row[index]
        freq_table[genre] = freq_table.get(genre,0)+1
        count_of_records += 1

        
#Step2 - replaces totals with percentage values
    for key in freq_table:
        freq_table[key] = round((freq_table[key] / count_of_records) * 100,4)
        
    return freq_table, count_of_records


#Step3 - use key-value pairs as tuples to populate new list and use it 
#        for displaying data
def display_table(dataset,index):
    freq_table, count_of_records = genre_freq_table(dataset,index)
    tbl_display = list()
    for key,value in freq_table.items():
        tbl_display.append((value,key)) #appending tuples taken from dictionary
    tbl_sorted = sorted(tbl_display,reverse = True)

    print('Total number of free apps:',count_of_records,'\n')
    print('Apps by Genre (%):')
    for tpl in tbl_sorted:
        print(tpl[1],':',tpl[0])
       
display_table(apple_free,11)

Total number of free apps: 3222 

Apps by Genre (%):
Games : 58.1626
Entertainment : 7.8833
Photo & Video : 4.9659
Education : 3.6623
Social Networking : 3.2899
Shopping : 2.6071
Utilities : 2.514
Sports : 2.1415
Music : 2.0484
Health & Fitness : 2.0174
Productivity : 1.7381
Lifestyle : 1.5829
News : 1.3346
Travel : 1.2415
Finance : 1.1173
Weather : 0.869
Food & Drink : 0.807
Reference : 0.5587
Business : 0.5276
Book : 0.4345
Navigation : 0.1862
Medical : 0.1862
Catalogs : 0.1241


The output above, showing percentages of free apps built in all different genres for the Apple Store gives us a clear indication of the two things happening:
* Number one being that our organization builds apps across various subject matters
* Number two being that majority of apps are built in the entertainment sector, mainly games.

Apps in the games genre make 58% of all apps created by the company with the second genre, entertainment, falling way behind at almost 8% of all the apps produced by the company. While majority of apps in the Apple store have been built with the aim of satysfying the needs of fun seekers, the remainder of  apps have been spread accros various utility type of genres, including productivity, travel, finance or business and many others. Those apps are less numerous in numbers however ones needs to remmber that it does not indicate the overall user base behind apps in each genre.

Will this distribution of apps accros genres repeat in the Google's Play Store data set? Let's find out.

In [20]:
display_table(android_free,1)

Total number of free apps: 8864 

Apps by Genre (%):
FAMILY : 18.9079
GAME : 9.7247
TOOLS : 8.4612
BUSINESS : 4.5916
LIFESTYLE : 3.9034
PRODUCTIVITY : 3.8921
FINANCE : 3.7004
MEDICAL : 3.5311
SPORTS : 3.3958
PERSONALIZATION : 3.3168
COMMUNICATION : 3.2378
HEALTH_AND_FITNESS : 3.0799
PHOTOGRAPHY : 2.9445
NEWS_AND_MAGAZINES : 2.7978
SOCIAL : 2.6625
TRAVEL_AND_LOCAL : 2.3353
SHOPPING : 2.245
BOOKS_AND_REFERENCE : 2.1435
DATING : 1.8615
VIDEO_PLAYERS : 1.7938
MAPS_AND_NAVIGATION : 1.3989
FOOD_AND_DRINK : 1.241
EDUCATION : 1.162
ENTERTAINMENT : 0.9589
LIBRARIES_AND_DEMO : 0.9364
AUTO_AND_VEHICLES : 0.9251
HOUSE_AND_HOME : 0.8236
WEATHER : 0.801
EVENTS : 0.7107
PARENTING : 0.6543
ART_AND_DESIGN : 0.6431
COMICS : 0.6205
BEAUTY : 0.5979


Google's App Store data set paints a different picture. First of all, the variety of genres across which apps have been built is greater than that of Apple's Store. Another point we can take is that distribution of apps is more even across different genres. There are clear winners here, with family and game genres taking top two spots, however the tools genre taking third position is very close to matching number of apps within when compared to the game genre.
In fact, utility type apps make up majority (over 70%) of all free apps published to the Google's store.

It is worth noting here that the column called **`Category`** has been used in the analysis of the Google's Play store data. **`Genres`** column is also available in the same data set however it seems to contain more granual information which would not help at this stage of analysis.

### Most Popular apps by Genre

#### Apple Store

What we know so far is the distribution of apps among different genres in both online stores. Information that is missing is which apps have attracted the largest number of users. Bearing in mind that the number of users is the main driver of the profits made from the apps, this aspect will be next step in the analysis.

Google's data set offers us number of installs for each app ion the form of a column called **`Installs`** however the set coming from the Apple store does not have this information. Not all is lost here since we can substitute this information with another column in the Apple data set called **`rating_count_total`**. This column should provide as useful information as the **`Install`** column from the Google's data set.

In [21]:
from statistics import median
freq_table, count_of_records = genre_freq_table(apple_free,11)

for key in freq_table:
    all_ratings = 0
    genre = key
    genre_app_count = 0
    list_of_app_ratings_in_genre = list()
    for i in apple_free:
        total_ratings = float(i[5]) # rating_count_tot
        total_genre = i[11]  # prime_genre
        if total_genre == genre:
            all_ratings += total_ratings # couont of all ratings for that genre
            genre_app_count += 1
            list_of_app_ratings_in_genre.append(total_ratings)
    #print(key,int(all_ratings),'Count_of_Apps:',genre_app_count, 'Average:',round(int(all_ratings)/genre_app_count,0),'Median:', int(median(new_list)))
    print(key,' Average:',int(round(int(all_ratings)/genre_app_count,0)),' Median:', int(median(sorted(list_of_app_ratings_in_genre))))
    

Travel  Average: 28244  Median: 798
Reference  Average: 74942  Median: 6614
Medical  Average: 612  Median: 566
Utilities  Average: 18684  Median: 1110
Photo & Video  Average: 28442  Median: 2206
Education  Average: 7004  Median: 606
Book  Average: 39758  Median: 421
News  Average: 21248  Median: 373
Productivity  Average: 21028  Median: 8737
Navigation  Average: 86090  Median: 8196
Health & Fitness  Average: 23298  Median: 2459
Social Networking  Average: 71548  Median: 4199
Games  Average: 22789  Median: 901
Entertainment  Average: 14030  Median: 1197
Finance  Average: 31468  Median: 1931
Business  Average: 7491  Median: 1150
Weather  Average: 52280  Median: 289
Sports  Average: 23009  Median: 1628
Shopping  Average: 26920  Median: 5936
Lifestyle  Average: 16486  Median: 1111
Food & Drink  Average: 33334  Median: 1490
Music  Average: 57327  Median: 3850
Catalogs  Average: 4004  Median: 1229


For each genre in the Apple's data set we have made two calculations. First one, to get the average number of ratings per app in each genre, and second one, to get the median value of all ratings in each genre. The reason why we decided to see median values is because we know that averages are sensitive to outliers. There could be an app or two in each genre with high (or low) number of ratings that would affect the value of the average and leading us to drawing incorrect conclusion.
Based on the output above we can make the following conclusions:
* Genres with the highest average number of ratings per app are Social Networking (71546), Reference (74942) and Navigation (86090). The average for Navigation genre is based on only 6 apps, therefore one could assume that it does contain one or two apps with a very high number of ratings. Volume of 6 is a very low number when it comes to sample data therefore, we should omit this genre altogether in drawing further conclusions. Social Networking Genre contains 106 apps and the average number of ratings is high however the median is low for that number of apps in the sample data which suggest that while there will be some successful apps in that genre, many others are not doing that well. Reference genre contains only 18 apps and the average is high, median value is also pretty high when compared to others therefore this genre should be on our list for further analysis.


* Genres that should be worth of our attention are Productivity and Shopping. The average number of ratings for both genres is not the highest, however when those are referenced with the corresponding median values, we can conclude that those genres are very popular and could present good opportunities for future work. As an example, not surprisingly the Games genre contains the largest number of apps (1874) and the average number of ratings per each app comes out at 22789. However, the median value for Games is only 901, which means that half of all Games (937 apps) attracted insignificant attention from potential customers. Saturation of the games market is huge, and it will be much harder for any developer to release the game that will stand out from the others. On the other side, **Shopping** genre contains 84 apps with the average of ratings at 26920, slightly higher than the Games genre's average. However, the median value for Shopping is a whopping 5936 which means that chances of releasing a successful app in that genre are much higher. With traditional brick and mortar stores being replaced by online shops, it is no wonder that apps in such genre gain on popularity.


* Genres we would probably want to avoid are the Weather, Book and News. While Book and Weather have high values of the average ratings per app, all three returned extremely low median values. That means that those groups contain at least 1 outlier, a very successful app with large number of ratings whereas majority of the apps in those genres proved to be undesirable.


Let's investigate the Google data set.

#### Google Play Store

As mentioned before, Google's data set offers a column called **`Installs`** which means total number of downloads for each app. We can use this column along with **`Genres`** in our analysis.

Unfortunately our excitment ends right there. Number of installs contained within the Google data is not precise. Instead of exact number of downloads, each app has been marked with an open-ended grouping which could make our comparision even harder (for example: 10,000+, 50,000+ etc.). However, we can simplfy our approach to this by using the bottom value of each range for each app. This will require converting values representing each range into numbers which can be achieved only after removal of other non-numeric characters (commas and plus signs).


In [22]:
from statistics import median
freq_table, count_of_records = genre_freq_table(android_free,1)

for key in freq_table:
    all_installs = 0
    genre = key
    genre_app_count = 0
    list_of_app_installs_in_genre = list()
    for i in android_free:
        total_installs = i[5].replace(',', '')
        total_installs = total_installs.replace('+', '')
        total_installs = float(total_installs)
        total_genre = i[1]  # category
        if total_genre == genre:
            all_installs += total_installs # couont of all ratings for that genre
            genre_app_count += 1
            list_of_app_installs_in_genre.append(total_installs)
    #print(key,int(all_installs),'Count_of_Apps:',genre_app_count, 'Average:',int(round(int(all_installs)/genre_app_count,0)),'Median:', int(median(list_of_app_installs_in_genre)))
    print(key,' Average:',int(round(int(all_installs)/genre_app_count,0)),' Median:', int(median(sorted(list_of_app_installs_in_genre))))
    

SPORTS  Average: 3638640  Median: 100000
NEWS_AND_MAGAZINES  Average: 9549178  Median: 50000
BUSINESS  Average: 1712290  Median: 1000
FAMILY  Average: 3695642  Median: 100000
TRAVEL_AND_LOCAL  Average: 13984078  Median: 100000
MEDICAL  Average: 120551  Median: 1000
PARENTING  Average: 542604  Median: 100000
PHOTOGRAPHY  Average: 17840110  Median: 1000000
COMMUNICATION  Average: 38456119  Median: 500000
VIDEO_PLAYERS  Average: 24727872  Median: 1000000
EVENTS  Average: 253542  Median: 1000
SOCIAL  Average: 23253652  Median: 100000
EDUCATION  Average: 1833495  Median: 1000000
FINANCE  Average: 1387692  Median: 10000
WEATHER  Average: 5074486  Median: 1000000
PRODUCTIVITY  Average: 16787331  Median: 100000
HEALTH_AND_FITNESS  Average: 4188822  Median: 500000
AUTO_AND_VEHICLES  Average: 647318  Median: 100000
MAPS_AND_NAVIGATION  Average: 4056942  Median: 100000
LIBRARIES_AND_DEMO  Average: 638504  Median: 10000
LIFESTYLE  Average: 1437816  Median: 10000
GAME  Average: 15588016  Median: 10

Even though we are using lower band of the range for each app, we can still use median values in gauging the distribution of apps within reach genre.

Analysis of the counts, averages and medians reveals the following facts.
The top three genres from the perspective of the average number of installs are communication (38mil+), video_players (24mil+) and social (23mil+). On the flip side, medians for all three genres are not very high. Median for the communication genre is at 500,000 with the other two coming at 100,000. We can deduct that there will be few leading apps in each of those genres that inflate the averages. Let's have a look.

In [23]:
temp_list = list()
for app in android_free:
    if app[1] == 'COMMUNICATION': #category
        app_name = app[0]
        installs = app[5].replace(',', '')
        installs = installs.replace('+', '')
        installs = float(installs)
        temp_list.append((installs,app_name))
temp_list = sorted(temp_list, reverse = True)
i = 0
while i < 10:
    print(temp_list[i])
    i +=1
    

(1000000000.0, 'WhatsApp Messenger')
(1000000000.0, 'Skype - free IM & video calls')
(1000000000.0, 'Messenger – Text and Video Chat for Free')
(1000000000.0, 'Hangouts')
(1000000000.0, 'Google Chrome: Fast & Secure')
(1000000000.0, 'Gmail')
(500000000.0, 'imo free video calls and chat')
(500000000.0, 'Viber Messenger')
(500000000.0, 'UC Browser - Fast Download Private & Secure')
(500000000.0, 'LINE: Free Calls & Messages')


As expected the COMMUNICATION category contains 6 apps with more than 1 billion of downloads each. Those outliers do affect the average, let's run the previous code excluding those five top apps.

In [24]:
from statistics import median

all_installs = 0
genre_app_count = 0
list_of_app_installs_in_genre = list()

for app in android_free:
    total_installs = app[5].replace(',', '')
    total_installs = total_installs.replace('+', '')
    total_installs = float(total_installs)
    total_genre = app[1]  # category
    total_installs_string =  app[5]
    if total_genre == 'COMMUNICATION':
        if not total_installs_string in ('1,000,000,000+','500,000,000+'):
            genre_app_count += 1
            all_installs += total_installs # couont of all ratings for that genre
            list_of_app_installs_in_genre.append(total_installs)

print(' Average:',int(round(int(all_installs)/genre_app_count,0)),' Median:', int(median(sorted(list_of_app_installs_in_genre))))
    

 Average: 9191689  Median: 100000


Removal of the most successful apps with more than 500 millions of reduced the average  by roughly 29 millions. If we were to leave all other genres unchanged, it would sill place this genre in the top 11 for the number of average downloads per app. The median went down to 100,000 which means that at least half of all apps in this genre are not very successful.

Let's investigate VIDEO PLAYERS, second most successful genre.

In [25]:
temp_list = list()
for app in android_free:
    if app[1] == 'VIDEO_PLAYERS': #category
        app_name = app[0]
        installs = app[5].replace(',', '')
        installs = installs.replace('+', '')
        installs = float(installs)
        temp_list.append((installs,app_name))
temp_list = sorted(temp_list, reverse = True)
i = 0
while i < 10:
    print(temp_list[i])
    i +=1
    

(1000000000.0, 'YouTube')
(1000000000.0, 'Google Play Movies & TV')
(500000000.0, 'MX Player')
(100000000.0, 'VivaVideo - Video Editor & Photo Movie')
(100000000.0, 'VideoShow-Video Editor, Video Maker, Beauty Camera')
(100000000.0, 'VLC for Android')
(100000000.0, 'Motorola Gallery')
(100000000.0, 'Motorola FM Radio')
(100000000.0, 'Dubsmash')
(50000000.0, 'Vote for')


Unsurprisingly, two very famous apps, YouTube and Google Play Movies & Mix are the reason behind inflated average. Let's test it again excluding those two apps and also the third one, with over 500mil downloads.

In [26]:
from statistics import median

all_installs = 0
genre_app_count = 0
list_of_app_installs_in_genre = list()

for app in android_free:
    total_installs = app[5].replace(',', '')
    total_installs = total_installs.replace('+', '')
    total_installs = float(total_installs)
    total_genre = app[1]  # category
    total_installs_string =  app[5]
    if total_genre == 'COMMUNICATION':
        if not total_installs_string in ('1,000,000,000+','500,000,000+'):
            genre_app_count += 1
            all_installs += total_installs # couont of all ratings for that genre
            list_of_app_installs_in_genre.append(total_installs)

print(' Average:',int(round(int(all_installs)/genre_app_count,0)),' Median:', int(median(sorted(list_of_app_installs_in_genre))))
    

 Average: 9191689  Median: 100000


Similarly to the test with the COMMUNICATION genre, the average for the VIDEO PLAYERS went down to 9mil+. Interestingly, the median value remains at 100,000, which means that half of all measured apps in that genre (almost 80 in 159) are not the best performers.

Genre we would like to look at closly now is PHOTOGRAPHY and SHOPPING.

In [27]:
temp_list = list()
for app in android_free:
    if app[1] == 'PHOTOGRAPHY': #category
        app_name = app[0]
        installs = app[5].replace(',', '')
        installs = installs.replace('+', '')
        installs = float(installs)
        temp_list.append((installs,app_name))
temp_list = sorted(temp_list, reverse = True)
i = 0
while i < 10:
    print(temp_list[i])
    i +=1

(1000000000.0, 'Google Photos')
(100000000.0, 'Z Camera - Photo Editor, Beauty Selfie, Collage')
(100000000.0, 'YouCam Perfect - Selfie Photo Editor')
(100000000.0, 'YouCam Makeup - Magic Selfie Makeovers')
(100000000.0, 'Sweet Selfie - selfie camera, beauty cam, photo edit')
(100000000.0, 'S Photo Editor - Collage Maker , Photo Collage')
(100000000.0, 'Retrica')
(100000000.0, 'PicsArt Photo Studio: Collage Maker & Pic Editor')
(100000000.0, 'PhotoGrid: Video & Pic Collage Maker, Photo Editor')
(100000000.0, 'Photo Editor Pro')


Google Photos in the PHOTOGRAPHY genre is the app with over 1 billion of downloads, 10 times more than the second best app. Let's remove it and check the average and median.

In [28]:
from statistics import median

all_installs = 0
genre_app_count = 0
list_of_app_installs_in_genre = list()

for app in android_free:
    total_installs = app[5].replace(',', '')
    total_installs = total_installs.replace('+', '')
    total_installs = float(total_installs)
    total_genre = app[1]  # category
    total_installs_string =  app[5]
    if total_genre == 'PHOTOGRAPHY':
        if not total_installs_string in ('1,000,000,000+','500,000,000+'):
            genre_app_count += 1
            all_installs += total_installs # couont of all ratings for that genre
            list_of_app_installs_in_genre.append(total_installs)

print(' Average:',int(round(int(all_installs)/genre_app_count,0)),' Median:', int(median(sorted(list_of_app_installs_in_genre))))
    

 Average: 14062572  Median: 1000000


Removal of the top scoring app reduced the average number of downloads from almost 18 million to 14 million. The median stays the same as before, at 1 million which means that half of all apps in that genre (130 out of 260) attracted at least 1 million of downloads.
We can conclude that this particular genre should will only grow over time in popularity. Ability to capture beautiful photos using mobile phones has always been one of the main selling points on today's mobile phone market and the demand for apps allowing people to edit and make those photos even better is a natural result of the boom in the mobile phones technology.

**SHOPPING** genre is the second one we want to investigate. We already liked how it performs in the Apple store. According to the analysis of the Google's Store data SHOPPING genre placed 13th with 199 apps averaging 7 millions of downloads and a very strong median of 1 million. Comperetivaly small difference between genre and the median (when compared to other top scoring genres) along with the high median value is what prompted us to investigate this genre in more detail.

In [29]:
temp_list = list()
for app in android_free:
    if app[1] == 'SHOPPING': #category
        app_name = app[0]
        installs = app[5].replace(',', '')
        installs = installs.replace('+', '')
        installs = float(installs)
        temp_list.append((installs,app_name))
temp_list = sorted(temp_list, reverse = True)
i = 0
while i < 10:
    print(temp_list[i])
    i +=1

(100000000.0, 'eBay: Buy & Sell this Summer - Discover Deals Now!')
(100000000.0, 'Wish - Shopping Made Fun')
(100000000.0, 'Flipkart Online Shopping App')
(100000000.0, 'Amazon Shopping')
(100000000.0, 'AliExpress - Smarter Shopping, Better Living')
(50000000.0, 'letgo: Buy & Sell Used Stuff, Cars & Real Estate')
(50000000.0, 'The birth')
(50000000.0, 'OLX - Buy and Sell')
(50000000.0, 'Myntra Online Shopping App')
(50000000.0, 'Mercado Libre: Find your favorite brands')


Top 5 apps in this category are all famous apps with a minimum of 100 million downloads each. Let's remove them from the calculation and re-analyse.

In [30]:
from statistics import median

all_installs = 0
genre_app_count = 0
list_of_app_installs_in_genre = list()

for app in android_free:
    total_installs = app[5].replace(',', '')
    total_installs = total_installs.replace('+', '')
    total_installs = float(total_installs)
    total_genre = app[1]  # category
    total_installs_string =  app[5]
    if total_genre == 'SHOPPING':
        if not total_installs_string in ('100,000,000+'):
            genre_app_count += 1
            all_installs += total_installs # couont of all ratings for that genre
            list_of_app_installs_in_genre.append(total_installs)

print(' Average:',int(round(int(all_installs)/genre_app_count,0)),' Median:', int(median(sorted(list_of_app_installs_in_genre))))
    

 Average: 4640921  Median: 1000000


The result is at least encouraging. The remaining apps in the SHOPPING genre still average 4.6 million of downloads per app and the median value is left unchanged at 1 million.
With more and more people doing their daily shopping online, looking for discounts and deals, market of online shopping is bound to grow at a very fast rate which opens great opportunites for app developers targeting this market.


### Conclusion

In this project, we went through a complete data science workflow:

* We started by clarifying the goal of our project.
* We collected relevant data.
* We cleaned the data to prepare it for analysis.
* We analyzed the cleaned data.

All of the above led us to a conslusion that the development of apps in the **shopping** genre may yield a very successful product. Apps in the shopping genre performed very well in both online stores, with the average number of downloads per app scoring higher when compared to other genres and also very strong median values, which have been maintained after removal of the highest scoring apps in the genre.  
Not only this, it has been a long term trend for the online sales market growing in a very rapid pace which is expected to continue in the current world we live in, dominated by ever changing technology and a trend of moving out from shopping in the brick and mortar stores.