# Profitable applications profiles for the App Store and Google Play markets
This project is about analyzing information from two datasets: `Apple Store` and `Google Play Store` markets. The information will help us to conclude what types of application are more profitable to develop and give this infomation to the developing team. Our company develops free applications aimed at English-speaking users and the main source of revenue consists of in-applications ads. This means that our revenue for any given application is mostly influenced by the number of users that use our application. Our goal for this project is to analyze data to help our developers understand what kinds of applications are likely to attract more users.

### Opening and exploring the data
Links on the documentation of the datasets using in this project:

1) [AppleStore dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

Content:
* "id" : Application ID
* "track_name": Application Name
* "size_bytes": Size (in Bytes)
* "currency": Currency Type
* "price": Price amount
* "ratingcounttot": User Rating counts (for all version)
* "ratingcountver": User Rating counts (for current version)
* "user_rating" : Average User Rating value (for all version)
* "userratingver": Average User Rating value (for current version)
* "ver" : Latest version code
* "cont_rating": Content Rating
* "prime_genre": Primary Genre
* "sup_devices.num": Number of supporting devices
* "ipadSc_urls.num": Number of screenshots showed for display
* "lang.num": Number of supported languages
* "vpp_lic": Vpp Device Based Licensing Enabled

2) [GooglePlayStore dataset](https://www.kaggle.com/lava18/google-play-store-apps).

Content:

* "App" : Application Name
* "Category" : Application category
* "Rating" : User Rating counts
* "Reviews" : User Reviews counts
* "Size" : Size (in Megabytes)
* "Installs" : Number of installs
* "Type" : Paid/Unpaid
* "Price" : Price amount
* "Content Rating" : Minimum acceptable age
* "Genres" : Primary Genre
* "Last Updated" : Date of the last update
* "Current Ver" : Application current version
* "Android Ver" : Required android version

First of all, to start working with the information stored in the datasets we will extract it from a `CSV` files and assign it to variables:
1. `DataAppleStore` for the information from the `AppleStore` dataset.
2. `DataGooglePlayStore` for the information from the `GooglePlayStore` dataset.

To do it, we will use a function `extract_data` that takes one argument `directory` and returns information from the dataset in the "list of lists" format.

In [1]:
from csv import reader
def extract_data(directory):
    OpenDataset = open(directory, encoding = "utf8")
    ReadData = reader(OpenDataset)
    return list(ReadData)
DataAppleStore = extract_data('..\..\..\Datasets\First\AppleStore.csv')
DataGooglePlayStore = extract_data('..\..\..\Datasets\First\googleplaystore.csv')

To have a first look at the data we will write the `explore_data` function that takes 4 arguments:
1. `dataset` - a title of a dataset, in our case it `DataAppleStore` and `DataGooglePlayStore` titles.
2. `start` and `end` - the start and the end indexes of a given dataset to display a certain number of rows that we want to analyze.
3. `rows_and_columns` - this argument is used to indicate if we need to display the aggregated information about the number of rows and the number of columns on the interval of rows chosen in the previous step. The argument is "False" by default.

After execution, the function will print rows on the chosen interval and aggregated information about the number of rows and columns (only if the information is needed).

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        print('\n')

The next block of code will use the `explore_data` function to display several rows to have a first look at the data.

In [3]:
print('Exaples of the "Apple Store" data:')
explore_data(DataAppleStore[0:], 0, 3, True)
print('Exaples of the "Google Play Store" data:')
explore_data(DataGooglePlayStore[0:], 0, 3, True)

Exaples of the "Apple Store" data:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


Exaples of the "Google Play Store" data:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moa

From the output we can see that the `Apple Store` data set has 7197 applications. At a quick glance, the columns that might be useful for the purpose of our analysis are `track_name`, `price`, `rating_count_tot`, `user_rating`, and `prime_genre`.

Google Play data set has 10841 applications. Columns that might be useful for the purpose of our analysis are `App`, `Rating`, `Category`, `Reviews`, `Installs`, `Type`, `Price`, and `Genres`.

### Deleting wrong data
The `Google Play` dataset has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/164101) outlines an error for row 10472 if header is not encluded. Let's print this row and compare it against the header and another row that is correct.

In [4]:
print(DataGooglePlayStore[0])
print('\n')
print(DataGooglePlayStore[1])
print('\n')
print(DataGooglePlayStore[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The row 10472 corresponds to the application `Life Made WI-Fi Touchscreen Photo Frame`, and we can see that the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5 (as mentioned in the [discussions section](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), this problem is caused by a missing value in the 'Category' column).

The next block of code is used for deleting the row with wrong data from the data set assigned to a variable `DataGooglePlayStore`.

In [5]:
print(len(DataGooglePlayStore))
del DataGooglePlayStore[10473]
print(len(DataGooglePlayStore))

10842
10841


### Removing duplicate entries

In the next step, we will check if the datasets have duplicate entries. To do it, we will write the `check_duplicates` function which takes two arguments (a dataset and an index of a column that contains information about a name of an application), checks if there are rows with the same name, and return two lists: a list of unique values and a list of duplicate values.

In [6]:
def check_duplicates(dataset, name_index):
    duplicate_vals = []
    unique_vals = []
    for row in dataset:
        if row[name_index] in unique_vals:
            duplicate_vals.append(row[name_index])
        else:
            unique_vals.append(row[name_index])
    return unique_vals, duplicate_vals
AppleStore_Vals = check_duplicates(DataAppleStore[1:], 1) #We specify the range [1:] because we do not want to include header. An index of a column that contains information about a name of an application for the AppleStore dataset is 1.
Unique_AppleStore_Vals = list(AppleStore_Vals[0])
Duplicate_AppleStore_Vals = list(AppleStore_Vals[1])
print('Number of unique applications in the AppleStore dataset is: ', len(Unique_AppleStore_Vals))
print('Number of duplicate applications in the AppleStore dataset is: ', len(Duplicate_AppleStore_Vals))
print('\n')
GooglePlayStore_Vals = check_duplicates(DataGooglePlayStore[1:], 0) #We specify the range [1:] because we do not want to include header. An index of a column that contains information about a name of an application for the AppleStore dataset is 0.
Unique_GooglePlayStore_Vals = list(GooglePlayStore_Vals[0])
Duplicate_GooglePlayStore_Vals = list(GooglePlayStore_Vals[1])
print('Number of unique applications in the GooglePlayStore dataset is: ', len(Unique_GooglePlayStore_Vals))
print('Number of duplicate applications in the GooglePlayStore dataset is: ', len(Duplicate_GooglePlayStore_Vals))

Number of unique applications in the AppleStore dataset is:  7195
Number of duplicate applications in the AppleStore dataset is:  2


Number of unique applications in the GooglePlayStore dataset is:  9659
Number of duplicate applications in the GooglePlayStore dataset is:  1181


As you can see, the number of duplicate applications in the `Apple Store` dataset is 2 and the number of duplicate applications in the `Google Play Store` dataset is 1181, also the number of unique applications in the `Apple Store` dataset is 7195 and the number of unique applications in the `Google Play Store` dataset is 9659.

We do not want to count certain applications more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per application.
Let's explore duplicate values of the datasets to see if we can build a criterion that will allow us to remove duplicate values from the data sets in a more specific way. To do it, we will use the `explore_duplicates` function which takes a dataset, a list of duplicates and an index of a column which contains names of applications and after execution returns a list of lists contains rows of repeated applications.

In [7]:
def explore_duplicates(dataset, duplicates_list, name_index):
    list_of_duplicates = []
    for value in duplicates_list:
        for row in dataset:
            if value == row[name_index]:
                list_of_duplicates.append(row)
    return list_of_duplicates
Duplicates_DataAppleStore = explore_duplicates(DataAppleStore, Duplicate_AppleStore_Vals, 1)
print('Duplicates of the "Apple Store" dataset:')
explore_data(Duplicates_DataAppleStore, 0, 4)
Duplicates_GooglePlayStore = explore_duplicates(DataGooglePlayStore, Duplicate_GooglePlayStore_Vals, 0)
print('Duplicates of the "Google Play Store" dataset:')
explore_data(Duplicates_GooglePlayStore, 0, 3)

Duplicates of the "Apple Store" dataset:
['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']


['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']


['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']


['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


Duplicates of the "Google Play Store" dataset:
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with de

After exploring the output of the `explore_duplicates` function we see that the main difference happens on a columns which correspond to the number of ratings/reviews (column five for the `Apple Store` market and column three for the `Google Play Store` market). Rather than removing duplicates randomly, we can use values that represent a number of reviews to build a criterion which in turn will allow us to sort duplicate values by the "number of reviews/ratings" column and keep the rows that have the highest number of reviews/ratings (the more reviews/ratings the row has, the more recent and reliable the data is).

To do it we will use the `find_max_reviews` function that takes three arguments:
1. `dataset` - a title of the passed dataset.
2. `name_index` - an index of a column contains names of applications.
3. `reviews_index` - an index of a column contains a number of reviews of applications.

The `find_max_reviews` function returns a dictionary where each key is a unique application name and the corresponding value is the highest number of reviews of that application.

In [8]:
def find_max_reviews(dataset, name_index, reviews_index):
    reviews_max = {}
    for row in dataset:
        name = row[name_index]
        n_reviews = float(row[reviews_index])
        if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
        elif name not in reviews_max:
            reviews_max[name] = n_reviews
    return reviews_max
Max_Reviews_AppleStore = find_max_reviews(DataAppleStore[1:], 1, 5) #We specify the range [1:] because we do not want to include header
Max_Reviews_DataGooglePlayStore = find_max_reviews(DataGooglePlayStore[1:], 0, 3) #We specify the range [1:] because we do not want to include header

To make sure everything went as expected, let's print a length of the `Max_Reviews_AppleStore` and `Max_Reviews_DataGooglePlayStore` dictionaries. As we remember, the number of unique applications in the `Apple Store` data set is 7195 and the number of unique applications in the `Google Play Store` data set is 9659, so the lengths of the `Max_Reviews_AppleStore` and `Max_Reviews_DataGooglePlayStore` dictionaries must be 7195 and 9659 accordingly.

In [9]:
print('The lenght of the \"Max_Reviews_AppleStore\" data set is:', len(Max_Reviews_AppleStore))
print('The lenght of the \"Max_Reviews_DataGooglePlayStore\" data set is:', len(Max_Reviews_DataGooglePlayStore))

The lenght of the "Max_Reviews_AppleStore" data set is: 7195
The lenght of the "Max_Reviews_DataGooglePlayStore" data set is: 9659


As we can see, everything work as expected.

Now, let's use the `Max_Reviews_AppleStore` and `Max_Reviews_DataGooglePlayStore` datasets to remove the duplicates.
In the next block of code we reassemble data from the `Apple Store` and `Google Play Store` datasets in the next way: based on the "highest number of reviews/ratings" criterion we will loop through a list of unique applications from the datasets and leave the only ones that have the highest number of reviews.

To do it we will use the `cleaning_data` function which does the following:
1. Takes four arguments: `dataset` - a name of a given dataset, `name_index` - an index of a column contains names of applications, `reviews_index` - index of a column contains a number of reviews/ratings of the applications, `max_reviews_dict` - a title of the dictionaries from `step seven` where keys are unique applications names and values are the highest numbers of reviews of those applications.
2. Loops through a given data set and for each row checks if the current number of reviews equal to a value from the `max_reviews_dict` for that application (which means that the current value of reviews is the maximum value) and a name of the application occurs in the loop for the first time.
3. Returns the `data_cleaned` list.

In [10]:
def cleaning_data(dataset, name_index, reviews_index, max_reviews_dict):
    data_cleaned = []
    already_added = []
    for row in dataset:
        name = row[name_index]
        n_reviews = float(row[reviews_index])
        if n_reviews == max_reviews_dict[name] and name not in already_added:
            data_cleaned.append(row)
            already_added.append(name)
    return data_cleaned
Cleaned_DataAppleStore = cleaning_data(DataAppleStore[1:], 1, 5, Max_Reviews_AppleStore)
Cleaned_DataGooglePlayStore = cleaning_data(DataGooglePlayStore[1:], 0, 3, Max_Reviews_DataGooglePlayStore)

Let's explore the `Cleaned_DataAppleStore` and `Cleaned_DataGooglePlayStore` datasets to ensure everything went as expected.
To do it we will compare lengths of the `Cleaned_DataAppleStore`, `Cleaned_DataGooglePlayStore` and `Unique_AppleStore_vals`, `Unique_GooglePlayStore_vals` accordingly and if the values will be the same then in the first approximation the `cleaning_data` function works correctly.

In [11]:
if len(Cleaned_DataAppleStore) == len(Unique_AppleStore_Vals):
    print('Lenght of "Cleaned_DataAppleStore" is equal to ' + str(len(Cleaned_DataAppleStore)) + ', lenght of "Unique_AppleStore_Vals" is equal to ' + str(len(Unique_AppleStore_Vals)) + ' so lenghts are identical.')
else:
    print('Lenght of "Cleaned_DataAppleStore" is equal to ' + str(len(Cleaned_DataAppleStore)) + ', lenght of "Unique_AppleStore_vals" is equal to ' + str(len(Unique_AppleStore_Vals)) + ' so lenghts are not identical.')
if len(Cleaned_DataGooglePlayStore) == len(Unique_GooglePlayStore_Vals):
    print('Lenght of "Cleaned_DataGooglePlayStore" is equal to ' + str(len(Cleaned_DataGooglePlayStore)) + ', lenght of "Unique_GooglePlayStore_vals" is equal to ' + str(len(Unique_GooglePlayStore_Vals)) + ' so lenghts are identical.')
else:
    print('Lenght of "Cleaned_DataGooglePlayStore" is equal to ' + str(len(Cleaned_DataGooglePlayStore)) + ', lenght of "Unique_GooglePlayStore_vals" is equal to ' + str(len(Unique_GooglePlayStore_Vals)) + ' so lenghts are not identical.')

Lenght of "Cleaned_DataAppleStore" is equal to 7195, lenght of "Unique_AppleStore_Vals" is equal to 7195 so lenghts are identical.
Lenght of "Cleaned_DataGooglePlayStore" is equal to 9659, lenght of "Unique_GooglePlayStore_vals" is equal to 9659 so lenghts are identical.


Let's explore the `Cleaned_DataAppleStore` and `Cleaned_DataGooglePlayStore` lists more in-depth to finally make sure that the `cleaning_data` function worked correctly. 

To do it we will use the `verify_func` that takes three arguments: `raw_dataset` -  a title of a dataset that has not been reassembled based on the "highest number of reviews/ratings" criterion, `cleaned_dataset` - a title of a dataset that has been reassembled on the `ninth step`, `title` - a string what represents a title of an application. The function will loop through two given datasets and return only the rows in which the `Application` column equal to a string assigned to `title`.

In [12]:
def verify_func(raw_dataset, cleaned_dataset, title):
    raw_data_dict = []
    cleaned_data_dict = []
    for row in raw_dataset:
        name = row[0]
        if name == title:
            raw_data_dict.append(row)
    for row in cleaned_dataset:
        name = row[0]
        if name == title:
            cleaned_data_dict.append(row)
    return raw_data_dict, cleaned_data_dict

Verified_Result = verify_func(DataGooglePlayStore[1:], Cleaned_DataGooglePlayStore, 'Instagram')
Verified_DataGooglePlayStore = Verified_Result[0]
Verified_Cleaned_DataGooglePlayStore = Verified_Result[1]

explore_data(Verified_DataGooglePlayStore, 0, 5, True)
explore_data(Verified_Cleaned_DataGooglePlayStore, 0, 1, True)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Number of rows: 4
Number of columns: 13


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Number of rows: 1
Number of columns: 13




As we can see in the block of code above, the `Google Play Store` dataset which contains "raw" data has four rows in which the number of reviews varies from 66509917 to 66577446. After using the `cleaning_data` function we have the `Cleaned_DataGooglePlayStore` list which contains only one row in which a number of reviews/ratings is a maximal value from the ones we have in the `Google Play Store` dataset. Based on these results, we can make a conclusion that the first stage of cleaning data executed correctly.

### Removing non-English applications
If we explore the datasets further, we will notice what the names of some of the applications consist of non-English characters.
Our next step of cleaning the data will be removing such applications.
In the next step, we will write a function which takes a string and return `False` if the string has more then three non-English characters. To check if a string has non-English characters we will use a build-in function `ord` which returns corresponding encoding ASCII number of each character and we know what characters that are specific to English texts lie between 0 and 127.

In [13]:
def check_characters(string):
    non_Eng_counter = 0
    for character in string:
        if ord(character) > 127:
            non_Eng_counter += 1
    if non_Eng_counter > 3:
        return False
    return True

To check whether the function is working correctly let's pass as an argument strings which have non-English characters.

In [14]:
string_1 = 'Instagram'
string_2 = '爱奇艺PPS -《欢乐颂2》电视剧热播'
string_3 = 'Docs To Go™ Free Office Suite'
string_4 = 'Instachat 😜'
print(check_characters(string_1))
print(check_characters(string_2))
print(check_characters(string_3))
print(check_characters(string_4))

True
False
True
True


At the first glance, it seems what the function works correctly, but if we explore the data sets a little longer we will notice what titles of some applications contain emoji or other symbols (™, — (em dash), – (en dash), etc.) that fall outside of the ASCII range. Let's solve this problem.
In the next block of code, we write the `non_Eng_cleaning_data` function which takes a dataset and leaves in it only those rows in which a name of an application does not have more than three non-English characters. To ensure everything works as expected, let's pass to the `non_Eng_cleaning_data` a test case dataset that looks like this `[[1, 'Instagram'], [2, '爱奇艺PPS -《欢乐颂2》电视剧热播'], [3, 'Docs To Go™ Free Office Suite'], [4, 'Instachat 😜']]`. We expect that the `non_Eng_cleaning_data` function will return the data set like this `[[1, 'Instagram'], [3, 'Docs To Go™ Free Office Suite'], [4, 'Instachat 😜']]` (without the second list).

In [15]:
def non_Eng_cleaning_data(dataset, name_index, check_func):
    cleaned_list = []
    for row in dataset:
        if check_func(row[name_index]):
            cleaned_list.append(row)
    return cleaned_list
test_dataset = [[1, 'Instagram'], [2, '爱奇艺PPS -《欢乐颂2》电视剧热播'], [3, 'Docs To Go™ Free Office Suite'], [4, 'Instachat 😜']]
the_result = non_Eng_cleaning_data(test_dataset, 1, check_characters)
print(the_result)

[[1, 'Instagram'], [3, 'Docs To Go™ Free Office Suite'], [4, 'Instachat 😜']]


As we can see above, everything went as expected.

Let's apply the `non_Eng_cleaning_data` function to the `Cleaned_DataAppleStore` and `Cleaned_DataGooglePlayStore` lists after that compare lenghts of the `Cleaned_DataAppleStore` and `Cleaned_DataGooglePlayStore` lists before and after usage the `non_Eng_cleaning_data` function.

In [16]:
Cleaned_nonEng_DataAppleStore = non_Eng_cleaning_data(Cleaned_DataAppleStore, 1, check_characters)
print('A lenght of the "Cleaned_DataAppleStore" is ' + str(len(Cleaned_DataAppleStore)))
print('A lenght of the "Cleaned_nonEng_DataAppleStore" is ' + str(len(Cleaned_nonEng_DataAppleStore)))
NonEngapp_DataAppleStore_lenght = len(Cleaned_DataAppleStore) - len(Cleaned_nonEng_DataAppleStore)
print('A number of applications contain non-English characters in the names for "Cleaned_DataAppleStore" is: ' + str(NonEngapp_DataAppleStore_lenght))
print('\n')
Cleaned_nonEng_DataGooglePlayStore = non_Eng_cleaning_data(Cleaned_DataGooglePlayStore, 0, check_characters)
print('A lenght of the "Cleaned_DataGooglePlayStore" is ' + str(len(Cleaned_DataGooglePlayStore)))
print('A lenght of the "Cleaned_nonEng_DataGooglePlayStore" is ' + str(len(Cleaned_nonEng_DataGooglePlayStore)))
NonEngapp_DataGooglePlayStore_lenght = len(Cleaned_DataGooglePlayStore) - len(Cleaned_nonEng_DataGooglePlayStore)
print('A number of applications contain non-English characters in the names for "Cleaned_DataGooglePlayStore" is: ' + str(NonEngapp_DataGooglePlayStore_lenght))

A lenght of the "Cleaned_DataAppleStore" is 7195
A lenght of the "Cleaned_nonEng_DataAppleStore" is 6181
A number of applications contain non-English characters in the names for "Cleaned_DataAppleStore" is: 1014


A lenght of the "Cleaned_DataGooglePlayStore" is 9659
A lenght of the "Cleaned_nonEng_DataGooglePlayStore" is 9614
A number of applications contain non-English characters in the names for "Cleaned_DataGooglePlayStore" is: 45


### Isolating the free applications
As we mentioned in the introduction, our company only builds applications that are free to download and install. At this point, our datasets contain both free and non-free applications. Our next step will be isolating the free applications. If we will exlore `price` column of the `Google Play Store` we will see, what some values of the column contain `$` character. To isolate  free applications we will need to make a comparison with zero by converting the string value of the price to float using the built-in function `float`. The problem is the `float` function throws an error if we pass it a character that can not be converted in float format such as `$`. To solve the problem, we will use slicing to remove `$` characters from prices.

In [17]:
def is_free_func(a_dataset, price_index):
    free_apps = []
    for row in a_dataset:
        app_price = row[price_index][-4:]
        if float(app_price) == 0.0:
            free_apps.append(row)
    return free_apps
Free_Cleaned_DataAppleStore = is_free_func(Cleaned_nonEng_DataAppleStore, 4)
print('A number of free applications in the "Apple Store" dataset is: ' + str(len(Free_Cleaned_DataAppleStore)))
Free_Cleaned_DataGooglePlayStore = is_free_func(Cleaned_nonEng_DataGooglePlayStore, 7)
print('A number of free applications in the "Google Play Store" dataset is: ' + str(len(Free_Cleaned_DataGooglePlayStore)))

A number of free applications in the "Apple Store" dataset is: 3220
A number of free applications in the "Google Play Store" dataset is: 8868


### Most common applications by genre
As we mentioned earlier, our aim is to determine the kinds of applications that are likely to attract more users because our revenue is highly influenced by the number of people using our application.

To minimize risks, our validation strategy for an application idea is comprised of three steps:
1. Build a minimal Android version of the application, and add it to the `Google Play`.
2. If the application has a good response from users, we develop it further.
3. If the application is profitable after six months, we build an iOS version of the application and add it to the `Apple Store`.

Because our end goal is to add the application on both `Google Play` and `Apple Store` merkets, we need to find application profiles that are successful on both markets. To do it let's inspect both data sets and identify columns we can use to make the conclusion. Let's start by determining the most popular genres of the applications for both markets.

Let's begin the analysis by getting a sense of the most common genres for each market.
First of all, we have to identify the columns having information about genre for both data sets.

In [18]:
explore_data(DataAppleStore[0:], 0, 1)
explore_data(DataGooglePlayStore[0:], 0, 1)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']




As we can see above, for the `Apple Store` dataset, a column having information about the genre is the `prime_genre` which has an index equal to `11` and for the `Google Play Store` it is the `Category` and `Genres` columns which have indexes equal to `1` and `9`, respectively.

Our next step will be creating frequency tables for the genres columns. To do it we will write two functions:
1. The `freq_table` function generates frequency tables that show percentages of the most popular genres. The function takes two arguments:
   * `dataset` -  a title of the passed dataset.
   * `genre_index` - an index of the dataset containing information about of an application depending on the criterion of interest, in our case it will be `genre`.

In [19]:
def freq_table(dataset, genre_index):
    total_numb_apps = len(dataset)
    freq_table = {}
    for row in dataset:
        genre = row[genre_index]
        if genre in freq_table:
            freq_table[genre] += 1
        else:
            freq_table[genre] = 1
    for key in freq_table:
        freq_table[key] = round((freq_table[key] / total_numb_apps) * 100, 2)
    return freq_table

2. The `display_table` function displays the percentages in descending order. As we know, dictionaries do not have order, and it will be very difficult to analyze the frequency tables. For this reason, we have to write a function that can help us display the entries in the frequency table in descending order. To do that, we'll use the built-in `sorted` function, however, function `sorted` does not work too well with dictionaries because it only considers and returns the dictionary keys. To solve the problem, we have to transform the dictionary into a list of tuples, where each tuple contains a dictionary key along with its corresponding dictionary value. To ensure the sorting works right, the dictionary value comes first, and the dictionary key comes second.

In [20]:
def display_table(freq_table):
    table = freq_table
    table_to_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_to_display.append(key_val_as_tuple)
    table_sorted = sorted(table_to_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0], '%')


Now, let's examine the frequency table for the `prime_genre` column of the `App Store` dataset.

In [21]:
Common_Genres_DataAppleStore = freq_table(Free_Cleaned_DataAppleStore, 11)
display_table(Common_Genres_DataAppleStore)

Games : 58.14 %
Entertainment : 7.89 %
Photo & Video : 4.97 %
Education : 3.66 %
Social Networking : 3.29 %
Shopping : 2.61 %
Utilities : 2.52 %
Sports : 2.14 %
Music : 2.05 %
Health & Fitness : 2.02 %
Productivity : 1.74 %
Lifestyle : 1.58 %
News : 1.34 %
Travel : 1.24 %
Finance : 1.12 %
Weather : 0.87 %
Food & Drink : 0.81 %
Reference : 0.56 %
Business : 0.53 %
Book : 0.43 %
Navigation : 0.19 %
Medical : 0.19 %
Catalogs : 0.12 %


As we can see above, the most common genre among free English-speaking applications is games - 58.14% of the total number. Entertainment and Photo & Video applications in the second and third place - 7.89% and 4.97%, respectively, Education is on the fourth-place - 3.66%. Based on this information, we can draw a conclusion what `Apple Store` is dominated by applications that are designed for fun, while applications with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are rarer. However, the fact that fun applications are the most numerous does not also imply that they also have the greatest number of users.

Let's continue by exploring the `Category` and `Genres` columns of the `Google Play Store` data set.

In [22]:
Common_Categories_DataGooglePlayStore = freq_table(Free_Cleaned_DataGooglePlayStore, 1)
display_table(Common_Categories_DataGooglePlayStore) #Category

FAMILY : 18.91 %
GAME : 9.72 %
TOOLS : 8.46 %
BUSINESS : 4.59 %
LIFESTYLE : 3.91 %
PRODUCTIVITY : 3.89 %
FINANCE : 3.7 %
MEDICAL : 3.55 %
SPORTS : 3.39 %
PERSONALIZATION : 3.32 %
COMMUNICATION : 3.24 %
HEALTH_AND_FITNESS : 3.08 %
PHOTOGRAPHY : 2.94 %
NEWS_AND_MAGAZINES : 2.8 %
SOCIAL : 2.66 %
TRAVEL_AND_LOCAL : 2.33 %
SHOPPING : 2.24 %
BOOKS_AND_REFERENCE : 2.14 %
DATING : 1.86 %
VIDEO_PLAYERS : 1.79 %
MAPS_AND_NAVIGATION : 1.4 %
FOOD_AND_DRINK : 1.24 %
EDUCATION : 1.16 %
ENTERTAINMENT : 0.96 %
LIBRARIES_AND_DEMO : 0.94 %
AUTO_AND_VEHICLES : 0.92 %
HOUSE_AND_HOME : 0.82 %
WEATHER : 0.8 %
EVENTS : 0.71 %
PARENTING : 0.65 %
ART_AND_DESIGN : 0.64 %
COMICS : 0.62 %
BEAUTY : 0.6 %


As we can see, the most popular categories of the `Google Play Store` market is different from the one we see in the `Apple Store`. On the first positions applications, designed for practical purposes: family, tools, business, etc.

The next step will be exploring `Genres` columns of the `Google Play Store` data set.

In [23]:
Common_Genres_DataGooglePlayStore = freq_table(Free_Cleaned_DataGooglePlayStore, 9)
display_table(Common_Genres_DataGooglePlayStore) #Genre

Tools : 8.45 %
Entertainment : 6.07 %
Education : 5.36 %
Business : 4.59 %
Lifestyle : 3.9 %
Productivity : 3.89 %
Finance : 3.7 %
Medical : 3.55 %
Sports : 3.46 %
Personalization : 3.32 %
Communication : 3.24 %
Action : 3.1 %
Health & Fitness : 3.08 %
Photography : 2.94 %
News & Magazines : 2.8 %
Social : 2.66 %
Travel & Local : 2.32 %
Shopping : 2.24 %
Books & Reference : 2.14 %
Simulation : 2.04 %
Dating : 1.86 %
Arcade : 1.85 %
Video Players & Editors : 1.77 %
Casual : 1.76 %
Maps & Navigation : 1.4 %
Food & Drink : 1.24 %
Puzzle : 1.13 %
Racing : 0.99 %
Role Playing : 0.94 %
Libraries & Demo : 0.94 %
Auto & Vehicles : 0.92 %
Strategy : 0.91 %
House & Home : 0.82 %
Weather : 0.8 %
Events : 0.71 %
Adventure : 0.68 %
Comics : 0.61 %
Beauty : 0.6 %
Art & Design : 0.6 %
Parenting : 0.5 %
Card : 0.45 %
Casino : 0.43 %
Trivia : 0.42 %
Educational;Education : 0.39 %
Board : 0.38 %
Educational : 0.37 %
Education;Education : 0.34 %
Word : 0.26 %
Casual;Pretend Play : 0.24 %
Music : 0.2 %
Ra

We can tell, what the most common applications of the markets differ from each other: on `Apple Store` market, it is applications created for entertainment such as games, social media, etc. while on the `Google Play Store` it is applications designed for practical purposes: tools, education, business, etc.

### Most popular applications by genre
Based on the received information, we still can not recommend a profitable application for the markets. Now we would like to get an idea about the kind of applications that have the most users. The criterion that characterizes a number of users can be amount of installs. In the `Google Play Store` data set there is the `Installs` column, but in the `Apple Store` data set there are no columns that represent the number of installs, instead, we can use the `rating_count_tot` column (the more number of ratings the more times the applications were installed).

### Apple Store

To get a number of reviews of the `Apple Store` applications by genre we will use the `vals_per_criterion` function. The function takes a frequency table, a dataset, a genre index and an installs index, after that for each key (which is a main genre of the application) in the frequency table compute an average number of ratings/installs for the applications whose genre matches with the key.

In [24]:
def reviews_by_genre(freq_table, dataset, genre_index, installs_index, remove_characters = False, is_applestore = True):
    reviews_by_genre_vals = {}
    for key in freq_table:
        total = 0
        len_genre = 0
        for app in dataset:
            genre_app = app[genre_index]
            if genre_app == key:
                installs = app[installs_index]
                if remove_characters:
                    installs = installs.replace('+','')
                    installs = installs.replace(',','')
                installs = float(installs)
                total += installs
                len_genre += 1
        avg_rating = round(total/len_genre, 2)
        reviews_by_genre_vals[key] = avg_rating
    if is_applestore:
        for key in reviews_by_genre_vals:
            print(key, ':', reviews_by_genre_vals[key],'ratings')
    else:
        for key in reviews_by_genre_vals:
            print(key, ':', reviews_by_genre_vals[key],'installs')
    return reviews_by_genre_vals
Ratings_Per_Genre_DataAppleStore = reviews_by_genre(Common_Genres_DataAppleStore, Free_Cleaned_DataAppleStore, 11, 5)

Social Networking : 71548.35 ratings
Photo & Video : 28441.54 ratings
Games : 22812.92 ratings
Music : 57326.53 ratings
Reference : 74942.11 ratings
Health & Fitness : 23298.02 ratings
Weather : 52279.89 ratings
Utilities : 18684.46 ratings
Travel : 28243.8 ratings
Shopping : 26919.69 ratings
News : 21248.02 ratings
Navigation : 86090.33 ratings
Lifestyle : 16485.76 ratings
Entertainment : 14029.83 ratings
Food & Drink : 33333.92 ratings
Sports : 23008.9 ratings
Book : 39758.5 ratings
Finance : 31467.94 ratings
Education : 7003.98 ratings
Productivity : 21028.41 ratings
Business : 7491.12 ratings
Catalogs : 4004.0 ratings
Medical : 612.0 ratings


To simplify the analysis, we will arrange the output in descending order using the `descending_order` function.

In [25]:
def descending_order(freq_table, applestore=False):
    table_to_display = []
    for key in freq_table:
        key_val_as_tuple = (freq_table[key], key)
        table_to_display.append(key_val_as_tuple)
    table_sorted = sorted(table_to_display, reverse = True)
    if applestore:
        for entry in table_sorted:
            print(entry[1], ':', entry[0], 'ratings')
    else: 
        for entry in table_sorted:
            print(entry[1], ':', entry[0], 'installs')
    return table_sorted
Ratings_Per_Genre_DataAppleStore_Desc = descending_order(Ratings_Per_Genre_DataAppleStore, True)

Navigation : 86090.33 ratings
Reference : 74942.11 ratings
Social Networking : 71548.35 ratings
Music : 57326.53 ratings
Weather : 52279.89 ratings
Book : 39758.5 ratings
Food & Drink : 33333.92 ratings
Finance : 31467.94 ratings
Photo & Video : 28441.54 ratings
Travel : 28243.8 ratings
Shopping : 26919.69 ratings
Health & Fitness : 23298.02 ratings
Sports : 23008.9 ratings
Games : 22812.92 ratings
News : 21248.02 ratings
Productivity : 21028.41 ratings
Utilities : 18684.46 ratings
Lifestyle : 16485.76 ratings
Entertainment : 14029.83 ratings
Business : 7491.12 ratings
Education : 7003.98 ratings
Catalogs : 4004.0 ratings
Medical : 612.0 ratings


As we can see, for the `Apple Store` market the highest number of ratings have applications whose main genre is navigation, reference and social networking. Let's explore which applications are related to these genres. To do it, we will use the `print_names_by_genre` function, which takes a dataset and print a title and a number of ratings of the application whose genre matches with a genre that passed to the function as an argument.

In [26]:
def print_installs_by_genre(dataset, prime_genre_index, genre, name_index, installs_index, applestore=False):
    for app in dataset:
        if app[prime_genre_index] == genre:
            if applestore:
                print(app[name_index], ':', app[installs_index], ' ratings')
            else: 
                print(app[name_index], ':', app[installs_index], ' installs')

We start with analyzing applications which genre is navigation.

In [27]:
print_installs_by_genre(Free_Cleaned_DataAppleStore, 11, 'Navigation', 1, 5, True)

Waze - GPS Navigation, Maps & Real-time Traffic : 345046  ratings
Google Maps - Navigation & Transit : 154911  ratings
Geocaching® : 12811  ratings
CoPilot GPS – Car Navigation & Offline Maps : 3582  ratings
ImmobilienScout24: Real Estate Search in Germany : 187  ratings
Railway Route Search : 5  ratings


As we can see, the largest number of reviews for applications of this genre relate to two applications: `Waze - GPS Navigation, Maps & Real-time Traffic` and `Google Maps` while the rest of the applications have few ratings relatively. This distribution of ratings may testify that the demand for this kind of applications is probably satisfied. Based on this, there is no point in making an application of this type due to the high competition, complexity and cost of development.

Next, we will be analyzing applications which genre is Reference.

In [28]:
print_installs_by_genre(Free_Cleaned_DataAppleStore, 11, 'Reference', 1, 5)

Bible : 985920  installs
Dictionary.com Dictionary & Thesaurus : 200047  installs
Dictionary.com Dictionary & Thesaurus for iPad : 54175  installs
Google Translate : 26786  installs
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418  installs
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588  installs
Merriam-Webster Dictionary : 16849  installs
Night Sky : 12122  installs
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535  installs
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693  installs
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497  installs
Guides for Pokémon GO - Pokemon GO News and Cheats : 826  installs
WWDC : 762  installs
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718  installs
VPN Express : 14  installs
Real Bike Traffic Rider Virtual Reality Glasses : 8  installs
教えて!goo : 0  installs
Jishok

Distribution of ratings differs from the one we saw in the previous step. Here, we can see that the difference between the number of ratings is not so big, and the ratings are distributed more evenly. We can assign applications of the category to three main groups: dictionaries and translators, games manuals, and spiritual literature. Let's analyze the applications of these three groups.
* **Dictionaries and translators**. Development of the next dictionary/translator will require a large number of costs and time, in addition, existing applications seems already have all the features that required by users. In addition to this, as we know for the `Apple Store` market most popular applications are the ones that were built for fun, and a dictionary/translator applications do not belong to this category, hence, developing that kind of applications will be probably unprofitable.
* **Games manuals**. This group seems to show some potential, let's take a closer look. First of all, applications from this group correlate with the concept of entertainment application. People playing games usually spend recent amount of time in games and in application related to the games such as forums, markets, etc. If the game has complex game mechanics, a lot of in-game quests, character leveling systems, etc. then users with high probability will need a some kind of application where they can discuss all these aspects of the game. One of the most influential factor affecting the commercial success of the application will be the correct choice of the game.
* **Spiritual literature**. Applications from this group are unlikely to be profitable because a number of users for such applications depends on religious orientation of the users. Applications for Christians and Muslims, the two most common religions in the world, are already exist, so development another one probably will not be successful, only if we add some unique and useful features in the application.

Next, we will be analyzing applications which genre is Social Networking.

In [29]:
print_installs_by_genre(Free_Cleaned_DataAppleStore, 11, 'Social Networking', 1, 5)

Facebook : 2974676  installs
Pinterest : 1061624  installs
Skype for iPhone : 373519  installs
Messenger : 351466  installs
Tumblr : 334293  installs
WhatsApp Messenger : 287589  installs
Kik : 260965  installs
ooVoo – Free Video Call, Text and Voice : 177501  installs
TextNow - Unlimited Text + Calls : 164963  installs
Viber Messenger – Text & Call : 164249  installs
Followers - Social Analytics For Instagram : 112778  installs
MeetMe - Chat and Meet New People : 97072  installs
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414  installs
InsTrack for Instagram - Analytics Plus More : 85535  installs
Tango - Free Video Call, Voice and Chat : 75412  installs
LinkedIn : 71856  installs
Match™ - #1 Dating App. : 60659  installs
Skype for iPad : 60163  installs
POF - Best Dating App for Conversations : 52642  installs
Timehop : 49510  installs
Find My Family, Friends & iPhone - Life360 Locator : 43877  installs
Whisper - Share, Express, Meet : 39819  installs
Hangouts : 36404  ins

Development of an application in the genre of social networking can be very profitable. As we can see from the data above, in this genre, there is one application that is distinguished by the number of ratings: Facebook. But if we take a look at the rest of the applications, we will see that the distribution of ratings between applications is more evenly. This aspect allows us to conclude that the number of users of one or another social network strongly depends on design and features that the social network has and also for what group of people the social network was made.

**Conclusion**: based on the results of the analysis of applications with the most popular genres, we can say that the most profitable applications for the `Apple Store` market could be a game manual of a popular game or a social network having good design, unique features and created for a large group of people united by a certain characteristic like social network for designers where users can send to each other layouts in the original, conduct presentations with the customers, etc.

### Google Play market.
For the `Google Play` market, we actually have data about the number of installs (column 5), so we should be able to get a clearer picture about genre popularity.

In [30]:
Installs_DataGooglePlayStore = freq_table(Free_Cleaned_DataGooglePlayStore, 5)
display_table(Installs_DataGooglePlayStore)

1,000,000+ : 15.72 %
100,000+ : 11.55 %
10,000,000+ : 10.54 %
10,000+ : 10.22 %
1,000+ : 8.4 %
100+ : 6.91 %
5,000,000+ : 6.82 %
500,000+ : 5.56 %
50,000+ : 4.77 %
5,000+ : 4.51 %
10+ : 3.54 %
500+ : 3.25 %
50,000,000+ : 2.3 %
100,000,000+ : 2.13 %
50+ : 1.92 %
5+ : 0.79 %
1+ : 0.51 %
500,000,000+ : 0.27 %
1,000,000,000+ : 0.23 %
0+ : 0.06 %
0 : 0.01 %


The install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.). For instance, we do not know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we do not need very precise data for our purposes — we only want to get an idea which application genres attract the most users, and we do not need perfect precision with respect to the number of users.

We are going to leave the numbers as they are, which means that we will consider that an application with 100,000+ installs has 100,000 installs, and an application with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, we will need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error.

In [31]:
Installs_Per_Genre_DataGooglePlayStore = reviews_by_genre(Common_Categories_DataGooglePlayStore, Free_Cleaned_DataGooglePlayStore, 1, 5, True, False)

ART_AND_DESIGN : 1986335.09 installs
AUTO_AND_VEHICLES : 647317.82 installs
BEAUTY : 513151.89 installs
BOOKS_AND_REFERENCE : 8767811.89 installs
BUSINESS : 1712290.15 installs
COMICS : 817657.27 installs
COMMUNICATION : 38456119.17 installs
DATING : 854028.83 installs
EDUCATION : 1833495.15 installs
ENTERTAINMENT : 11640705.88 installs
EVENTS : 253542.22 installs
FINANCE : 1387692.48 installs
FOOD_AND_DRINK : 1924897.74 installs
HEALTH_AND_FITNESS : 4188821.99 installs
HOUSE_AND_HOME : 1331540.56 installs
LIBRARIES_AND_DEMO : 638503.73 installs
LIFESTYLE : 1433701.52 installs
GAME : 15588015.6 installs
FAMILY : 3693438.69 installs
MEDICAL : 119816.97 installs
SOCIAL : 23253652.13 installs
SHOPPING : 7036877.31 installs
PHOTOGRAPHY : 17840110.4 installs
SPORTS : 3638640.14 installs
TRAVEL_AND_LOCAL : 13984077.71 installs
TOOLS : 10801391.3 installs
PERSONALIZATION : 5201482.61 installs
PRODUCTIVITY : 16787331.34 installs
PARENTING : 542603.62 installs
WEATHER : 5074486.2 installs
VIDEO

To simplify the analysis, we will arrange the output in descending order.

In [32]:
Installs_Per_Genre_DataGooglePlayStore_Desc = descending_order(Installs_Per_Genre_DataGooglePlayStore)

COMMUNICATION : 38456119.17 installs
VIDEO_PLAYERS : 24727872.45 installs
SOCIAL : 23253652.13 installs
PHOTOGRAPHY : 17840110.4 installs
PRODUCTIVITY : 16787331.34 installs
GAME : 15588015.6 installs
TRAVEL_AND_LOCAL : 13984077.71 installs
ENTERTAINMENT : 11640705.88 installs
TOOLS : 10801391.3 installs
NEWS_AND_MAGAZINES : 9549178.47 installs
BOOKS_AND_REFERENCE : 8767811.89 installs
SHOPPING : 7036877.31 installs
PERSONALIZATION : 5201482.61 installs
WEATHER : 5074486.2 installs
HEALTH_AND_FITNESS : 4188821.99 installs
MAPS_AND_NAVIGATION : 4056941.77 installs
FAMILY : 3693438.69 installs
SPORTS : 3638640.14 installs
ART_AND_DESIGN : 1986335.09 installs
FOOD_AND_DRINK : 1924897.74 installs
EDUCATION : 1833495.15 installs
BUSINESS : 1712290.15 installs
LIFESTYLE : 1433701.52 installs
FINANCE : 1387692.48 installs
HOUSE_AND_HOME : 1331540.56 installs
DATING : 854028.83 installs
COMICS : 817657.27 installs
AUTO_AND_VEHICLES : 647317.82 installs
LIBRARIES_AND_DEMO : 638503.73 installs
P

As we can see, for the `Google Play Store` market the highest number of installs have applications related to the categories: communication, video players, social. Let's explore which applications are related to these categories. To do it, we will use the `print_names_by_genre` function.

In [33]:
print_installs_by_genre(Free_Cleaned_DataGooglePlayStore, 1, 'COMMUNICATION', 0, 5)

WhatsApp Messenger : 1,000,000,000+  installs
Messenger for SMS : 10,000,000+  installs
My Tele2 : 5,000,000+  installs
imo beta free calls and text : 100,000,000+  installs
Contacts : 50,000,000+  installs
Call Free – Free Call : 5,000,000+  installs
Web Browser & Explorer : 5,000,000+  installs
Browser 4G : 10,000,000+  installs
MegaFon Dashboard : 10,000,000+  installs
ZenUI Dialer & Contacts : 10,000,000+  installs
Cricket Visual Voicemail : 10,000,000+  installs
TracFone My Account : 1,000,000+  installs
Xperia Link™ : 10,000,000+  installs
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+  installs
Skype Lite - Free Video Call & Chat : 5,000,000+  installs
My magenta : 1,000,000+  installs
Android Messages : 100,000,000+  installs
Google Duo - High Quality Video Calls : 500,000,000+  installs
Seznam.cz : 1,000,000+  installs
Antillean Gold Telegram (original version) : 100,000+  installs
AT&T Visual Voicemail : 10,000,000+  installs
GMX Mail : 10,000,000+  installs
O

CK Call NEW : 10+  installs
CM Transfer - Share any files with friends nearby : 5,000,000+  installs
mail.co.uk Mail : 5,000+  installs
ClanPlay: Community and Tools for Gamers : 1,000,000+  installs
CQ-Mobile : 1,000+  installs
CQ-Alert : 500+  installs
QRZ Assistant : 100,000+  installs
Pocket Prefix Plus : 10,000+  installs
Ham Radio Prefixes : 10,000+  installs
CS Customizer : 1,000+  installs
CS Browser | #1 & BEST BROWSER : 1,000+  installs
CS Browser Beta : 5,000+  installs
My Vodafone (GR) : 1,000,000+  installs
IZ2UUF Morse Koch CW : 50,000+  installs
C W Browser : 100+  installs
CW Bluetooth SPP : 100+  installs
CW BLE Peripheral Simulator : 500+  installs
Morse Code Reader : 100,000+  installs
Learn Morse Code - G0HYN Learn Morse : 5,000+  installs
Ring : 10,000+  installs
Hyundai CX Conference : 50+  installs
Cy Messenger : 100+  installs
Amadeus GR & CY : 100+  installs
Hlášenírozhlasu.cz : 10+  installs
SMS Sender - sluzba.cz : 1,000+  installs
WEB.DE Mail : 10,000,000+  

We can see, that there are a lot of applications, what have a number of installing higher then 1000000. In addition, the distribution of installs looks pretty uniform. Such distribution serves as an indicator that this niche of applications can be quite profitable if our application will be good enough, well designed and have features. We got the same result for the `Apple Store` market, so for the first iteration we already can say what development an application in this category be quite a promising.

The video players category unlikely to be profitable for the `Apple Store` market, so we will not axplore applications from the category.

The next category for the analysis will be `SOCIAL`.

In [34]:
print_installs_by_genre(Free_Cleaned_DataGooglePlayStore, 1, 'SOCIAL', 0, 5)

Facebook : 1,000,000,000+  installs
Facebook Lite : 500,000,000+  installs
Tumblr : 100,000,000+  installs
Social network all in one 2018 : 100,000+  installs
Pinterest : 100,000,000+  installs
TextNow - free text + calls : 10,000,000+  installs
Google+ : 1,000,000,000+  installs
The Messenger App : 1,000,000+  installs
Messenger Pro : 1,000,000+  installs
Free Messages, Video, Chat,Text for Messenger Plus : 1,000,000+  installs
Telegram X : 5,000,000+  installs
The Video Messenger App : 100,000+  installs
Jodel - The Hyperlocal App : 1,000,000+  installs
Hide Something - Photo, Video : 5,000,000+  installs
Love Sticker : 1,000,000+  installs
Web Browser & Fast Explorer : 5,000,000+  installs
LiveMe - Video chat, new friends, and make money : 10,000,000+  installs
VidStatus app - Status Videos & Status Downloader : 5,000,000+  installs
Love Images : 1,000,000+  installs
Web Browser ( Fast & Secure Web Explorer) : 500,000+  installs
SPARK - Live random video chat & meet new people : 5,0

Eddsworld Amino : 10,000+  installs
Rejoin Your Ex : 100+  installs
Amleen Ey : 1+  installs
Coupe Adhémar EY 2017 : 50+  installs
EZ Video Download for Facebook : 1,000,000+  installs
Messages, Text and Video Chat for Messenger : 10,000,000+  installs
All Social Networks : 1,000,000+  installs
Messenger Messenger : 10,000,000+  installs
Facebook Creator : 1,000,000+  installs
Swift for Facebook Lite : 500,000+  installs
Friendly for Facebook : 1,000,000+  installs
Faster for Facebook Lite : 1,000,000+  installs
Puffin for Facebook : 500,000+  installs
Profile Tracker - Who Viewed My Facebook Profile : 500,000+  installs
Pink Color for Facebook : 500,000+  installs
Messenger : 10,000,000+  installs
Stickers for Imo, fb, whatsapp : 10,000+  installs
Who Viewed My Facebook Profile - Stalkers Visitors : 5,000,000+  installs
Downloader plus for FB : 500+  installs
MB Notifications for FB (Free) : 100,000+  installs
Phoenix - Facebook & Messenger : 100,000+  installs
Faster Social for Faceb

The distribution of the applications installs looks quite similar to the one we saw in the `Apple Store` dataset. Again, in the first place by installs is Facebook, but there are a lot of other applications which have a decent number of installs. This distribution leads us to the same conclusion: developing another application in the social media category can be quite profitable. The only we have to do is determinate the target audience for our application and features that users need.

**Conclusion**: based on the results of the analysis of applications with the most popular genres, we can say that the most profitable applications for the `Google Play Store` market could be a social network application having good design, unique features.

# Conclusions
In this project, we analyzed data about the `App Store` and `Google Play` mobile applications with the goal of recommending an application profile that can be profitable for both markets.

We concluded that development a `social network` application could be very profitable for both the `Google Play` and the `App Store` markets. The markets are already full of social networks, so we need to correctly determine the group of people who will use the application and add some special features. We also know that for the `App Store` market the second most popular genre is reference where we saw that games manual also look promising. Combinimg this information, we can assume that developing a social network application, for example, for gamers, could be quite profitable for both markets.

