# Most attractive apps

Since the main source of revenue consists of in-app ads, the number of users becomes a significant factor. This project analyzes mobile apps data to help developers understand what type of apps are likely to attract more users on Google Play and App Store.

Goal of the project is to discover the kind of apps to be developed that will increase revenue through attracting the highest number of users possible.
Finding the most attractive apps can be a guide to:
  - Build a minimal Android version of the app and add to Google Play
  - If the app has a good response from users, develop it further
  - if the app is profitable after six months, build an iOS version and add it to App Store

In [None]:
from csv import reader

In [None]:
with open('AppleStore.csv', 'r') as file_opened:
    read_lines = reader(file_opened)
    apple_data = list(read_lines)

with open('googleplaystore.csv', 'r') as file_opened:
    read_lines = reader(file_opened)
    google_data = list(read_lines)

With the function `explore_data` the dataset can be sliced to the number of rows declared as `start` and `end` arguments. The argument `rows_and_columns` when set to `True` informs for the number of dataset rows and columns. 

In [None]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [None]:
explore_data(apple_data, 0, 5, True)

In [None]:
explore_data(google_data, 0, 5, True)

The documentation for the `apple_data` can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

The documentation for the `google_data` can be found [here](https://www.kaggle.com/lava18/google-play-store-apps).

In [None]:
copy_apple = apple_data.copy()
copy_google = google_data.copy()

In [None]:
header_apple_data = list(enumerate(copy_apple[0]))
print('Apple data header: ', header_apple_data)
print('\n')
header_google_data = list(enumerate(copy_google[0]))
print('Google data header: ', header_google_data)

## Data cleaning:
   - Remove incomplete rows
   - Remove duplicate apps
   - Remove non-English apps
   - Remove apps that aren't free

### Find and Remove Incomplete rows
The function `incomplete_row` finds the rows that are shorter compared to the header of the dataset. The shorter rows will be removed.

In [None]:
def incomplete_row(dataset):
    for row in dataset[1:]:
        if len(row) != len(dataset[0]):
            print(row)
            print('Incomplete row index: ', dataset.index(row))
    print('All rows are complete!')

#### Apple data

In [None]:
incomplete_row(apple_data)

#### Google data

In [None]:
incomplete_row(google_data)

The row with index `10473` misses the `Category` data and all other data are shifted one place. The row must be removed.

In [None]:
del google_data[10473]

In [None]:
print('New length of google_data: ', len(google_data))

### Find and Remove Duplicate rows
The function `find_duplicate` is searching by app name for any duplicated apps in a dataset.

In [None]:
def find_duplicate(dataset, index):
    duplicate_apps = []
    unique_apps = []
    for row in dataset[1:]:
        name = row[index]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
    print(len(duplicate_apps))
    return duplicate_apps

#### Apple data

In [None]:
find_duplicate(apple_data, 1)

The dataset `apple_data` has two duplicated apps. Exploring these rows, it can be assumed that data were collected at different times for different versions of the same app. Therefore, the rows for the old versions will be removed. Since the newest versions have more total ratings the `rating_count_tot` column will be used as guide for the removal.
The find and removal process can be done manually for this dataset.

In [None]:
for row in apple_data[1:]:
    name = row[1]
    if name == 'Mannequin Challenge' or name == 'VR Roller Coaster':
        print(row)

In [None]:
for app in apple_data[1:]:
    id_app = app[0]
    if id_app == '1178454060' or id_app == '1089824278':
        print(apple_data.index(app))

In [None]:
print(len(apple_data))
del apple_data[4464]
del apple_data[4832]
print(len(apple_data))

#### Google data

The dataset `google_data` has 1181 duplicated rows. Exploring these rows it can be assumed that data for some apps were collected more than once at the same or different time. Therefore, the apps with lesser number of `Reviews` will be removed and the one with the max number will be included.

In [None]:
find_duplicate(google_data, 0)

The dictionary `duplicates` informs of more than one duplicates of the same app. For example the app `Viber Messanger` has 5 entries. Four in the `duplicate_apps` list and one in the `unique_apps` list.

In [None]:
duplicates = {}
for name in find_duplicate(google_data, 0):
    if name in duplicates:
        duplicates[name] += 1
    else:
        duplicates[name] = 1
print(duplicates)
print(len(duplicates))

In [None]:
for app in google_data[1:]:
    name = app[0]
    if name == 'Viber Messenger':
        print(app)

In [None]:
review_max = {}

for row in google_data[1:]:
    name = row[0]
    n_review = int(row[3])
    if name in review_max and review_max[name] < n_review:
        review_max[name] = n_review
    if name not in review_max:
        review_max[name] = n_review
print(len(review_max))
print(len(google_data[1:])-1181)
print(review_max['Viber Messenger'])

In [None]:
google_data_clean = []
google_name_added = []

for app in google_data[1:]:
    name = app[0]
    n_review = int(app[3])
    if name not in google_name_added and n_review == review_max[name]:
        google_data_clean.append(app)
        google_name_added.append(name)
print(len(google_data_clean))

In [None]:
for app in google_data_clean:
    name = app[0]
    if name == 'Viber Messenger':
        print(app)

### Find and Remove non-English apps

The function `english_app` is used to seperate the non-english apps. If an app has more than 3 non-english letters in its name, it can be remove from the dataset.

In [None]:
def english_app(string):
    non_eng = 0
    for letter in string:
        if ord(letter) > 127:
            non_eng += 1
    if non_eng > 3:
        return False
    return True

#### Apple data

In [None]:
apple_data_eng = []

for row in apple_data[1:]:
    name = row[1]
    if english_app(name):
        apple_data_eng.append(row)
print(len(apple_data_eng))    

#### Google data

In [None]:
google_data_eng = []

for row in google_data_clean:
    name = row[0]
    if english_app(name):
        google_data_eng.append(row)
print(len(google_data_eng))

### Find and Remove non-Free apps

#### Apple data

In [None]:
apple_data_free = []

for row in apple_data_eng:
    price = float(row[4])
    if price == 0.0:
        apple_data_free.append(row)
print(len(apple_data_free))

#### Google data

In [None]:
google_data_free = []

for row in google_data_eng:
    type = row[6]
    if type == 'Free':
        google_data_free.append(row)
print(len(google_data_free))

## Find the most popular genres
An effective stategy to decide on what kind of apps should be build is to explore the most common genres in both stores. It can be assumed that these kinds of apps have higher demand.

The function `freq_table` creates a dictionary that lists the genres of the dataset and assigns to them the percentage of their appearance in the dataset. The function `display_table` sorts the percentages in descenting order. 

In [None]:
def freq_table(dataset, index):
    freq_dict = {}
    total = 0
    
    for app in dataset:
        total += 1
        genre = app[index]
        if genre in freq_dict:
            freq_dict[genre] += 1
        else:
            freq_dict[genre] = 1
            
    percent_freq = {}
    for key in freq_dict:
        percent_freq[key] = (freq_dict[key]/total) * 100
    return percent_freq
    
def display_table(dataset, index):
    table = freq_table(dataset, index)
    sort_table = []
    for key in table:
        to_tuple = (table[key], key)
        sort_table.append(to_tuple)
        
    sort_freq_perc = sorted(sort_table, reverse = True)
    for item in sort_freq_perc:
        print(item[1], ':', item[0])

#### Apple data
Exploring the results for the `apple_data_free` dataset we see that the genre `Games` comes first with a big difference in comparison to other genres.

In [None]:
display_table(apple_data_free, 11)

#### Google data
The landscape is different for the `google_data_free` dataset. It seems that in Google Play we can find more family-friendly apps. Since the relevant genre doesn't exist in the `apple_data_free` dataset we can not realy draw any definitive conclusion. It seems though that in both stores the entertaining apps have the highest demand.

In [None]:
display_table(google_data_free, 1)

### Find the number of downloads for each genre

Another important factor is the number of dowloads for each genre. This factor may give us more information about the kind of apps we should build to increase revenue. Therefore, we will use the data in the `Installs` column for the `google_data_free` dataset. Since the relevant column for the `apple_data_free` doesn't exist, we will use the `rating_count_tot` instead and try to meet a decision from there.

#### Apple data
Exploring the results we come to the conclusion that the most popular apps regarding the times that were rated are the `Navigation` apps with 86090.33. In second place are coming the `Reference` apps and in third place the `Social Networking` apps.

In [None]:
genre_apple = freq_table(apple_data_free, 11)

for genre in genre_apple:
    total = 0
    len_genre = 0
    for app in apple_data_free:
        genre_app = app[11]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', round(avg_n_ratings, 2))

In [None]:
for app in apple_data_free:
    genre = app[11]
    if genre == 'Navigation':
        print(app[1], ':', app[5])

In [None]:
for app in apple_data_free:
    genre = app[11]
    if genre == 'Reference':
        print(app[1], ':', app[5])

In [None]:
for app in apple_data_free:
    genre = app[11]
    if genre == 'Social Networking':
        print(app[1], ':', app[5])

#### Google data
For the Google Play market, we actually have data about the number of `Installs`, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.).

In [None]:
display_table(google_data_free,5)

One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

In [None]:
categories_android = freq_table(google_data_free, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in google_data_free:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

On average, `COMUNICATION` apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs

In [None]:
for app in google_data_free:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

In [None]:
under_100_m = []

for app in google_data_free:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

The `BOOKS_AND_REFERENCE` genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

In [None]:
for app in google_data_free:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

## Conclusion

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.