# Mobile App Data Analysis Project

## Project Overview

This project is an opportunity to integrate various skills to solve a real-world problem. You'll step into the role of a data analyst for a company specializing in Android and iOS mobile apps. These apps are distributed through Google Play and the App Store, with revenue primarily generated from in-app advertisements. Understanding the factors influencing user attraction is vital for the company's success.


## Dataset

As of September 2018, the App Store boasted approximately 2 million iOS apps, while Google Play housed around 2.1 million Android apps.

![Number of Apps in Leading App Stores](https://s3.amazonaws.com/dq-content/350/py1m8_statista.png)

[Source](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/)

You'll work with two datasets:

- **Google Play dataset**: Contains data on about 10,000 Android apps from August 2018. Download it [here](googleplaystore.csv).

- **App Store dataset**: Includes data on around 7,000 iOS apps from July 2017. Download it [here](AppleStore.csv).




We'll start by opening and exploring these two data sets. To make them easier to explore, we created a function named 
explore_data() that you can repeatedly use to print rows in a readable way.

In [1606]:
def explore_data(dataset, start, end,rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:',len(dataset[0]))

### `explore_data()` Function

The `explore_data()` function is designed to explore a dataset by printing rows within a specified slice and optionally showing the number of rows and columns.

#### Parameters:
- `dataset`: A list of lists representing the dataset.
- `start`: An integer representing the starting index of the slice.
- `end`: An integer representing the ending index of the slice.
- `rows_and_columns`: A Boolean indicating whether to print the number of rows and columns. Defaults to `False`.

#### Behavior:
1. Slices the dataset using `dataset[start:end]`.
2. Loops through the slice, printing each row followed by a new line character (`\n`) for spacing.
3. If `rows_and_columns` is `True`, prints the number of rows and columns.

**Note**: If the dataset includes a header row, the function may print an incorrect number of rows (one more than the actual length).


## Opening the Datasets

Next we will open the IOS and Google App Store datasets , after that we can explore the first rows of each,
using the `explore_data()` function created above.

For this the csv module will be used, and a function will be created:


In [1607]:
import csv

def open_csv_as_list_of_lists(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as file:
        csv_reader = csv.reader(file)
        for row in csv_reader:
            data.append(row)
    return data

app_store_data = open_csv_as_list_of_lists('AppleStore.csv')

google_play_data = open_csv_as_list_of_lists('googleplaystore.csv')

In [1608]:
explore_data(app_store_data,0,5,True)


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


The column that seem interesting are presented in the header such as: 'track_name' , 'currency', 'user_rating', 'price', 'rating_count_tot','rating_count_ver' and 'prime_genre'. 
In the table bellow are some explanations about the data: 
<br> 
<br>


| Column Name       | Description                                     |
|-------------------|-------------------------------------------------|
| "id"              | App ID                                          |
| "track_name"      | App Name                                        |
| "size_bytes"      | Size (in Bytes)                                 |
| "currency"        | Currency Type                                   |
| "price"           | Price amount                                    |
| "rating_count_tot"| User Rating counts (for all versions)           |
| "rating_count_ver"| User Rating counts (for current version)        |
| "user_rating"     | Average User Rating value (for all versions)    |
| "user_rating_ver" | Average User Rating value (for current version) |
| "ver"             | Latest version code                             |
| "cont_rating"     | Content Rating                                  |
| "prime_genre"     | Primary Genre                                   |
| "sup_devices.num" | Number of supporting devices                    |
| "ipadSc_urls.num" | Number of screenshots showed for display        |
| "lang.num"        | Number of supported languages                   |
| "vpp_lic"         | Vpp Device Based Licensing Enabled              |



In [1609]:
explore_data(google_play_data,0,5,True)


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


The column that seem interesting are presented in the header such as: 'Category' , 'Reviews', 'Price', 'Current Ver' and 'Rating'.
In the table bellow are some explanations about the data:
<br>
<br>

| Column Name   | Description                                                                         |
|---------------|-------------------------------------------------------------------------------------|
| App           | Application name                                                                    |
| Category      | Category the app belongs to                                                         |
| Rating        | Overall user rating of the app (as when scraped)                                    |
| Reviews       | Number of user reviews for the app (as when scraped)                                 |
| Size          | Size of the app (as when scraped)                                                   |
| Installs      | Number of user downloads/installs for the app (as when scraped)                      |
| Type          | Paid or Free                                                                        |
| Price         | Price of the app (as when scraped)                                                   |
| Content Rating| Age group the app is targeted at - Children / Mature 21+ / Adult                     |
| Genres        | An app can belong to multiple genres (apart from its main category)                   |
| Last Updated  | Date when the app was last updated on Play Store (as when scraped)                    |
| Current Ver   | Current version of the app available on Play Store (as when scraped)                  |
| Android Ver   | Min required Android version (as when scraped)                                       |


# Deleting Wrong Data

In the previous step, we opened the two data sets and explored the data. Before beginning our analysis, we need to make sure the data we analyze is accurate, or the results of our analysis will be wrong. This means that we need to do the following:

Detect inaccurate data, and correct or remove it.
Detect duplicate data, and remove the duplicates.
Recall that at our company, we only build apps that are free to download and install, and we design them for an English-speaking audience. This means that we'll need to do the following:

Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
Remove apps that aren't free.

Instructions:

* The Google Play dataset has a dedicated discussion section, and we can see that one of the [discussions](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row.

* Read the discussion, and determine the index of the row.

* Print the row at that index to check if it's incorrect. Take into account the user reporting the error might or might have not removed the header row, so the index number might vary.

* If the row has an error, remove the row using the del statement. For instance, to remove the row with the index 149 from a dataset data that is stored as a list of lists, you can use the code del data[149].

* Make sure you don't run the del statement more than once, otherwise you'll delete more than one row.

* Read the discussion section for the App Store dataset, and see if you can find any reports of wrong data.

In [1610]:
# The header is included that is why the index is 10473

google_play_data[10473] 

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

In [1611]:
# Run the del command just one time
del google_play_data[10473]
google_play_data[10473]

['osmino Wi-Fi: free WiFi',
 'TOOLS',
 '4.2',
 '134203',
 '4.1M',
 '10,000,000+',
 'Free',
 '0',
 'Everyone',
 'Tools',
 'August 7, 2018',
 '6.06.14',
 '4.4 and up']

# Removing Duplicate Entries: Part One

In the previous step, we started the data cleaning process and deleted a row with incorrect data from the Google Play dataset. If you explore the Google Play data set long enough or look at the discussions section, you'll notice some apps have duplicate entries. For instance, Instagram has four entries:


In [1612]:
for app in google_play_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Nex we will see how many duplicate data there are:


* Create two lists: one for storing the name of duplicate apps, and one for storing the name of unique apps. <br />
* Loop through the android data set (the Google Play data set), and for each iteration, we do the following:
  * We saved the app name to a variable named name.
  * If name was already in the unique_apps list, we appended name to the duplicate_apps list.
  * Else (if name wasn't already in the unique_apps list), we appended name to the unique_apps list.

In [1613]:
duplicate_names = []
duplicate_apps = []
unique_names = []
unique_apps = []

for app in google_play_data:
    name = app[0]
    if name in unique_names:
        duplicate_names.append(name)
        duplicate_apps.append(app)
    else:
        unique_names.append(name)
        unique_apps.append(app)
print('Number of duplicate apps:', len(duplicate_names))
print('\n')
print('Example of duplicate apps', duplicate_names[:15])

Number of duplicate apps: 1181


Example of duplicate apps ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times.

In [1614]:
highest_ratings = {}

for app in google_play_data:
    name = app[0]
    rating = app[3]
    if name not in highest_ratings or rating > highest_ratings[name] :
        highest_ratings[name] = rating

print('Expected length:', len(google_play_data) - 1181)
print('Actual length:', len(highest_ratings))

android_clean = []
already_added = []

for app in google_play_data:
    name = app[0]
    n_reviews = app[3]

    if (highest_ratings[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name) # make sure this is inside the if block

explore_data(android_clean, 0, 3, True)


Expected length: 9660
Actual length: 9660
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9660
Number of columns: 13


# Removing Non-English Apps

In the previous step, we managed to remove the duplicate app entries in the Google Play dataset. We don't need to do the same for the App Store data because there are no duplicates — you can check that for yourself using the id column (not the track_name column).

Remember we use English for the apps we develop at our company, and we'd like to analyze only the apps that are designed for an English-speaking audience. However, if we explore the data long enough, we'll find that both datasets have apps with names that suggest they are not designed for an English-speaking audience.

We're not interested in keeping these apps, so we'll remove them. One way to do this is to remove each app with a name containing a symbol that isn't commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

Each character we use in a string has a corresponding number associated with it. For instance, the corresponding number for character 'a' is 97, character 'A' is 65, and character '爱' is 29,233. We can get the corresponding number of each character using the ord() built-in function.

Soo, we are going to write a function that returns if a name have any character tha does not belong to the set of the English
characters. Bellow is the function:

In [1615]:
def check_if_is_english(word):
    for letter in word:
        if ord(letter) > 127:
            return False
    return  True

print(check_if_is_english('Instagram'))
print(check_if_is_english('Instachat 😜'))
print(check_if_is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))


True
False
False


If we're going to use the function we've created, we'll lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English. Our filter function is still not perfect, but it should be fairly effective.

In [1616]:
def new_check_if_is_english(word):
    count = 0
    for letter in word:
        if ord(letter) > 127:
            count += 1
            if count > 3:
                return False
    return  True

print(new_check_if_is_english('Instagram'))
print(new_check_if_is_english('Instachat 爱奇艺'))
print(new_check_if_is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


In [1617]:
new_google_data = []
new_apple_data = []

for app in android_clean:
    name = app[0]
    if new_check_if_is_english(name):
        new_google_data.append(app)

for app in app_store_data:
    name = app[1]
    if new_check_if_is_english(name):
        new_apple_data.append(app)

explore_data(new_apple_data,0,3,True)
explore_data(new_google_data,0,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 6184
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', 

## Isolating the Free Apps

So far in the data cleaning process, we've done the following:

* Removed inaccurate data
* Removed duplicate app entries
* Removed non-English apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our datasets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

Isolating the free apps will be our last step in the data cleaning process. On the next screen, we're going to start analyzing the data.

In [1618]:
free_google_apps = []
for app in new_google_data:
    price = app[7]
    if price == '0':
        free_google_apps.append(app)
        
print(len(free_google_apps))

free_apple_apps = []
for app in new_apple_data:
    price = app[4]
    if price == '0.0':
        free_apple_apps.append(app)
print(len(free_apple_apps))        

8862
3222


# Most Common Apps by Genre

As we mentioned in the introduction, our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affects our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by determining the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our datasets.


In [1619]:
genres_apple = {}
for app in free_apple_apps:
    genre = app[11]
    if genre in genres_apple:
        genres_apple[genre] += 1
    else:
        genres_apple[genre] = 1

# For Genre Column        
genres_google_genre = {}
for app in free_google_apps:
    genre = app[9]
    if genre in genres_google_genre:
        genres_google_genre[genre] += 1
    else:
        genres_google_genre[genre] = 1
        
#For Category Column

google_category = {}
for app in free_google_apps:
    category = app[1]
    if category in google_category:
        google_category[category] += 1
    else:
        google_category[category] = 1

print(genres_google_genre,'\n')
print(genres_apple,'\n')
print(google_category,'\n')

{'Art & Design': 53, 'Art & Design;Creativity': 6, 'Auto & Vehicles': 82, 'Beauty': 53, 'Books & Reference': 190, 'Business': 407, 'Comics': 54, 'Comics;Creativity': 1, 'Communication': 287, 'Dating': 165, 'Education': 474, 'Education;Creativity': 4, 'Education;Education': 30, 'Education;Pretend Play': 5, 'Education;Brain Games': 3, 'Entertainment': 538, 'Entertainment;Brain Games': 7, 'Entertainment;Creativity': 3, 'Entertainment;Music & Video': 15, 'Events': 63, 'Finance': 328, 'Food & Drink': 110, 'Health & Fitness': 273, 'House & Home': 73, 'Libraries & Demo': 83, 'Lifestyle': 345, 'Lifestyle;Pretend Play': 1, 'Arcade': 164, 'Puzzle': 100, 'Racing': 88, 'Sports': 307, 'Casual': 155, 'Simulation': 181, 'Adventure': 60, 'Trivia': 37, 'Action': 275, 'Word': 23, 'Role Playing': 83, 'Strategy': 81, 'Board': 33, 'Card': 39, 'Music': 18, 'Action;Action & Adventure': 9, 'Casual;Brain Games': 12, 'Educational;Creativity': 3, 'Puzzle;Brain Games': 16, 'Educational;Education': 35, 'Card;Brain

Now we will sort the genres by ocurrence:

In [1620]:
# Sort genres_apple by number of occurrences
sorted_genres_apple = dict(sorted(genres_apple.items(), key=lambda x: x[1], reverse=True))

# Sort genres_google_genre by number of occurrences
sorted_genres_google_genre = dict(sorted(genres_google_genre.items(), key=lambda x: x[1], reverse=True))

# Sort google_category by number of occurrences
sorted_google_category = dict(sorted(google_category.items(), key=lambda x: x[1], reverse=True))

print(sorted_genres_apple,'\n')
print(sorted_genres_google_genre,'\n')
print(sorted_google_category)

{'Games': 1874, 'Entertainment': 254, 'Photo & Video': 160, 'Education': 118, 'Social Networking': 106, 'Shopping': 84, 'Utilities': 81, 'Sports': 69, 'Music': 66, 'Health & Fitness': 65, 'Productivity': 56, 'Lifestyle': 51, 'News': 43, 'Travel': 40, 'Finance': 36, 'Weather': 28, 'Food & Drink': 26, 'Reference': 18, 'Business': 17, 'Book': 14, 'Navigation': 6, 'Medical': 6, 'Catalogs': 4} 

{'Tools': 748, 'Entertainment': 538, 'Education': 474, 'Business': 407, 'Lifestyle': 345, 'Productivity': 345, 'Finance': 328, 'Medical': 312, 'Sports': 307, 'Personalization': 294, 'Communication': 287, 'Action': 275, 'Health & Fitness': 273, 'Photography': 261, 'News & Magazines': 248, 'Social': 236, 'Travel & Local': 206, 'Shopping': 199, 'Books & Reference': 190, 'Simulation': 181, 'Dating': 165, 'Arcade': 164, 'Video Players & Editors': 157, 'Casual': 155, 'Maps & Navigation': 124, 'Food & Drink': 110, 'Puzzle': 100, 'Racing': 88, 'Libraries & Demo': 83, 'Role Playing': 83, 'Auto & Vehicles': 8

## Apple App Store:
### Top Genres:
- Games: 1874 occurrences
- Entertainment: 254 occurrences
- Photo & Video: 160 occurrences
- Education: 118 occurrences
- Social Networking: 106 occurrences

### Observations:
- The most common genre is "Games", indicating a significant presence of gaming apps on the Apple App Store.
- Entertainment and Photo & Video apps also have notable representation.
- Educational and Social Networking apps have a relatively lower number of occurrences compared to Games and Entertainment.

## Google Play Store:
### Top Genres:
- Tools: 748 occurrences
- Entertainment: 538 occurrences
- Education: 474 occurrences
- Business: 407 occurrences
- Lifestyle: 345 occurrences

### Observations:
- The most common genre is "Tools", suggesting a high demand for utility and productivity apps on the Google Play Store.
- Entertainment and Education apps also have a significant presence.
- Unlike the Apple App Store, where Games dominate, the Google Play Store has a diverse range of popular genres including Tools, Business, and Lifestyle.

### Google Play Store (Categories):
#### Top Categories:
- Family: 1678 occurrences
- Game: 859 occurrences
- Tools: 749 occurrences
- Business: 407 occurrences
- Lifestyle: 346 occurrences

#### Observations:
- The most common category is "Family", which encompasses a wide range of apps suitable for all ages.
- Gaming and Tools categories follow closely behind.
- Unlike the Apple App Store, where explicit categories like Entertainment and Photo & Video are prominent, the Google Play Store categorizes apps differently, with Family and Game categories being the most prevalent.

These insights provide a glimpse into the popularity and distribution of app genres and categories across both the Apple App Store and the Google Play Store.


One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play dataset, we can find this information in the Installs column, but this information is missing for the App Store dataset. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` column.

Let's start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to do the following:

1. Isolate the apps of each genre
2. Add up the user ratings for the apps of that genre
3. Divide the sum by the number of apps belonging to that genre (not by the total number of apps)

The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.


In [1621]:

for genre in genres_apple:
    sum_user_ratings = 0
    len_genre = 0
    
    for app in free_apple_apps:
        genre_app = app[11]
        if genre_app == genre:
            ratings = float(app[5])
            sum_user_ratings+= ratings
            len_genre += 1
    avg_n_ratings = sum_user_ratings / len_genre
    print(genre, ':', avg_n_ratings)
    

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


As we can see, apps related to Navigation, Finance, Social Networking and Reference ,have high numbers. We are going to see if the numbers
are inflated by any app.

In [1622]:
for app in free_apple_apps:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) 

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


In [1623]:
for app in free_apple_apps:
    if app[-5] == 'Music':
        print(app[1], ':', app[5]) 

Pandora - Music & Radio : 1126879
Spotify Music : 878563
Shazam - Discover music, artists, videos & lyrics : 402925
iHeartRadio – Free Music & Radio Stations : 293228
SoundCloud - Music & Audio : 135744
Magic Piano by Smule : 131695
Smule Sing! : 119316
TuneIn Radio - MLB NBA Audiobooks Podcasts Music : 110420
Amazon Music : 106235
SoundHound Song Search & Music Player : 82602
Sonos Controller : 48905
Bandsintown Concerts : 30845
Karaoke - Sing Karaoke, Unlimited Songs! : 28606
My Mixtapez Music : 26286
Sing Karaoke Songs Unlimited with StarMaker : 26227
Ringtones for iPhone & Ringtone Maker : 25403
Musi - Unlimited Music For YouTube : 25193
AutoRap by Smule : 18202
Spinrilla - Mixtapes For Free : 15053
Napster - Top Music & Radio : 14268
edjing Mix:DJ turntable to remix and scratch music : 13580
Free Music - MP3 Streamer & Playlist Manager Pro : 13443
Free Piano app by Yokee : 13016
Google Play Music : 10118
Certified Mixtapes - Hip Hop Albums & Mixtapes : 9975
TIDAL : 7398
YouTube Mu

In [1624]:
for app in free_apple_apps:
    if app[-5] == 'Social Networking':
        print(app[1], ':', app[5]) 

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we'll leave this level of detail for later.

In [1625]:
for app in free_apple_apps:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5]) 

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


The popularity of the Reference category on app stores is largely driven by religious books. However, there's an opportunity to create an app that not only attracts users but also keeps them engaged for longer periods. By integrating elements of music, literature, and religion, we can tap into a growing demand for multimedia content. One idea is to develop an app that offers a diverse range of religious audiobooks and videos, providing users with an immersive and enriching experience. 

Additionally, incorporating features such as personalized recommendations and interactive forums can further enhance user engagement. By capitalizing on the intersection of music, literature, and religion, there's a high probability of generating profit in this untapped market.

Expanding on the provided insight about the popularity of the Reference category on app stores being primarily related to religious books, we can explore opportunities to capitalize on this trend by creating an app that integrates music, books, and religion. Here are some ideas and insights:

1. Integrating Different Media Formats:
Audiobooks and Videos: Develop an app that offers a diverse range of religious content, including audiobooks and videos of religious teachings, sermons, and discussions. This caters to users who prefer audio or visual formats over traditional text-based content.
2. Enhancing User Engagement:
Interactive Features: Implement interactive features such as quizzes, polls, and discussion forums to encourage user engagement and foster community interaction among users with similar religious interests.
Personalized Recommendations: Utilize algorithms to provide personalized recommendations based on users' religious preferences, reading habits, and listening history, enhancing user satisfaction and retention.
3. Music Integration:
Religious Music Streaming: Integrate a curated library of religious music, hymns, chants, and devotional songs into the app, allowing users to listen to their favorite religious music seamlessly.
Music for Meditation: Offer playlists specifically curated for meditation, prayer, or spiritual reflection, providing users with a serene and calming audio environment conducive to religious practices.
4. Social Features:
Community Building: Facilitate connections among users through social features such as user profiles, messaging, and group discussions, fostering a supportive and inclusive religious community within the app.
User-Generated Content: Allow users to contribute their own content, such as testimonials, personal reflections, and religious artwork, fostering a sense of ownership and belonging within the app community.

For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.)

One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

In [1626]:
for category in google_category:
    total = 0
    len_category = 0
    for app in free_google_apps:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1820673.076923077
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15560965.599534342
FAMILY : 3694276.334922527
MEDICAL : 120616.48717948717
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17805627.643678162
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10682301.033377837
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

Based on the list above the top 5 categories are:

COMMUNICATION: 38,456,119.17
VIDEO_PLAYERS: 24,727,872.45
SOCIAL: 23,253,652.13
PHOTOGRAPHY: 17,805,627.64
PRODUCTIVITY: 16,787,331.34

Let's see if there any inflation in this numbers, such as an app tha have most of the downloads in the category:

In [1627]:
for app in free_google_apps:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

In [1628]:
for app in free_google_apps:
    if app[1] == 'VIDEO_PLAYERS' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

YouTube : 1,000,000,000+
Motorola Gallery : 100,000,000+
VLC for Android : 100,000,000+
Google Play Movies & TV : 1,000,000,000+
MX Player : 500,000,000+
Dubsmash : 100,000,000+
VivaVideo - Video Editor & Photo Movie : 100,000,000+
VideoShow-Video Editor, Video Maker, Beauty Camera : 100,000,000+
Motorola FM Radio : 100,000,000+


In [1629]:
for app in free_google_apps:
    if app[1] == 'SOCIAL' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Facebook : 1,000,000,000+
Facebook Lite : 500,000,000+
Tumblr : 100,000,000+
Pinterest : 100,000,000+
Google+ : 1,000,000,000+
Badoo - Free Chat & Dating App : 100,000,000+
Tango - Live Video Broadcast : 100,000,000+
Instagram : 1,000,000,000+
Snapchat : 500,000,000+
LinkedIn : 100,000,000+
Tik Tok - including musical.ly : 100,000,000+
BIGO LIVE - Live Stream : 100,000,000+
VK : 100,000,000+


In [1630]:
for app in free_google_apps:
    if app[1] == 'PHOTOGRAPHY' and (app[5] == '1,000,000,000+'
                               or app[5] == '500,000,000+'
                               or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

B612 - Beauty & Filter Camera : 100,000,000+
YouCam Makeup - Magic Selfie Makeovers : 100,000,000+
Sweet Selfie - selfie camera, beauty cam, photo edit : 100,000,000+
Google Photos : 1,000,000,000+
Retrica : 100,000,000+
Photo Editor Pro : 100,000,000+
BeautyPlus - Easy Photo Editor & Selfie Camera : 100,000,000+
PicsArt Photo Studio: Collage Maker & Pic Editor : 100,000,000+
Photo Collage Editor : 100,000,000+
Z Camera - Photo Editor, Beauty Selfie, Collage : 100,000,000+
PhotoGrid: Video & Pic Collage Maker, Photo Editor : 100,000,000+
Candy Camera - selfie, beauty camera, photo editor : 100,000,000+
YouCam Perfect - Selfie Photo Editor : 100,000,000+
Camera360: Selfie Photo Editor with Funny Sticker : 100,000,000+
S Photo Editor - Collage Maker , Photo Collage : 100,000,000+
AR effect : 100,000,000+
Cymera Camera- Photo Editor, Filter,Collage,Layout : 100,000,000+
LINE Camera - Photo editor : 100,000,000+
Photo Editor Collage Maker Pro : 100,000,000+


In [1631]:
for app in free_google_apps:
    if app[1] == 'PRODUCTIVITY' and (app[5] == '1,000,000,000+'
                               or app[5] == '500,000,000+'
                               or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Microsoft Word : 500,000,000+
Microsoft Outlook : 100,000,000+
Microsoft OneDrive : 100,000,000+
Microsoft OneNote : 100,000,000+
Google Keep : 100,000,000+
ES File Explorer File Manager : 100,000,000+
Dropbox : 500,000,000+
Google Docs : 100,000,000+
Microsoft PowerPoint : 100,000,000+
Samsung Notes : 100,000,000+
SwiftKey Keyboard : 100,000,000+
Google Drive : 1,000,000,000+
Adobe Acrobat Reader : 100,000,000+
Google Sheets : 100,000,000+
Microsoft Excel : 100,000,000+
WPS Office - Word, Docs, PDF, Note, Slide & Sheet : 100,000,000+
Google Slides : 100,000,000+
ColorNote Notepad Notes : 100,000,000+
Evernote – Organizer, Planner for Notes & Memos : 100,000,000+
Google Calendar : 500,000,000+
Cloud Print : 500,000,000+
CamScanner - Phone PDF Creator : 100,000,000+


We see a pattern for the video players category. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

The game genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

In [1632]:
for app in free_google_apps:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

It's possible to see that the most the popular apps are related to Ebook readers, also dictionaries and 
religious books. Creating and app that focus on readers but have some twist maybe be profitable. Presenting some ideas:

 * Collaborative Reading Communities: Create a platform where readers can connect with each other to discuss books, share annotations, and even collaborate on writing new chapters or alternative endings to existing stories. This fosters a sense of community and co-creation among readers.

* Instead of traditional linear storytelling, introduce interactive elements where readers can make choices that affect the plot and outcome of the story. This could create a more immersive and personalized reading experience.

* Use machine learning algorithms to analyze readers' preferences, reading habits, and feedback to provide personalized book recommendations tailored to their interests and tastes. This helps users discover new content they're likely to enjoy.

* Integrate the reading experience with real-world activities or events, such as geo-tagged story chapters that unlock when users visit specific locations or themed reading challenges tied to holidays or cultural events.