# Profitable Distributions of Apple and Google App Stores

For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and in the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

Our aim is to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store. To do this, we'll need to collect and analyze data about mobile apps available on Google Play and the App Store.

https://www.kaggle.com/lava18/google-play-store-apps

https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

In [1]:
from csv import reader

# Google app store dataset
opened_file = open('datasets/googleplaystore.csv')
read_file = reader(opened_file)
google = list(read_file)
google_header = google[0]
google = google[1:]

# Apple app store dataset
opened_file = open('datasets/AppleStore.csv')
read_file = reader(opened_file)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:]

We can define a function `explore_data` which will be used to easily display data from the provided rows. 

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print('Google Dataset')
print(google_header, '\n')
explore_data(google, 0, 5, True)
print('\n')
print('Apple Dataset')
print(apple_header, '\n')
explore_data(apple, 0, 5, True)

Google Dataset
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']

['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free'

The column headers of the Google dataset are self explanitory. 

The The Apple Dataset includes some less than explicit column headers. The headers `rating_count_ver` and `user_rating_ver` represent the total counts of ratings and the average rating in the current version of the app only, respectively. The dataset sources can be referenced for more information on the column headers. 

### Data Cleaning
Before we can begin analyzing these datasets, they must be cleaned. This includes removing erroneous entries and duplicates. Also, because we are interested in the markets of free apps designed for english speakers, we will be removing non-free apps and non-English apps.

As pointed out by [a discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) in the comments of the dataset, there is one error in one of the entries which we will remove from the Google dataset.

In [4]:
explore_data(google, 10472, 10473)
del google[10472]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']



The Google dataset also has several duplicate entries.

In [12]:
duplicate_apps = []
unique_apps = []

for app in google:
    app_name = app[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
        
print(f'Number of duplicate apps: {len(duplicate_apps)}')
print(f'Examples: {duplicate_apps[:10]}')

Number of duplicate apps: 1181
Examples: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


two examples of duplicate apps are Box and Instagram.

In [16]:
print(google_header)
print('')
for app in google:
    app_name = app[0]
    if app_name == 'Box':
        print(app)
print('')
for app in google:
    app_name = app[0]
    if app_name == 'Instagram':
        print(app)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'V

Some duplicates are identical, while some (like Instagram) have different counts of ratings. This implies that the data entries could have been scraped at different times. Because we want the most recent data, we will only keep the etries with the highest count of ratings. 