# Profitable App Profiles for the App Store and Google Play Markets

This project will analyze data of apps available on Google Play and App Store which are:
- Free to download and install.
- Revenue from given apps are mostly influenced by the number of users who use those apps

The goal of this project is to help developers understand what types of apps are likely to attract more users.

Documentation: 

- Data set containing approximately seven thousand apps from App Store: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps#appleStore_description.csv
- Data set containing approximately ten thousand apps from Google Play Store: https://www.kaggle.com/lava18/google-play-store-apps

## Open and explore the data

In [1]:
from csv import reader

opened_ios = open('AppleStore.csv')
read_file  = reader(opened_ios)
ios_apps   = list(read_file)
ios_header = ios_apps[0]
ios_apps   = ios_apps[1:]

opened_andr = open('googleplaystore.csv')
read_file   = reader(opened_andr)
andr_apps   = list(read_file)
andr_header = andr_apps[0]
andr_apps   = andr_apps[1:]

def explore_data(dataset, start, end, rows_and_colums=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_colums:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))
    

In [2]:
print('Information from IOS apps: \n')
print(ios_header)
print('\n')
explore_data(ios_apps, 0, 5, True)
print('\n\n')
print('Information from Android apps: \n')
print(andr_header)
print('\n')
explore_data(andr_apps, 0, 5, True)

Information from IOS apps: 

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows:  7197
Number 

## Error Detection

- The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row.

In [3]:
print(andr_apps[10472]) #wrong entry
print('\n')
print(andr_header)
print('\n')
print(andr_apps[0])   #correct entry example

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


- As we can see that '1.9' is the app rating and it has been missing the 'Category' column which made shifted information for the next columns

## Data Cleaning

In [4]:
print(len(andr_apps))
del andr_apps[10472] #be careful not to run this more than once
print(len(andr_apps))

10841
10840


## Duplicate entries

- We started the data cleaning above and delete row with incorrect data from Google data set. However, there are some duplicatcate entries which will affect the analysis result in Google Play data set. The code below will help to sort them out.

In [5]:
duplicate_apps = []
unique_apps    = []

for app in andr_apps:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps[:10])

Number of duplicate apps:  1181


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


- We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app.
- Let's examine the rows we printed for the 'Google Ads' app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times.


In [6]:
for app in andr_apps:
    name = app[0]
    if name == 'Google Ads':
        print(app)
        print('\n')

['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']


['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']


['Google Ads', 'BUSINESS', '4.3', '29331', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']




- We noticed that 'Google Ads' had three entries with different user ratings. We can use this information to build a criteria for removing the duplication.
- The higher the number of reviews, the more recent data should be. We will keep the row with the highest number of reviews and remove the others. 

## Remove the duplicates

We looped through the Google Play data set and found that there are 1181 duplicates, after removing them, we should be left with 9659 rows.

In [9]:
print('Expected rows(length of the list): ', len(andr_apps) - 1181)

Expected rows(length of the list):  9659


In [13]:
reviews_max = {}

for app in andr_apps:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    
    if name not in reviews_max:
        reviews_max[name] = n_reviews  
        
print(len(reviews_max))


9659


In [15]:
andr_clean    = []
already_added = []

for app in andr_apps:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        andr_clean.append(app)
        already_added.append(name)
        
        
print(len(andr_clean))


9659


The reason why we need to have the _already_added_ list is to avoid some entries which have the same number of highest reviews, they will both be kept on the list. For the apps which have the highest reviews, it will be counted once and add to the _already_added_ list

The _already_added_ list will keep track of the apps which have the highest number of reviews has been counted and remove the others.