# Profitable App Profiles for the App Sotre and Google Play Market
The main goal for this project is to analize data from App Store and Google Play Market in order to help a company that builds free apps to make data-driven decision. The company has revenue for in-app adds, so the amount of revenue depends on how many users has our app.

The next cell will open both data sets and convert them into a list of list:

In [1]:
App_Store_set = open('AppleStore.csv')
Google_Play_set = open('googleplaystore.csv')

from csv import reader

App_Store_read = reader(App_Store_set)
Google_Play_read = reader(Google_Play_set)

App_Store_data = list(App_Store_read)
Google_Play_data = list(Google_Play_read)

Both datasets have headers, so we will remove and save them for future reference in next cell:

In [2]:
google_header = Google_Play_data[0]
apple_header = App_Store_data[0]

In [3]:
del Google_Play_data[0]
del App_Store_data[0]

Also, while reading discussion about the Google Play Market data set (you can read about it [here](https://www.kaggle.com/lava18/google-play-store-apps/discussion)) we can read that there is an error in an specific entry. We will proceed to delete that entry too, to avoid errors:

In [4]:
del Google_Play_data[10472]

We will now define a new function, that will help us to analize data by slicing a part of the list:

In [5]:
def explore_data(dataset, first, last, rows_and_columns=False):
    slicing = dataset[first:last]
    for app in slicing:
        print(app)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

In [6]:
explore_data(App_Store_data, 10, 12, rows_and_columns=True)

['512939461', 'Subway Surfers', '156038144', 'USD', '0.0', '706110', '97', '4.5', '4.0', '1.72.1', '9+', 'Games', '38', '5', '1', '1']


['362949845', 'Fruit Ninja Classic', '104590336', 'USD', '1.99', '698516', '132', '4.5', '4.0', '2.3.9', '4+', 'Games', '38', '5', '13', '1']


Number of rows:  7197
Number of columns:  16


If we keep reading about our Google Play dataset, we will find out that it has duplicate entries, that is, there are apps that appear two times in our dataset. We can confirm that by running next cell:

In [7]:
unique_names = []
duplicated_names = []

for app in Google_Play_data[1:]:
    name = app[0]
    if name in unique_names:
        duplicated_names.append(name)
    else:
        unique_names.append(name)

We need to remove duplicate entries but not any entry will work for us, so we will keep those with the highest number of reviews. We will find the entry with most reviews in next cell:

In [8]:
reviews_max = {}

for app in Google_Play_data:
    name = app[0]
    n_reviews = float(app[2])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        

We will then remove duplicate rows:

In [9]:
android_clean = []
already_added = []

for app in Google_Play_data:
    name = app[0]
    n_reviews = float(app[2])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

We are also aiming only to apps for a english-speakers audience. We can check whether an app´s name is inenglish using next cell:

In [10]:
def english_app(name):
    ascii_char = 0
    for i in name:
        if ord(i) > 127:
            ascii_char += 1
    if ascii_char >3:
        return False
    else:
        return True

In [11]:
english_app('Docs To Go™ Free Office Suite')

True

Using english_app function we can now separate apps for english-speakers in a different list. We do that in next cell:

In [12]:
english_android = []
english_apple = []

for app in android_clean:
    name = app[0]
    if english_app(name) == True:
        english_android.append(app)

for app in App_Store_data:
    name = app[1]
    if english_app(name) == True:
        english_apple.append(app)

Since we are aiming for free apps, we need to separate free apps from paid apps, as shown in next cell:

In [13]:
free_apps_Android = []
paid_apps_Android = []

for app in english_android:
    price = app[6]
    if price == 'Free':
        free_apps_Android.append(app)
    else:
        paid_apps_Android.append(app)

free_apps_Apple = []
paid_apps_Apple = []

for app in english_apple:
    price = float(app[4])
    if price == 0:
        free_apps_Apple.append(app)
    else:
        paid_apps_Apple.append(app)

After cleaning data, we can now analyze both datasets in order to find a app profile that fits in both App Store and Google Play. Here is our plan:
* Build a minimal Android version of the app, and add it ti Google Play
* If the app has a good response from users, we develop it further
* If the app is profitable after six months, we build iOS version of the app and add it fo the App Store

Let's check both datasets' headers

In [14]:
google_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [15]:
apple_header

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we then develop it further.
- If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

We'll build two functions we can use to analyze the frequency tables:

1. One function to generate frequency tables that show percentages
2. Another function that we can use to display the percentages in a descending order

In [16]:
def freq_table(dataset, index):
    total = 0
    freq_dict = {}
    for app in dataset:
        total+= 1
        genre = app[index]
        if genre in freq_dict:
            freq_dict[genre] += 1
        else:
            freq_dict[genre] = 1
            
    percentage_dict = {}
    for value in freq_dict:
        percentage = (freq_dict[value]/total)*100
        percentage_dict[value] = percentage
        
    return percentage_dict

In [17]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Below, we calculate the average number of user ratings per app genre on the App Store:

In [18]:
genre_table_apple = freq_table(free_apps_Apple, 11)

In [19]:
for genre in genre_table_apple:
    total = 0
    len_genre = 0
    for app in free_apps_Apple:
        genre_app = app[11]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_ratings = total / len_genre
    print(genre, avg_ratings)

Food & Drink 33333.92307692308
Games 22788.6696905016
Productivity 21028.410714285714
Book 39758.5
Catalogs 4004.0
Photo & Video 28441.54375
Health & Fitness 23298.015384615384
Education 7003.983050847458
Navigation 86090.33333333333
Weather 52279.892857142855
Business 7491.117647058823
Sports 23008.898550724636
Lifestyle 16485.764705882353
Utilities 18684.456790123455
Music 57326.530303030304
Entertainment 14029.830708661417
Finance 31467.944444444445
Travel 28243.8
Shopping 26919.690476190477
Social Networking 71548.34905660378
Medical 612.0
News 21248.023255813954
Reference 74942.11111111111


For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.).

To perform computations, however, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

In [24]:
genre_table_google = freq_table(free_apps_Android, 1)
genre_table_google

{'ART_AND_DESIGN': 0.7534699272967614,
 'AUTO_AND_VEHICLES': 0.9517514871116985,
 'BEAUTY': 0.5551883674818242,
 'BOOKS_AND_REFERENCE': 2.1017845340383343,
 'BUSINESS': 3.344348975545274,
 'COMICS': 0.7005948446794449,
 'COMMUNICATION': 3.0931923331130204,
 'DATING': 1.7316589557171185,
 'EDUCATION': 1.4937210839391937,
 'ENTERTAINMENT': 1.3218770654329148,
 'EVENTS': 0.5948446794448117,
 'FAMILY': 19.07468605419696,
 'FINANCE': 3.8202247191011236,
 'FOOD_AND_DRINK': 1.2161269001982815,
 'GAME': 11.037673496364839,
 'HEALTH_AND_FITNESS': 3.0667547918043625,
 'HOUSE_AND_HOME': 0.8195637805684072,
 'LIBRARIES_AND_DEMO': 0.8460013218770654,
 'LIFESTYLE': 3.688037012557832,
 'MAPS_AND_NAVIGATION': 1.4805023132848645,
 'MEDICAL': 3.0138797091870457,
 'NEWS_AND_MAGAZINES': 2.6173165895571713,
 'PARENTING': 0.634500991407799,
 'PERSONALIZATION': 3.0799735624586915,
 'PHOTOGRAPHY': 3.2782551222736287,
 'PRODUCTIVITY': 3.727693324520819,
 'SHOPPING': 2.3529411764705883,
 'SOCIAL': 2.65697290152

In [25]:
for category in genre_table_google:
    total = 0
    len_category = 0
    for app in free_apps_Android:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            total += float(installs)
            len_category += 1
    avg_installs = total / len_category
    print(category, avg_installs)

EDUCATION 3108407.079646018
BUSINESS 2753974.1501976284
PARENTING 647208.5416666666
BOOKS_AND_REFERENCE 10476157.264150944
ART_AND_DESIGN 2003791.2280701755
WEATHER 5542846.153846154
HOUSE_AND_HOME 1565838.7096774194
PRODUCTIVITY 20537621.879432622
BEAUTY 640861.9047619047
COMMUNICATION 47166160.384615384
VIDEO_PLAYERS 27268931.944444444
PHOTOGRAPHY 18738970.201612905
ENTERTAINMENT 21134600.0
FOOD_AND_DRINK 2300192.934782609
HEALTH_AND_FITNESS 4885919.051724138
AUTO_AND_VEHICLES 737219.4444444445
SOCIAL 27302664.05472637
GAME 16655938.269461079
DATING 1075582.5190839695
LIFESTYLE 1782802.9032258065
SHOPPING 7866974.382022472
PERSONALIZATION 6562636.9527897
MEDICAL 168882.35087719298
FINANCE 1574833.2179930797
NEWS_AND_MAGAZINES 11960046.212121213
SPORTS 4601628.844537815
TRAVEL_AND_LOCAL 16171381.56424581
TOOLS 12344508.658536585
LIBRARIES_AND_DEMO 813796.875
FAMILY 3045982.508662509
COMICS 847567.9245283019
EVENTS 354431.3333333333
MAPS_AND_NAVIGATION 4491486.25
