# Profitable App Profiles for the App Store and Google Play Markets

Suppose we are at a company, which only build apps that are free to download and install, and our main source of revenue consists of in-app ads.

Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users. 

In [1]:
# import packages
import numpy as np
import pandas as pd

In [2]:
# read datasets into dataframe
android = pd.read_csv("googleplaystore.csv")
ios = pd.read_csv("AppleStore.csv")

In [3]:
# print the number of rows and columns in Google Play dataset
print("Number of rows in Google Play dataset:", android.shape[0])
print("Number of columns in Google Play dataset:", android.shape[1])

Number of rows in Google Play dataset: 10841
Number of columns in Google Play dataset: 13


In [4]:
# print the numbe of rows and columns in Apple Store dataset
print("Number of rows in Apple Store dataset:", ios.shape[0])
print("Number of columns in Apple Store dataset:", ios.shape[1])

Number of rows in Apple Store dataset: 7197
Number of columns in Apple Store dataset: 16


#### The Google Play dataset has a dedicated discussion section, one of the discussions outlines an error for row 10472.

In [5]:
# print the incorrect row
print(android.iloc[10472])

App               Life Made WI-Fi Touchscreen Photo Frame
Category                                              1.9
Rating                                                 19
Reviews                                              3.0M
Size                                               1,000+
Installs                                             Free
Type                                                    0
Price                                            Everyone
Content Rating                                        NaN
Genres                                  February 11, 2018
Last Updated                                       1.0.19
Current Ver                                    4.0 and up
Android Ver                                           NaN
Name: 10472, dtype: object


#### The maximum rating for a Google Play app is 5. We can see that the rating is 19, this is clearly off. As a consequence, we'll delete this row

In [6]:
# drop incorrect row
android.drop(10472, inplace=True)

#### Another discussion outlines that there exist duplicate entries in Google Play dataset

In [7]:
# print number of duplicate entries in Google Play dataset
number_of_duplicate_apps = android.shape[0] - android['App'].nunique()
print("Number of duplicate apps:", number_of_duplicate_apps)

Number of duplicate apps: 1181


In [8]:
# removing duplicate entries
android.drop_duplicates(subset=['App'], inplace=True)

In [9]:
# print number of row and columns in Google Play dataset after removing duplicate entries
print("Number of rows in Google Play dataset:", android.shape[0])
print("Number of columns in Google Play dataset:", android.shape[1])

Number of rows in Google Play dataset: 9659
Number of columns in Google Play dataset: 13


#### Selecting the free apps
As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.

In [10]:
free_android = android[(android['Price'] == '0') | (android['Price'] == '0.0')]
free_ios = ios[ios['price'] == 0]

print("Number of free android apps:", free_android.shape[0])
print("Number of free ios apps:", free_ios.shape[0])

Number of free android apps: 8903
Number of free ios apps: 4056


#### Find Most Common Apps by Genre

In [11]:
# most common Apps of Google Play dataset
free_android['Genres'].value_counts(normalize=True).sort_values(ascending=False)

Tools                                    0.084017
Entertainment                            0.060878
Education                                0.053914
Business                                 0.045827
Lifestyle                                0.039200
Productivity                             0.038863
Finance                                  0.036842
Medical                                  0.035044
Sports                                   0.034483
Personalization                          0.033135
Communication                            0.032349
Action                                   0.030888
Health & Fitness                         0.030664
Photography                              0.029428
News & Magazines                         0.028305
Social                                   0.026508
Travel & Local                           0.023138
Shopping                                 0.022464
Books & Reference                        0.021790
Simulation                               0.020667


In [12]:
# most common Apps of Apple Store dataset
free_ios['prime_genre'].value_counts(normalize=True).sort_values(ascending=False)

Games                0.556460
Entertainment        0.082347
Photo & Video        0.041174
Social Networking    0.035256
Education            0.032544
Shopping             0.029832
Utilities            0.026874
Lifestyle            0.023176
Finance              0.020710
Sports               0.019477
Health & Fitness     0.018738
Music                0.016519
Book                 0.016272
Productivity         0.015286
News                 0.014300
Travel               0.013807
Food & Drink         0.010602
Weather              0.007643
Reference            0.004931
Navigation           0.004931
Business             0.004931
Catalogs             0.002219
Medical              0.001972
Name: prime_genre, dtype: float64

#### Find Most Popular Apps by Genre on the App Store

In [13]:
free_ios.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4056 entries, 0 to 7196
Data columns (total 16 columns):
id                  4056 non-null int64
track_name          4056 non-null object
size_bytes          4056 non-null int64
currency            4056 non-null object
price               4056 non-null float64
rating_count_tot    4056 non-null int64
rating_count_ver    4056 non-null int64
user_rating         4056 non-null float64
user_rating_ver     4056 non-null float64
ver                 4056 non-null object
cont_rating         4056 non-null object
prime_genre         4056 non-null object
sup_devices.num     4056 non-null int64
ipadSc_urls.num     4056 non-null int64
lang.num            4056 non-null int64
vpp_lic             4056 non-null int64
dtypes: float64(3), int64(8), object(5)
memory usage: 459.5+ KB


In [14]:
free_ios.groupby("prime_genre")['rating_count_tot'].mean().sort_values(ascending=False)

prime_genre
Reference            67447.900000
Music                56482.029851
Social Networking    53078.195804
Weather              47220.935484
Photo & Video        27249.892216
Navigation           25972.050000
Travel               20216.017857
Food & Drink         20179.093023
Sports               20128.974684
Health & Fitness     19952.315789
Productivity         19053.887097
Games                18924.688968
Shopping             18746.677686
News                 15892.724138
Utilities            14010.100917
Finance              13522.261905
Entertainment        10822.961078
Lifestyle             8978.308511
Book                  8498.333333
Business              6367.800000
Education             6266.333333
Catalogs              1779.555556
Medical                459.750000
Name: rating_count_tot, dtype: float64

#### Find Most Popular Apps by Genre on the Google Play

In [15]:
free_android['Installs'] = free_android['Installs'].str.replace(",","")
free_android['Installs'] = free_android['Installs'].str.replace("+","")
free_android['Installs'] = free_android['Installs'].astype("float")

free_android.groupby("Genres")['Installs'].mean().sort_values(ascending=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Genres
Communication                            3.832263e+07
Adventure;Action & Adventure             3.533333e+07
Video Players & Editors                  2.479007e+07
Social                                   2.325365e+07
Arcade                                   2.288837e+07
Casual                                   1.956922e+07
Puzzle;Action & Adventure                1.836667e+07
Photography                              1.773767e+07
Educational;Action & Adventure           1.701667e+07
Productivity                             1.673896e+07
Racing                                   1.591065e+07
Travel & Local                           1.405148e+07
Casual;Action & Adventure                1.291667e+07
Action                                   1.260359e+07
Strategy                                 1.112429e+07
Tools                                    1.068188e+07
Card;Action & Adventure                  1.000000e+07
Tools;Education                          1.000000e+07
Lifestyle;Pretend Pla