# Recommendation for Android and iOS mobile apps
#### Project Summary

In this guided project, you'll work as a data analyst for a company that builds Android and iOS mobile apps. The company you work at builds mobile apps and makes them available on Google Play and the App Store.

The only apps that are built are those that are free to download and install. This means the main source of revenue consists of in-app ads. It also means revenue for any given app is mostly influenced by the number of users who use the app — the more users who see and engage with the ads, the better. The goal of this project is to analyze data to help the developers understand what type of apps are likely to attract more users.

In [1]:
import pandas as pd
from langdetect import detect

In [2]:
g_store = pd.read_csv('./data/googleplaystore.csv', sep=',')
a_store = pd.read_csv('./data/applestore.csv', sep=',')

** Checking Dimension of the data **

In [3]:
g_store.shape

(10841, 13)

In [4]:
g_store.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [5]:
g_store.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [6]:
g_store.describe(include='all')

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
count,10841,10841,9367.0,10841.0,10841,10841,10840,10841.0,10840,10841,10841,10833,10838
unique,9660,34,,6002.0,462,22,3,93.0,6,120,1378,2832,33
top,ROBLOX,FAMILY,,0.0,Varies with device,"1,000,000+",Free,0.0,Everyone,Tools,"August 3, 2018",Varies with device,4.1 and up
freq,9,1972,,596.0,1695,1579,10039,10040.0,8714,842,326,1459,2451
mean,,,4.193338,,,,,,,,,,
std,,,0.537431,,,,,,,,,,
min,,,1.0,,,,,,,,,,
25%,,,4.0,,,,,,,,,,
50%,,,4.3,,,,,,,,,,
75%,,,4.5,,,,,,,,,,


In [7]:
a_store.shape

(7197, 16)

In [8]:
a_store.head()

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1


In [9]:
#Number of non-null value per column with corresponding datatype
a_store.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7197 entries, 0 to 7196
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                7197 non-null   int64  
 1   track_name        7197 non-null   object 
 2   size_bytes        7197 non-null   int64  
 3   currency          7197 non-null   object 
 4   price             7197 non-null   float64
 5   rating_count_tot  7197 non-null   int64  
 6   rating_count_ver  7197 non-null   int64  
 7   user_rating       7197 non-null   float64
 8   user_rating_ver   7197 non-null   float64
 9   ver               7197 non-null   object 
 10  cont_rating       7197 non-null   object 
 11  prime_genre       7197 non-null   object 
 12  sup_devices.num   7197 non-null   int64  
 13  ipadSc_urls.num   7197 non-null   int64  
 14  lang.num          7197 non-null   int64  
 15  vpp_lic           7197 non-null   int64  
dtypes: float64(3), int64(8), object(5)
memory 

In [10]:
# Summary statistics for numerical and categorical columns
a_store.describe(include='all')

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
count,7197.0,7197,7197.0,7197,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197,7197,7197.0,7197.0,7197.0,7197.0
unique,,7195,,1,,,,,,1590.0,4,23,,,,
top,,Mannequin Challenge,,USD,,,,,,1.0,4+,Games,,,,
freq,,2,,7197,,,,,,317.0,4433,3862,,,,
mean,863131000.0,,199134500.0,,1.726218,12892.91,460.373906,3.526956,3.253578,,,,37.361817,3.7071,5.434903,0.993053
std,271236800.0,,359206900.0,,5.833006,75739.41,3920.455183,1.517948,1.809363,,,,3.737715,1.986005,7.919593,0.083066
min,281656500.0,,589824.0,,0.0,0.0,0.0,0.0,0.0,,,,9.0,0.0,0.0,0.0
25%,600093700.0,,46922750.0,,0.0,28.0,1.0,3.5,2.5,,,,37.0,3.0,1.0,1.0
50%,978148200.0,,97153020.0,,0.0,300.0,23.0,4.0,4.0,,,,37.0,5.0,1.0,1.0
75%,1082310000.0,,181924900.0,,1.99,2793.0,140.0,4.5,4.5,,,,38.0,5.0,8.0,1.0


## Data Cleaning & Preprocessing

In [11]:
def remove_wrong_data(data,column):
    del data[column]

In [12]:
'''
Since the column Android Ver contains inconsistent 
data i.e version varies with device we delete column
'''

remove_wrong_data(g_store,'Android Ver')

In [13]:
g_store.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver'],
      dtype='object')

In [14]:
'''
The name of every app is supposed to be unique,
the unique count of the column App which is 9960 is 
less than the total number of non-null values of 10841, 
this indicates the presence of duplicate names.
'''

#finding duplicates in the coulumn 'App'
duplicate_rows = g_store[g_store.duplicated(['App'])]

In [15]:
# Number of duplicate rows
len(duplicate_rows)

1181

In [16]:
def data_without_duplicate(data,column):
    copy_without_duplicate = data.drop_duplicates(subset = [column])
    return copy_without_duplicate

In [17]:
google_no_duplicate = data_without_duplicate(g_store, 'App')

In [18]:
len(google_no_duplicate)

9660

In [59]:
# Removing non-english Apps from google dataset
#first we create a coulmn with language of specififed app

google_no_duplicate['Language'] = google_no_duplicate['App'].apply(lambda x: detect(x))

In [58]:
# Removing non-english apps from apple dataset

a_store['language'] = a_store['track_name'].apply(lambda x: detect(x))

In [None]:
# selecting only english app
google_eng = google_no_duplicate.loc[google_no_duplicate['Language'] == en]

In [61]:
# Google dataset with free apps
google_store = google_no_duplicate.loc[google_no_duplicate['Type']== 'Free']



In [62]:
# Apple dataset with free apps
apple = a_store.loc[a_store['price']== 0.0]


In [63]:
google=google_store.dropna()

In [64]:
# Reinspecting the new google and apple dataset
google.info()
google.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7589 entries, 0 to 10840
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             7589 non-null   object 
 1   Category        7589 non-null   object 
 2   Rating          7589 non-null   float64
 3   Reviews         7589 non-null   object 
 4   Size            7589 non-null   object 
 5   Installs        7589 non-null   object 
 6   Type            7589 non-null   object 
 7   Price           7589 non-null   object 
 8   Content Rating  7589 non-null   object 
 9   Genres          7589 non-null   object 
 10  Last Updated    7589 non-null   object 
 11  Current Ver     7589 non-null   object 
dtypes: float64(1), object(11)
memory usage: 770.8+ KB


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver
count,7589,7589,7589.0,7589.0,7589,7589,7589,7589.0,7589,7589,7589,7589
unique,7589,33,,5125.0,372,19,1,1.0,6,112,1210,2522
top,Grindr - Gay chat,FAMILY,,3.0,Varies with device,"1,000,000+",Free,0.0,Everyone,Tools,"August 3, 2018",Varies with device
freq,1,1454,,70.0,1106,1394,7589,7589.0,6102,653,236,952
mean,,,4.166504,,,,,,,,,
std,,,0.534153,,,,,,,,,
min,,,1.0,,,,,,,,,
25%,,,4.0,,,,,,,,,
50%,,,4.3,,,,,,,,,
75%,,,4.5,,,,,,,,,


## Analyze

In [65]:
#finding the most frequent category of installs
google['Installs'].value_counts()

1,000,000+        1394
100,000+          1012
10,000,000+        935
10,000+            870
5,000,000+         607
1,000+             566
500,000+           492
50,000+            417
5,000+             359
100+               237
50,000,000+        202
100,000,000+       188
500+               163
10+                 51
50+                 42
500,000,000+        24
1,000,000,000+      20
5+                   9
1+                   1
Name: Installs, dtype: int64

In [66]:
highest_installs = google.loc[google['Installs'] == '1,000,000,000+']


In [67]:
'''
From the analysis, it is observed that communication apps
have the highest number of installs eventhough they make 
just a small fraction of of the sample, that is just 6 out 
of 10841 samples have over one billion installs
''' 
highest_installs['Category'].value_counts()

COMMUNICATION          6
SOCIAL                 3
TRAVEL_AND_LOCAL       2
VIDEO_PLAYERS          2
GAME                   1
TOOLS                  1
NEWS_AND_MAGAZINES     1
PHOTOGRAPHY            1
PRODUCTIVITY           1
ENTERTAINMENT          1
BOOKS_AND_REFERENCE    1
Name: Category, dtype: int64

In [73]:
nexthighest_installs = google.loc[google['Installs'] == '1,000,000+']

In [74]:
'''
We again check for genres with the highest frequency of installs.
It can be observed that Family apps have the highest number 
of installs in the category of 1,000,000+ downloads
'''
nexthighest_installs['Category'].value_counts()

FAMILY                 253
GAME                   151
TOOLS                   97
FINANCE                 57
HEALTH_AND_FITNESS      56
PRODUCTIVITY            56
PHOTOGRAPHY             49
NEWS_AND_MAGAZINES      47
EDUCATION               44
SPORTS                  43
LIFESTYLE               43
PERSONALIZATION         42
COMMUNICATION           40
TRAVEL_AND_LOCAL        38
ENTERTAINMENT           36
BUSINESS                34
SHOPPING                34
SOCIAL                  33
VIDEO_PLAYERS           33
FOOD_AND_DRINK          26
MAPS_AND_NAVIGATION     25
WEATHER                 21
HOUSE_AND_HOME          21
BOOKS_AND_REFERENCE     20
DATING                  19
MEDICAL                 16
PARENTING               13
AUTO_AND_VEHICLES       13
COMICS                  11
BEAUTY                   8
LIBRARIES_AND_DEMO       7
ART_AND_DESIGN           4
EVENTS                   4
Name: Category, dtype: int64

In [72]:
 '''
 this is to analyze the mean total ratings count per genre.
 '''
apple[['prime_genre','rating_count_tot']].groupby('prime_genre').mean()

Unnamed: 0_level_0,rating_count_tot
prime_genre,Unnamed: 1_level_1
Book,8498.333333
Business,6367.8
Catalogs,1779.555556
Education,6266.333333
Entertainment,10822.961078
Finance,13522.261905
Food & Drink,20179.093023
Games,18924.688968
Health & Fitness,19952.315789
Lifestyle,8978.308511


## Recommendation
From the analysis above, the recommended apps that perform well for android are Communication and Family apps whiles that of the apple store are Books. This could be profitable for a developer since the time spent on these apps is relatively long enough for users to engage with the ads. 