# Profitable App Profiles for the App Store and Google Play Markets

Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets.

## Exploring the Data

Dataset containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from [this link](https://www.kaggle.com/lava18/google-play-store-apps).
A data set containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from [this link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

### Read files

In [None]:
from csv import reader

# Read and saved as list of list data
apple_data=list(reader(open('AppleStore.csv')))

#Remove header row from apple dataset
apple_header=apple_data[:1]

# Saved rest of data removing header row from apple dataset
apple=apple_data[1:]

#Read and saved as list of list data
google_data=list(reader(open('googleplaystore.csv')))

# Remove header row from google dataset
google_header=google_data[:1]

#Saved rest of data removing header row from google dataset
google=google_data[1:]

### create explore_dataset() function to explore both datasets

In [None]:
def explore_dataset(data,start_index,end_index,rows_and_columns=False):
    dataset_slice=data[start_index:end_index]
    print(data[:6])
    if  rows_and_columns:
        print('no of rows %s'%(len(data)))
        print('no of columns %s'%(len(data[0])))

In [None]:
# explore apple dataset

explore_dataset(apple,1,6,True)

[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'], ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1'], ['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624', '1814', '4.5', '4.0', '6.26', '12+', 'Social Networking', '37', '5', '27', '1']]
no of rows 7197
no of columns 16


In [None]:
#explore google app dataset

explore_dataset(google,1,6,True)

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', '

####  From above data exploration tried to identify columns which can be helpful in analysis
* In apple dataset 'price','user_rating','prime_gener','size_bytes' columns are very important
* In google dataset 'type','price','Geners','Reviews','Installs','Size' columns are important

# Deleting Wrong Data

* Detect inaccurate data, and correct or remove it.
* Detect duplicate data, and remove the duplicates.

The Google Play data set has a [dedicated discussion](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion) section, and we can see that one of the discussions outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [None]:
print(google[10472])  #incorrect row
print('\n')
print(google_header) # google dataset header row
print(google[10467])       # correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']]
['FI CFL', 'FINANCE', '3.7', '112', '3.9M', '10,000+', 'Free', '0', 'Everyone', 'Finance', 'July 5, 2018', '1.1.1', '5.0 and up']


In 10472 row rating is 19 which is out of range, so delete this row

In [None]:
print(len(google)) # before deleting row
del google[10472]
print(len(google)) #after deleting row

10841
10840


# Removing Duplicate Entries:Part one

### Count total duplicate and unique entries of apps in google data set.
Apple dataset have no duplicate entries.

In [None]:
# Google dataset

duplicate_apps=[]
unique_apps=[]
for app in google:
    app_name=app[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
print('No of duplicate entries in google dataset',len(duplicate_apps))
print('No of unique entries in google daataset',len(unique_apps))

# Apple dataset

duplicate_apps_apple=[]
unique_apps_apple=[]
for app in apple:
    app_name=app[0]
    if app_name in unique_apps:
        duplicate_apps_apple.append(app_name)
    else:
        unique_apps_apple.append(app_name)
print('No of duplicate entries in apple dataset',len(duplicate_apps_apple))
print('No of unique entries in apple daataset',len(unique_apps_apple))


No of duplicate entries in google dataset 1181
No of unique entries in google daataset 9659
No of duplicate entries in apple dataset 0
No of unique entries in apple daataset 7197


In [None]:
# print few duplicate apps
print(duplicate_apps[:5])

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


In [None]:
# Check all duplicated rows of any duplicate app randomly chosen

for app in google:
    if app[0]==duplicate_apps[0]:
        print(app)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


In [None]:
for app in google:
    if app[0]==duplicate_apps[5]:
        print(app)

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [None]:
for app in google:
    if app[0]=="Garena Free Fire":
        print(app)

['Garena Free Fire', 'GAME', '4.5', '5465624', '53M', '100,000,000+', 'Free', '0', 'Teen', 'Action', 'August 3, 2018', '1.21.0', '4.0.3 and up']
['Garena Free Fire', 'GAME', '4.5', '5476569', '53M', '100,000,000+', 'Free', '0', 'Teen', 'Action', 'August 3, 2018', '1.21.0', '4.0.3 and up']
['Garena Free Fire', 'GAME', '4.5', '5476569', '53M', '100,000,000+', 'Free', '0', 'Teen', 'Action', 'August 3, 2018', '1.21.0', '4.0.3 and up']
['Garena Free Fire', 'GAME', '4.5', '5534114', '53M', '100,000,000+', 'Free', '0', 'Teen', 'Action', 'August 3, 2018', '1.21.0', '4.0.3 and up']


### Observation: In duplicate entries of above three apps i found few apps have all data enties in rows same but few apps differ in 'Reviews' entries and rest data is same.

* I can keep row with highest no of reviews in dataset because higest no of reviews would be latest and removing rest of duplicate entries

# Removing Duplicate Entries:Part Two

In [None]:
# check the index of 'Review' column

print(google_header)
google_header[0].index('Reviews')

[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']]


3

* Creating dictionary where each key is a unique app name and corresponding
value is the highest no of reviews of that app

In [None]:
# iniliasise empty dictionary to store max review of apps
reviews_max={}

for row in google:
    app_name=row[0]
    reviews=float(row[3])
    if app_name in reviews_max and reviews_max[app_name]<reviews:
            reviews_max[app_name]=float(row[3])
    # avoided else to prevent wrong updastes in reviews
    if app_name not in reviews_max:
        reviews_max[app_name]=reviews
print(len(reviews_max))

9659


* I will use reviews_max dictionary to delete duplicate entries

In [None]:
# create empty list to store cleaned data
google_clean=[]

#create empty list to store app names
already_added=[]

#loop through the google data set
for app_row in google:
    app_name=app_row[0]
    reviews=float(app_row[3])
    if (reviews==reviews_max[app_name]) and (app_name not in already_added):
        google_clean.append(app_row)
        already_added.append(app_name) #keep this inside if block ,i made misrake here now corrected

* explore cleaned google data

In [None]:
print(google_clean[:5])
len(google_clean)

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]


9659

# Removing Non-English Apps:Part One

* we use English for the apps we develop at our company, and we'd like to analyze only the apps that are directed toward an English-speaking audience

* The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127. We will use this concept to remove non english apps.

### * Write function that takes string and returns false if there's any character in string that doesn't falls in 0 to 127 range

In [None]:
# Approach 1

def english_apps(input_string):
    for s in input_string:
        if ord(s) in range(0,128):
            return True
        else:
            return False

# Approach 2

def englishh_apps(input_string):
    for s in input_string:
        if ord(s)>127:
            return False
        else:
            return True

# I used two approaches to define function and both are correct

In [None]:
# Check function whether it checks english apps corectly
# Checking both functions but in code i will use one of it

print(english_apps('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(englishh_apps('爱奇艺PPS -《欢乐颂2》电视剧热播'))

print(english_apps('Instagram'))
print(englishh_apps('Instagram'))

False
False
True
True


In [None]:
englishh_apps('Instachat 😜')
# function is considering emoji withing range which is not valid
#and counting app name as english

True

In [None]:
english_apps('😜')
# when passed single emoji ,it returned falls means function is correct
# but why it returns True above ???????

False

# Removing Non-English Apps:Part Two

* Go™ Free Office Suite' and 'Instachat 😜'.Function will not count in english apps.This is because emojis and characters like ™ fall outside the ASCII range and have corresponding numbers over 127.

* If we're going to use the function we've created, we'll lose useful data since many English apps will be incorrectly labeled as non-English.


In [None]:
ord('😜')

128540

 * To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.

In [None]:
def english_appss(input_string):
    count=0
    for s in input_string:
        if ord(s)>127:
            count+=1
    if count<=3:
        return True
    else:
         return False

In [None]:
english_appss('Instachat 😜😜😜😜')

False

In [None]:
english_appss('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [None]:
english_apps('Instachat 😜')

True

Above function is now fairly better for removing non english apps

In [None]:
# for google dataset

# initialise empty list to store rows having english language apps only
english_nonduplicate=[]

# remove rows having non english apps entries

for app_row in google_clean:
    app_name=app_row[0]
    if english_appss(app_name):
        english_nonduplicate.append(app_row)

# For Apple dataset

english_apple=[]

# remove rows having non english apps entries

for app_row in apple:
    app_name=app_row[0]
    if english_appss(app_name):
        english_apple.append(app_row)

In [None]:
#explore data in new list having only english apps and non duplicated row
#entries in google data set
print(len(english_nonduplicate))
print(english_nonduplicate[:5])

9614
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]


In [None]:
#explore data in new list having only english apps and non duplicated row
#entries in apple data set
print(len(english_apple))
print(english_apple[:5])

7197
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'], ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']]


# Isolate the Free Apps

In [None]:
# Google dataset

# check index of price item
price_index=google_header[0].index('Price')
print(price_index)

# check type of price value whether it is string or float
type(google[0][7])

# it is string, so convert in float

7


str

In [None]:
apple_header

[['id',
  'track_name',
  'size_bytes',
  'currency',
  'price',
  'rating_count_tot',
  'rating_count_ver',
  'user_rating',
  'user_rating_ver',
  'ver',
  'cont_rating',
  'prime_genre',
  'sup_devices.num',
  'ipadSc_urls.num',
  'lang.num',
  'vpp_lic']]

In [None]:
# Apple dataset

# check index of price item
price_index=apple_header[0].index('price')
print(price_index)

# check type of price value whether it is string or float
type(apple[0][4])

print(apple[:3])

4
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']]


In [None]:
# For google dataset
#initialise list to store apps data which are free
free_english_nonduplicate=[]

# Loop over english_nonduplicate list and isolate free apps in separate list

for app_row in english_nonduplicate:
    if app_row[7]=='0':
        free_english_nonduplicate.append(app_row)


# For apple data set

#initialise list to store apps data which are free
free_english_apple=[]

# Loop over english_nonduplicate list and isolate free apps in separate list

for app_row in english_apple:
    if app_row[4]=='0.0':
        free_english_apple.append(app_row)


In [None]:
# check no of rows left in google data set after above analysis
len(free_english_nonduplicate)

8864

In [None]:
# check no of rows left in apple data set after above analysis
len(free_english_apple)

4056

# Most Common Apps by Genre: Part One

 So far, I spent a good amount of time on cleaning data, and:
* Removed inaccurate data
* Removed duplicate app entries
* Removed non-English apps
* Isolated the free apps

### Observation:
I can use 'Category' , 'Geners' , 'size' , 'curr_version' , 'Reviews' columns for analysis.
* "Geners"  and 'Category' is most suitable for creating frequency regarding geners in google dataset.
* 'prime_genre' in apple data set

# Most Common Apps by Genre: Part Two

### Create freq_table() function to find most common geners

In [None]:
def freq_table(dataset,index):
    table={}
    total=0
    for app_row in dataset:
        total+=1
        item=app_row[index]
        if item in table:
            table[item]+=1
        else:
            table[item]=1
    freq_percentage={}
    for key in table:
        freq_percentage[key]=(table[key]/total)*100
    return freq_percentage

### Create display_table() function to display frequency table of columns

In [None]:
def display_table(dataset,index):
    table=freq_table(dataset,index)
    table_display=[]
    for key in table:
        key_val_as_tuple=(table[key],key)
        table_display.append(key_val_as_tuple)
    table_sorted=sorted(table_display,reverse=True)
    for entry in table_sorted:
        print(entry[1],':',entry[0])

In [None]:
# find index values of 'Geners' and 'Category'
print(google_header[0].index('Genres'))
print(google_header[0].index('Category'))

9
1


In [None]:
# display freq table for 'Genres' column
display_table(free_english_nonduplicate,9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

### Observation:
'Tools' , 'Entertainment' , 'Education' , 'Business' , genres are most common.It means these genres have  most of users.Making app based on thede genres will be profitable.

In [None]:
# display freq table for 'Category' column
display_table(free_english_nonduplicate,1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

### Observation:
'Family' , 'Game' , 'Tools' , 'Business' are most popular categories

In [None]:
# check index of 'prime_genre' column
apple_header[0].index('prime_genre')

11

In [None]:
# display freq table for 'prime_genre' column in apple dataset
display_table(free_english_apple,11)

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


### Observation:
'Games' , 'Entertainment', 'Photo and Video', 'Social Networking'
are most ppoppular genres on apple store.

# Most Common Apps by Genre: Part Three

Analyzing the frequency table generated for the prime_genre column of the App Store data set.

* What is the most common genre? What is the runner-up?
* What other patterns ?
* What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)?
* Can i recommend an app profile for the App Store market based on this frequency table alone?
* If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users?


### Observation:
* 'Games' is the most common genre. 'Medical' is the least common genre.
* Most of apps are designed for entertainment purpose. Lesser apps are designed for practical purpose.
* If large no of apps are of particulare genre in store it does not necessarily mean it has large no users.It may be developers are interested or biased for games genre app.But it may also indicate large no of users as developer try to create apps which people more like and intended to use.

Analyze the frequency table you generated for the Category and Genres column of the Google Play data set.

* What are the most common genres?
* What other patterns ?
* Compare the patterns you see for the Google Play market with those you saw for the App Store market.
* Can you recommend an app profile based on what you found so far? Do the frequency tables you generated reveal the most frequent app genres or what genres have the most users?

### Observaton:
* 'Tools' in genres and 'Family' in category are most common genres.
* Google play store has balanced apps designed for practical purpose and entertainment.
* Comparing in both stores apps entertainment and games are most frequent genres

#  Most Popular Apps by Genre on the App Store

what genres are the most popular (have the most users) ?
We can use 'Installs' column but this information is missing in App store dataset , so we will take total number of user ratings as proxy into account

In [None]:
# Create frequncy table for prime_genre column
apple_genre_freq=freq_table(free_english_apple,11)
apple_genre_freq

{'Book': 1.6272189349112427,
 'Business': 0.4930966469428008,
 'Catalogs': 0.22189349112426035,
 'Education': 3.2544378698224854,
 'Entertainment': 8.234714003944774,
 'Finance': 2.0710059171597637,
 'Food & Drink': 1.0601577909270217,
 'Games': 55.64595660749507,
 'Health & Fitness': 1.8737672583826428,
 'Lifestyle': 2.3175542406311638,
 'Medical': 0.19723865877712032,
 'Music': 1.6518737672583828,
 'Navigation': 0.4930966469428008,
 'News': 1.4299802761341223,
 'Photo & Video': 4.117357001972387,
 'Productivity': 1.5285996055226825,
 'Reference': 0.4930966469428008,
 'Shopping': 2.983234714003945,
 'Social Networking': 3.5256410256410255,
 'Sports': 1.947731755424063,
 'Travel': 1.3806706114398422,
 'Utilities': 2.687376725838264,
 'Weather': 0.7642998027613412}

In [None]:
# Loop over apple_genre_freq
for genre in apple_genre_freq:
    total=0
    len_genre=0
    for app_row in free_english_apple:
        genre_app=app_row[11]
        if genre_app==genre:
            n_ratings=float(app_row[5])
            total+=n_ratings
            len_genre+=1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Music : 56482.02985074627
Health & Fitness : 19952.315789473683
Food & Drink : 20179.093023255813
Shopping : 18746.677685950413
Navigation : 25972.05
Photo & Video : 27249.892215568863
Sports : 20128.974683544304
Business : 6367.8
Social Networking : 53078.195804195806
Entertainment : 10822.961077844311
Lifestyle : 8978.308510638299
Utilities : 14010.100917431193
Travel : 20216.01785714286
News : 15892.724137931034
Games : 18924.68896765618
Medical : 459.75
Book : 8498.333333333334
Catalogs : 1779.5555555555557
Education : 6266.333333333333
Productivity : 19053.887096774193
Finance : 13522.261904761905
Weather : 47220.93548387097
Reference : 67447.9


# Most Popular Apps by Genre on Google Play

In [None]:
# Explore 'Installs' column in google dataset
display_table(free_english_nonduplicate,5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


* We don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

* We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.
* To perform computations, however, we'll need to convert each install number from string to float.
* This means we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error.

In [None]:
# Create frequency table to get unique app genres in category column
category_freq=freq_table(free_english_nonduplicate,1)
print(category_freq)
category_freq
# Why different ordering of keys in dict with print and without print
# statement???????????????????????????????????????

{'COMMUNICATION': 3.2378158844765346, 'MAPS_AND_NAVIGATION': 1.3989169675090252, 'SOCIAL': 2.6624548736462095, 'LIBRARIES_AND_DEMO': 0.9363718411552346, 'AUTO_AND_VEHICLES': 0.9250902527075812, 'PARENTING': 0.6543321299638989, 'GAME': 9.724729241877256, 'ART_AND_DESIGN': 0.6430505415162455, 'PRODUCTIVITY': 3.892148014440433, 'FINANCE': 3.7003610108303246, 'NEWS_AND_MAGAZINES': 2.7978339350180503, 'WEATHER': 0.8009927797833934, 'SPORTS': 3.395758122743682, 'COMICS': 0.6204873646209386, 'DATING': 1.861462093862816, 'FAMILY': 18.907942238267147, 'BOOKS_AND_REFERENCE': 2.1435018050541514, 'HEALTH_AND_FITNESS': 3.0798736462093865, 'ENTERTAINMENT': 0.9589350180505415, 'BUSINESS': 4.591606498194946, 'TRAVEL_AND_LOCAL': 2.33528880866426, 'PHOTOGRAPHY': 2.944494584837545, 'EVENTS': 0.7107400722021661, 'MEDICAL': 3.531137184115524, 'SHOPPING': 2.2450361010830324, 'LIFESTYLE': 3.9034296028880866, 'BEAUTY': 0.5979241877256317, 'TOOLS': 8.461191335740072, 'PERSONALIZATION': 3.3167870036101084, 'VID

{'ART_AND_DESIGN': 0.6430505415162455,
 'AUTO_AND_VEHICLES': 0.9250902527075812,
 'BEAUTY': 0.5979241877256317,
 'BOOKS_AND_REFERENCE': 2.1435018050541514,
 'BUSINESS': 4.591606498194946,
 'COMICS': 0.6204873646209386,
 'COMMUNICATION': 3.2378158844765346,
 'DATING': 1.861462093862816,
 'EDUCATION': 1.1620036101083033,
 'ENTERTAINMENT': 0.9589350180505415,
 'EVENTS': 0.7107400722021661,
 'FAMILY': 18.907942238267147,
 'FINANCE': 3.7003610108303246,
 'FOOD_AND_DRINK': 1.2409747292418771,
 'GAME': 9.724729241877256,
 'HEALTH_AND_FITNESS': 3.0798736462093865,
 'HOUSE_AND_HOME': 0.8235559566787004,
 'LIBRARIES_AND_DEMO': 0.9363718411552346,
 'LIFESTYLE': 3.9034296028880866,
 'MAPS_AND_NAVIGATION': 1.3989169675090252,
 'MEDICAL': 3.531137184115524,
 'NEWS_AND_MAGAZINES': 2.7978339350180503,
 'PARENTING': 0.6543321299638989,
 'PERSONALIZATION': 3.3167870036101084,
 'PHOTOGRAPHY': 2.944494584837545,
 'PRODUCTIVITY': 3.892148014440433,
 'SHOPPING': 2.2450361010830324,
 'SOCIAL': 2.662454873646

In [None]:
# Loop over category freq
for category in category_freq:
    total=0
    len_category=0
    for app_row in free_english_nonduplicate:
        category_app=app_row[1]
        if category_app==category:
            n_installs=app_row[5].replace(',','').replace('+','')
            total+=float(n_installs)
            len_category+=1
    avg_installs=total/len_category
    print(category,':',avg_installs)

COMMUNICATION : 38456119.167247385
MAPS_AND_NAVIGATION : 4056941.7741935486
SOCIAL : 23253652.127118643
LIBRARIES_AND_DEMO : 638503.734939759
AUTO_AND_VEHICLES : 647317.8170731707
PARENTING : 542603.6206896552
GAME : 15588015.603248259
ART_AND_DESIGN : 1986335.0877192982
PRODUCTIVITY : 16787331.344927534
FINANCE : 1387692.475609756
NEWS_AND_MAGAZINES : 9549178.467741935
WEATHER : 5074486.197183099
SPORTS : 3638640.1428571427
COMICS : 817657.2727272727
DATING : 854028.8303030303
FAMILY : 3695641.8198090694
BOOKS_AND_REFERENCE : 8767811.894736841
HEALTH_AND_FITNESS : 4188821.9853479853
ENTERTAINMENT : 11640705.88235294
BUSINESS : 1712290.1474201474
TRAVEL_AND_LOCAL : 13984077.710144928
PHOTOGRAPHY : 17840110.40229885
EVENTS : 253542.22222222222
MEDICAL : 120550.61980830671
SHOPPING : 7036877.311557789
LIFESTYLE : 1437816.2687861272
BEAUTY : 513151.88679245283
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
VIDEO_PLAYERS : 24727872.452830188
EDUCATION : 1833495.145631068
F

## In this project, we went through a complete data science workflow:
* We started by clarifying the goal of our project.
* We collected relevant data.
* We cleaned the data to prepare it for analysis.
* We analyzed the cleaned data.