# Finding the most attractive App - Analyzing App Store and Google Play Markets

Our goal is to identify the profile of the app that is more profitable, analyzing App Store and Google Play Markets. 
This help our developers to make data-driven decisions with respect to the kind of apps are likely to attract more users. 

## Opening and Exploring Data

Considering the huge quantity of apps, we try to analyze a sample of data. In particular, we decide to analyze two existing data sets at no cost:
* a data set cointaing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from [this link](http://localhost:8826/edit/Dataquest_projects/App%20Profile/googleplaystore.csv).
* a data set cointaing data about approximately seven thousand iOs apps from App Store. You can download the data set directly from [this link](http://localhost:8826/edit/Dataquest_projects/App%20Profile/AppleStore.csv).

In [19]:
from csv import reader 
### The Google Play data set ###
opened_file = open('C:/Users/Acer/Dataquest_projects/App Profile/googleplaystore.csv', encoding='utf8')
read_file=reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

In [10]:
### The App Store data set ###
opened_file = open('C:/Users/Acer/Dataquest_projects/App Profile/AppleStore.csv', encoding='utf8')
read_file=reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

We want to explore the two data sets, so we'll create a function named explore_data() that we can use repeatedly to explore rows in a more readable way. We'll also add an option for our function to show the number of rows and columns for any data set.

In [12]:
def explore_data(dataset, start, end, row_and_columns=False):
    dataset_slice=dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if row_and_columns:
        print ('Number of rows:', len(dataset))
        print ('Number of columns:', len(dataset[0]))

        
print(android_header)
print('\n')
explore_data(android,0,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


The **columns** that might be uselful for our analysis are: **'App'**, **'Category'**, **'Reviews'**, **'Installs'**, **'Type'**, **'Price'** and **'Genres'**.

In [24]:
print(ios_header)
print('\n')
explore_data(ios,0,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


Looking at the **columns**, we can find interesting: **"track_name"**, **"currency"**, **"price"**, **"rating_count_tot"** and **"prime_genre"**. 

We can find the details about the other columns [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home). 

## Deleting Wrong Data

The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion), and we can see that [one of the discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [27]:
print (android[10472]) # incorrect row
print('\n')
print(android_header) # header
print('\n')
print(android[1]) # correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


The row 10472 corresponds to the app *Life Made WI-Fi Touchscreen Photo Frame*, and we can see that the ratins is 19. This is clearly off because the maximum rating for a Google Play app is 5. As a consequence, we'll delete this row. 

In [28]:
print(len(android))
del android[10472] #don't run this more than once
print (len(android))

10841
10840


## Removing Duplicate Entries

If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For istance, the application Instagram has four entries.

In [30]:
for app in android:
    name=app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [1]:
how_many_times_app={}
for app in android:
    name=app[0]
    if name in how_many_times_app:
        how_many_times_app[name]+=1 # counts how many time the app appears
    else:
        how_many_times_app[name]=1 

# in name_app we can see haw many time every app appears 


NameError: name 'android' is not defined

In total, there are 1.181 cases where an app occurs more than once:

In [94]:
more_than_once=[]

times_more_than_one = 0

for app in how_many_times_app:
    if how_many_times_app[app]>1:
        more_than_once.append(app)
        times_more_than_one += how_many_times_app[app]-1 #-1 because a app has to occur only one time
            
print(times_more_than_one) # how many times apps appear more than one 

1181


We don't want to remove rows randomly, but we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more realiable the ratings.

In [100]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In a previous code cell, we found that are 1.181 cases where an app occurs more than once, so the lenght of our dictionary (of unique apps) should be equal to the difference between the lenght of our data set and 1181 

In [103]:
print ('Expected lenght',len(android)-1181)
print ('Actual lenght',len(reviews_max))

Expected lenght 9659
Actual lenght 9659



Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

* We start by initializing two empty lists, android_clean and already_added.
* We loop through the android data set, and for every iteration:
   * We isolate the name of the app and the number of reviews.
   * We add the current row (app) to the android_clean list, and the app name (name) to the already_added list if:
     * The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and
     * The name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps.

In [104]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name) # make sure this is inside the if block

Now let's quickly explore the new data set, and confirm that the number of rows is 9,659.

In [105]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


We have 9659 rows, just as expected.

## Removing Non-English Apps

If we explore the data sets enough, we can notice that there are names of some apps that are not for an English-speaking audience.
Below, we see a couple of examples from both data sets:

In [107]:
print(ios[813][1])
print(ios[6731][1])

print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
中国語 AQリスニング
لعبة تقدر تربح DZ


So, we'll remove all the apps that contains symbol that is not commonly used in English text. All the character that are specific to English texts (English alphabet, numbers, punctuation marks and symbols like +,\*,..) are encoded using the ASCII standard. 
Each ASCII character has a corresponding number between 0 and 127 associated with it, and we can take advantage of that to build a function that checks an app name and tells us whether it contains non-ASCII characters. 

In [108]:
def is_english(string):
    
    for character in string:
        if ord(character)>127:
            return False
        
    return True

In [110]:
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Instagram'))

False
True


The function seems to work fine, but some English app names use emojis or other symbols (™, — (em dash), – (en dash), etc.) that fall outside of the ASCII range. Because of this, we'll remove useful apps if we use the function in its current form.

In [111]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

print(ord('™'))
print(ord('😜'))

False
False
8482
128540


To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:

In [113]:
def is_english(string):
    no_ascii = 0
    for character in string:
        if ord(character)>127:
            no_ascii += 1
            
    if no_ascii > 3:
        return False
    else:
        return True 

print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False
