## Profitable App Profiles for the App Store and Google Play Market

our aim in this project to find mobile app profiles that are profitable for the App Store and Google Play markets. suppose We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users


## Opening and Exploring the data 
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.


Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:


- <a href = "https://www.kaggle.com/lava18/google-play-store-apps">A data set</a> containing data about approximately ten thousand Android apps from Google Play.download it <a href = "https://www.kaggle.com/lava18/google-play-store-apps/download">here</a> 
- <a href = "https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps">A data set</a>  containing data about approximately seven thousand iOS apps from the App Store. download it <a href = "https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/download">here</a>


let's start by opening the two data sets and the continue with exploring the data

In [1]:
# function to read a data set

def read_data(path , code = "utf-8" , want_header = False):
    
    '''returns dataset with header'''
    
    import csv
    data_file = open(path , encoding= code)
    data = list(csv.reader(data_file))
    header = data[0]
    dataset = data[1:]
    # printing header or not
    if want_header :
        print("dataset header:\n\n",header)
        
    return dataset 
    

In [2]:
# read each data set and print the header and number of apps in each dataset

android = read_data("googleplaystore.csv" , want_header=True)


dataset header:

 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [3]:
ios = read_data("AppleStore.csv" , want_header=True)

dataset header:

 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


- we can create a function to explore each dataset and show some rows of data

In [4]:
def data_information(dataset , start = 2, end = 5):
    """ display some rows and some information about a dataset"""
    for row in dataset[start : end]:    # looping into slice of the dataset
        print(row)
        print("\n")
    print("number of columns : {}".format(len(dataset[0])))
    print("\n number of rows : {}".format(len(dataset)))

In [5]:
#explore google play store
data_information(android , 7, 10)

['Infinite Painter', 'ART_AND_DESIGN', '4.1', '36815', '29M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'June 14, 2018', '6.1.61.1', '4.2 and up']


['Garden Coloring Book', 'ART_AND_DESIGN', '4.4', '13791', '33M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'September 20, 2017', '2.9.2', '3.0 and up']


['Kids Paint Free - Drawing Fun', 'ART_AND_DESIGN', '4.7', '121', '3.1M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'July 3, 2018', '2.8', '4.0.3 and up']


number of columns : 13

 number of rows : 10841


 We see that the Google Play data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are **'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'**.

In [6]:
# explore App store
data_information(ios)

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


number of columns : 16

 number of rows : 7197



We have 7197 iOS apps in this data set, and the columns that seem interesting are: **'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'**. Not all column names are self-explanatory in this case, but details about each column can be found in the data set <a href = "https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home">documentation.</a>

## Deleting wrong data

The row 10472 corresponds to the app Life Made WI-Fi Touchscreen Photo Frame, and we can see that the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5 (as mentioned in the <a href = "https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015" >discussions section</a>, this problem is caused by a missing value in the 'Category' column). As a consequence, we'll delete this row.

In [7]:
#detect any rating more than 5

for row in android:
    rating = row[2]
    if float(rating) >  5:
        print(row , "\n")


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 



from the <a href = "https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015">  discussion section </a> we can now the index of this column is 10472

In [8]:
print(len(android))
# del(android[10472])   # be careful do not run this multible times you will delete more rows
print(len(android))

10841
10840


## Removing Duplicates

**Part 1**

- If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance, the application **Instagram** has four entries: 



In [9]:
for app in android:
    name = app[0]
    if name == "Instagram":
        print(app, "\n")

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 



from the <a href = "https://www.kaggle.com/lava18/google-play-store-apps/discussion">discussions</a> we can notice that In total, there are 1,181 cases where an app occurs more than once 
let's see this :

In [10]:
names = []
unique_apps = []
duplicate_apps = []

for app in android:
    name = app[0]
    if name not in names:
        unique_apps.append(app)
        names.append(name)
    else:
        duplicate_apps.append(app)

print(len(unique_apps))
print(len(duplicate_apps))
print("number of duplicates is {} apps".format(len(android) - len(unique_apps)))

9659
1181
number of duplicates is 1181 apps


In [11]:
print(duplicate_apps[:2])    


[['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up'], ['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']]



We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed two cells above for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

to do that we will
- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
- Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

In [12]:
reviews_max = {}

for app in android :
    name = app[0]
    reviews = float(app[3])   # convert string dtype to float
    
    if name in reviews_max and reviews_max[name] < reviews:
        reviews_max[name] = reviews
        
    elif name not in reviews_max:
        reviews_max[name] = reviews
        
reviews_max

{'Photo Editor & Candy Camera & Grid & ScrapBook': 159.0,
 'Coloring book moana': 974.0,
 'U Launcher Lite – FREE Live Cool Themes, Hide Apps': 87510.0,
 'Sketch - Draw & Paint': 215644.0,
 'Pixel Draw - Number Art Coloring Book': 967.0,
 'Paper flowers instructions': 167.0,
 'Smoke Effect Photo Maker - Smoke Editor': 178.0,
 'Infinite Painter': 36815.0,
 'Garden Coloring Book': 13791.0,
 'Kids Paint Free - Drawing Fun': 121.0,
 'Text on Photo - Fonteee': 13880.0,
 'Name Art Photo Editor - Focus n Filters': 8788.0,
 'Tattoo Name On My Photo Editor': 44829.0,
 'Mandala Coloring Book': 4326.0,
 '3D Color Pixel by Number - Sandbox Art Coloring': 1518.0,
 'Learn To Draw Kawaii Characters': 55.0,
 'Photo Designer - Write your name with shapes': 3632.0,
 '350 Diy Room Decor Ideas': 27.0,
 'FlipaClip - Cartoon animation': 194216.0,
 'ibis Paint X': 224399.0,
 'Logo Maker - Small Business': 450.0,
 "Boys Photo Editor - Six Pack & Men's Suit": 654.0,
 'Superheroes Wallpapers | 4K Backgrounds': 


In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [18]:
print(len(android) - len(reviews_max))
print(len(duplicate_apps))

1181
1181



Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

- We start by initializing two empty lists, `android_clean` and `already_added`.
- We loop through the android data set, and for every iteration:

    - We isolate the name of the app and the number of reviews.
    - We add the current row (app) to the android_clean list, and the app name (name) to the already_added list if:
  
        - The number of reviews of the current app matches the number of reviews of that app as described in the `reviews_max dictionary`; and
        - The name of the app is not already in the `already_added` list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for `reviews_max[name] == reviews`, we'll still end up with duplicate entries for some apps.

In [27]:
print(len(reviews_max))

android_clean = []
already_added = []

for app in android :
    name = app[0]
    reviews = float(app[3])
    if (reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
print(len(android_clean))

9659
9659


## Removing Non-English Apps¶
**Part One**

If you explore the data sets enough, you'll notice the names of some of the apps suggest they are not directed toward an English-speaking audience. Below, we see a couple of examples from both data sets:

In [28]:
print(ios[813][1])
print(ios[6731][1])

print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
中国語 AQリスニング
لعبة تقدر تربح DZ



We're not interested in keeping these kind of apps, so we'll remove them. One way to go about this is to remove each app whose name contains a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.).

All these characters that are specific to English texts are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it, and we can take advantage of that to build a function that checks an app name and tells us whether it contains non-ASCII characters.

We built this function below, and we use the built-in ord() function to find out the corresponding encoding number of each character.

In [32]:
def is_english(string):
    for char in string:
        if ord(char) > 127 :
            return "NO not english"
    return "Yes is English"


print(is_english("Facebook"))
print(is_english("اسلام حسام سليمان ")) # Arabic text

Yes is English
NO not english



The function seems to work fine, but some English app names use emojis or other symbols (™, — (em dash), – (en dash), etc.) that fall outside of the ASCII range. Because of this, we'll remove useful apps if we use the function in its current form.

In [42]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

print(ord('™'))
print(ord('😜'))

NO not english
NO not english
8482
128540


## Part two

To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:

In [49]:
def is_english(string):
    s = 0
    for char in string :
        if ord(char) > 127 : 
            s += 1
    if s > 3 :
        return "False not english name"
        
    return "okay english name"


print(is_english('Instachat 😜😜😜😜'))


print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

False not english name
okay english name
okay english name



The function is still not perfect, and very few non-English apps might get past our filter, but this seems good enough at this point in our analysis — we shouldn't spend too much time on optimization at this point.

Below, we use the is_english() function to filter out the non-English apps for both data sets:

In [52]:
android_english = []
ios_english = []

for app in android_clean : 
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
print("android length " , len(android_english)  )


for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
print("ios length is " , len(ios_english))

android length  9659
ios length is  7197


In [69]:
s = 0
for app in android_clean:
    name = app[0]
    if  not is_english(name):
        print(app)