Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

In [58]:
from csv import reader
#read the data googleplay
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]



In [59]:
#read the ios and apple dataset
opened_file = open("AppleStore.csv" , encoding='utf8')
read_file = reader(opened_file)
ios=list(read_file)
ios_header=ios[0]
ios=ios[1:]

In [60]:
#function to explore the data
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We see that the Google Play data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

In [61]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


### Data Cleaning

The Google Play data set has a dedicated discussion section, and we can see that one of the discussions outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [62]:
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The row 10472 corresponds to the app Life Made WI-Fi Touchscreen Photo Frame, and we can see that the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5 (as mentioned in the discussions section, this problem is caused by a missing value in the 'Category' column). As a consequence, we'll delete this row.

In [63]:
print(len(android))
del android[10472] 
print(len(android))

10841
10840


In [64]:
#removing duplicate entries
duplicated_data =[]
unique_android_data=[]

for i in android:
    name= i[0]
    if name in unique_android_data:
        duplicated_data.append(name)
    else:
        unique_android_data.append(name)

print("number of duplicated apps:",len(duplicated_data))
print("Examples of duplicated apps:", duplicated_data[2:9]) 

number of duplicated apps: 1181
Examples of duplicated apps: ['Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business']


In [65]:
#removing duplicate entries
duplicated_data =[]
unique_ios_data=[]

for i in ios:
    name= i[0]
    if name in unique_android_data:
        duplicated_data.append(name)
    else:
        unique_android_data.append(name)

print("number of duplicated apps:",len(duplicated_data))
print("Examples of duplicated apps:", duplicated_data[2:9]) 

number of duplicated apps: 0
Examples of duplicated apps: []


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed two cells above for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we will:

Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

In [66]:
# Removing the duplicates
reviews_max={}
for app in android:
    name=app[0]
    n_reviews = float(app[2])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] =n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [67]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

We start by initializing two empty lists, android_clean and already_added.

We loop through the android data set, and for every iteration:

We isolate the name of the app and the number of reviews.

We add the current row (app) to the android_clean list, and the app name (name) to the already_added list if:

The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and

The name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps.

In [68]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = (app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name) # make sure this is inside the if block

## Removing Non-English Apps
When exploring the data long enough, wefind that both datasets have apps with names that suggest they are not designed for an English-speaking audience.
we use the built-in ord() function to find out the corresponding encoding number of each character.

In [69]:
def findstrings(str):
    for i in str:
        if ord(i) > 127:
            return False
    return True
print(findstrings('Facebook'))
print(findstrings('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(findstrings('Instachat 😜'))

True
False
False


In [70]:
#f the input string has more than three characters that fall outside the ASCII range (0 - 127), then the function should return False (identify the string as non-English), otherwise it should return True

def findstrings(str):
    non_ascii=0
    for i in str:
        if ord(i) > 127:
            non_ascii+=1
    if non_ascii > 3:
        return False
    else:
        return True        
print(findstrings('Docs To Go™ Free Office Suite'))
print(findstrings('Instachat 😜') )         

True
True


In [72]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if findstrings(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if findstrings(name):
        ios_english.append(app)
        
