 # Profitable App Profiles
 
 This is a guided project. The aim here is to:
 
 1. To review basic, raw Python skills
 2. Get an idea of what kinds of apps are likely to attract more users. The idea is we work at a company which offers apps which are free to donwload and install, and are aimed at English-speaking audiences.
 
 For this, we'll be using two datasets:
 
1. Google Play Data: This was collected August 2018 and contains data on approx. 10,000 Android Apps.
2. Apple Data: This was collected July 2017 and contains data on approx. 7,000 IOS Apps.

---


In [1]:
# Imports
from csv import reader

In [2]:
# Open Google
file = open("googleplaystore.csv")
google = list(reader(file))
file.close()

# Open Apple
file = open("AppleStore.csv")
apple = list(reader(file))
file.close()

In [3]:
# Explore data
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
# Google Data Exploration
explore_data(google, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


In [5]:
# Apple Data Exploration
explore_data(apple, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


---

We can see that for our **Google data**:

* There are 10842 rows (including header row)
* There are 13 columns

We can see that for our **Apple data**:

* There are 7198 rows (including header row)
* There are 16 columns

---

For our analysis, we want to indentify the apps which are most profitable. We therefore may want to focus on columns that relate to app reviews and price.

---

Next up, we note in a Kaggle discussion that one of the rows of data is missing some data. Let's loop through all rows and find rows which don't have the same number of entries as the header row.

In [6]:
for row in google:
    if len(row) != len(google[0]):
        print(row)
        print(google.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10473


We've found the row. We can see that it is missing some data. Let's delete it.

In [7]:
del google[10473]

From reading some more Kaggle discussions, it looks like the Google data may have a duplicate applications. Let's see if we can find these.

In [23]:
total = 0
count = 1
for row1 in google[1:]:
    count += 1
    for row2 in google[count:]:
        if row2[0] == row1[0]:
            total += 1
            break
print(total)

1181


In the above, we can see that there are 1181 duplicates. Note we can also do something like the below to find duplicates.

In [24]:
duplicate_list = list()
unique_list = list()

for row in google[1:]:
    app = row[0]
    if app in unique_list:
        duplicate_list.append(app)
    else:
        unique_list.append(app)
        
print(len(duplicate_list))   

1181


Let's explore the duplicates and see if we can remove them.

In [26]:
duplicate_list[:10]

['Quick PDF Scanner + OCR FREE',
 'Box',
 'Google My Business',
 'ZOOM Cloud Meetings',
 'join.me - Simple Meetings',
 'Box',
 'Zenefits',
 'Google Ads',
 'Google My Business',
 'Slack']

In [43]:
for row in google:
    if row[0] == "Instagram":
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We can see a slight different at index 3. This corrosponds to number of reviews. A criterion for removing the duplicates could be to keep the record with the highest number of reviews (the most recent entry) and remove the rest. 

In [46]:
dictionary = dict()

for row in google[1:]:
    app = row[0]
    review = row[3]
    if app not in dictionary:
        dictionary[app] = review
    else:
        if review > dictionary[app]:
            dictionary[app] = review

Now that we have a dictionary with each app name along with their highest review, we'll use this to amend our current google dataset.

In [58]:
google_clean = list()
already_added = list()

for row in google[1:]:
    app = row[0]
    review = row[3]
    if (review == dictionary[app]) and (app not in already_added):
        google_clean.append(row)
        already_added.append(app)

len(google_clean)

9659

We have now successfuly removed the duplicate rows.