# Popularity Analysis of Free-to-Play Apps

The goal of this project is to do a basic data analysis of the summary data available from the Google Play Store and the Apple iOS Mobile App Store.

These data sets are available for download here:

 * [Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps/home)
 * [Apple iOS Store Apps](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)
 
This project will provide basic insight into the characteristics that are correlated with higher download rates of popular apps. 

In [1]:
from csv import reader
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios_data = list(read_file)
ios_header = ios_data[0]
ios_data = ios_data[1:]

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
ggl_data = list(read_file)
ggl_header = ggl_data[0]
ggl_data = ggl_data[1:]

In [2]:
# creates a function to view selected rows of the data set
# or to print the length of the data set 
# function assumes the data set does not have a header

def explore_data(dataset, start, end, print_count=False):
    data_slice = dataset[start:end]
    for row in data_slice:
        print(row)
        print('\n') # creates space between rows
    if print_count:
        print('Number of rows:', len(dataset))
        print('Number of columns', len(dataset[0]))

In [5]:
print(ggl_header)
print(ggl_data[1:5])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
[['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']]


In [4]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [15]:
explore_data(ggl_data,0,2,print_count=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10840
Number of columns 13


In [16]:
explore_data(ios_data,0,2,print_count=True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns 16


The discussion page related to the Google Play Store data shows there are some duplicate entries in that data set. We need to check for duplicates and remove every entry except for the most recent. We can use the number of reviews to determine the most recent data point. We should also check the iOS App data as well, eventhough there is nothing in the discussion page.

In [14]:
unique_apps_ggl = []
duplicate_apps_ggl = []

for i in ggl_data:
    name = i[0]
    if name in unique_apps_ggl:
        duplicate_apps_ggl.append(name)
    else:
        unique_apps_ggl.append(name)
        
print('# of unique google apps: ', len(unique_apps_ggl))
print('# of duplicate google apps: ', len(duplicate_apps_ggl))

unique_apps_ios = []
duplicate_apps_ios = []

for i in ios_data:
    name = i[0]
    if name in unique_apps_ios:
        duplicate_apps_ios.append(name)
    else:
        unique_apps_ios.append(name)
        
print('# of unique apple apps: ', len(unique_apps_ios))
print('# of duplicate apple apps: ', len(duplicate_apps_ios))

# of unique google apps:  9659
# of duplicate google apps:  1181
# of unique apple apps:  7197
# of duplicate apple apps:  0


In [18]:
for i in ggl_data:
    name = i[0]
    if name == 'Facebook':
        print(i)

['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']


It appears onlt the Google data set contains duplicate entries. Printing a likely duplicate shows the only column that differs is the reveiw count. Here the examples show: 78,158,306 & 78,128,208.

Because there are 1,181 duplicate entries in the Google data set, after their removal the length of the data set should be reduced from 10,840 to 9,659.

In [20]:
reviews_max = {}

for i in ggl_data:
    name = i[0]
    n_reviews = float(i[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max.update({name:n_reviews})
        
print(len(reviews_max))

9659


Above, we created a dictionary and added entries to the reviews_max dictionary. If the name was already in the dictionary, the review count was updated to the higher amount. If they were not in the dictionary, we added the name and review count.

Next, we can create a new, clean data set using an empty list and a ggl_added list to check if we've already added the data. The two lists should be the same length to make sure we've added everything properly, and they should match our expected length.

In [22]:
ggl_clean = []
ggl_added = []

for i in ggl_data:
    name = i[0]
    n_reviews = float(i[3])
    if n_reviews == reviews_max[name] and name not in ggl_added:
        ggl_clean.append(i)
        ggl_added.append(name)

print(len(ggl_clean))
print(len(ggl_added))

9659
9659
