## App Store Analysis

The goal of this analysis will be to take two data sets, Google Play and iOS App Store and break down what types of apps user's are most attracted to. 

The Google Play data was collected in August 2018 and contains data on approximately 10,000 Android Apps.
The iOS data was collected in July 2017 and contains approximately data on 7,000 iOS Apps.

We will be focusing on apps designed for an english speaking audience.
As well as only free to download apps. 

In [2]:
from csv import reader

# Open Google Play Store Data
# add encoding to avoid error "'charmap' codec can't decode byte 0x90 in position"
# Use r before the directory location to convert to a raw string
opened_file = open(r"C:\Users\MalikSami\Desktop\Data_Analytics\App_Store_Analysis\googleplaystore.csv",
                  encoding="utf8")
read_file = reader(opened_file)

android = list(read_file)
android_header = android[0]
android = android[1:]


In [6]:
# Open iOS App Store Data
opened_file = open(r"C:\Users\MalikSami\Desktop\Data_Analytics\App_Store_Analysis\AppleStore.csv",
                  encoding="utf8")
read_file = reader(opened_file)

ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [8]:
# Explore Function 
# start and end are integers that represent the location of the splices 
# rows_and_columns will priint the number of rows and columns if set to True
# Note the data set should not have a header row, or the function will print the wrong number of rows
def explore_data(dataSet, start, end, rows_and_columns = False):
    dataSet_slice = dataSet[start:end]
    for row in dataSet_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows: ', len(dataSet))
        print('Number of columns: ', len(dataSet[0]))
        
    

### Exploring the data

In [9]:
explore_data(android, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows:  10841
Number of columns:  13


In [10]:
explore_data(ios, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  7197
Number of columns:  16


#### We can look at the header row of each data set to find relevant columns that can be useful in our analysis

In [13]:
print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


##### In the Android data set we should consider looking at:
'App' , 'Category', 'Rating', 'Reviews', 'Price', 'Content Rating' and 'Genres'



In [16]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


##### In the iOS data set we should consider looking at:
'track_name', 'price', 'rating_count_tot',  'user_rating', 'cont_rating', 'prime_genre'

rating_count_tot = total number of ratintgs
cont_rating = Recomended Age

## Data Cleaning

Since we are only analyzing apps that are in english and free, we need to clean our data. 
Also if there are any errors in the data, those need to be addressed. 


Here We can see by running a for loop on the android data, that the header length is not equal to the length of the row. The following code pin points where the error is. 
From here we can delete that row from our data. Make sure to only run the del statement once since it will delete the next row if run again. 

In [18]:
for row in android:
    header_length = len(android_header)
    row_length = len(row)
    if row_length != header_length:
        print(row)
        print(android.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472


In [19]:
print(len(android))
del(android[10472])
print(len(android))


10841
10840


#### Duplicate Data
Next we need to search for duplicate data in our data set. 
We can create two lists, duplicate_apps and unique_apps and run a for loop to seperate the data into the two lists.

In [20]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps: ', len(duplicate_apps))
print('Number of unique apps: ', len(unique_apps))

Number of duplicate apps:  1181
Number of unique apps:  9659


We do not want to randomly delete the duplicates. Instead we want to find a way to keep the most recent data and get rid of the older duplicates. If you note the 4th index of the data, which is the number of reviews. We can see that that a higher total number of reviews would mean that the data is more recent.

In [22]:
# Note the fourth index, which is total reviews.
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

print('Expected length: ', len(android) - 1181)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
Expected length:  9659


Now we can create a dictionary and if the app name is already in the dictionary and the app's total reviews are greater than the n_reviews, we update the number of reviews for that entry.

If the app name is not in the dictionary than we create a new entry in the dictionary, where the key is the app name and the value is the total number of reviews.

In [27]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))
        
    

9659


Now we can create a list of a cleaned version of the android data (android_clean)

We can write a for loop to  run and see if reviews_max[name] is == n_reviews and that it does not already exist in the already_added list.
If both conditions are met, the row is added to android_clean and the app name is added to the already_added list

In [28]:
# We store rows in android_clean
android_clean = []
# We store the app name in already_added
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

        
    

In [29]:
explore_data(android_clean,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9659
Number of columns:  13


#### Cleaning the non - English apps

Now we can write a function to take out the non  english apps from the data. 

The function will take a string and use the built in python function ord(), which returns a integer representing the unicode character.
If that code is greater than 127 then that would mean it is not an english character. 




In [32]:
def is_english(string):
    
    for character in string:
        if ord(character) > 127:
            return False
        
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))


True
False
False
False
