 # AppleStore and Google Playstore Data Analysis
 
In-app ads are the major source of revenue for developers at Ssylix Technologies LTD as we develop and deploy free apps to our users for download and installation purposes. This analysis will aid in providing healthy information to our developers, ranging from top apps that are likely to attract users using the data provided from the Apple AppStore and the Google PlayStore to improve our development.

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. Our goal for this project is to analyze data from the AppleStore and Google PlayStore to help our developers understand what type of apps are likely to attract more users.

In [1]:
def open_sets(dataSets):
    opened_file = open(dataSets, encoding = 'utf8')
    from csv import reader
    read_file = reader(opened_file)
    app_data = list(read_file)
    
    return app_data

In [2]:
android = open_sets("googleplaystore.csv")
android_header = android[0]
android = android[1:]
ios = open_sets("AppleStore.csv")
ios_header = ios[0]
ios = ios[1:]

In [3]:
def explore_data(dataSets, start, end, rows_and_columns = False): 
    data = dataSets[start:end]
    for rows in data:
        print (rows)
        print ('\n')
    
    if rows_and_columns:
        print("Number of rows:", len(dataSets))
        print("Number of columns:", len(dataSets[0]))

## Delete incorrect rows

The function below is made to delete incorrect rows by checking the length of the Ios header or the Android App header against their respective rows.

Employing a function approach to simplify and analyze not only the google playstore apps but as well as the apple store app as it searches for incorrect rows by the Data Set passed as an argument to the function.

The argument pass may take the data sets of the android apps or the ios apps. 
Use of the function is depicted below the function and index as well as the row deleted are returned accurately.

In [4]:
def delete_incorrect_rows(dataSets):
    rows_deleted = []
    for row in dataSets:
        rowlength = len(row)
        header_length = len(dataSets[0])
        if rowlength != header_length:
            rows_deleted.append(dataSets.index(row))
            print(row)
            del dataSets[dataSets.index(row)]
    print("Incorrect row(s) deleted:", rows_deleted)
    print("\n")

In [5]:
print(len(android),  len(ios))

10841 7197


In [6]:
print("Android DataSets".upper())
print(android_header)
print("\n")
delete_incorrect_rows(android)

print("IOS DataSets".upper())
print(ios_header)
print("\n")
delete_incorrect_rows(ios)

ANDROID DATASETS
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Incorrect row(s) deleted: [10472]


IOS DATASETS
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Incorrect row(s) deleted: []




In [7]:
print(len(android),  len(ios))

10840 7197


## Duplicate or unique function

This takes four arguments
* The data set parameter (googleplaystore.csv or AppleStore.csv)
* Start parameter for slicing with default at `None`
* End parameter to indicate the end of the slice with default also at `None`
* Duplicate parameter with default `True` value which returns a list of duplicates if true and a list of unique list if false

Below the function is a print statement to show the function use and returned values for unique apps as well as duplicates.

The `start` and `end` parameters may be used as analysis to study returned data when printing the full data is not needed

In [8]:
def duplicate_or_unique(dataSet, start = None, end = None, duplicate=True):
    duplicate_apps = []
    unique_apps = []
    for app in dataSet:
        name = app[0]
        duplicate_apps.append(name) if name in unique_apps else unique_apps.append(name)
    
    if (start and end) is not None:
        duplicate_slice = duplicate_apps[start:end]
        unique_slice = unique_apps[start:end]
    else:
        duplicate_slice = duplicate_apps.copy()
        unique_slice = unique_apps.copy()
        
    return duplicate_slice if duplicate else unique_slice


In [9]:
print ("ANDROID")
android_duplicates = duplicate_or_unique(android)
print(android_duplicates[1:4])
android_unique = duplicate_or_unique(android, duplicate = False)
print ("Total rows:", len(android_duplicates) + len(android_unique))
print ("Duplicate rows: ", len(android_duplicates), ",", "Unique rows: ", len(android_unique))

print ("\n")
print("IOS")
ios_duplicates = duplicate_or_unique(ios)
ios_unique = duplicate_or_unique(ios, duplicate = False)
print ("Total rows:", len(ios_duplicates) + len(ios_unique))
print ("Duplicate rows: ", len(ios_duplicates), ",", "Unique rows: ", len(ios_unique))

ANDROID
['Box', 'Google My Business', 'ZOOM Cloud Meetings']
Total rows: 10840
Duplicate rows:  1181 , Unique rows:  9659


IOS
Total rows: 7197
Duplicate rows:  0 , Unique rows:  7197


## Clean Duplicates function
Functions takes in three parameters
* Data sets as the first to receive the **googleplaystore.csv** or **AppleStore.csv** data sets
* ***Index*** argument to clean the dataset in relation to a specific row in the data set passed as an argument in the above point
* `float_` with default argument at `True` to change the index value to a float or not

An index of 3 is passed below to signify that we would be cleaning the data in relation to the reviews column with defualt **float_** parameter at `True` and a cleaned reviews-relative list is returned below.

We created an empty dictionary and using conditional clauses, add new unique max reviews for duplicate rows in the data sets by updating the dataset if a record is already found to be inputted with the code

        ``` if name in data and data[name] < cleanRow: 
                data[name] = cleanRow
            elif name not in data:
                data[name] = cleanRow
        ```
        
With cleanRow indicating the float index status of the individual column data sets

In [10]:
def cleanDuplicates(dataSets, index, float_ = True):
    data = {}
    for app in dataSets:
        name = app[0]
        cleanRow = float(app[index]) if float_ else app[index]
        
        if name in data and data[name] < cleanRow:
            data[name] = cleanRow
        elif name not in data:
            data[name] = cleanRow
            
    return data

In [11]:
cleanedAndroidDuplicates = cleanDuplicates(android, 3)
print(len(cleanedAndroidDuplicates))

9659


## The CleanList function 

The function takes in four parameters
* The dataset paramter may take argument from the (**"googleplaystore.csv or AppleStore.csv"**) data sets
* The index parameter may take the argument of a relative column to be studied
* The cleanedDuplicates parameter take the cleaned duplicates and compares it to the data sets passed in the argument as `dataSets`
* Final argument `float_` takes a default value of `True` to convert the said column to float values

Two List `cleanedList` and `alreadyAdded` are created locally in the function and using a conditional statement the name are checked if they are already added in the `alreadyAdded` list before the full row is then added to the `cleanedList`

Both list are returned as Tuples and can be accessed when indexing the returned function call

An example of its use is depicted below the function

In [12]:
def cleanList(dataSets, index, cleanedDuplicates, float_ = True):
    cleanedList = []
    alreadyAdded = []
    for app in dataSets:
        app_name = app[0]
        cleanRow = float(app[index]) if float_ else app[index]
        
        if (cleanedDuplicates[app_name] == cleanRow) and (app_name not in alreadyAdded):
            cleanedList.append(app)
            alreadyAdded.append(app_name)
    
    return cleanedList, alreadyAdded

In [13]:
androidClean = cleanList(android, 3, cleanedAndroidDuplicates)
android_clean = androidClean[0]
already_added = androidClean[1]

print(len(android_clean), len(already_added))

9659 9659


## IsEnglish function 

The function takes a string argument and using the `ord` python [built-in function](https://docs.python.org/3/library/functions.html#ord) converts the characters to their corresponding Unicode values

The english character values falls within 0 and 127 of the Unicode values. Hence, using a conditional if statement we check each character falls within conditioned range of english characters.

The ***non_english*** variable in the function is given a default value of zero and increases on each iteration where our condition is not met and finally the value is returned.

In [14]:
def isEnglish(string):
    non_english = 0
    for char in string:
        character = ord(char)
        if character > 127:
            non_english += 1
    
    return non_english

## returnEnglishApps function

The function takes in two arguments
- The ***dataSet*** containing the passed csv data set to be analyzed
- The ***nameIndex*** to signify the desired column in which the name of the apps from the dataset rows is to be studied

An empty list *englishApps* is created and rows that meets the criteria from the `isEnglish(string)` function above is appended into the englishApps and finally returned to end the function.

In [15]:
def returnEnglishApps(dataSet, nameIndex):
    englishApps = []
    for app in dataSet:
        name = app[nameIndex]
        if not isEnglish(name) > 3:
            englishApps.append(app)
    
    return englishApps

In [16]:
android_english = returnEnglishApps(android_clean, 0)
ios_english = returnEnglishApps(ios, 1)

print(ios_header)
explore_data(ios_english, 0, end = 5, rows_and_columns = True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 6183
Number of columns: 16


## freeApps function

The freeApps function takes three parameters
- The dataSet which takes the argument of the cleaned English apps
- The price index argument; as Apple and Android index are different for price column
- And finally the **android** argument with default `True` value to evaluate using an if and else statement to produce either a `'0'` referring android and `'0.0'` to represent the Apple price column for free index

Apps that match are appended to a free variable initiated as an empty list and finally returned t the end of function.

Its use to show the len of the apps returned are shown below the function

In [17]:
def freeApps(dataSet, index, android = True):
    free = []
    for app in dataSet:
        price = app[index]
        value = '0' if android else '0.0'
        if price == value:
            free.append(app)
            
    return free

In [18]:
free_iosEnglishApps = freeApps(ios_english, 4, False)
free_androidEnglishApps = freeApps(android_english, 7)

print(len(free_iosEnglishApps), len(free_androidEnglishApps))

3222 8864


In [19]:
def frequencyTable(dataSet, index):
    table = {}
    total = 0
    for app in dataSet:
        total += 1
        value = app[index]
        if value in table:
            table[value] += 1 
        else:
            table[value] = 1
    
    percentage = {}
    for key in table:
        percentage[key] =  ((table[key] / total) * 100)
    
    return percentage, table
            

In [20]:
for game in frequencyTable(free_androidEnglishApps, 9)[0]:
    if "Game" in game:
        print(game)

Education;Brain Games
Entertainment;Brain Games
Casual;Brain Games
Puzzle;Brain Games
Educational;Brain Games
Board;Brain Games
Parenting;Brain Games
Role Playing;Brain Games


In [21]:
def display_table(dataSet, index = False, percentage = True, fullAnalysis = True):
    if fullAnalysis:
        table = frequencyTable(dataSet, index)[0] if percentage else frequencyTable(dataSet, index)[1]
    else:
        table = dataSet
        
    display = []
    for key in table:
        display_tuple = (table[key], key)
        display.append(display_tuple)
    
    sort_display = sorted(display, reverse = True)
    for entry in sort_display:
        print(entry[1], ":", entry[0])

In [22]:
print("apple Genre column".upper())
ios_genre = display_table(free_iosEnglishApps, -5, 1)
print("\n")

print("android Genre column".upper())
android_genre = display_table(free_androidEnglishApps, 9, 1)
print("\n")

print("android category column".upper())
android_category = display_table(free_androidEnglishApps, 1, 0)

APPLE GENRE COLUMN
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


ANDROID GENRE COLUMN
Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.4634

In [23]:
def appGenre(dataSet, genreIndex, installIndex, android = False):
    app_genre = frequencyTable(dataSet, genreIndex)[0]
    """
    Alternatively 
    Step 1: len_of_genre = frequencyTable(dataSet, genreIndex)[1]
    """
    data = {}
    for genre in app_genre:
        total_user_installs = 0
        len_genre = 0
        for app in dataSet:
            genre_app = app[genreIndex]
            if genre_app == genre:
                if android: 
                    ins = app[installIndex]  
                    ins = ins.replace(",","")
                    ins = ins.replace("+","")
                installs = float(ins) if android else float(app[installIndex])
                total_user_installs += installs
                len_genre += 1
                """
                Alternative Step 2:
                if genre in len_of_genre:
                    len_genre = len_of_genre[genre]
                """
        
        average_rating = (total_user_installs / len_genre)
        data[genre] = average_rating
     
    # To sort the values we pass it through the display_table function
    return display_table(data, fullAnalysis = False)

In [24]:
def checkGenre(dataSet, genre, genreIndex, nameIndex, installIndex):
    for app in dataSet:
        if app[genreIndex] == genre:
            print(app[nameIndex], ":", app[installIndex])

In [27]:
appGenre(free_androidEnglishApps, 9, 5, True)

Communication : 38456119.167247385
Adventure;Action & Adventure : 35333333.333333336
Video Players & Editors : 24947335.796178345
Social : 23253652.127118643
Arcade : 22888365.48780488
Casual : 19569221.602564104
Puzzle;Action & Adventure : 18366666.666666668
Photography : 17840110.40229885
Educational;Action & Adventure : 17016666.666666668
Productivity : 16787331.344927534
Racing : 15910645.681818182
Travel & Local : 14051476.145631067
Casual;Action & Adventure : 12916666.666666666
Action : 12603588.872727273
Strategy : 11199902.530864198
Tools : 10802461.246995995
Tools;Education : 10000000.0
Role Playing;Brain Games : 10000000.0
Lifestyle;Pretend Play : 10000000.0
Casual;Music & Video : 10000000.0
Card;Action & Adventure : 10000000.0
Adventure;Education : 10000000.0
News & Magazines : 9549178.467741935
Music : 9445583.333333334
Educational;Pretend Play : 9375000.0
Puzzle;Brain Games : 9280666.666666666
Word : 9094458.695652174
Racing;Action & Adventure : 8816666.666666666
Books & R