# Portfolio Project

In this project I combine all the knowledge acquired through my learning and make use of them to perform basic data analysis.

Considering myself as a data analyst for a company that builds Android and iOS mobile apps, which are free to download and whose revenue generation is through the in-app ads. Which implies that the app's revenue is dependent on the number of active users that engage with the ads, and interact with the application.

Goal: Analyze data to help developers understand what type of apps are likely to attract more users.

## Loading the data
Performing necessary imports

In [237]:
from csv import reader

Defining a function to explore the dataset

In [238]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Opening the datasets

In [239]:
appleDataOpened = open("AppleStore.csv", encoding = "utf-8")
googleDataOpened = open("googleplaystore.csv", encoding = "utf-8")

Reading the data

In [240]:
appleData = list(reader(appleDataOpened))
googleData = list(reader(googleDataOpened))

In [241]:
appleHeader = appleData[0]
appleData = appleData[1:]
googleHeader = googleData[0]
googleData = googleData[1:]

Exploring the data
* Apple Store data

In [242]:
explore_data(dataset = appleData,start = 0, end = 5, rows_and_columns = True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


* Google play store data

In [243]:
explore_data(dataset = googleData,start = 0, end = 5, rows_and_columns = True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13


## Data processing
* Detect inaccurate data, and correct or remove it.
* Detect duplicate data, and remove the duplicates.

The app developemnt is targetted towards english speaking audience, hence:
* Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
* Remove apps that aren't free.  

From the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) at kaggle we find that the row 10472 has some error. Let us examine it compared against a correct data

In [244]:
print(googleHeader)
print(googleData[13])
print(googleData[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Mandala Coloring Book', 'ART_AND_DESIGN', '4.6', '4326', '21M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design', 'June 26, 2018', '1.0.4', '4.4 and up']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The 'Genres' column seems to be missing data.  
Removing the row with error 

In [245]:
del googleData[10472]

### Removing duplicate entries

In [246]:
appCount = {}
for app in googleData:
    if(app[0] in appCount):
        appCount[app[0]] +=1
    else:
        appCount[app[0]] = 1

In [247]:
uniqueApps = []
duplicateApps = []
for app in appCount:
    if(appCount[app]>1):
        duplicateApps.append(app)
    else:
        uniqueApps.append(app)

Examing the apps that have duplicate records

In [248]:
print(duplicateApps[:15],'\nThere are',len(duplicateApps),'duplicate apps.. The first 15 are displayed above')

['Google Drive', 'Just She - Top Lesbian Dating', 'Vigo Video', 'Meet24 - Love, Chat, Singles', 'BP Journal - Blood Pressure Diary', 'Word Search', 'Christian Dating For Free App', 'Toca Life: City', 'Live Talk - Free Text and Video Chat', 'Brilliant', 'Medical ID - In Case of Emergency (ICE)', 'Daily Manga - Comic & Webtoon', 'FP Notebook', 'Amino: Communities and Chats', 'MLB At Bat'] 
There are 798 duplicate apps.. The first 15 are displayed above


Examing an app and it's duplicate records

In [249]:
subData = []
print(googleHeader)
for app in googleData:
    if(app[0]=='Instagram'):
        print(app)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We observe varying 'Reviews' columns denoting different timings when the app details were extracted from the store.  
We use this field to remove duplicates, we keep only the record that has the highest number of reviews in the output data file.

Expected number of records after removal of duplicates in google playstore dataset

In [250]:
expectedRecordsGoogle = len(set(duplicateApps)) + len(uniqueApps) 
print(expectedRecordsGoogle)

9659


Creating a dictionary where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

In [251]:
appDictGoogle = {}
for app in googleData:
    name = app[0]
    reviews = int(app[3])
    if(name in appDictGoogle and appDictGoogle[name] < reviews):
        appDictGoogle[name] = reviews
    elif(name not in appDictGoogle):
        appDictGoogle[name] = reviews

We've now obtained the best(latest) record of each duplicate set  
Removing the duplicates from the googleData

In [252]:
cleanedData = []
already_added = []
for app in googleData:
    name = app[0]
    reviews = int(app[3])
    if(appDictGoogle[name] == reviews and name not in already_added):
        cleanedData.append(app)
        already_added.append(name)
googleData = cleanedData       

In [253]:
print('Number of records in dataset before removing duplicates: ',len(googleData))
print('Expected number of records in dataset after removing duplicates: ',expectedRecordsGoogle)
print('Number of records in dataset after removing duplicates: ',len(cleanedData))

Number of records in dataset before removing duplicates:  9659
Expected number of records in dataset after removing duplicates:  9659
Number of records in dataset after removing duplicates:  9659


Checking if the App Store dataset has any duplicates 

In [254]:
appCount = {}
for app in appleData:
    if(app[0] in appCount):
        appCount[app[0]] +=1
    else:
        appCount[app[0]] = 1
uniqueApps = []
duplicateApps = []
for app in appCount:
    if(appCount[app]>1):
        duplicateApps.append(app)
    else:
        uniqueApps.append(app)
print('Number of duplicates found: ',len(duplicateApps))

Number of duplicates found:  0


### Removing non-english entries
The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system.  
Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not.  

Note: In my interpretation some apps use emojis in their name, to avoid filtering out these I check for more than 1 character in name whose ASCII value is not in prescribed range.

Function to find names of apps that are non-english and that need to be removed

In [255]:
def findNonEnglish(dataset):
    removeList = []
    for app in dataset:
        name = app[0]
        nameSplit = [x for x in name]
        count = 0
        for char in nameSplit:
            c_ord = ord(char)
            if(c_ord not in range(0,128)):
                count+=1
                if(count>3):
                    if(name not in removeList):
                        removeList.append(name)
    return(removeList)

Function to remove unnecessary apps from an input list

In [256]:
def required(dataset,removeList):
    requiredApps = []
    for app in dataset:
        name = app[0]
        if(name not in removeList):
            requiredApps.append(app)
    return(requiredApps)

Removing all unnecessary apps from google playstore data

In [257]:
removeList = findNonEnglish(googleData)
googleData = required(googleData,removeList)

In [258]:
print('Length of google playstore dataset after removing unnecessary apps: ',len(googleData))

Length of google playstore dataset after removing unnecessary apps:  9614


Performing the same action for appstore apps

In [259]:
removeList = findNonEnglish(appleData)
appleData = required(appleData,removeList)

In [260]:
print('Length of apple appstore dataset after removing unnecessary apps: ',len(appleData))

Length of apple appstore dataset after removing unnecessary apps:  7197


### Isolating the free apps

In [261]:
googleFinal = []
appleFinal = []

for app in googleData:
    price = app[7]
    if price == '0':
        googleFinal.append(app)
        
for app in appleData:
    price = app[4]
    if price == '0.0':
        appleFinal.append(app)

In [262]:
print('Free Google Playstore apps: ',len(googleFinal))
print('Free Apple appstore apps: ',len(appleFinal))


Free Google Playstore apps:  8864
Free Apple appstore apps:  4056


Function to generate frequency table

In [263]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

Function to display the contents in descending order of value

In [264]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Analyzing the apps by category to see which category is the most popular on both platforms

In [265]:
display_table(appleData, -5)[:]

Games : 53.66124774211477
Entertainment : 7.433652910935113
Education : 6.294289287203002
Photo & Video : 4.849242740030569
Utilities : 3.4458802278727245
Health & Fitness : 2.501042100875365
Productivity : 2.473252744198972
Social Networking : 2.3204112824788106
Lifestyle : 2.0008336807002918
Music : 1.9174656106711132
Shopping : 1.6951507572599693
Sports : 1.5839933305543976
Book : 1.5562039738780047
Finance : 1.445046547172433
Travel : 1.1254689453939142
News : 1.0421008753647354
Weather : 1.0004168403501459
Reference : 0.8892594136445742
Food & Drink : 0.8753647353063776
Business : 0.7919966652771988
Navigation : 0.6391552035570377
Medical : 0.31957760177851885
Catalogs : 0.1389467833819647


In [266]:
display_table(googleData, 9)

Tools : 8.602038693571874
Entertainment : 5.793634283336801
Education : 5.231953401289786
Business : 4.358227584772207
Medical : 4.108591637195756
Personalization : 3.900561680882047
Productivity : 3.879758685250676
Lifestyle : 3.775743707093822
Finance : 3.588516746411483
Sports : 3.442895776991887
Communication : 3.2660703141252343
Action : 3.110047846889952
Health & Fitness : 2.995631370917412
Photography : 2.9124193883919283
News & Magazines : 2.600374453921365
Social : 2.485957977948825
Travel & Local : 2.26752652381943
Books & Reference : 2.26752652381943
Shopping : 2.090701060952777
Simulation : 1.9762845849802373
Arcade : 1.9138755980861244
Dating : 1.768254628666528
Casual : 1.7162471395881007
Video Players & Editors : 1.674641148325359
Maps & Navigation : 1.3417932182234242
Puzzle : 1.2377782400665696
Food & Drink : 1.1649677553567712
Role Playing : 1.0817557728312877
Strategy : 0.9777407946744331
Racing : 0.9465363012273768
Libraries & Demo : 0.8737258165175785
Auto & Vehicl

The most popular genre seems to be "Tools"

Analyzing data based on category

In [267]:
display_table(googleData, 1)

FAMILY : 19.325982941543582
GAME : 9.819013938007073
TOOLS : 8.61244019138756
BUSINESS : 4.358227584772207
MEDICAL : 4.108591637195756
PERSONALIZATION : 3.900561680882047
PRODUCTIVITY : 3.879758685250676
LIFESTYLE : 3.786145204909507
FINANCE : 3.588516746411483
SPORTS : 3.3804867900977738
COMMUNICATION : 3.2660703141252343
HEALTH_AND_FITNESS : 2.995631370917412
PHOTOGRAPHY : 2.9124193883919283
NEWS_AND_MAGAZINES : 2.600374453921365
SOCIAL : 2.485957977948825
TRAVEL_AND_LOCAL : 2.2779280216351157
BOOKS_AND_REFERENCE : 2.26752652381943
SHOPPING : 2.090701060952777
DATING : 1.768254628666528
VIDEO_PLAYERS : 1.6954441439567296
MAPS_AND_NAVIGATION : 1.3417932182234242
FOOD_AND_DRINK : 1.1649677553567712
EDUCATION : 1.1025587684626585
ENTERTAINMENT : 0.9049303099646349
LIBRARIES_AND_DEMO : 0.8737258165175785
AUTO_AND_VEHICLES : 0.8737258165175785
WEATHER : 0.8217183274391513
HOUSE_AND_HOME : 0.7593093405450385
EVENTS : 0.6656958602038693
PARENTING : 0.6240898689411275
ART_AND_DESIGN : 0.6240