# Mobile Applications Data Analysis


### What the project about:
Analyse the data from Apple Store and Google Play Market. Mobile Applicatoins users, ratings, price etc.

### Goal of project:
By analysing the data understand the most profitable application type. Considering this will a be free app with main revenue through in-app ads.



## Read data and analyse what valuable we have for our goal

In [143]:
# Reading data files. Saving data as table structure - arrays of arrays.

import csv

android_aps_data = []
ios_aps_data = []

with open('google-play-store-apps/googleplaystore.csv') as android_aps_data_file:
    csv_reader = csv.reader(android_aps_data_file)
    
    for row in csv_reader:
        android_aps_data.append(row)
        
with open('app-store-apple-data-set-10k-apps/AppleStore.csv') as ios_aps_data_file:
    csv_reader = csv.reader(ios_aps_data_file)
    
    for row in csv_reader:
        ios_aps_data.append(row)


In [144]:
# Function to explore and investigate data.

def exloreData(dataset, start, stop, print_number_of_rows_and_columns=False):
    for row in dataset[start:stop]:
        print(row)
        print('\n')
        
    if print_number_of_rows_and_columns:
        print('Columns in total: ', len(row[0]))
        print('Rows in total: ', len(dataset))
    

In [145]:
# Print android columns names

exloreData(dataset=android_aps_data, start=0, stop=3, print_number_of_rows_and_columns=False)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']




Possible needed data for us is under columns 'Category'[0], 'Rating'[1], 'Installs'[4], 'Type'[5], 'Content Rating'[7], 'Genres'[8], 'Android Ver'[11]

In [146]:
# Print ios columns names

exloreData(dataset=ios_aps_data, start=0, stop=3, print_number_of_rows_and_columns=False)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']




Possible needed data for us is under columns 'price'[5], 'user_rating'[8], 'cont_rating'[11], 'prime_genre'[12], 'sup_devices.num'[13], 'lang.num'[15]

## Clean data

#### Broken data

In [147]:
print(android_aps_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


As we see row number 10473 has no 'Category' row. And all data of this row moved to left. That's means data for this app is broken and it's better to **remove this data item**. 

In [148]:
# removing broken data item

del android_aps_data[10473]

#### Dublicate data

Now we need to check is there are duplicates in our data tables.

In [149]:
def check_for_duplicates(dataset, name_index=0):
    allData = {}
    dups = {}

    for index, data in enumerate(dataset):
        if data[name_index] in allData:
            allData[data[name_index]].append({
                'data': data,
                'index': index,
            })
        else:
            allData[data[name_index]] = [{
                'data': data,
                'index': index
            }]
    
    for key in allData:
        if len(allData[key]) > 1:
            dups[key] = allData[key]

    return dups

In [150]:
# print all android apps names that have duplicates
android_dups = check_for_duplicates(android_aps_data[1:], 0)

count_dups = 0
for dup in android_dups:
    print('"' + dup + '"', "amount: ",  len(android_dups[dup]))
    count_dups += len(android_dups[dup]) - 1
    
print('Overall duplicates: ', count_dups)

"Coloring book moana" amount:  2
"Mcqueen Coloring pages" amount:  2
"UNICORN - Color By Number & Pixel Art Coloring" amount:  2
"Textgram - write on photos" amount:  2
"Wattpad ðŸ“– Free Books" amount:  2
"Amazon Kindle" amount:  2
"Dictionary - Merriam-Webster" amount:  2
"NOOK: Read eBooks & Magazines" amount:  2
"Oxford Dictionary of English : Free" amount:  2
"Spanish English Translator" amount:  2
"NOOK App for NOOK Devices" amount:  2
"Ebook Reader" amount:  2
"English Dictionary - Offline" amount:  2
"Docs To Goâ„¢ Free Office Suite" amount:  2
"Google My Business" amount:  3
"OfficeSuite : Free Office + PDF Editor" amount:  2
"Curriculum vitae App CV Builder Free Resume Maker" amount:  2
"Facebook Pages Manager" amount:  2
"Box" amount:  3
"Call Blocker" amount:  2
"ZOOM Cloud Meetings" amount:  2
"Facebook Ads Manager" amount:  2
"Quick PDF Scanner + OCR FREE" amount:  3
"SignEasy | Sign and Fill PDF and other Documents" amount:  2
"Genius Scan - PDF Scanner" amount:  2
"Tiny

"HotelTonight: Book amazing deals at great hotels" amount:  3
"Moto File Manager" amount:  2
"Google" amount:  2
"Google Translate" amount:  2
"Cache Cleaner-DU Speed Booster (booster & cleaner)" amount:  2
"SHAREit - Transfer & Share" amount:  2
"Gboard - the Google Keyboard" amount:  3
"Share Music & Transfer Files - Xender" amount:  2
"Flashlight" amount:  2
"CM Flashlight (Compass, SOS)" amount:  2
"Mobi Calculator free & AD free!" amount:  2
"VPN Free - Betternet Hotspot VPN & Private Browser" amount:  2
"osmino Wi-Fi: free WiFi" amount:  2
"CM Locker - Security Lockscreen" amount:  2
"Nova Launcher" amount:  2
"ZEDGEâ„¢ Ringtones & Wallpapers" amount:  4
"CM Launcher 3D - Theme, Wallpapers, Efficient" amount:  2
"Smart Launcher 5" amount:  2
"Apex Launcher" amount:  3
"Yandex Browser with Protect" amount:  2
"Ringtone Maker" amount:  2
"Beautiful Widgets Pro" amount:  2
"Beautiful Widgets Free" amount:  2
"HD Widgets" amount:  2
"Backgrounds HD (Wallpapers)" amount:  2
"ai.type F

To remove less needed duplicates, we need to decide how to prioritize them. My understanding that first priority is 'Version' of application. . If versions of duplicates are same, then use 'Reviews'.

In [151]:
# needed to compare versions
from distutils.version import LooseVersion
indexes_to_remove = []

for name, apps in android_dups.items():
    latest_app = None
    for app in apps:
        if not latest_app:
            latest_app = app
        elif ((latest_app['data'][11] != 'Varies with device' and app['data'][11] != 'Varies with device') 
              and LooseVersion(latest_app['data'][11]) < LooseVersion(app['data'][11])):
            latest_app = app
            indexes_to_remove.append(latest_app['index'] + 1)
        elif latest_app['data'][3] < app['data'][3]:
            latest_app = app
            indexes_to_remove.append(latest_app['index'] + 1)
        else:
            indexes_to_remove.append(app['index'] + 1)
            
# removed founded indexes
for index in sorted(indexes_to_remove, reverse=True):
    del android_aps_data[index]

# check that there is no duplicates
len(check_for_duplicates(android_aps_data[1:], 0)) == 0


True

In [152]:
# print all ios apps names that have duplicates
ios_dups = check_for_duplicates(ios_aps_data[1:], 2)

for dup in ios_dups:
    print('"' + dup + '"', "amount: ",  len(ios_dups[dup]))

"VR Roller Coaster" amount:  2
"Mannequin Challenge" amount:  2
