# Analyzing Mobile App Data

In this project we are going to analyze mobile app data with a goal to find apps that are likely to attract more users. We are using two datasets:
- Android dataset with approximately 10,000 apps from Google play. [Source](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
- iOS apps dataset with approximately 7,000 apps from App Store. [Source](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

## Opening and exploring the datasets

In [1]:
open_ios = open('AppleStore.csv')
open_android = open('googleplaystore.csv')
from csv import reader
read_ios = reader(open_ios)
read_android = reader(open_android)
ios_list = list(read_ios)
android_list = list(read_android)


In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

## Exploring datasets

### First three rows of iOS dataset

In [3]:
explore_data(ios_list,1,4)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']




### First three rows of Android dataset

In [4]:
explore_data(android_list,1,4)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']




### Amount of rows and columns

In [5]:
explore_data(ios_list, 0, 0, rows_and_columns = True)

Number of rows: 7198
Number of columns: 16


In [6]:
explore_data(android_list, 0, 0, rows_and_columns = True)

Number of rows: 10842
Number of columns: 13


### Column names

In [7]:
explore_data(ios_list,0,1)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']




In [8]:
explore_data(android_list,0,1)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']




For our goal of finding attractive apps, we are using columns track_name, rating_count_tot, user_rating and prime_genre for iOS apps. For android the same columns are called App, Reviews, Rating, and Category

# Data Cleaning

Let's check if any of the apps are missing values. We can do that by checking if the length of the app row is the same as the length of header row.

In [9]:
row_number = 0
for app in android_list:
    if len(app) != len(android_list[0]):
        print('Corrupted data on row: ' + str(row_number))
    row_number += 1

Corrupted data on row: 10473


In [10]:
for app in ios_list[1:]:
    if len(app) != len(ios_list[0]):
        print(app)

We can see that one android file is indeed missing a value. We are going to delete that corrupted row from our dataset.

In [11]:
del android_list[10473]

From the App Store dataset [Discussion page](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps/discussion/106176) we can find out that there is a duplicate file. Two fix this we have to delete it.

In [12]:
duplicate_list = []
unique_list = []
for app in ios_list:
    if app[1] in unique_list:
        duplicate_list.append(app[1])
    else:
        unique_list.append(app[1])
print(duplicate_list)

['Mannequin Challenge', 'VR Roller Coaster']


There seems to be two duplicates, let's print them out to explore more.

In [13]:
print(ios_list[0])
for app in ios_list:
    if app[1] in duplicate_list:
        print(app)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


Here we can see that there are differences in columns. Based on rating_count_tot we can see that the other version is more recent, because it has more ratings. Let's delete the older duplicates.

In [14]:
print("With duplicates: " + str(len(ios_list[1:])))
reviews_max = {}
for app in ios_list[1:]:
    name = app[1]
    reviews = float(app[5])
    reviews_max[name] = reviews
    if name in reviews_max and reviews_max[name] < reviews:
        reviews_max[name] = reviews
    elif name not in reviews_max:
        reviews_max[name] = reviews

print("Without duplicates:" + str(len(reviews_max)))

With duplicates: 7197
Without duplicates:7195


We can see that our new dictionary contains correct amount of values. Now we can use it to make a clean ios_list.

In [15]:
ios_clean = []
already_added = []

for app in ios_list[1:]:
    name = app[1]
    n_reviews = float(app[5])
    if n_reviews == reviews_max[name] and name not in already_added:
        ios_clean.append(app)
        already_added.append(app)
print(len(ios_clean))
print(ios_clean[0:2])

7195
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']]


Ios has now been cleaned. It contains 7195 values, meaning that two duplicates were deleted. Let's also check if Android dataset has any duplicates.

In [16]:
duplicate_list = []
unique_list = []
for app in android_list:
    name = app[0]
    if name in unique_list:
        duplicate_list.append(name)
    else:
        unique_list.append(name)
print("There are " + str(len(duplicate_list)) + " duplicate values.")

There are 1181 duplicate values.


Android dataset has a lot of duplicate values. Let's print a one of them and explain why.

In [17]:
print(android_list[0])
for app in android_list:
    name = app[0]
    if name == duplicate_list[0]:
        print(app)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


As with iOS, we can see that the "Reviews" column has different values, meaning that there are newer versions of the same app. Let's delete the older ones that have less reviews.

In [18]:
#Creating a dictionary with name: reviews
print("Before deleting duplicates: " + str(len(android_list[1:])))
max_reviews = {}
for app in android_list[1:]:
    name = app[0]
    reviews = app[3]
    if name not in max_reviews:
        max_reviews[name] = reviews
    elif name in max_reviews and max_reviews[name] <= reviews:
        max_reviews[name] = reviews
print("Afte deleting duplicates: " + str(len(max_reviews)))
print("The difference is: " + str(len(android_list[1:]) - len(max_reviews)))

Before deleting duplicates: 10840
Afte deleting duplicates: 9659
The difference is: 1181


Now we have the correct amount of apps in the dictionary, so let's turn the dictionary into a clean android list.

In [19]:
android_clean = []
already_added = []
for app in android_list[1:]:
    name = app[0]
    reviews = app[3]
    if name not in already_added and reviews == max_reviews[name]:
        android_clean.append(app)
        already_added.append(name)
print(len(android_clean))

9659


Now we have a clean data. Next we want to get rid of all the apps that are not in English. To do this we will check which apps use alphabets that do not belong in English ASCII system, meaning all of those that have ASCII value of over 127. Let's create a function for this.

In [20]:
def english_word(word):
    for character in word:
        if ord(character) > 127:
            return False
    return True

In [21]:
print(english_word("Instagram"))
print(english_word("ääööåå"))

True
False


While this seems to be working as expected, we do have a problem with some names involving symbols like emojis. For example a problem occurs with the following app:

In [22]:
print(english_word("Docs To Go™ Free Office Suite"))

False


To bypass this problem, the easiest solution is that we will only remove those than contain more than three characters outside our ASCII range. It is not perfect way, but fairly effective.

In [23]:
def english_word(word):
    amount = 0
    for character in word:
        if ord(character) > 127:
            amount += 1
    if amount > 3:
        return False
    return True
        

In [24]:
print(english_word("Docs To Go™ Free Office Suite"))
print(english_word('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False


Let's now clean both datasets by deleting non-English apps.

In [25]:
ios_english = []
android_english = []

for app in ios_clean:
    name = app[1]
    if english_word(name) == True:
        ios_english.append(app)

for app in android_clean:
    name = app[0]
    if english_word(name) == True:
        android_english.append(app)

print("The length of iOS dataset is now: " + str(len(ios_english)) + ".")
print("The length of Android dataset is now: " + str(len(android_english)) + ".")

The length of iOS dataset is now: 6181.
The length of Android dataset is now: 9614.


## Filtering free apps

In [26]:
#iOS price index 4, android 7
#Let's add headers
ios_free = [ios_list[0]]
android_free = [android_list[0]]
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_free.append(app)

for app in android_english:
    price = app[7]
    if price == '0':
        android_free.append(app)
        
print("The length of iOS dataset is now: " + str(len(ios_free)) + ".")
print("The length of Android dataset is now: " + str(len(android_free)) + ".")

The length of iOS dataset is now: 3221.
The length of Android dataset is now: 8863.


## Finding the correct App for our project

Our goal is to create an app for both versions: Android and iOS. Because of this, we need to find apps that perform well in both datasets. To do this we will create a frequency table for a few columns in our datasets. The columns we are going to use are App(track_name), Genres(prime_genre), Installs(rating_count_tot) and Rating(user_rating).

In [27]:
#A function that makes a frequency table with 2 columns
def freq_table(dataset, index):
    dictionary_2_columns = {}
    total = 0
    
    for app in dataset:
        total += 1
        column = app[index]
        if column in dictionary_2_columns:
            dictionary_2_columns[column] += 1
        else:
            dictionary_2_columns[column] = 1
            
    table_percentages = {}
    for key in dictionary_2_columns:
        percentage = (dictionary_2_columns[key] / total) * 100
        table_percentages[key] = percentage
        
    return table_percentages

In [28]:
#A function that turns our dictionary to sorted table
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Let's try our functions with prime_genre, Genres and Category.

In [29]:
display_table(android_english, 9) #Genres

Tools : 8.591637195756189
Entertainment : 5.793634283336801
Education : 5.231953401289786
Business : 4.358227584772207
Medical : 4.108591637195756
Personalization : 3.900561680882047
Productivity : 3.879758685250676
Lifestyle : 3.775743707093822
Finance : 3.588516746411483
Sports : 3.442895776991887
Communication : 3.2660703141252343
Action : 3.110047846889952
Health & Fitness : 2.995631370917412
Photography : 2.9124193883919283
News & Magazines : 2.600374453921365
Social : 2.485957977948825
Travel & Local : 2.26752652381943
Books & Reference : 2.26752652381943
Shopping : 2.090701060952777
Simulation : 1.9762845849802373
Arcade : 1.9138755980861244
Dating : 1.7786561264822136
Casual : 1.7058456417724153
Video Players & Editors : 1.674641148325359
Maps & Navigation : 1.3417932182234242
Puzzle : 1.2377782400665696
Food & Drink : 1.1649677553567712
Role Playing : 1.0817557728312877
Strategy : 0.9777407946744331
Racing : 0.9465363012273768
Libraries & Demo : 0.8737258165175785
Auto & Vehic

In [30]:
display_table(android_english, 1) #Category

FAMILY : 19.346785937174953
GAME : 9.787809444560017
TOOLS : 8.602038693571874
BUSINESS : 4.358227584772207
MEDICAL : 4.108591637195756
PERSONALIZATION : 3.900561680882047
PRODUCTIVITY : 3.879758685250676
LIFESTYLE : 3.786145204909507
FINANCE : 3.588516746411483
SPORTS : 3.3804867900977738
COMMUNICATION : 3.2660703141252343
HEALTH_AND_FITNESS : 2.995631370917412
PHOTOGRAPHY : 2.9124193883919283
NEWS_AND_MAGAZINES : 2.600374453921365
SOCIAL : 2.485957977948825
TRAVEL_AND_LOCAL : 2.2779280216351157
BOOKS_AND_REFERENCE : 2.26752652381943
SHOPPING : 2.090701060952777
DATING : 1.7786561264822136
VIDEO_PLAYERS : 1.6954441439567296
MAPS_AND_NAVIGATION : 1.3417932182234242
FOOD_AND_DRINK : 1.1649677553567712
EDUCATION : 1.1129602662783442
ENTERTAINMENT : 0.9049303099646349
LIBRARIES_AND_DEMO : 0.8737258165175785
AUTO_AND_VEHICLES : 0.8737258165175785
WEATHER : 0.8217183274391513
HOUSE_AND_HOME : 0.7593093405450385
EVENTS : 0.6656958602038693
PARENTING : 0.6240898689411275
ART_AND_DESIGN : 0.62

In [31]:
display_table(ios_english, 11) #prime_genre

Games : 54.84549425659279
Entertainment : 7.264196731920401
Education : 6.633230868791458
Photo & Video : 5.516906649409481
Utilities : 3.446044329396538
Productivity : 2.7180067950169877
Health & Fitness : 2.6694709593916843
Music : 2.216469826888853
Social Networking : 2.0385050962627407
Sports : 1.682575635010516
Lifestyle : 1.6016825756350106
Shopping : 1.375182009383595
Weather : 1.116324219381977
Travel : 0.970716712506067
News : 0.9221808768807637
Book : 0.8898236531305614
Reference : 0.8574664293803592
Business : 0.8574664293803592
Finance : 0.7927519818799547
Food & Drink : 0.7118589225044492
Navigation : 0.4530011325028313
Medical : 0.33975084937712347
Catalogs : 0.08089305937550557


For iOS games is the most common genre with almost 55% of apps being games. After that comes entertainment (7.3%) and education (6.6%). For Android genres tools comes first (8.6%), then comes entertainment (5.8%) and education (5.2%). For Android categories the top 3 are: Family (19.3%), Games (9.8%) and tools (8.6%). The general impression seems that most of the apps are created for the following reasons: Gaming, entertainment, education, or productivity.

Looking at the data, we can see that Android Genres is the same as Categories, but with more detailed information. For our purposes Categories is enough, so we will continue using that frequency table from now on.

## Finding the genres with most users

Even though games are dominating App Store, it doesn't mean that every game has a lot of players. Because of this we are going to make frequency tables based on installations. For Google Play data we can do this easily with 'Installs' column, but for App Store we have to use rating_count_tot as a workaround.

In [32]:
prime_genre = freq_table(ios_english,11)

for genre in prime_genre:
    total = 0
    len_genre = 0
    for row in ios_english[1:]:
        genre_app = row[11]
        if genre_app == genre:
            n_ratings = float(row[5])
            total += n_ratings
            len_genre += 1
            
    avg_n_users = total / len_genre
    print(str(genre) + " " + str(avg_n_users))

Social Networking 36938.472
Photo & Video 14688.715542521993
Games 15595.726548672566
Music 29047.109489051094
Reference 27037.188679245282
Health & Fitness 10802.157575757576
Weather 23145.246376811596
Utilities 7927.525821596244
Travel 19030.183333333334
Shopping 26635.011764705883
News 16980.315789473683
Navigation 19370.821428571428
Lifestyle 8930.373737373737
Entertainment 8862.409799554565
Food & Drink 19934.386363636364
Sports 15350.913461538461
Book 10359.2
Finance 23353.530612244896
Education 2472.278048780488
Productivity 8508.089285714286
Business 5149.320754716981
Catalogs 3465.0
Medical 648.952380952381


Most average users seem to be in category 'Social Networking', but considering that this includes app like Instagram, Facebook, Twitter, etc. this isn't really a category we want to compete in. Music includes apps like Spotify etc. so that's not really viable category either. Some recommendations could be apps in categories like Health & Fitness, Lifestyle, Book or Sports.

Now let's take a look at Google Play data. If we look at one row, we can see that installations aren't as straightforward as expected.

In [33]:
print(android_english[1])

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


The app above has installations of '5,000,000+'. We need to change this values from strings to integers. We also can't now exactly what 5,000,000+ means, it could be 5,000,001 or even 5,999,999. Because of this we are going to assume that all of the intervals mean exactly that amount, so 100,000+ for example means 100,000.

In [40]:
android_genres = freq_table(android_english, 1)

for genre in android_genres:
    total = 0
    len_category = 0
    for row in android_english[1:]:
        category_app = row[1]
        if category_app == genre:
            n_installs = row[5]
            n_installs = n_installs.replace("+", "")
            n_installs = n_installs.replace(",", "")
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(genre + " " + str(avg_n_installs))

ART_AND_DESIGN 1919103.3898305085
AUTO_AND_VEHICLES 632501.3214285715
BEAUTY 513151.88679245283
BOOKS_AND_REFERENCE 7641777.871559633
BUSINESS 1663758.627684964
COMICS 817657.2727272727
COMMUNICATION 35153714.17515924
DATING 824129.2807017544
EDUCATION 1770579.4392523365
ENTERTAINMENT 11375402.298850575
EVENTS 249580.640625
FINANCE 1319851.4028985507
FOOD_AND_DRINK 1891060.2767857143
HEALTH_AND_FITNESS 3972300.388888889
HOUSE_AND_HOME 1331540.5616438356
LIBRARIES_AND_DEMO 630903.6904761905
LIFESTYLE 1369954.7774725275
GAME 14227278.868225291
FAMILY 3344163.6580645163
MEDICAL 96691.58734177215
SOCIAL 22961790.384937238
SHOPPING 6966908.880597015
PHOTOGRAPHY 16604098.410714285
SPORTS 3373767.6861538463
TRAVEL_AND_LOCAL 13218662.767123288
TOOLS 9676869.30471584
PERSONALIZATION 4086652.4853333333
PRODUCTIVITY 15530942.008042896
PARENTING 525351.8333333334
WEATHER 4570892.658227848
VIDEO_PLAYERS 24121489.079754602
NEWS_AND_MAGAZINES 9472807.04
MAPS_AND_NAVIGATION 3900634.7286821706


One genre that stands out is books and reference with over 7.6 million installations. Books was also one of the genres in iOS recommendations. One app we could build on could be something that combines books and productivity together.