# Analyzing Mobile App Data

The current Jupyter notebook is the result of a guided project from Dataquest. The objective is to find mobile apps that are profitable in two **datasets**:

<p style="display: flex; align-items:center;">
  <a href="https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps"><img src="https://img.shields.io/badge/Kaggle-20BEFF?style=for-the-badge&logo=Kaggle&logoColor=white" target="_blank"></a>&nbsp;&nbsp;<b>Apple Store data</b>&nbsp;(approximately seven thousand iOS apps)
</p>

<p style="display: flex; align-items:center;">
  <a href="https://www.kaggle.com/datasets/lava18/google-play-store-apps"><img src="https://img.shields.io/badge/Kaggle-20BEFF?style=for-the-badge&logo=Kaggle&logoColor=white" target="_blank"></a>&nbsp;&nbsp;<b>Google Play Store data</b>&nbsp;(approximately ten thousand Android apps)
</p>

To achieve the goal of finding profitable mobile apps, this project will focus on free apps that can generate revenue by displaying ads. Additionally, this notebook will utilize classes created to apply Object-Oriented Programming principles. The code for these classes is available [here](./FakePandas.py).


# Importing Libs

In [745]:
import FakePandas as fpd

# Auxiliar Functions

In [746]:
def count_wrong_user_ratings(dataset, column):
  column_values  = dataset.get_column(column)
  rows_to_delete = []

  dataset_user_rating_errors = 0
  
  for idx, value in enumerate(column_values):
    if float(value) < 1 or float(value) > 5:
      dataset_user_rating_errors += 1
    
  return dataset_user_rating_errors

In [747]:
def delete_wrong_user_ratings(dataset, column):
  column_values  = dataset.get_column(column)
  rows_to_delete = []

  for idx, value in enumerate(column_values):
    if float(value) < 1 or float(value) > 5:
      rows_to_delete.append(idx)

  for idx in sorted(rows_to_delete, reverse=True):
    dataset.delete_row(idx)

In [748]:
def filter_dict(freq_dict, num_items):
  filtered_dict_keys = list(freq_dict.keys())[:num_items]
  filtered_dict = {}

  for key in filtered_dict_keys:
    filtered_dict[key] = freq_dict[key]

  return filtered_dict

In [749]:
def print_items_in_dict(freq_dict, num_items=None):
  filtered_dict_keys = list(freq_dict.keys())

  if num_items != None:
    filtered_dict_keys = filtered_dict_keys[:num_items]

  for key in filtered_dict_keys:
    print(f"{key} : {freq_dict[key]}")

In [750]:
def delete_duplicates_by_app_name(dataset, column):
  column_values  = dataset.get_column(column)
  rows_to_delete = []
  unique_apps    = []

  for idx, value in enumerate(column_values):
    if value not in unique_apps:
      unique_apps.append(value)
    else:
      rows_to_delete.append(idx)

  for idx in sorted(rows_to_delete, reverse=True):
    dataset.delete_row(idx)

In [751]:
def is_english(string):
  non_ascii = 0
  
  for character in string:
    if ord(character) > 127:
      non_ascii += 1
  
  if non_ascii > 3:
    return False
  else:
    return True

In [752]:
def delete_non_english_apps(dataset, column):
  column_values  = dataset.get_column(column)
  rows_to_delete = []

  for idx, value in enumerate(column_values):
    if not is_english(value):
      rows_to_delete.append(idx)

  for idx in sorted(rows_to_delete, reverse=True):
    dataset.delete_row(idx)

In [753]:
def deleting_paid_apps(dataset, column):
  column_values  = dataset.get_column(column)
  rows_to_delete = []

  for idx, value in enumerate(column_values):
    if float(value.replace('$', '')) != 0.0:
      rows_to_delete.append(idx)

  for idx in sorted(rows_to_delete, reverse=True):
    dataset.delete_row(idx)

In [754]:
def get_best_freq(genres_freq, percentage, size):
  current_percentage = 0
  best_freq          = {}

  for key, value in genres_freq.items():
    best_freq[key] = value / size
    current_percentage += best_freq[key]
    best_freq[key] = str(round(best_freq[key] * 100, 2)) + "%"
    
    if current_percentage >= percentage:
      break
  
  return current_percentage, best_freq

# Data Aquisition

In [755]:
data_applestore = fpd.read_csv("../data/AppleStore.csv")
data_playstore  = fpd.read_csv("../data/googleplaystore.csv")

In [756]:
data_applestore.shape

(7197, 16)

In [757]:
data_playstore.shape

(10840, 13)

Let's see the first 5 apps in each dataset

In [758]:
print("=== Apple Store ===\n")
data_applestore.head()

=== Apple Store ===

id: ['284882215', '389801252', '529479190', '420009108', '284035177']
track_name: ['Facebook', 'Instagram', 'Clash of Clans', 'Temple Run', 'Pandora - Music & Radio']
size_bytes: ['389879808', '113954816', '116476928', '65921024', '130242560']
currency: ['USD', 'USD', 'USD', 'USD', 'USD']
price: ['0.0', '0.0', '0.0', '0.0', '0.0']
rating_count_tot: ['2974676', '2161558', '2130805', '1724546', '1126879']
rating_count_ver: ['212', '1289', '579', '3842', '3594']
user_rating: ['3.5', '4.5', '4.5', '4.5', '4.0']
user_rating_ver: ['3.5', '4.0', '4.5', '4.0', '4.5']
ver: ['95.0', '10.23', '9.24.12', '1.6.2', '8.4.1']
cont_rating: ['4+', '12+', '9+', '9+', '12+']
prime_genre: ['Social Networking', 'Photo & Video', 'Games', 'Games', 'Music']
sup_devices.num: ['37', '37', '38', '40', '37']
ipadSc_urls.num: ['1', '0', '5', '5', '4']
lang.num: ['29', '29', '18', '1', '1']
vpp_lic: ['1', '1', '1', '1', '1']


In [759]:
print("=== Play Store ===\n")
data_playstore.head()

=== Play Store ===

App: ['Photo Editor & Candy Camera & Grid & ScrapBook', 'Coloring book moana', 'U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'Sketch - Draw & Paint', 'Pixel Draw - Number Art Coloring Book']
Category: ['ART_AND_DESIGN', 'ART_AND_DESIGN', 'ART_AND_DESIGN', 'ART_AND_DESIGN', 'ART_AND_DESIGN']
Rating: ['4.1', '3.9', '4.7', '4.5', '4.3']
Reviews: ['159', '967', '87510', '215644', '967']
Size: ['19M', '14M', '8.7M', '25M', '2.8M']
Installs: ['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+']
Type: ['Free', 'Free', 'Free', 'Free', 'Free']
Price: ['0', '0', '0', '0', '0']
Content Rating: ['Everyone', 'Everyone', 'Everyone', 'Teen', 'Everyone']
Genres: ['Art & Design', 'Art & Design;Pretend Play', 'Art & Design', 'Art & Design', 'Art & Design;Creativity']
Last Updated: ['January 7, 2018', 'January 15, 2018', 'August 1, 2018', 'June 8, 2018', 'June 20, 2018']
Current Ver: ['1.0.0', '2.0.0', '1.2.4', 'Varies with device', '1.1']
Android Ver: ['4.0.3 and u

# Deleting Data

## Wrong Value

The way the algorithm was built, lines with missing values ​​can already be removed in cases where the number of items in a line is not equal to the number of columns.

In **Apple Store** dataset we have 5 types of _rating_

In [760]:
[idx for idx, column in enumerate(data_applestore.columns) if "rating" in column]

[5, 6, 7, 8, 10]

They are:

- `rating_count_tot` : User Rating counts (for all version)
- `rating_count_ver` : User Rating counts (for current version)
- `user_rating` : Average User Rating value (for all version)
- `user_rating_ver` : Average User Rating value (for current version)
- `cont_rating` : Content Rating

`user_rating` and `user_rating_ver` are (float) values between 1 and 5. Let's see if all data fits in this case

In [761]:
print("=== Apple Store ===")
print(f"Total of Errors on column <user_rating> : {count_wrong_user_ratings(data_applestore, 'user_rating')}")
print(f"Total of Errors on column <user_rating_ver> : {count_wrong_user_ratings(data_applestore, 'user_rating_ver')}")

=== Apple Store ===
Total of Errors on column <user_rating> : 929
Total of Errors on column <user_rating_ver> : 1443


So let's delete the wrong values from **Apple Store** dataset.

In [762]:
delete_wrong_user_ratings(data_applestore, "user_rating")
delete_wrong_user_ratings(data_applestore, "user_rating_ver")

In [763]:
print("=== Apple Store ===")
print(f"Total of Errors on column <user_rating> : {count_wrong_user_ratings(data_applestore, 'user_rating')}")
print(f"Total of Errors on column <user_rating_ver> : {count_wrong_user_ratings(data_applestore, 'user_rating_ver')}")

=== Apple Store ===
Total of Errors on column <user_rating> : 0
Total of Errors on column <user_rating_ver> : 0


In the two stores, the apps are rated between [1-5] points/stars.

In **Play Store** dataset we have 1 rating called `Rating`

In [764]:
print("=== Play Store ===")
print(f"Total of Errors on column <Rating> : {count_wrong_user_ratings(data_playstore, 'Rating')}")

=== Play Store ===
Total of Errors on column <Rating> : 0


## Duplicates

Let's check if we have duplicates by app name

In [765]:
freq_applestore = fpd.generate_frequency_dict(data_applestore, "track_name", sorted_dict=True)
freq_applestore_filtered_dict = filter_dict(freq_applestore, 10)

print_items_in_dict(freq_applestore_filtered_dict, 10)

Mannequin Challenge : 2
VR Roller Coaster : 2
Facebook : 1
Instagram : 1
Clash of Clans : 1
Temple Run : 1
Pandora - Music & Radio : 1
Pinterest : 1
Bible : 1
Candy Crush Saga : 1


In [766]:
freq_playstore = fpd.generate_frequency_dict(data_playstore, "App", sorted_dict=True)
freq_playstore_filtered_dict = filter_dict(freq_playstore, 10)

print_items_in_dict(freq_playstore_filtered_dict, 10)

ROBLOX : 9
CBS Sports App - Scores, News, Stats & Watch Live : 8
Duolingo: Learn Languages Free : 7
Candy Crush Saga : 7
8 Ball Pool : 7
ESPN : 7
Nick : 6
Subway Surfers : 6
Bubble Shooter : 6
slither.io : 6


As seen, we have duplicated data in both databases. In this case, i will delete them.

In [767]:
delete_duplicates_by_app_name(data_applestore, "track_name")

freq_applestore = fpd.generate_frequency_dict(data_applestore, "track_name", sorted_dict=True)
freq_applestore_filtered_dict = filter_dict(freq_applestore, 10)

print_items_in_dict(freq_applestore_filtered_dict, 10)

Facebook : 1
Instagram : 1
Clash of Clans : 1
Temple Run : 1
Pandora - Music & Radio : 1
Pinterest : 1
Bible : 1
Candy Crush Saga : 1
Spotify Music : 1
Angry Birds : 1


In [768]:
delete_duplicates_by_app_name(data_playstore, "App")

freq_playstore = fpd.generate_frequency_dict(data_playstore, "App", sorted_dict=True)
freq_playstore_filtered_dict = filter_dict(freq_playstore, 10)

print_items_in_dict(freq_playstore_filtered_dict, 10)

Photo Editor & Candy Camera & Grid & ScrapBook : 1
Coloring book moana : 1
U Launcher Lite – FREE Live Cool Themes, Hide Apps : 1
Sketch - Draw & Paint : 1
Pixel Draw - Number Art Coloring Book : 1
Paper flowers instructions : 1
Smoke Effect Photo Maker - Smoke Editor : 1
Infinite Painter : 1
Garden Coloring Book : 1
Kids Paint Free - Drawing Fun : 1


## Non-English Apps

In the databases we have apps that are no directed toward an English-speaking audience.

In [769]:
for idx, app_name in enumerate(data_applestore["track_name"]):
  if not is_english(app_name):
    print(data_applestore.get_row(idx))

    break

['405667771', '聚力视频HD-人民的名义,跨界歌王全网热播', '90725376', 'USD', '0.0', '7446', '8', '4.0', '4.5', '5.0.8', '12+', 'Entertainment', '24', '4', '1', '1']


In [770]:
for idx, app_name in enumerate(data_playstore["App"]):
  if not is_english(app_name):
    print(data_playstore.get_row(idx))

    break

['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up']


Let's delete this apps.

In [771]:
delete_non_english_apps(data_applestore, "track_name")
delete_non_english_apps(data_playstore, "App")

# Isolating Free Apps

As said in the introduction of this jupyter notebook, the goal is to search based on free apps.

In [772]:
deleting_paid_apps(data_applestore, "price")
deleting_paid_apps(data_playstore, "Price")

Now, for the analysis, we have:

In [773]:
print(f"{data_applestore.shape[0]} Apple Store apps")
print(f"{data_playstore.shape[0]} Play Store apps")

2863 Apple Store apps
8862 Play Store apps


# Most Common Apps by Genre/Category

Another way to focus efforts on a specific group of apps is to filter the datasets by genre/category

In [774]:
PERCENTAGE = 0.8

In [775]:
data_applestore_genres = fpd.generate_frequency_dict(data_applestore, "prime_genre", sorted_dict=True)

In [776]:
print_items_in_dict(data_applestore_genres)

Games : 1735
Entertainment : 228
Photo & Video : 131
Education : 95
Social Networking : 85
Shopping : 71
Utilities : 67
Music : 64
Health & Fitness : 55
Sports : 55
Productivity : 50
Lifestyle : 44
News : 31
Travel : 28
Finance : 28
Food & Drink : 23
Weather : 22
Reference : 16
Business : 13
Book : 10
Navigation : 4
Catalogs : 4
Medical : 4


In [777]:
best_genres_percentage, data_applestore_best_genres = get_best_freq(data_applestore_genres, percentage=PERCENTAGE, size=data_applestore.shape[0])

print(f"These apps genres represent {round(best_genres_percentage, 2) * 100}% of Apple Store Dataset:\n")
print_items_in_dict(data_applestore_best_genres)


These apps genres represent 82.0% of Apple Store Dataset:

Games : 60.6%
Entertainment : 7.96%
Photo & Video : 4.58%
Education : 3.32%
Social Networking : 2.97%
Shopping : 2.48%


In [778]:
data_playstore_genres     = fpd.generate_frequency_dict(data_playstore, "Genres", sorted_dict=True)
data_playstore_categories = fpd.generate_frequency_dict(data_playstore, "Category", sorted_dict=True)

In [779]:
print_items_in_dict(data_playstore_genres, 20)

Tools : 747
Entertainment : 538
Education : 474
Business : 407
Lifestyle : 345
Productivity : 345
Finance : 328
Medical : 312
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181


In [780]:
best_genres_percentage, data_playstore_best_genres = get_best_freq(data_playstore_genres, percentage=PERCENTAGE, size=data_playstore.shape[0])

print(f"These apps genres represent {round(best_genres_percentage, 2) * 100}% of Play Store Dataset:\n")
print_items_in_dict(data_playstore_best_genres)

These apps genres represent 80.0% of Play Store Dataset:

Tools : 8.43%
Entertainment : 6.07%
Education : 5.35%
Business : 4.59%
Lifestyle : 3.89%
Productivity : 3.89%
Finance : 3.7%
Medical : 3.52%
Sports : 3.46%
Personalization : 3.32%
Communication : 3.24%
Action : 3.1%
Health & Fitness : 3.08%
Photography : 2.95%
News & Magazines : 2.8%
Social : 2.66%
Travel & Local : 2.32%
Shopping : 2.25%
Books & Reference : 2.14%
Simulation : 2.04%
Dating : 1.86%
Arcade : 1.85%
Video Players & Editors : 1.77%
Casual : 1.76%


In [781]:
print_items_in_dict(data_playstore_categories, 20)

FAMILY : 1635
GAME : 875
TOOLS : 748
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 312
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 158


In [782]:
best_categories_percentage, data_playstore_best_categories = get_best_freq(data_playstore_categories, percentage=PERCENTAGE, size=data_playstore.shape[0])

print(f"These apps categories represent {round(best_categories_percentage, 2) * 100}% of Play Store Dataset:\n")
print_items_in_dict(data_playstore_best_categories)

These apps categories represent 80.0% of Play Store Dataset:

FAMILY : 18.45%
GAME : 9.87%
TOOLS : 8.44%
BUSINESS : 4.59%
LIFESTYLE : 3.9%
PRODUCTIVITY : 3.89%
FINANCE : 3.7%
MEDICAL : 3.52%
SPORTS : 3.4%
PERSONALIZATION : 3.32%
COMMUNICATION : 3.24%
HEALTH_AND_FITNESS : 3.08%
PHOTOGRAPHY : 2.95%
NEWS_AND_MAGAZINES : 2.8%
SOCIAL : 2.66%
TRAVEL_AND_LOCAL : 2.34%


In the case of Play Store Dataset, we have the information of `Installs` (Number of user downloads/installs for the app). If we are interested in focusing on a specific range of downloads (generally the largest possible amount), we can filter by that column and discover the genres we should focus on to display ads.

In [783]:
list(fpd.generate_frequency_dict(data_playstore, "Installs").keys())

['10,000+',
 '500,000+',
 '5,000,000+',
 '50,000,000+',
 '100,000+',
 '50,000+',
 '1,000,000+',
 '10,000,000+',
 '5,000+',
 '100,000,000+',
 '1,000,000,000+',
 '1,000+',
 '500,000,000+',
 '500+',
 '100+',
 '50+',
 '10+',
 '1+',
 '5+',
 '0+',
 '0']

In [784]:
INSTALLS = '1,000,000,000+'

In [785]:
data = []

for idx, value in enumerate(data_playstore["Installs"]):
  if value == INSTALLS:
    data.append(data_playstore.get_row(idx))

filtered_dataset = fpd.DataSet(columns=data_playstore.columns, data=data)
      

In [786]:
filtered_dataset_genres = fpd.generate_frequency_dict(filtered_dataset, "Genres", sorted_dict=True)

print_items_in_dict(filtered_dataset_genres)

Communication : 6
Social : 3
Travel & Local : 2
Video Players & Editors : 2
Books & Reference : 1
Entertainment : 1
Arcade : 1
Photography : 1
Tools : 1
Productivity : 1
News & Magazines : 1


# Conclusions

This Jupyter Notebook was able to show that with simple operations we can extract insights from databases.

In the end, it was possible to generate information to meet our initial needs in this Jupyter Notebook