# Analyzing Mobile App Data

The current Jupyter notebook is the result of a guided project from Dataquest. The objective is to find mobile apps that are profitable in two **datasets**:

<p style="display: flex; align-items:center;">
  <a href="https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps"><img src="https://img.shields.io/badge/Kaggle-20BEFF?style=for-the-badge&logo=Kaggle&logoColor=white" target="_blank"></a>&nbsp;&nbsp;<b>Apple Store data</b>&nbsp;(approximately seven thousand iOS apps)
</p>

<p style="display: flex; align-items:center;">
  <a href="https://www.kaggle.com/datasets/lava18/google-play-store-apps"><img src="https://img.shields.io/badge/Kaggle-20BEFF?style=for-the-badge&logo=Kaggle&logoColor=white" target="_blank"></a>&nbsp;&nbsp;<b>Google Play Store data</b>&nbsp;(approximately ten thousand Android apps)
</p>

To achieve the goal of finding profitable mobile apps, this project will focus on free apps that can generate revenue by displaying ads. Additionally, this notebook will utilize classes created to apply Object-Oriented Programming principles. The code for these classes is available [here](./FakePandas.py).


# Importing Libs

In [213]:
import FakePandas as fpd

# Auxiliar Functions

In [214]:
def count_wrong_user_ratings(dataset, column):
  column_values  = dataset.get_column(column)
  rows_to_delete = []

  dataset_user_rating_errors = 0
  
  for idx, value in enumerate(column_values):
    if float(value) < 1 or float(value) > 5:
      dataset_user_rating_errors += 1
    
  return dataset_user_rating_errors

In [215]:
def delete_wrong_user_ratings(dataset, column):
  column_values  = dataset.get_column(column)
  rows_to_delete = []

  for idx, value in enumerate(column_values):
    if float(value) < 1 or float(value) > 5:
      rows_to_delete.append(idx)

  for idx in sorted(rows_to_delete, reverse=True):
    dataset.delete_row(idx)

In [216]:
def delete_duplicates_by_app_name(dataset, column):
  column_values  = dataset.get_column(column)
  rows_to_delete = []
  unique_apps    = []

  for idx, value in enumerate(column_values):
    if value not in unique_apps:
      unique_apps.append(value)
    else:
      rows_to_delete.append(idx)

  for idx in sorted(rows_to_delete, reverse=True):
    dataset.delete_row(idx)

In [217]:
def is_english(string):
  non_ascii = 0
  
  for character in string:
    if ord(character) > 127:
      non_ascii += 1
  
  if non_ascii > 3:
    return False
  else:
    return True

In [218]:
def delete_non_english_apps(dataset, column):
  column_values  = dataset.get_column(column)
  rows_to_delete = []

  for idx, value in enumerate(column_values):
    if not is_english(value):
      rows_to_delete.append(idx)

  for idx in sorted(rows_to_delete, reverse=True):
    dataset.delete_row(idx)

# Data Aquisition

In [219]:
data_applestore = fpd.read_csv("../data/AppleStore.csv")
data_playstore  = fpd.read_csv("../data/googleplaystore.csv")

Let's see the first 5 apps in each dataset

In [220]:
data_applestore.head()

id: ['284882215', '389801252', '529479190', '420009108', '284035177']
track_name: ['Facebook', 'Instagram', 'Clash of Clans', 'Temple Run', 'Pandora - Music & Radio']
size_bytes: ['389879808', '113954816', '116476928', '65921024', '130242560']
currency: ['USD', 'USD', 'USD', 'USD', 'USD']
price: ['0.0', '0.0', '0.0', '0.0', '0.0']
rating_count_tot: ['2974676', '2161558', '2130805', '1724546', '1126879']
rating_count_ver: ['212', '1289', '579', '3842', '3594']
user_rating: ['3.5', '4.5', '4.5', '4.5', '4.0']
user_rating_ver: ['3.5', '4.0', '4.5', '4.0', '4.5']
ver: ['95.0', '10.23', '9.24.12', '1.6.2', '8.4.1']
cont_rating: ['4+', '12+', '9+', '9+', '12+']
prime_genre: ['Social Networking', 'Photo & Video', 'Games', 'Games', 'Music']
sup_devices.num: ['37', '37', '38', '40', '37']
ipadSc_urls.num: ['1', '0', '5', '5', '4']
lang.num: ['29', '29', '18', '1', '1']
vpp_lic: ['1', '1', '1', '1', '1']


In [221]:
data_playstore.head()

App: ['Photo Editor & Candy Camera & Grid & ScrapBook', 'Coloring book moana', 'U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'Sketch - Draw & Paint', 'Pixel Draw - Number Art Coloring Book']
Category: ['ART_AND_DESIGN', 'ART_AND_DESIGN', 'ART_AND_DESIGN', 'ART_AND_DESIGN', 'ART_AND_DESIGN']
Rating: ['4.1', '3.9', '4.7', '4.5', '4.3']
Reviews: ['159', '967', '87510', '215644', '967']
Size: ['19M', '14M', '8.7M', '25M', '2.8M']
Installs: ['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+']
Type: ['Free', 'Free', 'Free', 'Free', 'Free']
Price: ['0', '0', '0', '0', '0']
Content Rating: ['Everyone', 'Everyone', 'Everyone', 'Teen', 'Everyone']
Genres: ['Art & Design', 'Art & Design;Pretend Play', 'Art & Design', 'Art & Design', 'Art & Design;Creativity']
Last Updated: ['January 7, 2018', 'January 15, 2018', 'August 1, 2018', 'June 8, 2018', 'June 20, 2018']
Current Ver: ['1.0.0', '2.0.0', '1.2.4', 'Varies with device', '1.1']
Android Ver: ['4.0.3 and up', '4.0.3 and up', 

# Deleting Data

## Wrong Value

The way the algorithm was built, lines with missing values ​​can already be removed in cases where the number of items in a line is not equal to the number of columns.

In **Apple Store** dataset we have 5 types of _rating_

In [222]:
[idx for idx, column in enumerate(data_applestore.columns) if "rating" in column]

[5, 6, 7, 8, 10]

They are:

- `rating_count_tot` : User Rating counts (for all version)
- `rating_count_ver` : User Rating counts (for current version)
- `user_rating` : Average User Rating value (for all version)
- `user_rating_ver` : Average User Rating value (for current version)
- `cont_rating` : Content Rating

`user_rating` and `user_rating_ver` are (float) values between 1 and 5. Let's see if all data fits in this case

In [223]:
print("=== Apple Store ===")
print(f"Total of Errors on column <user_rating> : {count_wrong_user_ratings(data_applestore, 'user_rating')}")
print(f"Total of Errors on column <user_rating_ver> : {count_wrong_user_ratings(data_applestore, 'user_rating_ver')}")

=== Apple Store ===
Total of Errors on column <user_rating> : 929
Total of Errors on column <user_rating_ver> : 1443


So let's delete the wrong values from **Apple Store** dataset.

In [224]:
delete_wrong_user_ratings(data_applestore, "user_rating")
delete_wrong_user_ratings(data_applestore, "user_rating_ver")

In [225]:
print("=== Apple Store ===")
print(f"Total of Errors on column <user_rating> : {count_wrong_user_ratings(data_applestore, 'user_rating')}")
print(f"Total of Errors on column <user_rating_ver> : {count_wrong_user_ratings(data_applestore, 'user_rating_ver')}")

=== Apple Store ===
Total of Errors on column <user_rating> : 0
Total of Errors on column <user_rating_ver> : 0


In the two stores, the apps are rated between [1-5] points/stars.

In **Play Store** dataset we have 1 rating called `Rating`

In [226]:
print("=== Play Store ===")
print(f"Total of Errors on column <Rating> : {count_wrong_user_ratings(data_playstore, 'Rating')}")

=== Play Store ===
Total of Errors on column <Rating> : 0


## Duplicates

Let's check if we have duplicates by app name

In [227]:
freq_applestore = fpd.generate_frequency_dict(data_applestore, "track_name", sorted_dict=True)

freq_applestore

{'Mannequin Challenge': 2,
 'VR Roller Coaster': 2,
 'Facebook': 1,
 'Instagram': 1,
 'Clash of Clans': 1,
 'Temple Run': 1,
 'Pandora - Music & Radio': 1,
 'Pinterest': 1,
 'Bible': 1,
 'Candy Crush Saga': 1,
 'Spotify Music': 1,
 'Angry Birds': 1,
 'Subway Surfers': 1,
 'Fruit Ninja Classic': 1,
 'Solitaire': 1,
 'CSR Racing': 1,
 'Crossy Road - Endless Arcade Hopper': 1,
 'Injustice: Gods Among Us': 1,
 'Hay Day': 1,
 'Clear Vision (17+)': 1,
 'Minecraft: Pocket Edition': 1,
 'PAC-MAN': 1,
 'Calorie Counter & Diet Tracker by MyFitnessPal': 1,
 'DragonVale': 1,
 'The Weather Channel: Forecast, Radar & Alerts': 1,
 'Head Soccer': 1,
 'Google – Search made just for mobile': 1,
 'Despicable Me: Minion Rush': 1,
 'The Sims™ FreePlay': 1,
 'Google Earth': 1,
 'Plants vs. Zombies': 1,
 'Sonic Dash': 1,
 'Groupon - Deals, Coupons & Discount Shopping App': 1,
 '8 Ball Pool™': 1,
 'Tiny Tower - Free City Building': 1,
 'Jetpack Joyride': 1,
 'Bike Race - Top Motorcycle Racing Games': 1,
 'Sha

In [228]:
freq_playstore = fpd.generate_frequency_dict(data_playstore, "App", sorted_dict=True)

freq_playstore

{'ROBLOX': 9,
 'CBS Sports App - Scores, News, Stats & Watch Live': 8,
 'Duolingo: Learn Languages Free': 7,
 'Candy Crush Saga': 7,
 '8 Ball Pool': 7,
 'ESPN': 7,
 'Nick': 6,
 'Subway Surfers': 6,
 'Bubble Shooter': 6,
 'slither.io': 6,
 'Temple Run 2': 6,
 'Helix Jump': 6,
 'Zombie Catchers': 6,
 'Sniper 3D Gun Shooter: Free Shooting Games - FPS': 6,
 'Bowmasters': 6,
 'Bleacher Report: sports news, scores, & highlights': 6,
 'Viber Messenger': 5,
 'Netflix': 5,
 'Calorie Counter - MyFitnessPal': 5,
 'Plants vs. Zombies FREE': 5,
 'Granny': 5,
 'Angry Birds Classic': 5,
 'Flow Free': 5,
 'Zombie Tsunami': 5,
 'Farm Heroes Saga': 5,
 'MeetMe: Chat & Meet New People': 5,
 'Wish - Shopping Made Fun': 5,
 'eBay: Buy & Sell this Summer - Discover Deals Now!': 5,
 'BeautyPlus - Easy Photo Editor & Selfie Camera': 5,
 'MLB At Bat': 5,
 'theScore: Live Sports Scores, News, Stats & Videos': 5,
 'Yahoo Fantasy Sports - #1 Rated Fantasy App': 5,
 'Skyscanner': 5,
 'TripAdvisor Hotels Flights Re

As seen, we have duplicated data in both databases. In this case, i will delete them.

In [229]:
delete_duplicates_by_app_name(data_applestore, "track_name")

freq_applestore = fpd.generate_frequency_dict(data_applestore, "track_name", sorted_dict=True)

freq_applestore

{'Facebook': 1,
 'Instagram': 1,
 'Clash of Clans': 1,
 'Temple Run': 1,
 'Pandora - Music & Radio': 1,
 'Pinterest': 1,
 'Bible': 1,
 'Candy Crush Saga': 1,
 'Spotify Music': 1,
 'Angry Birds': 1,
 'Subway Surfers': 1,
 'Fruit Ninja Classic': 1,
 'Solitaire': 1,
 'CSR Racing': 1,
 'Crossy Road - Endless Arcade Hopper': 1,
 'Injustice: Gods Among Us': 1,
 'Hay Day': 1,
 'Clear Vision (17+)': 1,
 'Minecraft: Pocket Edition': 1,
 'PAC-MAN': 1,
 'Calorie Counter & Diet Tracker by MyFitnessPal': 1,
 'DragonVale': 1,
 'The Weather Channel: Forecast, Radar & Alerts': 1,
 'Head Soccer': 1,
 'Google – Search made just for mobile': 1,
 'Despicable Me: Minion Rush': 1,
 'The Sims™ FreePlay': 1,
 'Google Earth': 1,
 'Plants vs. Zombies': 1,
 'Sonic Dash': 1,
 'Groupon - Deals, Coupons & Discount Shopping App': 1,
 '8 Ball Pool™': 1,
 'Tiny Tower - Free City Building': 1,
 'Jetpack Joyride': 1,
 'Bike Race - Top Motorcycle Racing Games': 1,
 'Shazam - Discover music, artists, videos & lyrics': 1,


In [230]:
delete_duplicates_by_app_name(data_playstore, "App")

freq_applestore = fpd.generate_frequency_dict(data_playstore, "App", sorted_dict=True)

freq_applestore

{'Photo Editor & Candy Camera & Grid & ScrapBook': 1,
 'Coloring book moana': 1,
 'U Launcher Lite – FREE Live Cool Themes, Hide Apps': 1,
 'Sketch - Draw & Paint': 1,
 'Pixel Draw - Number Art Coloring Book': 1,
 'Paper flowers instructions': 1,
 'Smoke Effect Photo Maker - Smoke Editor': 1,
 'Infinite Painter': 1,
 'Garden Coloring Book': 1,
 'Kids Paint Free - Drawing Fun': 1,
 'Text on Photo - Fonteee': 1,
 'Name Art Photo Editor - Focus n Filters': 1,
 'Tattoo Name On My Photo Editor': 1,
 'Mandala Coloring Book': 1,
 '3D Color Pixel by Number - Sandbox Art Coloring': 1,
 'Learn To Draw Kawaii Characters': 1,
 'Photo Designer - Write your name with shapes': 1,
 '350 Diy Room Decor Ideas': 1,
 'FlipaClip - Cartoon animation': 1,
 'ibis Paint X': 1,
 'Logo Maker - Small Business': 1,
 "Boys Photo Editor - Six Pack & Men's Suit": 1,
 'Superheroes Wallpapers | 4K Backgrounds': 1,
 'Mcqueen Coloring pages': 1,
 'HD Mickey Minnie Wallpapers': 1,
 'Harley Quinn wallpapers HD': 1,
 'Color

## Non-English Apps

In the databases we have apps that are no directed toward an English-speaking audience.

In [231]:
for idx, app_name in enumerate(data_applestore["track_name"]):
  if not is_english(app_name):
    print(data_applestore.get_row(idx))

    break

['405667771', '聚力视频HD-人民的名义,跨界歌王全网热播', '90725376', 'USD', '0.0', '7446', '8', '4.0', '4.5', '5.0.8', '12+', 'Entertainment', '24', '4', '1', '1']


In [232]:
for idx, app_name in enumerate(data_playstore["App"]):
  if not is_english(app_name):
    print(data_playstore.get_row(idx))

    break

['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up']


Let's delete this apps.

In [233]:
delete_non_english_apps(data_applestore, "track_name")
delete_non_english_apps(data_playstore, "App")

# Isolating Free Apps

In [234]:
def deleting_paid_apps(dataset, column):
  column_values  = dataset.get_column(column)
  rows_to_delete = []

  for idx, value in enumerate(column_values):
    if float(value) != 0.0:
      rows_to_delete.append(idx)

  for idx in sorted(rows_to_delete, reverse=True):
    dataset.delete_row(idx)

# Most Common Apps by Genre

In [235]:
def generate_genre_frequency(dataset, column):
    # freq = {}

    # for idx, value in enumerate(dataset[column]):
    #     if value not in freq.keys():
    #         freq[value] = 1
    #     else:
    #         freq[value] += 1

    # if sorted_dict:
    #     return dict(sorted(freq.items(), key=lambda x: x[1], reverse=True))
    
    # return freq

# Conclusions