---
---
---

# ✨ **CHALLENGE**: Introductory Data Analysis with App Store Data ✨

In this challenge, you'll put your basic descriptive data analysis skills to the test by performing some basic data exploration and evaluation of an unfamiliar dataset before answer a multitude of predetermined and conditional descriptive questions.

By the end of this notebook, you should feel much more comfortable with your analysis and exploration skill set, even when limited to just leveraging `numpy` and `pandas`.

---
---

## 💠 **PART ONE**: Importations and Initializations 💠

Every time we do anything in the realm of data science, we should consider designing our development process to most effectively reflect an engineering pipeline that makes sense.

Generally, it can be useful to identify all major dependencies and libraries up front so we don't have to worry about outstanding external dependencies later on.

As such, we'll start by importing relevant data structures, objects, and libraries needed to perform some exploratory data analyses - namely `numpy` and `pandas`.

In [1]:
import numpy as numpy    # Numerical Python Operations
import pandas as pandas      # DataFrame Operations


---

### 📌 **REQUIRED CHALLENGE!** 📌

> Take a moment to reflect on the full scale of libraries, packages, and dependencies available to you as a data analyst.
>
> **For this challenge, create a text cell below and write some thoughts on what other libraries and tools you'd like to use to augment your data analysis skills, given what you already know.**
>
> Feel free to be creative with this question; you do not have to reference only the specific libraries that we've covered in class.

---

We're now ready to get access to our dataset: the **App Store** data!

In [2]:
# Create and set absolute dataset path
# Note: reuploaded into this repo for ease of viewing locally
PATH_DATASET = "AppleStore.csv"

# Save dataset as Pandas DataFrame
dataset = pandas.read_csv(PATH_DATASET)

In [3]:
dataset.head(1)

Unnamed: 0.1,Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,1,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1


Before we jump too deeply into an analysis, it's important to apply some creative thinking to help us better understand our data.

---

### 📌 **REQUIRED CHALLENGE!** 📌

> If you recall, data dictionaries can help us exponentially in understanding how our data is shaped, what features/variables it comprises, and what sorts of questions we can ask.
>
> Many times, we can access data dictionaries as metadata, external objects, or other files associated with a data download.
>
> However, for this exercise, let's see if you can't intuit your way into interpret what each column most likely represents by creating your own data dictionary.
>
> **Your challenge is to fill in the following cell's data dictionary structure with interpretations for each of the dataset's features.**
>
> Apply logic and reasoning, and feel free to access any external reasoning to maintain confidence about what each column most likely means across the data.
>
> (**NOTE**: If you become stumped and want to check your answers, feel free to **[access the data source](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)** to ***verify*** your data dictionary assignments and ensure you have the appropriate information before moving forward.)

---

#### 🔎 **App Store Data Dictionary** 🔍

- `Unnamed:0`: index
- `id`: app id
- `track_name`: app name
- `size_bytes`: ipa filesize bundle
- `currency`: purchase currency
- `price`: price in local currency
- `rating_count_tot`: amount of ratings in total
- `rating_count_ver`: amount of ratings for this version of the app
- `user_rating`: calculated app rating
- `user_rating_ver`: calculated app rating for this version of the app
- `ver`: app version
- `cont_rating`: content rating
- `prime_genre`: primary genre
- `sup_devices.num`: amount of devices supported by the app
- `lang.num`: amount of languages supported by the app
- `vpp_lic`: apple bs i dont remember

Now that we have our toolkit and dataset, it's time to start exploring our data... the best way you know how. After all, you're the one exploring it!

---
---

## 💠 **PART TWO**: Basic Data Exploration 💠

---

### 📌 **REQUIRED CHALLENGE!** 📌

> Data exploration is an incredibly important skill to hone and maintain, no matter the data!
>
> How else will you start developing the ability to ask interesting and relevant descriptive questions?
>
> **For this challenge, you will use as little or as many cells as you'd like to perform some basic data exploration and cleaning, ensuring that your data's integrity is as optimized as possible.**
>
> This could mean imputing null values, categorizing data, combining features, dropping noisy columns, etc.
>
> In other words, now's the time for creative decision-making - do whatever you think is best to ensure your data is as sanitary as it could be while ideating on some interesting questions to ask!

---

In [4]:
# Drop Unneeded Unnamed:0

try:
    dataset.drop(columns='Unnamed: 0', inplace=True)
except:
    pass 

dataset.head(2)

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1
1,281796108,Evernote - stay organized,158578688,USD,0.0,161065,26,4.0,3.5,8.2.2,4+,Productivity,37,5,23,1


In [5]:
# STUDENTS: write some code here! Feel free to create more cells too!

# Replace smart quotes with standard quotes
def replace_smart_quotes(text):
    text = text.replace(u'\u201c', u'"').replace(u'\u201d', u'"')
    text = text.replace(u'\u2014', u'-').replace(u'\u2013', u'-')
    return text

dataset['track_name'].apply(lambda name: replace_smart_quotes(name))


0                                         PAC-MAN Premium
1                               Evernote - stay organized
2         WeatherBug - Local Weather, Radar, Maps, Alerts
3       eBay: Best App to Buy, Sell, Save! Online Shop...
4                                                   Bible
                              ...                        
7192                                                Kubik
7193                                    VR Roller-Coaster
7194                Bret Michaels Emojis + Lyric Keyboard
7195            VR Roller Coaster World - Virtual Reality
7196                         Escape the Sweet Shop Series
Name: track_name, Length: 7197, dtype: object

---
---

## 💠 **PART THREE**: Descriptive Analysis 💠

---

### 📌 **REQUIRED CHALLENGE!** 📌

> Descriptive analyses are powered by one major idea: asking critical questions about our data's relationships, patterns, and distributions.
>
> However, before we get there, let's at least get some basic fundamentals down for assessing our descriptive analysis capabilities.
>
> **Your challenge is to answer the following five descriptive analysis questions as best as you can and as programmatically as you can, relying on `numpy` and `pandas` to get the job done.**
>
> For starters, we'll navigate through some predetermined descriptive questions - ensure you utilize the full range of your `numpy` and `pandas` analytical skills to answer them to the best of your ability!
>
> Additionally, major predetermined questions have some helper comments provided to assist you in streamlining your development process.

---

### 🔸 **Q1**: `What are the top ten highest rated free to play apps?`

(**NOTE**: _Use ratings of current versions._)

In [6]:
"""
What are the top ten highest rated (use rating of current version) free-to-play apps?

> STEP 1: Get all free-to-play apps.
> STEP 2: Sort by current version rating.
> STEP 3: Get top ten apps.
"""

free_apps = dataset[(dataset["price"] == 0.00)]
free_apps = free_apps.sort_values("user_rating_ver", ascending=False)

# get top ten
free_apps.head(10)

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
5903,1102828488,Delicious - Emily’s Message in a Bottle,336847872,USD,0.0,333,42,4.0,5.0,1.5,4+,Games,38,5,11,1
4124,1024000488,浪漫庄园(自由创造你的梦想),97181696,USD,0.0,36,17,4.5,5.0,1.3.9,12+,Games,40,5,1,1
5896,1102520605,CNN Politics,30449664,USD,0.0,254,1,3.0,5.0,2.8.2,4+,News,37,0,1,1
1234,510461758,Bike Race - Top Motorcycle Racing Games,120187904,USD,0.0,405007,4053,4.5,5.0,7.3.11,4+,Games,38,5,9,1
1236,510855668,Amazon Music,77778944,USD,0.0,106235,4605,4.5,5.0,6.5.0,4+,Music,37,5,6,1
4230,1032240256,【真・お絵かきパズル】〇〇投げてみた結果ｗｗ　完全無料！,69857280,USD,0.0,12,1,4.5,5.0,1.1.0,4+,Games,40,0,1,1
5791,1097469600,Beyond 14,99192832,USD,0.0,426,29,4.5,5.0,1.1.2,4+,Games,38,5,6,1
5778,1096825045,Coin Dozer: Casino,151451648,USD,0.0,2024,340,5.0,5.0,1.6,12+,Games,37,5,1,1
1252,515094775,Bazaart Photo Editor Pro and Picture Collage M...,120474624,USD,0.0,4909,479,4.5,5.0,4.6.3,4+,Photo & Video,37,2,13,1
3866,1001501844,Deliveroo: Restaurant Delivery - Order Food Ne...,62734336,USD,0.0,1702,580,4.5,5.0,2.16.1,12+,Food & Drink,37,0,9,1


### 🔸 **Q2**: `What genre has the most apps?`

In [7]:
"""
What genre has the most apps?

> STEP 1: Use `.value_counts()` to combine app counts by genre.
"""

dataset["prime_genre"].value_counts().head(1)

prime_genre
Games    3862
Name: count, dtype: int64

### 🔸 **Q3**: `For paid apps that are sold in dollars, what is the median rating for each genre?`

In [8]:
from statistics import median
"""
For paid apps that are sold in dollars, what is the median rating for each genre?

> STEP 1: Get all paid apps.
> STEP 2: Get a subset of paid apps sold in dollars (USD).
> STEP 3: Calculate median rating per genre.
"""

paid_apps = dataset[dataset["price"] != 0.00]
paid_apps_usd = paid_apps.query("currency == 'USD'")

paid_apps_usd.groupby('prime_genre')['user_rating'].median().reset_index()

Unnamed: 0,prime_genre,user_rating
0,Book,4.5
1,Business,4.0
2,Catalogs,4.5
3,Education,4.0
4,Entertainment,4.0
5,Finance,4.25
6,Food & Drink,4.0
7,Games,4.5
8,Health & Fitness,4.5
9,Lifestyle,4.0


### 🔸 **Q4**: `What is the average size in megabytes for each genre for apps rated 4.0 or higher?`

(**NOTE**: _Use ratings of current version._)

In [9]:
four_rated_apps = dataset.query('user_rating_ver >= 4.0')
mean_size_per_genre_of_four_rated_apps = four_rated_apps.groupby('prime_genre')['size_bytes'].mean().reset_index()

round(mean_size_per_genre_of_four_rated_apps['size_bytes'] / 1_000_000, 2)

0     235.47
1      62.80
2      75.63
3     191.34
4     126.20
5      94.73
6      81.94
7     293.07
8     101.13
9      68.78
10    543.63
11    121.64
12     73.05
13     66.27
14     71.62
15     90.29
16    112.73
17    104.06
18     94.62
19     90.73
20     68.40
21     57.80
22     73.82
Name: size_bytes, dtype: float64

### 🔸 **Q5**: `What is the average price (in USD) by genre for paid apps rated higher than 3.0?`

(**NOTE**: _Assume that all non-USD prices are dropped from consideration._)

In [10]:
"""
What is the average price (in USD) by genre for paid apps rated higher than 3.0?
(NOTE: Assume that all non-USD prices are not considered.)

> STEP 1: Get all apps rated higher than 3.0.
> STEP 2: Get subset of data by paid apps.
> STEP 3: Get subset of data by USD currencies only.
> STEP 4: Calculate mean app price by genre.
"""

three_rated_apps = paid_apps_usd.query("user_rating_ver > 3.0")
paid_apps_usd.groupby('prime_genre')['user_rating'].median().reset_index()

Unnamed: 0,prime_genre,user_rating
0,Book,4.5
1,Business,4.0
2,Catalogs,4.5
3,Education,4.0
4,Entertainment,4.0
5,Finance,4.25
6,Food & Drink,4.0
7,Games,4.5
8,Health & Fitness,4.5
9,Lifestyle,4.0


Great work! However, answering some predetermined questions does not make you a professional data analyst.

At the end of the day, your own creativity and intrigue are what guide your analytical and technical hand.

---

### 📌 **REQUIRED CHALLENGE!** 📌

> It's worthwhile putting you in the batter's box and allowing you to ask some complex descriptive questions... that you'll also take the opportunity to answer.
>
> **Your challenge is to create _two_ (2) additional descriptive questions to ask and answer them using the cells provided below.**
>
> We want to be wary of straying too deeply into the realm of inferentially statistical and/or predictively analytical questions, as those may be easy to ask but very challenging to effectively answer.
>
> As such, be direct and concise with your questions, but also don't be afraid of being ambitious; this is the way to becoming a better, more confident data scientist, after all!

---

### 🔸 **Q6**: Of the top 25 highly rated apps, what percentage of those apps are paid?

In [11]:
# STUDENTS: Write some code to answer your first proposed descriptive question!

# Of the top 25 highly rated apps, what percentage of those apps are paid?

top_25_apps = dataset.sort_values(by='user_rating', ascending=False).head(25)

paid_apps_count = top_25_apps[top_25_apps['price'] > 0]['id'].count()
total_apps_count = top_25_apps['id'].count()
percentage_paid_apps = (paid_apps_count / total_apps_count) * 100

print("{}% of the top 25 highly rated apps are paid.".format(percentage_paid_apps))

56.00000000000001% of the top 25 highly rated apps are paid.


### 🔸 **Q7**: Of all paid apps sold in USD, what is the most common pricepoint?

In [12]:
paid_apps_usd["price"].mode()[0]

0.99

---
---

## 💠 **PART FOUR**: Conclusions 💠

With seven descriptive questions asked and answered and a data exploration under your belt, your basic skills have been put to the test!

Hopefully you feel proud of your ability to combine creative thinking with technical design and explore a dataset previously unknown to you.

There'll be a lot more opportunity to do so moving forward!

---
---
---