<a href="https://colab.research.google.com/github/Charlottecool/DS-NTL-062424/blob/main/%5BFIS_DS%5D_Tutorial_Lab_Introductory_App_Store_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
---
---

# ✨ **CHALLENGE**: Introductory Data Analysis with App Store Data ✨

In this challenge, you'll put your basic descriptive data analysis skills to the test by performing some basic data exploration and evaluation of an unfamiliar dataset before answer a multitude of predetermined and conditional descriptive questions.

By the end of this notebook, you should feel much more comfortable with your analysis and exploration skill set, even when limited to just leveraging `numpy` and `pandas`.

---
---

## 💠 **PART ONE**: Importations and Initializations 💠

Every time we do anything in the realm of data science, we should consider designing our development process to most effectively reflect an engineering pipeline that makes sense.

Generally, it can be useful to identify all major dependencies and libraries up front so we don't have to worry about outstanding external dependencies later on.

As such, we'll start by importing relevant data structures, objects, and libraries needed to perform some exploratory data analyses - namely `numpy` and `pandas`.

In [1]:
import numpy as np    # Numerical Python Operations
import pandas as pd   # DataFrame Operations

---

### 📌 **REQUIRED CHALLENGE!** 📌

> Take a moment to reflect on the full scale of libraries, packages, and dependencies available to you as a data analyst.
>
> **For this challenge, create a text cell below and write some thoughts on what other libraries and tools you'd like to use to augment your data analysis skills, given what you already know.**
>
> Feel free to be creative with this question; you do not have to reference only the specific libraries that we've covered in class.

---

We're now ready to get access to our dataset: the **App Store** data!

In [18]:
# Create and set absolute dataset path
PATH_DATASET = "AppleStore.csv"

# Save dataset as Pandas DataFrame
dataset = pd.read_csv(PATH_DATASET)

In [4]:
dataset.head(1)

Unnamed: 0.1,Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,1,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1


Before we jump too deeply into an analysis, it's important to apply some creative thinking to help us better understand our data.

---

### 📌 **REQUIRED CHALLENGE!** 📌

> If you recall, data dictionaries can help us exponentially in understanding how our data is shaped, what features/variables it comprises, and what sorts of questions we can ask.
>
> Many times, we can access data dictionaries as metadata, external objects, or other files associated with a data download.
>
> However, for this exercise, let's see if you can't intuit your way into interpret what each column most likely represents by creating your own data dictionary.
>
> **Your challenge is to fill in the following cell's data dictionary structure with interpretations for each of the dataset's features.**
>
> Apply logic and reasoning, and feel free to access any external reasoning to maintain confidence about what each column most likely means across the data.
>
> (**NOTE**: If you become stumped and want to check your answers, feel free to **[access the data source](https://www.kaggle.com/datasets/gauthamp10/apple-appstore-apps)** to verify your data dictionary assignments and ensure you have the appropriate information before moving forward.)

---

#### 🔎 **App Store Data Dictionary** 🔍

- `Unnamed:0`: ???
- `id`: ???
- `track_name`: ???
- `size_bytes`: ???
- `currency`: ???
- `price`: ???
- `rating_count_tot`: ???
- `rating_count_ver`: ???
- `user_rating`: ???
- `user_rating_ver`: ???
- `ver`: ???
- `cont_rating`: ???
- `prime_genre`: ???
- `sup_devices.num`: ???
- `lang.num`: ???
- `vpp_lic`: ???

Now that we have our toolkit and dataset, it's time to start exploring our data... the best way you know how. After all, you're the one exploring it!

---
---

## 💠 **PART TWO**: Basic Data Exploration 💠

---

### 📌 **REQUIRED CHALLENGE!** 📌

> Data exploration is an incredibly important skill to hone and maintain, no matter the data!
>
> How else will you start developing the ability to ask interesting and relevant descriptive questions?
>
> **For this challenge, you will use as little or as many cells as you'd like to perform some basic data exploration and cleaning, ensuring that your data's integrity is as optimized as possible.**
>
> This could mean imputing null values, categorizing data, combining features, dropping noisy columns, etc.
>
> In other words, now's the time for creative decision-making - do whatever you think is best to ensure your data is as sanitary as it could be while ideating on some interesting questions to ask!

---

In [6]:
dataset.describe()

Unnamed: 0.1,Unnamed: 0,id,size_bytes,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
count,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0
mean,4759.069612,863131000.0,199134500.0,1.726218,12892.91,460.373906,3.526956,3.253578,37.361817,3.7071,5.434903,0.993053
std,3093.625213,271236800.0,359206900.0,5.833006,75739.41,3920.455183,1.517948,1.809363,3.737715,1.986005,7.919593,0.083066
min,1.0,281656500.0,589824.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0
25%,2090.0,600093700.0,46922750.0,0.0,28.0,1.0,3.5,2.5,37.0,3.0,1.0,1.0
50%,4380.0,978148200.0,97153020.0,0.0,300.0,23.0,4.0,4.0,37.0,5.0,1.0,1.0
75%,7223.0,1082310000.0,181924900.0,1.99,2793.0,140.0,4.5,4.5,38.0,5.0,8.0,1.0
max,11097.0,1188376000.0,4025970000.0,299.99,2974676.0,177050.0,5.0,5.0,47.0,5.0,75.0,1.0


In [7]:
dataset.isna().sum()

Unnamed: 0          0
id                  0
track_name          0
size_bytes          0
currency            0
price               0
rating_count_tot    0
rating_count_ver    0
user_rating         0
user_rating_ver     0
ver                 0
cont_rating         0
prime_genre         0
sup_devices.num     0
ipadSc_urls.num     0
lang.num            0
vpp_lic             0
dtype: int64

In [9]:
dataset['id'].unique()

array([ 281656475,  281796108,  281940292, ..., 1187779532, 1187838770,
       1188375727])

Now that you've successfully cleaned and processed your data to an introductory extent, we can get started with some descriptive analyses!

---
---

## 💠 **PART THREE**: Descriptive Analyses 💠

---

### 📌 **REQUIRED CHALLENGE!** 📌

> Descriptive analyses are powered by one major idea: asking critical questions about our data's relationships, patterns, and distributions.
>
> However, before we get there, let's at least get some basic fundamentals down for assessing our descriptive analysis capabilities.
>
> **Your challenge is to answer the following five descriptive analysis questions as best as you can and as programmatically as you can, relying on `numpy` and `pandas` to get the job done.**
>
> For starters, we'll navigate through some predetermined descriptive questions - ensure you utilize the full range of your `numpy` and `pandas` analytical skills to answer them to the best of your ability!
>
> Additionally, major predetermined questions have some helper comments provided to assist you in streamlining your development process.

---

### 🔸 **Q1**: `What are the top ten highest rated free to play apps?`

(**NOTE**: _Use ratings of current versions._)

In [39]:
"""
What are the top ten highest rated (use rating of current version) free-to-play apps?

> STEP 1: Get all free-to-play apps.
> STEP 2: Sort by current version rating.
> STEP 3: Get top ten apps.

"""
free_apps = dataset[dataset['price'] == 0.0]
top_10_free_apps = free_apps.sort_values(by='user_rating', ascending=False).head(10)
app_names = top_10_free_apps['track_name'].tolist()
print(app_names)

['SelfieCity', 'Eye Training Cocololo-3dステレオグラム視力回復アプリ-', 'Triller - Music Video & Film Maker', 'Egg, Inc.', '小咖秀-全民视频才艺直播社区', 'Patternator Pattern Maker Backgrounds & Wallpapers', 'DANDY DUNGEON Legend of Brave Yamada', 'Ab & Core Sworkit - Free Workout Trainer', 'Cookie Clickers 2', 'Productive habits & daily goals tracker']


### 🔸 **Q2**: `What genre has the most apps?`

In [43]:
"""
What genre has the most apps?

> STEP 1: Use `.value_counts()` to combine app counts by genre.
"""
dataset['prime_genre'].value_counts().head(1)

prime_genre
Games    3862
Name: count, dtype: int64

### 🔸 **Q3**: `For paid apps that are sold in dollars, what is the median rating for each genre?`

In [49]:
"""
For paid apps that are sold in dollars, what is the median rating for each genre?

> STEP 1: Get all paid apps.
> STEP 2: Get a subset of paid apps sold in dollars (USD).
> STEP 3: Calculate median rating per genre.
"""
paid_apps_dollars = dataset[(dataset['price'] != '0.00')&(dataset['currency']=='USD')]
median_rating_per_genre = paid_apps_dollars.groupby('prime_genre')['user_rating'].median()
print(median_rating_per_genre)


prime_genre
Book                 3.50
Business             4.00
Catalogs             1.75
Education            4.00
Entertainment        3.50
Finance              3.00
Food & Drink         4.00
Games                4.50
Health & Fitness     4.50
Lifestyle            3.50
Medical              4.50
Music                4.00
Navigation           3.50
News                 3.50
Photo & Video        4.50
Productivity         4.50
Reference            4.00
Shopping             4.00
Social Networking    3.50
Sports               3.50
Travel               4.00
Utilities            4.00
Weather              4.00
Name: user_rating, dtype: float64


### 🔸 **Q4**: `What is the average size in megabytes for each genre for apps rated 4.0 or higher?`

(**NOTE**: _Use ratings of current version._)

In [54]:
"""
What is the average size in megabytes for each genre for apps rated 4.0 or higher?
(NOTE: Use rating of current version.)

> STEP 1: Get all apps rated 4.0 or higher.
> STEP 2: Calculate mean app size per genre.
> STEP 3: Convert Series values from KB to MB by dividing values by 1,000,000.
"""
app_rated_over_4 = dataset[dataset['user_rating'] >= 4.0]
mean_app_size_per_genre = app_rated_over_4.groupby('prime_genre')['size_bytes'].mean()
mean_app_size_per_genre_mb = mean_app_size_per_genre/1000000

prime_genre
Book                 240.729963
Business              65.193809
Catalogs              78.466304
Education            182.588147
Entertainment        118.746180
Finance               94.012611
Food & Drink          81.013728
Games                289.441604
Health & Fitness      97.003087
Lifestyle             78.385214
Medical              583.507968
Music                117.347902
Navigation            65.457529
News                  66.876507
Photo & Video         74.221146
Productivity          82.609820
Reference            116.281190
Shopping              99.442608
Social Networking     86.420384
Sports                87.398319
Travel                82.145045
Utilities             59.824530
Weather               65.665164
Name: size_bytes, dtype: float64


### 🔸 **Q5**: `What is the average price (in USD) by genre for paid apps rated higher than 3.0?`

(**NOTE**: _Assume that all non-USD prices are dropped from consideration._)

In [59]:
"""
What is the average price (in USD) by genre for paid apps rated higher than 3.0?
(NOTE: Assume that all non-USD prices are not considered.)

> STEP 1: Get all apps rated higher than 3.0.
> STEP 2: Get subset of data by paid apps.
> STEP 3: Get subset of data by USD currencies only.
> STEP 4: Calculate mean app price by genre.
"""
app_rated_over_3_paid_apps_usd = dataset[(dataset['user_rating'] > 3.0) & (dataset['price'] > 0.00) & (dataset['currency']=='USD')]
mean_app_price_by_genre3 = app_rated_over_3_paid_apps_usd.groupby('prime_genre')['price'].mean()

Great work! However, answering some predetermined questions does not make you a professional data analyst.

At the end of the day, your own creativity and intrigue are what guide your analytical and technical hand.

---

### 📌 **REQUIRED CHALLENGE!** 📌

> It's worthwhile putting you in the batter's box and allowing you to ask some complex descriptive questions... that you'll also take the opportunity to answer.
>
> **Your challenge is to create _two_ (2) additional descriptive questions to ask and answer them using the cells provided below.**
>
> We want to be wary of straying too deeply into the realm of inferentially statistical and/or predictively analytical questions, as those may be easy to ask but very challenging to effectively answer.
>
> As such, be direct and concise with your questions, but also don't be afraid of being ambitious; this is the way to becoming a better, more confident data scientist, after all!

---

### 🔸 **Q6**: `[STUDENT SUBMISSION QUESTION]`

In [78]:
# STUDENTS: Write some code to answer your first proposed descriptive question!
'''
What are the free app has the best rating?
'''
free_app_top_rating = dataset[(dataset['user_rating']==4.5) & (dataset['price']== 0.00)]
print(free_app_top_rating['track_name'].to_list())


['Bible', 'Sonos Controller', 'OpenTable - Restaurant Reservations', 'Chase Mobile℠', 'The Masters Tournament', 'WhatsApp Messenger', 'Zillow Real Estate - Homes for Sale & for Rent', 'radio.de - Der Radioplayer', 'AutoScout24 - mobile used & new car market', 'Sky Burger - Build & Match Food Free', 'Sleep Cycle alarm clock', 'Pixel Starships™ : 8Bit MMORPG', 'MyRadar NOAA Weather Radar Forecast', 'Waze - GPS Navigation, Maps & Real-time Traffic', 'Spotify Music', 'Starbucks', 'Walgreens – Pharmacy, Photo, Coupons and Shopping', 'Runtastic Running, Jogging and Walking Tracker', 'Calorie Counter & Diet Tracker by MyFitnessPal', 'BlaBlaCar - Trusted Carpooling', 'IMDb Movies & TV - Trailers and Showtimes', 'Angry Birds', 'Star Chart', 'Badoo - Meet New People, Chat, Socialize.', 'Venmo', 'Waterlogged - Daily Hydration Tracker', 'The Washington Post Classic', 'Groupon - Deals, Coupons & Discount Shopping App', 'Solitaire', 'Angry Birds HD', 'Dictionary.com Dictionary & Thesaurus for iPad',

### 🔸 **Q7**: `[STUDENT SUBMISSION QUESTION]`

In [86]:
# STUDENTS: Write some code to answer your second proposed descriptive question!
'''
What are the expensive app (price over $200) in dollars?
'''

expensive_app_in_usd = dataset[(dataset['price']>= 200.00) & (dataset['currency']=='USD')]
print(expensive_app_in_usd['track_name'].to_list())

['Proloquo2Go - Symbol-based AAC', 'LAMP Words For Life']


---
---

## 💠 **PART FOUR**: Conclusions 💠

With seven descriptive questions asked and answered and a data exploration under your belt, your basic skills have been put to the test!

Hopefully you feel proud of your ability to combine creative thinking with technical design and explore a dataset previously unknown to you.

There'll be a lot more opportunity to do so moving forward!

---
---
---