____________________
# Music 255: Pandas Homework

In this homework assignment, you will get a chance to work on:

* **data clean up**: type checking, formatting, variable range
* **filtering and sorting**: producing desired subsets of data
* **customization**: creating new columns, entries, merging datasets

Read more about Pandas [here](https://pandas.pydata.org/about/
).

Find Pandas Tutorials [here](https://www.w3schools.com/python/pandas/default.asp).

Pandas Cheat Sheet:  [here](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

Also remember the guide to various Pandas features in the **M225_B_Pandas Jupyter Notebook**, 
which uses data sets for the Beatles to demonstrate the methods.

#### Setup: Import Libraries

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

____________________

# Part 1: Import Data

As a reminder:


Pandas **data frames** are the basic unit upon which all operations take place.  Data frames are like spreadsheets, with columns and rows.

Indeed, Pandas can easily import spreadsheets in **CSV** (comma-separated values) format.  We will also import data from databases in **JSON** format (JavaScript Object Notation), similar to a Python dictionary.  There are special scripts for working with JSON, too.

Pandas can export as well as import these formats (among others).

In this portion of the homework, you will be asked to import one of the three datasets we compiled for you:

* Beatles: https://docs.google.com/spreadsheets/d/e/2PACX-1vS5BmNaCGD2fZW_csT6pQHD42pkAP9Dy-Vm5qnbQl03Y4ZvAbp_b5NRFNQ0fBSBCvpN6RCxowRc9AQ_/pub?output=csv  <br><br>
* Rolling Stones: https://docs.google.com/spreadsheets/d/e/2PACX-1vQHD3EgBmbdiAAhiS_wML0suRow6r8q26x8k3g7QqyjaGAjDyp0nEymaasQiuDgtDEYm0HOD1MZaJUJ/pub?output=csv <br><br>
* Led Zeppelin: https://docs.google.com/spreadsheets/d/e/2PACX-1vQx7QDXjfB4d_7tav01q0zyZMaGrffldBhUenDJaJt03DH4bTTOneM8srUnJ8zkQ425_xIIXXty5Dzy/pub?output=csv <br><br>
* Bob Dylan: https://docs.google.com/spreadsheets/d/e/2PACX-1vRW3Vi8boPnwbnAwI1sRkJ-D4mF24esW1KvWm7rMueD4hFp2Vu6zXHCgOtEgkD383ZHgQow0j2DKGEM/pub?output=csv

Use the cell below to save your dataset into a Pandas DataFrame:

In [30]:
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vQx7QDXjfB4d_7tav01q0zyZMaGrffldBhUenDJaJt03DH4bTTOneM8srUnJ8zkQ425_xIIXXty5Dzy/pub?output=csv"
imported_dataset = pd.read_csv(url)

In the cell below, write a code that will output the first three rows of your DataFrame:

In [31]:
# make this cell output the first three rows of your DataFrame

It is also important to obtain the metadata for your dataset. Typically, a data sample is described by its mean, standard deviation, size, shape, and datatypes. Please use the cell above to print out all available information about your dataset (use Lab Notebook B for reference):

In [32]:
# describe your dataset

____________________
# Part 2: Data Clean Up

Once you're familiar with your data, it likely needs to be cleaned. Some common practices include **type checking**, **dropping NaNs (Not a Number's)**, and **removing duplicate entries**. In the cells below, you'll get a chance to practice each of these techniques.

#### Step 2.1: Type Checking

One of the features provided by your Spotify Datasets is named "mode" and stands for the Musical Mode of a track. While Spotify's scores are typically binary (0 => Minor; 1 => Major), we sometimes encounter errors in these files: e.g. someone entered "Major" instead of 1.

First, we ask you to type check your dataFrame. There is a built-in Pandas method that does this. Please use the cell below to output the **dataTypes for your DataFrame**:

In [33]:
# use this cell to check the data types of your DataFrame

At this point, you should be able to see that the "mode" column is of type Object. This is because it features both Integers (0's and 1's) and Strings ("Major" and "Minor"), so Python puts the values in an Object wrapper. 

At this point, we want to check to see what are all possible values in the column. Value range checking could be accomplished by calling Python's **set()** function on the DataFrame column. Please output the **set of all values of the "mode" column** in the cell below:

In [34]:
# produce the set of all values in the "mode" column here

In [35]:
imported_dataset.head()

Unnamed: 0.1,Unnamed: 0,artist,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,Album_Name
0,0,Led Zeppelin,You Shook Me - 23/3/69 Top Gear; Remaster,4AIJz1t4ysqOT1c5BLSRQQ,0.414,0.49,4,-8.576,1,0.0545,0.0999,0.472,0.67,130.761,314013,3,The Complete BBC Sessions (Remastered)
1,1,Led Zeppelin,I Can't Quit You Baby - 23/3/69 Top Gear;Remaster,5QRD5sNh0aaWMyjTzQ0QIn,0.411,0.442,2,-12.276,1,0.262,0.0266,0.198,0.367,145.973,263627,3,The Complete BBC Sessions (Remastered)
2,2,Led Zeppelin,Communication Breakdown - Live on Tasty Pop Su...,0RR8wuHHc5NqSFxhPDDBNV,0.367,0.779,9,-8.334,1,0.0488,0.00708,0.288,0.794,90.354,191827,4,The Complete BBC Sessions (Remastered)
3,3,Led Zeppelin,Dazed and Confused - 3/23/69 Top Gear;Remaster,3vDA1z8UmHtVLQV7McvhEj,0.303,0.571,11,-11.521,0,0.0958,0.0757,0.114,0.422,152.001,399587,3,The Complete BBC Sessions (Remastered)
4,4,Led Zeppelin,The Girl I Love She Got Long Black Wavy Hair -...,7ecVrUYlhj6OrKTAK0oDzo,0.264,0.609,9,-10.992,0,0.0655,0.311,0.322,0.862,185.875,183227,4,The Complete BBC Sessions (Remastered)


In order to clean up this column, you'll need to **write a function** to clean up your data. This could be accomplished by **defining a function** and **applying it within the column**, or by utilizing the **lambda function definition**. 

Please use the cell below to **clean up the "mode" column**:

{'0', '1', 'Major', 'Minor'}

At this point, your "mode" column should only feature 1's and 0's. Please use the cell below to produce the **set of all values** in the column: 

In [None]:
# produce the new set of all values in the "mode" column here

#### Step 2.2: NaN Check

Another useful data clean-up technique is getting rid of (known as "dropping") the empty values. In Python, a missing value is considered a NaN -- short for Not a Number (also known as None). In your Spotify dataset, several track entries (represented as dataset rows) have NaN/None in the "valence" column.

In the cell below, produce a line of code that will **drop NaN's** in your dataset. Then, use Pandas to check for NaN's (see lab example).

In [None]:
# drop the None values; check if any left

#### Step 2.3: Uniqueness

Finally, sometimes it is very important to check for the uniqueness of the data entries. In the case of your Spotify dataset, we want to make sure no entry is repeated -- as we know that the same song featured in two different albums would be represented as two distinct entries.

Please use the cell below to **remove duplicates** from your DataFrame.

In [None]:
# remove duplicates here

____________________
# Part 3: Rearranging Data

At this point, you should have a cleaned-up and well-compiled dataset that is ready to be analyzed. As part of configuring your data to fit your analysis purposes, you will often have to **rearrange, sort, filter, group, or bin** your entries. 

#### Step 3.1: Column Order

Sometimes, you would have to **switch your columns around** in order to make your data appear a certain way. Please use the cell below to swap the "loudness" and "mode" columns in your dataset. Output the dataset's column names to showcase your result.

In [15]:
# rearrange columns here: return the columns names list

#### Step 3.2: Row Order

Similarly, sometimes you would have to **rearrange rows**. Use the cell below to swap rows 2 and 3. Output the head of the dataset to illustrate the change:

In [16]:
# swap rows 2 and 3 here

#### Step 3.3: Sorting

Oftentimes, data needs to be sorted. Use the cell below to **sort the entries** in your DataFrame **based on energy**. That is, based on the values in the "energy" column (in ascending order). Output the first 10 rows of your dataset to illustrate the result: 

In [None]:
# sort your data here

#### Step 3.4: Binning 

Another important tool is being able to categorize data. Oftentimes, this is done through "binning" -- assigning the entry to one of several discrete categories based on some continuous value. In our specific example, we will use the values in the "danceability" column (expressed as floats ranging from 0 to 1) to classify a track as a Dance Tune (0 => Not a Dance Tune; 1 => Definitely a Dance Tune). 

First, you need to think about picking a certain threshold value. Is Get Back by the Beatles (0.628 danceability rating) a Dance Tune? How about Doctor Robert (0.392 danceability score)? Use the code cell below to **pick your danceability threshold value** and save it as a variable. 

In [None]:
# save your danceability threshold here

Edit this cell to provide **verbal explanation** regarding your choice:

#### Why I chose my threshold: 

Now, use the threshold you selected to **produce a column named "DanceTune"**. The new column should say *True* (boolean) **if a track has a danceability score equal to or above your chosen threshold**. Output the column to illustrate your result:

In [None]:
# bin danceability here; output the new column

#### Step 3.5: Categorical Binning

Sometimes, a simple True/False ranking isn't enough. For example, the "tempo" column provides the Beat Per Minute musical tempo value for a given track; this value typically ranges between 1 and ~500 bpm. While it is possible to classify tracks into Slow and Not Slow, it might be more useful to, for example, categorize them into "Slow", "Medium", and "Fast". 

Similarly to the previous task, use the space below to **come up with two relative thresholds** that would separate tempo into Slow, Medium, and Fast:

In [None]:
# save your tempo thresholds here

Edit this cell to provide **verbal explanation** regarding your choice:

#### Why I chose my thresholds: 

Now, use the threshold you selected to **produce a column named "Pace"**. The new column should say "*Slow*" (String) if a track has a tempo score below your lower threshold, "*Medium*" (String) if it has a tempo score equal to or above your lower threshold but below your upper threshold, and "*Fast*" (String) if its score is above your upper threshold. Output the column to illustrate your result:

In [18]:
# add the new column here

#### Step 3.6: Merging Datasets

Finally, another useful thing is **combining several datasets** together. Use the cell below to import one of the other datasets (provided in the Import Data portion of this assignment) as a Pandas DataFrame and then merge it with your current dataset vertically **as a new dataset** (think about columns you have added). 

Output the new DataFrame to illustrate your result: 

In [None]:
# produce the new combined DataFrame here:

____________________
# Part 4: Data Analysis

As a final portion of this assignment, we ask you to explore some Data Analysis techniques in Pandas.

#### Step 4.1: Correlation

One useful metric of two variables relating to each other is the correlation coefficient. Use the cell below to **find the correlation coefficient (R^2)** between two variables (think columns) of your choice within your dataset:

In [19]:
# explore correlation here

Use this cell to explain why you chose the two variables and what you were able to find:

#### Why I chose my thresholds:


#### Step 4.2: Count

Another tool you might find useful is **counting data entries** that match a given condition. Use the cell below to count DanceTunes that are Slow:

In [None]:
# count entries here

#### Step 4.3: Simple Charts

Finally, it is extremely useful to produce visual results for your data. Use the cell below to **produce a simple scatter plot** using matplotlib or altair based on two variables of your choice:

In [None]:
# produce your scatter plot here

Use this cell to explain why you chose the variables to graph out and how one could interpret the results:

#### Thoughts:

____________________
# Reflections (optional)

Please use this space to reflect on your experience, ask any additional questions, or suggest changes to this or other assignments.