# Learn Data Science with Spotify
Welcome! This tutorial is a complement to the textbook for the EdX course Data 8.1x, Data Science: Computational Thinking with Python. Since the textbook uses a custom Tables object instead of a standard Pandas dataframe, I thought it would be fun to show how to do the same operations in Pandas. We'll use your own Spotify listening history as a source of data, so you might learn something about your music tastes at the same time.

You can also follow it independently as a simple introduction to working with Pandas and learn some basic concepts of data science.

This is a Jupyter notebook, another standard data science tool. You can find tutorials online, but the basics are simple: cells contain text or code and can be modified.  You can execute code in the cells by selecting it and pressing the play button or shift-enter. If a cell fails to execute, it's usually because you didn't execute necessary code in a preceding cell. For this reason, it's usually best to execute all cells in order.

Before we get started, we need to do a bit of setup. First, let's import pandas and numpy, two standard data science packages for Python. Run the following cell (and all future code cells as you come to them):

In [1]:
import pandas as pd
import numpy as np

Next, we need to get your music history from Spotify. When you run the following cell, Spotify will ask you for permission for our app to access your top artists from your listening history and your followed artists. If you say yes, the app will download your top tracks and followed artists and convert the data into CSV format. If you prefer to skip this step, you can use the data from my own listening history.

Note that you only need to run the following cell once to create the CSV file, or again if you want to update the file with your latest history.

If you prefer the app to only use your followed artists or your top artists, you can modify the cell below with a keyword argument. For example:

- `user_followed_csv(top=False)` will only get your followed artists
- `user_followed_csv(followed=False)` will only get your top artists

In [2]:
from dataspot import user_artists_csv
user_artists_csv()

Configuration Succesful
Configuration Succesful


Now that we have our CSV file, let's load it into Pandas as a DataFrame:

In [3]:
with open('data/user_artists.csv', 'r') as csv_file:
    user_artists = pd.read_csv(csv_file, index_col='name')

Good! We should now have everything we need to start following the textbook.

## Chapter 3.4
We'll pick things up at [chapter 3.4](https://www.inferentialthinking.com/chapters/03/4/Introduction_to_Tables.html), which introduces the concept of tables. The equivalent in Pandas is called a DataFrame. Let's see what it looks like in a Jupyter notebook by running the following cell:

In [4]:
user_artists

Unnamed: 0_level_0,followers,popularity,total_genres,genre_0,genre_1,genre_2,genre_3,genre_4,genre_5,genre_6,...,genre_12,genre_13,genre_14,genre_15,genre_16,genre_17,genre_18,genre_19,id,uri
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Kamaal Williams,58205,46,4,indie jazz,indie soul,neo r&b,uk contemporary jazz,,,,...,,,,,,,,,01mXk9IDlVczWwZvVHAiIS,spotify:artist:01mXk9IDlVczWwZvVHAiIS
Madison McFerrin,38447,47,5,a cappella,alternative r&b,indie jazz,indie soul,neo r&b,,,...,,,,,,,,,02zPEtdzUWnPToEVLRiQ7e,spotify:artist:02zPEtdzUWnPToEVLRiQ7e
Scott Walker,94142,49,10,art pop,art rock,baroque pop,dance rock,experimental,experimental rock,freak folk,...,,,,,,,,,04tBaW21jyUfeP5iqiKBVq,spotify:artist:04tBaW21jyUfeP5iqiKBVq
Jerico,168,2,1,doujin,,,,,,,...,,,,,,,,,050aWtsntLl4HdCJSoCNDa,spotify:artist:050aWtsntLl4HdCJSoCNDa
CocoRosie,297716,50,4,art pop,folktronica,freak folk,new weird america,,,,...,,,,,,,,,05fo024EFotg9songSENOZ,spotify:artist:05fo024EFotg9songSENOZ
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Free Nationals,121298,64,2,alternative r&b,indie soul,,,,,,...,,,,,,,,,4596e2d3KmYzAeVenjCxfj,spotify:artist:4596e2d3KmYzAeVenjCxfj
BADBADNOTGOOD,476274,63,5,alternative hip hop,canadian modern jazz,escape room,funk,indie soul,,,...,,,,,,,,,65dGLGjkw3UbddUg2GKQoZ,spotify:artist:65dGLGjkw3UbddUg2GKQoZ
King Geedorah,93479,55,3,alternative hip hop,hardcore hip hop,hip hop,,,,,...,,,,,,,,,77AKJs9SJqxHXbPgtJPKRa,spotify:artist:77AKJs9SJqxHXbPgtJPKRa
Les Hommes,2907,29,1,nu jazz,,,,,,,...,,,,,,,,,2Cvdhz7BUQbO4LxeBBQM8s,spotify:artist:2Cvdhz7BUQbO4LxeBBQM8s


Unless you barely listen to Spotify, the notebook probably can't show all the data at once. It should display the first five and last five rows of artists, as well as the first ten and last ten columns of data.

If you don't recognize the artists, it's possible the app failed to download your data from Spotify and you are using the default data from my own listening history. Make sure you ran the cell with the command `user_followed_csv()` (after executing all previous code cells in order). If that cell failed to execute properly, I'm sorry. You'll have to make do with the default data.

You might notice that the `name` column is in bold. That's because it's the index column, which we specified with the kwarg `index_col` when loading the CSV file. This will be useful later.

Panda's`head` method is similar to Table's `show` method in the textbook. By default, it will display the first five rows:

In [None]:
user_artists.head()

If you give it an integer, it will display that many rows:

In [None]:
user_artists.head(2)

Pandas also offers a `tail` method to show the last rows instead of the first. Remember, you can change the code in the notebook to test it out for yourself. Try a different value than the default 8 we've given:

In [None]:
user_artists.tail(8)

You can select a single column by index using square brackets without changing the original DataFrame. This is equivalent to the Table `select` method in the textbook:

In [None]:
user_artists['followers']

You'll notice Pandas gives you some information about the column you selected: it's name, the number of rows, and the Pandas data type of all entries.

The original DataFrame is unchanged:

In [None]:
user_artists

You can also select a single column using dot notation `user_artists.followers`

To select multiple columns, you have to pass a list of column names. Watch out for the double brackets here, which indicate the list:

In [None]:
user_artists[['followers', 'popularity']]

We can also drop columns we're not interested in for the moment with the `drop` method, which is similar to the method with the same name in the textbook. The simplest syntax is to pass a list of columns as a kwarg:

In [None]:
user_artists.drop(columns=['followers'])

Again, none of these methods modify the original DataFrame. If we want to work on a modified version of the DataFrame, we have to assign it a variable name. For example, if we wanted to work on music genres, we could save a DataFrame that only contained that information:

In [None]:
music_genres = user_artists.drop(columns=['followers','popularity'])

Now, we can refew to this new DataFrame anytime we want:

In [None]:
music_genres.head()

The original table remains unchanged. Let's use it to create a simple table to study the popularity of your favorite artists:

In [None]:
artist_popularity = user_artists[['popularity','followers']]
artist_popularity

Of course, this table would be much more interesting if it were sorted! We can do this with the `sort_values` method, which is equivalent to the `sort` method in the textbook:

In [None]:
artist_popularity.sort_values('popularity')

Pandas has another sort function, `sort_index`, which always sorts according to the index column we specified when creating the DataFrame:

In [None]:
artist_popularity.sort_index()

Let's sort by popularity again, but this time, we'll put the most popular artists at the top of the list. In the textbook, the kwarg for this is `descending=True`, but in Pandas, we'll use `ascending=False` instead.

In [None]:
artist_popularity.sort_values('popularity', ascending=False)

Maybe you'll notice that some of your artists have the same popularity, but a different number of followers. We can sort by multiple columns to break the ties using the number of followers:

In [None]:
artist_popularity.sort_values(['popularity','followers'], ascending=False)

Again, the `sort_value` method doesn't change the original DataFrame. We can assign the new one a name, as for the previous methods, but we can also use the `in_place` kwarg to sort the original. The same kwarg can be used with the `sort_index` method.

In [None]:
artist_popularity.sort_values(['popularity','followers'], ascending=False, inplace=True)
artist_popularity

There are different ways to filter the data according to certain values, similarly to the `where` method in the textbook. One of the simplest is called boolean indexing, where we pass a condition to the DataFrame and only the rows where the value is True are returned:

In [None]:
artist_popularity[artist_popularity.followers > 50000] #Using dot notation for clarity. Alternate syntax below:
#artist_popularity[artist_popularity['followers'] > 50000]

You can apply multiple conditions. Link them with the `&` operator. Don't forget to put each expression in parentheses, because the `&` operator has priority over equality comparisons. 

In [None]:
artist_popularity[(artist_popularity.followers > 50000) & (artist_popularity.followers < 200000)]

Let's play arond with what we've learned a bit. Unless we have similar musical tastes, you'll have to modify the following cells according to your own Spotify data. This is on purpose! Doing things yourself is a better learning experience for most people.

Let's start by searching for our favorite artist. For me, that's the wonderful Jori Hulkkonen. Change the value of the following cell to your own favorite. Since we'll be exploring music genres, we'll use the original DataFrame, which is still intact. Since we've used the 'name' column of our CSV file as the index, we'll use `user_artists.index` or `user_artists['index'] instead of `user_artists.name` or `user_artists['name']`.

In [None]:
user_artists[user_artists.index == 'Jori Hulkkonen']

You should have a single row matching your favorite artist. This is actually a Pandas object known as a Series. As you might have guessed, a DataFrame is comprised of many Series.

 What genre does your favorite artist belong to according to Spotify? In my case, there's only one genre listed, but your artist might have many genres listed in the Series. Jori Hulkkonen's is listed in the `genre_0` column: "finnish electro". Sounds pretty unique! I wonder if I listen to any other artists in this genre? You probably don't, so replace `'finnish electro'` with whatever genre you see above that best fits your favorite artist. If you're not sure, just pick whatever is in the `genre_0` column. Make sure to keep the exact spelling or it won't work.

In [None]:
user_artists[user_artists.genre_0 == 'finnish electro']

Can you believe I listen to eight other Finnish electro artists? And I'm actually Canadian. They were all first suggested to me by Spotify, apart from Jori Hulkkonen himself. Maybe this classification is part of the reason why? We'd have to see their algorithms to know for sure, but it's a possiblity.

What about you? Are you surprised by how many or how few artists came up? Three of my "finnish electro" artists only showed up when I changed the cell above to search the `genre_1` column. Try that and see what comes up. There are ways to search all the columns at once, of course, but they require methods we haven't seen yet, so we won't use them here. If you like learning on your own, feel free to google it and try it out in the cell below.

The previous cell is blank on purpose, because I hope you'll try to fill it out! The worse thing that can happen is you get an error message. Just modify your code and run it again until it works. If you're having trouble getting the syntax right, you can copy-paste code similar to what you're looking to do from elsewhere in the notebook and modify it to suit your needs. Good luck!