# SC207 - Session 3
# Exploring, structuring and visualising data with Pandas
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/python_pandas.jpg?raw=true" align="right">


- A major part of computational social science is the storing, manipulation and reporting of data. 
- Pandas is a powerful data management library specifically built for these kinds of tasks.
- It can handle very large amounts of data whilst remaining quick and responsive.

- We will be using Pandas throughout our practical sessions as a general purpose data management tool but this week we will focus on learning its features.

[__Pandas Documentation__](http://pandas.pydata.org/pandas-docs/stable/)



### The Data

<img src="https://github.com/Minyall/sc207_materials/blob/master/images/spotify.png?raw=true" align="right" width=150>

Today we will be using data gathered from Spotify, the popular music streaming service. Spotify provides access to some of its data through their public API. This data has been collected and pre-prepared by your instructors.


### Imports
Importing modules in python is standard practice. Rather than everyone create their own unique code from scratch every time, modules allow us to integrate code developed by others into our own work. In most instances it is better to use a well supported pre-existing library than to write your own.

In [3]:
# Here we import the `pandas` module. We could simply use `import pandas` however `as` allows us to use a shorter name.
# As social convention many modules are referred to with these short names.

import pandas as pd

## Loading Data
Often (though not always) you will be importing data from a file into Pandas. Pandas can handle a range of import types...


<img src="https://github.com/Minyall/sc207_materials/blob/master/images/pandas_import.png?raw=true">

[Source - Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/io.html)

...some will be familiar to you, others less so.

Today's data is stored as a __CSV file__, a common format for storing data in a simple way that can be read by lots of different programs including text readers, Microsoft Excel, etc.

We need to provide either the relative or full path to the file so Python knows where in your computer to look. If the file is in the same place as your notebook you can provide the *relative* path which is the file path relative to the notebook. In our case the path is simply `spotify_top_songs.csv`. 

Whilst you're still learning it is best to just keep all relevant files in the same folder as your notebook.


In [76]:
filename = 'spotify_top_songs.csv'


In [77]:
# load your saved data as variable songs_df, df being short for DataFrame

songs_df = pd.read_csv(filename)
type(songs_df)

pandas.core.frame.DataFrame

In [78]:
# We can get a quick sense of the size of our dataset using .shape
# (number of rows, numer of columns)

songs_df.shape

(1187, 13)

## Viewing your data

In [80]:
# .head() shows us the top 5 rows
songs_df.head()

Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
0,75FEaRjZTKLhTrFGsfMUXR,Running Up That Hill (A Deal With God),Kate Bush,art pop,1985,False,96,298933,Top 50 - United Kingdom,0.629,-13.123,,mixed_pop
1,4Dvkj6JhhA12EX05fT7y2e,As It Was,Harry Styles,pop,2022,False,94,167303,Top 50 - United Kingdom,0.52,-5.338,0.0557,mixed_pop
2,40SBS57su9xLiE1WqkXOVr,Afraid To Feel,LF SYSTEM,***OOPS!***,2022,False,82,177524,Top 50 - United Kingdom,0.578,-3.929,0.114,mixed_pop
3,2KukL7UlQ8TdvpaA7bY3ZJ,BREAK MY SOUL,Beyoncé,dance pop,2022,False,90,278281,Top 50 - United Kingdom,0.687,-5.04,0.0826,mixed_pop
4,6xGruZOHLs39ZbVccQTuPZ,Glimpse of Us,Joji,alternative r&b,2022,False,98,233456,Top 50 - United Kingdom,0.44,-9.258,0.0531,mixed_pop


In [81]:
# .tail() shows us the last 5 rows

songs_df.tail()

Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
1182,5d6ZRqgbz26Sg4bk1oifQw,Blue Suede Shoes,Carl Perkins,rock-and-roll,1957,False,57,134445,All Out 50s,0.548,-7.318,0.0364,all_out_decades
1183,6pPr1KLZit9FgFNhp7xE5m,Cheek To Cheek,Ella Fitzgerald,adult standards,1956,False,0,351893,All Out 50s,0.648,-13.395,0.0883,all_out_decades
1184,2k6qpHJsrKCCyvsHv2cPqR,Diana,Paul Anka,adult standards,1966,False,56,140520,All Out 50s,0.551,-8.49,0.0325,all_out_decades
1185,6lYeYgSkWh6TZDQy6YZuvG,Just A Gigolo - Remastered,Louis Prima,adult standards,1991,False,53,283200,All Out 50s,0.525,-11.987,0.0945,all_out_decades
1186,2R7uUQ0Dehu80gsOcydQC9,Bo Diddley,Bo Diddley,acoustic blues,1958,False,53,149013,All Out 50s,0.809,-12.484,0.0574,all_out_decades


In [82]:
# You can specify the number of rows to return

songs_df.head(10)

Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
0,75FEaRjZTKLhTrFGsfMUXR,Running Up That Hill (A Deal With God),Kate Bush,art pop,1985,False,96,298933,Top 50 - United Kingdom,0.629,-13.123,,mixed_pop
1,4Dvkj6JhhA12EX05fT7y2e,As It Was,Harry Styles,pop,2022,False,94,167303,Top 50 - United Kingdom,0.52,-5.338,0.0557,mixed_pop
2,40SBS57su9xLiE1WqkXOVr,Afraid To Feel,LF SYSTEM,***OOPS!***,2022,False,82,177524,Top 50 - United Kingdom,0.578,-3.929,0.114,mixed_pop
3,2KukL7UlQ8TdvpaA7bY3ZJ,BREAK MY SOUL,Beyoncé,dance pop,2022,False,90,278281,Top 50 - United Kingdom,0.687,-5.04,0.0826,mixed_pop
4,6xGruZOHLs39ZbVccQTuPZ,Glimpse of Us,Joji,alternative r&b,2022,False,98,233456,Top 50 - United Kingdom,0.44,-9.258,0.0531,mixed_pop
5,1PckUlxKqWQs3RlWXVBLw3,About Damn Time,Lizzo,dance pop,2022,True,95,191822,Top 50 - United Kingdom,0.836,-6.305,0.0656,mixed_pop
6,02MWAaffLxlfxAUY7c5dvx,Heat Waves,Glass Animals,gauze pop,2020,False,91,238805,Top 50 - United Kingdom,0.761,-6.9,0.0944,mixed_pop
7,0oiv4E896TUTTeQU0cmIui,Massive,Drake,canadian hip hop,2022,False,79,336924,Top 50 - United Kingdom,0.499,-6.774,0.0561,mixed_pop
8,4N5s8lPTsjI9EGP7K4SXzB,Green Green Grass,George Ezra,folk-pop,2022,False,69,167613,Top 50 - United Kingdom,0.685,-4.413,0.0595,mixed_pop
9,1qEmFfgcLObUfQm0j1W2CK,Late Night Talking,Harry Styles,pop,2022,False,95,177954,Top 50 - United Kingdom,0.714,-4.595,0.0468,mixed_pop


## Exercise 1

Using `songs_df`, view the
- top 20 rows
- last 30 rows


In [83]:
# Write your code for exercise 1 in this cell



In [84]:
# QUESTION
# What is the name of the song in the very last row of the dataframe? Assign the name as a string to the answer variable

answer =

if answer.lower() == songs_df.iloc[-1,1].lower():
    print(f'Correct!')
else:
    print('Incorrect - Try again')


SyntaxError: invalid syntax (<ipython-input-84-30a2c98ef231>, line 4)

## Describing your DataFrame
<img src="https://pandas.pydata.org/docs/_images/01_table_dataframe.svg" title='Pandas DataFrame' width="400" height="200"/>

DataFrames are like big Excel spreadsheets. They have...

- Rows
- Columns

...and `values` which is the data inside the cells of your spreadsheet.


In [85]:
# We can see a list of our columns...
songs_df.columns

Index(['track_id', 'track_name', 'artists', 'genre', 'release_year',
       'explicit', 'popularity', 'duration_ms', 'playlist_name',
       'danceability', 'loudness', 'speechiness', 'playlist_type'],
      dtype='object')

In [86]:
# Accessing the row index, i.e. the row labels

songs_df.index

RangeIndex(start=0, stop=1187, step=1)

In [87]:
# Accessing the DataFrame values

songs_df.values

array([['75FEaRjZTKLhTrFGsfMUXR',
        'Running Up That Hill (A Deal With God)', 'Kate Bush', ...,
        -13.123, nan, 'mixed_pop'],
       ['4Dvkj6JhhA12EX05fT7y2e', 'As It Was', 'Harry Styles', ...,
        -5.338, 0.0557, 'mixed_pop'],
       ['40SBS57su9xLiE1WqkXOVr', 'Afraid To Feel', 'LF SYSTEM', ...,
        -3.929, 0.114, 'mixed_pop'],
       ...,
       ['2k6qpHJsrKCCyvsHv2cPqR', 'Diana', 'Paul Anka', ..., -8.49,
        0.0325, 'all_out_decades'],
       ['6lYeYgSkWh6TZDQy6YZuvG', 'Just A Gigolo - Remastered',
        'Louis Prima', ..., -11.987, 0.0945, 'all_out_decades'],
       ['2R7uUQ0Dehu80gsOcydQC9', 'Bo Diddley', 'Bo Diddley', ...,
        -12.484, 0.0574, 'all_out_decades']], dtype=object)

The `.info()` method gives us an overview of our DataFrame, including...
- A summary of the index labels
- Information the columns
- a 'Non-Null' Count. i.e. how many 'cells' in the column have a value in them.
- The type (Dtype) of values that column holds.
    - Integer (int64) - e.g. 5
    - Float (float64)- e.g. 5.3
    - Boolean (bool) - e.g. True / False
    - Other (object) - Usually a string, but can also be any python object e.g. lists, dictionaries, classes.
- A summary of how much computer memory the data needs.

In [88]:
# An informative overview of our DataFrame

songs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1187 entries, 0 to 1186
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   track_id       1187 non-null   object 
 1   track_name     1187 non-null   object 
 2   artists        1187 non-null   object 
 3   genre          1187 non-null   object 
 4   release_year   1187 non-null   int64  
 5   explicit       1187 non-null   bool   
 6   popularity     1187 non-null   int64  
 7   duration_ms    1187 non-null   int64  
 8   playlist_name  1187 non-null   object 
 9   danceability   1187 non-null   float64
 10  loudness       1187 non-null   float64
 11  speechiness    1186 non-null   float64
 12  playlist_type  1187 non-null   object 
dtypes: bool(1), float64(3), int64(3), object(6)
memory usage: 112.6+ KB


## Accessing columns and rows

### Accessing columns
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/01_table_series.svg?raw=true" title='Pandas DataFrame'/>

A column is also known as a `Series` and in Pandas.

In [89]:
# We can select a single column like we would select a dictionary key

songs_track_names = songs_df['track_name']

In [90]:
songs_track_names.head()

0    Running Up That Hill (A Deal With God)
1                                 As It Was
2                            Afraid To Feel
3                             BREAK MY SOUL
4                             Glimpse of Us
Name: track_name, dtype: object

In [91]:
type(songs_track_names)

pandas.core.series.Series

Note that the Series retains the index labels on the left from the original dataframe. They're not just listing the position of the rows, but are the index reference numbers for the rows.

In [92]:
# We can also select a subset of columns by using double brackets.

songs_track_artist_genre = songs_df[['track_name','artists','genre']]

songs_track_artist_genre.head()

Unnamed: 0,track_name,artists,genre
0,Running Up That Hill (A Deal With God),Kate Bush,art pop
1,As It Was,Harry Styles,pop
2,Afraid To Feel,LF SYSTEM,***OOPS!***
3,BREAK MY SOUL,Beyoncé,dance pop
4,Glimpse of Us,Joji,alternative r&b


In [93]:
# double-brackets returns a DataFrame, single brackets returns a Series

type(songs_track_artist_genre)

pandas.core.frame.DataFrame

In [94]:
# ...even if you only select one column
songs_single_track_name = songs_df['track_name']
songs_doubled_track_name = songs_df[['track_name']]

In [95]:
type(songs_single_track_name)

pandas.core.series.Series

In [96]:
type(songs_doubled_track_name)

pandas.core.frame.DataFrame

## Exercise 2
From `songs_df` access the column 'artists' and assign it to a new variable. Check the type to make sure it is a `Series`.

In [97]:
# Write your code for exercise 2 in this cell



### Accessing rows
Rows can be accessed in a number of ways...
- By referring to their row label - `.loc`
- By referring to their row index - `.iloc`
- By filtering based on specific criteria

In [98]:
# Select the row with label 4

row_label_4 = songs_df.loc[4]
row_label_4

track_id          6xGruZOHLs39ZbVccQTuPZ
track_name                 Glimpse of Us
artists                             Joji
genre                    alternative r&b
release_year                        2022
explicit                           False
popularity                            98
duration_ms                       233456
playlist_name    Top 50 - United Kingdom
danceability                        0.44
loudness                          -9.258
speechiness                       0.0531
playlist_type                  mixed_pop
Name: 4, dtype: object

Pandas has flipped the data so that the column labels are on the left, and the row label is at the top. This is because not only is a single column a `Series`, but a single row is a `Series` too.

In [99]:
# We can check the type of our row
type(row_label_4)

pandas.core.series.Series

In [100]:
# If we wanted to double check our results
songs_df.head()

Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
0,75FEaRjZTKLhTrFGsfMUXR,Running Up That Hill (A Deal With God),Kate Bush,art pop,1985,False,96,298933,Top 50 - United Kingdom,0.629,-13.123,,mixed_pop
1,4Dvkj6JhhA12EX05fT7y2e,As It Was,Harry Styles,pop,2022,False,94,167303,Top 50 - United Kingdom,0.52,-5.338,0.0557,mixed_pop
2,40SBS57su9xLiE1WqkXOVr,Afraid To Feel,LF SYSTEM,***OOPS!***,2022,False,82,177524,Top 50 - United Kingdom,0.578,-3.929,0.114,mixed_pop
3,2KukL7UlQ8TdvpaA7bY3ZJ,BREAK MY SOUL,Beyoncé,dance pop,2022,False,90,278281,Top 50 - United Kingdom,0.687,-5.04,0.0826,mixed_pop
4,6xGruZOHLs39ZbVccQTuPZ,Glimpse of Us,Joji,alternative r&b,2022,False,98,233456,Top 50 - United Kingdom,0.44,-9.258,0.0531,mixed_pop


In [101]:
# Select the row at index position 4

row_idx_4 = songs_df.iloc[4]
row_idx_4


track_id          6xGruZOHLs39ZbVccQTuPZ
track_name                 Glimpse of Us
artists                             Joji
genre                    alternative r&b
release_year                        2022
explicit                           False
popularity                            98
duration_ms                       233456
playlist_name    Top 50 - United Kingdom
danceability                        0.44
loudness                          -9.258
speechiness                       0.0531
playlist_type                  mixed_pop
Name: 4, dtype: object

It's the same.... so what's the point?

By default, Pandas labels the rows by their index position.


| index | label | name   |
|-------|-------|--------|
| 0     | 0     | Arthur |
| 1     | 1     | Betty  |
| 2     | 2     | Carole |

However, whilst the position is always fixed, the labels may change order if the data changes order. If we reversed the order of the data by name...

| index | label | name |
|-------|-------|----------|
| 0     | 2     | Carole   |
| 1     | 1     | Betty    |
| 2     | 0     | Arthur   |



In [102]:
# Example using our data

songs_artist_alphabetical = songs_df.sort_values('artists')
songs_artist_alphabetical.head()

Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
584,35zGjsxI020C2NPKp2fzS7,It's Gonna Be Me,*NSYNC,boy band,2000,False,62,191040,All Out 2000s,0.644,-4.666,0.0801,all_out_decades
477,3tjFYV6RSFtuktYl3ZtYcq,Mood (feat. iann dior),24kGoldn,cali rap,2020,True,32,140525,Every UK Number One: 2022,0.7,-3.558,0.0369,uk_no1
376,3tjFYV6RSFtuktYl3ZtYcq,Mood (feat. iann dior),24kGoldn,cali rap,2020,True,32,140525,Every Official UK Number 1 Ever,0.7,-3.558,0.0369,uk_no1
694,1JClFT74TYSXlzpagbmj0S,California Love - Original Version,2Pac,g funk,1998,False,65,285026,All Out 90s,0.767,-2.715,0.041,all_out_decades
687,7f0jXNMu2xjQUtmKMuWhGA,What's Up?,4 Non Blondes,new wave pop,1992,False,0,295493,All Out 90s,0.566,-9.875,0.0285,all_out_decades


In [103]:
songs_artist_alphabetical_idx_4 = songs_artist_alphabetical.iloc[4]
songs_artist_alphabetical_idx_4

track_id         7f0jXNMu2xjQUtmKMuWhGA
track_name                   What's Up?
artists                   4 Non Blondes
genre                      new wave pop
release_year                       1992
explicit                          False
popularity                            0
duration_ms                      295493
playlist_name               All Out 90s
danceability                      0.566
loudness                         -9.875
speechiness                      0.0285
playlist_type           all_out_decades
Name: 687, dtype: object

In [104]:
songs_artist_alphabetical_label_4 = songs_artist_alphabetical.loc[4]
songs_artist_alphabetical_label_4

track_id          6xGruZOHLs39ZbVccQTuPZ
track_name                 Glimpse of Us
artists                             Joji
genre                    alternative r&b
release_year                        2022
explicit                           False
popularity                            98
duration_ms                       233456
playlist_name    Top 50 - United Kingdom
danceability                        0.44
loudness                          -9.258
speechiness                       0.0531
playlist_type                  mixed_pop
Name: 4, dtype: object

## Exercise
- Using `.sort_values()` sort your dataframe by `track_name` and assign the result to a new variable.
- Using your sorted dataframe, access the row labelled `350` and assign it to its own variable.
- Using your sorted dataframe, access the row at index `350` and assign it to its own variable.

In [105]:
# Use this cell for the exercise





## Filtering by conditions
Filtering allows you to select multiple rows based on particular criteria. Our dataset was produced by pulling track information from a range of different playlists on Spotify.

In [108]:
# We can see the list of playlists by asking for unique values in the playlist_name column

songs_df['playlist_name'].unique()

array(['Top 50 - United Kingdom', 'The Pop List', "Today's Top Hits",
       'Cheesy Hits!', 'Alt. Pop.', 'Hit Rewind',
       'Every Official UK Number 1 Ever', 'Every UK Number One: 2022',
       'All Out 2000s', 'All Out 2010s', 'All Out 90s', 'All Out 80s',
       'All Out 70s', 'All Out 60s', 'All Out 50s'], dtype=object)

Filtering can be done in many ways, but the standard approach is to use the following syntax...

Dataframe[filter conditions]

In [110]:
# Filter our dataframe to only show songs from Spotify's Top 50 UK playlist
# (Top 50 most played songs in the UK on the day dataset was produced)

filter_uk_top_50 = songs_df['playlist_name'] == 'Top 50 - United Kingdom'

songs_top_50 = songs_df[filter_uk_top_50]
songs_top_50

Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
0,75FEaRjZTKLhTrFGsfMUXR,Running Up That Hill (A Deal With God),Kate Bush,art pop,1985,False,96,298933,Top 50 - United Kingdom,0.629,-13.123,,mixed_pop
1,4Dvkj6JhhA12EX05fT7y2e,As It Was,Harry Styles,pop,2022,False,94,167303,Top 50 - United Kingdom,0.52,-5.338,0.0557,mixed_pop
2,40SBS57su9xLiE1WqkXOVr,Afraid To Feel,LF SYSTEM,***OOPS!***,2022,False,82,177524,Top 50 - United Kingdom,0.578,-3.929,0.114,mixed_pop
3,2KukL7UlQ8TdvpaA7bY3ZJ,BREAK MY SOUL,Beyoncé,dance pop,2022,False,90,278281,Top 50 - United Kingdom,0.687,-5.04,0.0826,mixed_pop
4,6xGruZOHLs39ZbVccQTuPZ,Glimpse of Us,Joji,alternative r&b,2022,False,98,233456,Top 50 - United Kingdom,0.44,-9.258,0.0531,mixed_pop
5,1PckUlxKqWQs3RlWXVBLw3,About Damn Time,Lizzo,dance pop,2022,True,95,191822,Top 50 - United Kingdom,0.836,-6.305,0.0656,mixed_pop
6,02MWAaffLxlfxAUY7c5dvx,Heat Waves,Glass Animals,gauze pop,2020,False,91,238805,Top 50 - United Kingdom,0.761,-6.9,0.0944,mixed_pop
7,0oiv4E896TUTTeQU0cmIui,Massive,Drake,canadian hip hop,2022,False,79,336924,Top 50 - United Kingdom,0.499,-6.774,0.0561,mixed_pop
8,4N5s8lPTsjI9EGP7K4SXzB,Green Green Grass,George Ezra,folk-pop,2022,False,69,167613,Top 50 - United Kingdom,0.685,-4.413,0.0595,mixed_pop
9,1qEmFfgcLObUfQm0j1W2CK,Late Night Talking,Harry Styles,pop,2022,False,95,177954,Top 50 - United Kingdom,0.714,-4.595,0.0468,mixed_pop


In [112]:
# We could filter again to then see which songs in the top 50 are explicit

filter_explicit = songs_top_50['explicit'] == True
songs_top_50[filter_explicit]

Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
5,1PckUlxKqWQs3RlWXVBLw3,About Damn Time,Lizzo,dance pop,2022,True,95,191822,Top 50 - United Kingdom,0.836,-6.305,0.0656,mixed_pop
10,3F5CgOj3wFlRv51JsHbxhe,Jimmy Cooks (feat. 21 Savage),Drake,canadian hip hop,2022,True,92,218364,Top 50 - United Kingdom,0.529,-4.711,0.175,mixed_pop
11,7u3w4fQhulC1etJbdfmv3Q,IFTK,Tion Wayne,london rap,2022,True,79,190684,Top 50 - United Kingdom,0.614,-4.32,0.226,mixed_pop
13,59nOXPmaKlBfGMDeOVGrIK,WAIT FOR U (feat. Drake & Tems),Future,atl hip hop,2022,True,91,189893,Top 50 - United Kingdom,0.463,-4.474,0.34,mixed_pop
16,531KGXtBroSrOX9LVmiIgc,Starlight,Dave,uk hip hop,2022,True,83,211935,Top 50 - United Kingdom,0.954,-9.551,0.288,mixed_pop
17,5rF6YUIlgiat22OT1lWspJ,Seventeen Going Under,Sam Fender,modern rock,2021,True,75,297933,Top 50 - United Kingdom,0.48,-4.792,0.0362,mixed_pop
18,0wHFktze2PHC5jDt3B17DC,First Class,Jack Harlow,deep underground hip hop,2022,True,85,173947,Top 50 - United Kingdom,0.902,-5.902,0.109,mixed_pop
25,3pudQCMnsFGwOElTZmuml8,Baby,Aitch,manchester hip hop,2022,True,78,177733,Top 50 - United Kingdom,0.769,-5.772,0.214,mixed_pop
30,7kjANxR8XN4hCzLaSc2roy,go - goddard. Remix,Cat Burns,pop,2022,True,78,192514,Top 50 - United Kingdom,0.728,-6.601,0.13,mixed_pop
32,7fYRg3CEbk6rNCuzNzMT06,Potion (with Dua Lipa & Young Thug),Calvin Harris,dance pop,2022,True,89,214459,Top 50 - United Kingdom,0.824,-4.869,0.0473,mixed_pop


In [113]:
# Which songs in the top 50 were released before 2020?

filter_pre_2020 = songs_top_50['release_year'] < 2020
songs_top_50[filter_pre_2020]

Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
0,75FEaRjZTKLhTrFGsfMUXR,Running Up That Hill (A Deal With God),Kate Bush,art pop,1985,False,96,298933,Top 50 - United Kingdom,0.629,-13.123,,mixed_pop
14,2MuWTIM3b0YEAskbeeFE1i,Master Of Puppets,Metallica,hard rock,1986,False,78,515386,Top 50 - United Kingdom,0.543,-9.11,0.0353,mixed_pop
22,003vvx7Niy0yvhvHt4a68B,Mr. Brightside,The Killers,alternative rock,2004,False,85,222973,Top 50 - United Kingdom,0.352,-5.23,0.0747,mixed_pop
35,7ef4DlsgrMEH11cDZd32M6,One Kiss (with Dua Lipa),Calvin Harris,dance pop,2018,False,91,214846,Top 50 - United Kingdom,0.791,-3.24,0.11,mixed_pop
45,7jtQIBanIiJOMS6RyCx6jZ,Another Love,Tom Odell,chill pop,2013,True,55,244360,Top 50 - United Kingdom,0.442,-8.55,0.0451,mixed_pop
47,58ge6dfP91o9oXMzq3XkIS,505,Arctic Monkeys,garage rock,2007,False,82,253586,Top 50 - United Kingdom,0.52,-5.866,0.0543,mixed_pop
49,4RvWPyQ5RL0ao9LPZeSouE,Everybody Wants To Rule The World,Tears For Fears,new romantic,1985,False,86,251488,Top 50 - United Kingdom,0.645,-12.095,0.0527,mixed_pop


In [114]:
# in release year order...

songs_top_50[filter_pre_2020].sort_values('release_year')

Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
0,75FEaRjZTKLhTrFGsfMUXR,Running Up That Hill (A Deal With God),Kate Bush,art pop,1985,False,96,298933,Top 50 - United Kingdom,0.629,-13.123,,mixed_pop
49,4RvWPyQ5RL0ao9LPZeSouE,Everybody Wants To Rule The World,Tears For Fears,new romantic,1985,False,86,251488,Top 50 - United Kingdom,0.645,-12.095,0.0527,mixed_pop
14,2MuWTIM3b0YEAskbeeFE1i,Master Of Puppets,Metallica,hard rock,1986,False,78,515386,Top 50 - United Kingdom,0.543,-9.11,0.0353,mixed_pop
22,003vvx7Niy0yvhvHt4a68B,Mr. Brightside,The Killers,alternative rock,2004,False,85,222973,Top 50 - United Kingdom,0.352,-5.23,0.0747,mixed_pop
47,58ge6dfP91o9oXMzq3XkIS,505,Arctic Monkeys,garage rock,2007,False,82,253586,Top 50 - United Kingdom,0.52,-5.866,0.0543,mixed_pop
45,7jtQIBanIiJOMS6RyCx6jZ,Another Love,Tom Odell,chill pop,2013,True,55,244360,Top 50 - United Kingdom,0.442,-8.55,0.0451,mixed_pop
35,7ef4DlsgrMEH11cDZd32M6,One Kiss (with Dua Lipa),Calvin Harris,dance pop,2018,False,91,214846,Top 50 - United Kingdom,0.791,-3.24,0.11,mixed_pop


In [117]:
# by popularity.
# the ascending keyword in .sort_values() allows us to control if lowest (default), or highest values are first.

filter_most_popular = songs_top_50['popularity'] > 90

songs_top_50[filter_most_popular].sort_values('popularity', ascending=False)


Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
4,6xGruZOHLs39ZbVccQTuPZ,Glimpse of Us,Joji,alternative r&b,2022,False,98,233456,Top 50 - United Kingdom,0.44,-9.258,0.0531,mixed_pop
0,75FEaRjZTKLhTrFGsfMUXR,Running Up That Hill (A Deal With God),Kate Bush,art pop,1985,False,96,298933,Top 50 - United Kingdom,0.629,-13.123,,mixed_pop
5,1PckUlxKqWQs3RlWXVBLw3,About Damn Time,Lizzo,dance pop,2022,True,95,191822,Top 50 - United Kingdom,0.836,-6.305,0.0656,mixed_pop
9,1qEmFfgcLObUfQm0j1W2CK,Late Night Talking,Harry Styles,pop,2022,False,95,177954,Top 50 - United Kingdom,0.714,-4.595,0.0468,mixed_pop
1,4Dvkj6JhhA12EX05fT7y2e,As It Was,Harry Styles,pop,2022,False,94,167303,Top 50 - United Kingdom,0.52,-5.338,0.0557,mixed_pop
10,3F5CgOj3wFlRv51JsHbxhe,Jimmy Cooks (feat. 21 Savage),Drake,canadian hip hop,2022,True,92,218364,Top 50 - United Kingdom,0.529,-4.711,0.175,mixed_pop
31,3uUuGVFu1V7jTQL60S1r8z,Where Are You Now,Lost Frequencies,belgian edm,2021,False,92,148197,Top 50 - United Kingdom,0.671,-8.117,0.103,mixed_pop
33,0O6u0VJ46W86TxN9wgyqDj,I Like You (A Happier Song) (with Doja Cat),Post Malone,dfw rap,2022,True,92,192840,Top 50 - United Kingdom,0.733,-6.009,0.0751,mixed_pop
6,02MWAaffLxlfxAUY7c5dvx,Heat Waves,Glass Animals,gauze pop,2020,False,91,238805,Top 50 - United Kingdom,0.761,-6.9,0.0944,mixed_pop
12,5LYMamLv12UPbemOaTPyeV,Music For a Sushi Restaurant,Harry Styles,pop,2022,False,91,193813,Top 50 - United Kingdom,0.72,-4.652,0.04,mixed_pop


[Click here for more information on Spotify's popularity metric.](https://hexdocs.pm/spotify_web_api/Spotify.Tracks.html#t:popularity/0)

We broke down the stages by...
- First creating a seperate `songs_top_50` variable by selecting all the rows in the appropriate playlist.
- Creating filter variables first before applying them to our new pre-filtered dataset.

Pandas supports combining these stages using `&` to combine filters in a single command. However, always aim for clarity and readability above complex solutions.

In [119]:
# Explicit Top 50 playlist songs with a popularity over 90, sorted by release year

songs_df[
    (songs_df['playlist_name'] == 'Top 50 - United Kingdom') &
    (songs_df['explicit'] == True) &
    (songs_df['popularity'] > 90)
].sort_values('release_year')

Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
38,4ZtFanR9U6ndgddUvNcjcG,good 4 u,Olivia Rodrigo,pop,2021,True,91,178146,Top 50 - United Kingdom,0.563,-5.044,0.154,mixed_pop
5,1PckUlxKqWQs3RlWXVBLw3,About Damn Time,Lizzo,dance pop,2022,True,95,191822,Top 50 - United Kingdom,0.836,-6.305,0.0656,mixed_pop
10,3F5CgOj3wFlRv51JsHbxhe,Jimmy Cooks (feat. 21 Savage),Drake,canadian hip hop,2022,True,92,218364,Top 50 - United Kingdom,0.529,-4.711,0.175,mixed_pop
13,59nOXPmaKlBfGMDeOVGrIK,WAIT FOR U (feat. Drake & Tems),Future,atl hip hop,2022,True,91,189893,Top 50 - United Kingdom,0.463,-4.474,0.34,mixed_pop
33,0O6u0VJ46W86TxN9wgyqDj,I Like You (A Happier Song) (with Doja Cat),Post Malone,dfw rap,2022,True,92,192840,Top 50 - United Kingdom,0.733,-6.009,0.0751,mixed_pop


## Exercise
Which song on Spotify's playlist 'All Out 50s' is the most popular song amongst Spotify listeners today?

1. Filter `songs_df` to only include tracks on the playlist `All Out 50s`.
2. Sort the filtered dataset by popularity, so that the most popular song is at the top.

In [131]:
# Write your code for the exercise here








Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
0,75FEaRjZTKLhTrFGsfMUXR,Running Up That Hill (A Deal With God),Kate Bush,art pop,1985,False,96,298933,Top 50 - United Kingdom,0.629,-13.123,,mixed_pop
455,75FEaRjZTKLhTrFGsfMUXR,Running Up That Hill (A Deal With God),Kate Bush,art pop,1985,False,96,298933,Every UK Number One: 2022,0.629,-13.123,0.055,uk_no1
135,29d0nY7TzCoi22XBqDQkiP,Running Up That Hill (A Deal With God) - 2018 ...,Kate Bush,art pop,1985,False,92,300840,Today's Top Hits,0.625,-11.903,0.0596,mixed_pop
51,75FEaRjZTKLhTrFGsfMUXR,Running Up That Hill (A Deal With God),Kate Bush,art pop,1985,False,96,298933,The Pop List,0.629,-13.123,0.055,mixed_pop
787,75FEaRjZTKLhTrFGsfMUXR,Running Up That Hill (A Deal With God),Kate Bush,art pop,1985,False,96,298933,All Out 80s,0.629,-13.123,0.055,all_out_decades
413,7ef4DlsgrMEH11cDZd32M6,One Kiss (with Dua Lipa),Calvin Harris,dance pop,2018,False,91,214846,Every Official UK Number 1 Ever,0.791,-3.24,0.11,uk_no1
35,7ef4DlsgrMEH11cDZd32M6,One Kiss (with Dua Lipa),Calvin Harris,dance pop,2018,False,91,214846,Top 50 - United Kingdom,0.791,-3.24,0.11,mixed_pop
6,02MWAaffLxlfxAUY7c5dvx,Heat Waves,Glass Animals,gauze pop,2020,False,91,238805,Top 50 - United Kingdom,0.761,-6.9,0.0944,mixed_pop
163,58HvfVOeJY7lUuCqF0m3ly,MIDDLE OF THE NIGHT,Elley Duhé,alt z,2020,False,93,184447,Today's Top Hits,0.41,-8.271,0.0467,mixed_pop
160,0T5iIrXA4p5GsubkhuBIKV,Until I Found You,Stephen Sanchez,gen z singer-songwriter,2021,False,91,177720,Today's Top Hits,0.539,-6.05,0.0288,mixed_pop


## Summarising Data

We will learn much more about Pandas' ability to manipulate and aggregate data throughout the course. However, here is a small taste of its capabilities.

In [132]:
# Value counts provides a quick summary of how many times each value appears in a Series

songs_df['playlist_name'].value_counts()

All Out 80s                        100
Every Official UK Number 1 Ever    100
All Out 70s                        100
All Out 90s                        100
All Out 60s                        100
All Out 2000s                      100
All Out 2010s                      100
All Out 50s                        100
The Pop List                        80
Hit Rewind                          75
Alt. Pop.                           50
Cheesy Hits!                        50
Today's Top Hits                    50
Top 50 - United Kingdom             50
Every UK Number One: 2022           32
Name: playlist_name, dtype: int64

In [153]:
# How many times does an artist appear in the dataset?
songs_df['artists'].value_counts().head(10)

Ed Sheeran       20
Elton John       17
Taylor Swift     15
One Direction    13
Harry Styles     13
Lady Gaga        11
Shawn Mendes     11
Justin Bieber    11
Ariana Grande    10
Drake            10
Name: artists, dtype: int64

In [142]:
# What is the average popularity score for our particular dataset?
# For spotify overall 50 is average
songs_df['popularity'].mean()


58.64448188711036

### Grouping
Pandas `.groupby()` is an incredibly powerful feature that allows us to ask complex questions of our data.

In [145]:
# is what is the average popularity score of thr tracks on each playlist?

songs_df.groupby('playlist_name')['popularity'].mean().sort_values(ascending=False)

playlist_name
Today's Top Hits                   85.8200
Top 50 - United Kingdom            85.0800
The Pop List                       73.4375
All Out 2000s                      65.7000
Hit Rewind                         65.0000
All Out 70s                        63.6000
Every UK Number One: 2022          61.3125
All Out 2010s                      59.5400
All Out 90s                        59.0600
All Out 80s                        58.7700
Every Official UK Number 1 Ever    50.4500
All Out 60s                        48.1700
Alt. Pop.                          46.9800
All Out 50s                        37.5600
Cheesy Hits!                       34.4000
Name: popularity, dtype: float64

In [147]:
# is what is the average popularity score of tracks by release date?

songs_df.groupby('release_year')['popularity'].mean().sort_values(ascending=False)

release_year
1981    76.000000
2022    74.909605
1980    73.750000
1979    73.000000
2014    72.777778
          ...    
1960    38.333333
1952    38.000000
2015    37.166667
2011    33.437500
1988    32.833333
Name: popularity, Length: 70, dtype: float64

In [152]:
# what about grouping again by explicit?
songs_df.groupby(['release_year','explicit'])['popularity'].mean().sort_index(ascending=False)

release_year  explicit
2022          True        78.222222
              False       73.455285
2021          True        66.571429
              False       59.702128
2020          True        74.000000
                            ...    
1957          False       47.571429
1956          False       50.857143
1955          False       65.000000
1954          False       65.000000
1952          False       38.000000
Name: popularity, Length: 90, dtype: float64