# Data Wrangling with Pandas
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/python_pandas.jpg?raw=true" align="right">


- A major part of computational social science is the storing, manipulation and reporting of data. 
- Pandas is a powerful data management library specifically built for these kinds of tasks.
- It can handle very large amounts of data whilst remaining quick and responsive.

- We will be using Pandas throughout our practical sessions as a general purpose data management tool but this week we will focus on learning its features.

[__Pandas Documentation__](http://pandas.pydata.org/pandas-docs/stable/)

The first thing we need to do is `import` the pandas library. This ensures it is available for us to use in this environment.

In [14]:
# Here we import the `pandas` module. We could simply use `import pandas` however `as` allows us to use a shorter name.
# As social convention many modules are referred to with these short names.

import pandas as pd



## Loading the Data

<img src="https://github.com/Minyall/sc207_materials/blob/master/images/spotify.png?raw=true" align="right" width=150>

Today we will be using data gathered from Spotify, the popular music streaming service. Spotify provides access to some of its data through their public API. This data has been collected and pre-prepared by your instructors.


In [15]:
filename = 'test_spotify_top_songs.csv'
songs_df = pd.read_csv(filename)
type(songs_df)


pandas.core.frame.DataFrame

In [16]:
songs_df

Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
0,5mjYQaktjmjcMKcUIcqz4s,Strangers,Kenya Grace,singer-songwriter pop,2023,False,97,172964,Top 50 - United Kingdom,0.628,-8.307,,mixed_pop
1,56y1jOTK0XSvJzVv9vHQBK,Paint The Town Red,Doja Cat,dance pop,2023,True,87,230480,Top 50 - United Kingdom,0.864,-7.683,0.1940,mixed_pop
2,1reEeZH9wNt4z1ePYLyC7p,greedy,Tate McRae,alt z,2023,True,56,131872,Top 50 - United Kingdom,0.750,-3.190,0.0322,mixed_pop
3,59NraMJsLaMCVtwXTSia8i,Prada,cassö,***OOPS!***,2023,True,94,132359,Top 50 - United Kingdom,0.638,-5.804,0.0375,mixed_pop
4,2FDTHlrBguDzQkp7PVj16Q,Sprinter,Dave,uk hip hop,2023,True,94,229133,Top 50 - United Kingdom,0.916,-8.067,0.2410,mixed_pop
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1275,07GtDOCxmye5KDWsTSACPk,Chantilly Lace,The Big Bopper,doo-wop,1958,False,57,145266,All Out 50s,0.489,-6.054,0.0858,all_out_decades
1276,3SQhmctWreNM0X6Zkm2K5R,Maybellene,Chuck Berry,blues,1959,False,57,143240,All Out 50s,0.756,-10.701,0.1120,all_out_decades
1277,7ycFNferNuk6wAjBa0vWvl,Little Darlin',The Diamonds,rhythm and blues,1996,False,57,129333,All Out 50s,0.631,-11.402,0.0420,all_out_decades
1278,1xVOttVNT27FBTD8iHjOfU,It's Only Make Believe,Conway Twitty,arkansas country,1959,False,57,132026,All Out 50s,0.461,-9.627,0.0598,all_out_decades


In [17]:
# .head() shows us the top 5 rows
songs_df.head()

Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
0,5mjYQaktjmjcMKcUIcqz4s,Strangers,Kenya Grace,singer-songwriter pop,2023,False,97,172964,Top 50 - United Kingdom,0.628,-8.307,,mixed_pop
1,56y1jOTK0XSvJzVv9vHQBK,Paint The Town Red,Doja Cat,dance pop,2023,True,87,230480,Top 50 - United Kingdom,0.864,-7.683,0.194,mixed_pop
2,1reEeZH9wNt4z1ePYLyC7p,greedy,Tate McRae,alt z,2023,True,56,131872,Top 50 - United Kingdom,0.75,-3.19,0.0322,mixed_pop
3,59NraMJsLaMCVtwXTSia8i,Prada,cassö,***OOPS!***,2023,True,94,132359,Top 50 - United Kingdom,0.638,-5.804,0.0375,mixed_pop
4,2FDTHlrBguDzQkp7PVj16Q,Sprinter,Dave,uk hip hop,2023,True,94,229133,Top 50 - United Kingdom,0.916,-8.067,0.241,mixed_pop


In [18]:
# .tail() shows us the last 5 rows

songs_df.tail()

Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
1275,07GtDOCxmye5KDWsTSACPk,Chantilly Lace,The Big Bopper,doo-wop,1958,False,57,145266,All Out 50s,0.489,-6.054,0.0858,all_out_decades
1276,3SQhmctWreNM0X6Zkm2K5R,Maybellene,Chuck Berry,blues,1959,False,57,143240,All Out 50s,0.756,-10.701,0.112,all_out_decades
1277,7ycFNferNuk6wAjBa0vWvl,Little Darlin',The Diamonds,rhythm and blues,1996,False,57,129333,All Out 50s,0.631,-11.402,0.042,all_out_decades
1278,1xVOttVNT27FBTD8iHjOfU,It's Only Make Believe,Conway Twitty,arkansas country,1959,False,57,132026,All Out 50s,0.461,-9.627,0.0598,all_out_decades
1279,3nUrhP3KuK4R1qdxRk2Kgo,Stupid Cupid,Connie Francis,adult standards,2005,False,57,133746,All Out 50s,0.609,-4.739,0.0389,all_out_decades


In [19]:
# You can specify the number of rows to return

songs_df.head(10)

Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
0,5mjYQaktjmjcMKcUIcqz4s,Strangers,Kenya Grace,singer-songwriter pop,2023,False,97,172964,Top 50 - United Kingdom,0.628,-8.307,,mixed_pop
1,56y1jOTK0XSvJzVv9vHQBK,Paint The Town Red,Doja Cat,dance pop,2023,True,87,230480,Top 50 - United Kingdom,0.864,-7.683,0.194,mixed_pop
2,1reEeZH9wNt4z1ePYLyC7p,greedy,Tate McRae,alt z,2023,True,56,131872,Top 50 - United Kingdom,0.75,-3.19,0.0322,mixed_pop
3,59NraMJsLaMCVtwXTSia8i,Prada,cassö,***OOPS!***,2023,True,94,132359,Top 50 - United Kingdom,0.638,-5.804,0.0375,mixed_pop
4,2FDTHlrBguDzQkp7PVj16Q,Sprinter,Dave,uk hip hop,2023,True,94,229133,Top 50 - United Kingdom,0.916,-8.067,0.241,mixed_pop
5,1BxfuPKGuaTgP7aM0Bbdwr,Cruel Summer,Taylor Swift,pop,2019,False,99,178426,Top 50 - United Kingdom,0.552,-5.707,0.157,mixed_pop
6,5aIVCx5tnk0ntmdiinnYvw,Water,Tyla,***OOPS!***,2023,False,91,200255,Top 50 - United Kingdom,0.673,-3.495,0.0755,mixed_pop
7,1kuGVB7EU95pJObxwvfwKS,vampire,Olivia Rodrigo,pop,2023,True,95,219724,Top 50 - United Kingdom,0.511,-5.745,0.0578,mixed_pop
8,3vkCueOmm7xQDoJ17W1Pm3,My Love Mine All Mine,Mitski,brooklyn indie,2023,False,93,137773,Top 50 - United Kingdom,0.504,-14.958,0.0321,mixed_pop
9,2ZWmmrWUgDBcPSLihBMvhg,"Baddadan (feat. IRAH, Flowdan, Trigga & Takura)",Chase & Status,dancefloor dnb,2023,False,81,177291,Top 50 - United Kingdom,0.62,-0.504,0.252,mixed_pop


The `.info()` method gives us an overview of our DataFrame, including...
- A summary of the index labels
- Information the columns
- a 'Non-Null' Count. i.e. how many 'cells' in the column have a value in them.
- The type (Dtype) of values that column holds.
    - Integer (int64) - e.g. 5
    - Float (float64)- e.g. 5.3
    - Boolean (bool) - e.g. True / False
    - Other (object) - Usually a string, but can also be any python object e.g. lists, dictionaries, classes.
- A summary of how much computer memory the data needs.

In [20]:
songs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1280 entries, 0 to 1279
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   track_id       1280 non-null   object 
 1   track_name     1280 non-null   object 
 2   artists        1280 non-null   object 
 3   genre          1280 non-null   object 
 4   release_year   1280 non-null   int64  
 5   explicit       1280 non-null   bool   
 6   popularity     1280 non-null   int64  
 7   duration_ms    1280 non-null   int64  
 8   playlist_name  1280 non-null   object 
 9   danceability   1280 non-null   float64
 10  loudness       1280 non-null   float64
 11  speechiness    1279 non-null   float64
 12  playlist_type  1280 non-null   object 
dtypes: bool(1), float64(3), int64(3), object(6)
memory usage: 121.4+ KB


We can also get some of this information seperately. For example the row and column names. Sometimes this is necessary if there are more columns than `.info()` wants to show.

In [7]:
songs_df.index

RangeIndex(start=0, stop=1187, step=1)

In [42]:
songs_df.columns

Index(['track_id', 'track_name', 'artists', 'genre', 'release_year',
       'explicit', 'popularity', 'duration_ms', 'playlist_name',
       'danceability', 'loudness', 'speechiness', 'playlist_type'],
      dtype='object')

## Accessing columns, rows and cells
Being able to tell Pandas to provide us specific columns, specific rows, or even specific cells can be important when exploring data.

In [9]:
# We can access a single column by 
# providing the name in square brackets.
songs_df['track_name']

0       Running Up That Hill (A Deal With God)
1                                    As It Was
2                               Afraid To Feel
3                                BREAK MY SOUL
4                                Glimpse of Us
                         ...                  
1182                          Blue Suede Shoes
1183                            Cheek To Cheek
1184                                     Diana
1185                Just A Gigolo - Remastered
1186                                Bo Diddley
Name: track_name, Length: 1187, dtype: object

In [22]:
# or multiple names in a list if we want a few columns

songs_df[['track_name','artists']]

Unnamed: 0,track_name,artists
0,Strangers,Kenya Grace
1,Paint The Town Red,Doja Cat
2,greedy,Tate McRae
3,Prada,cassö
4,Sprinter,Dave
...,...,...
1275,Chantilly Lace,The Big Bopper
1276,Maybellene,Chuck Berry
1277,Little Darlin',The Diamonds
1278,It's Only Make Believe,Conway Twitty


In [23]:
# We can also assign this to a new variable if we want quick access.
# Note: When you set columns to a variable like this, it is referencing
# the original DataFrame, not copying it.

songs_track_artist = songs_df[['track_name','artists']]
songs_track_artist.head()

Unnamed: 0,track_name,artists
0,Strangers,Kenya Grace
1,Paint The Town Red,Doja Cat
2,greedy,Tate McRae
3,Prada,cassö
4,Sprinter,Dave



We can access specific rows...
- By referring to their row label - `.loc`
- By referring to their row index - `.iloc`


In [24]:
songs_df.loc[1]

track_id          56y1jOTK0XSvJzVv9vHQBK
track_name            Paint The Town Red
artists                         Doja Cat
genre                          dance pop
release_year                        2023
explicit                            True
popularity                            87
duration_ms                       230480
playlist_name    Top 50 - United Kingdom
danceability                       0.864
loudness                          -7.683
speechiness                        0.194
playlist_type                  mixed_pop
Name: 1, dtype: object

At the moment row labels and row indexes (the row's position in the data) are the same by default. However, say we messed up the order by sorting the data by year.

In [30]:
by_year = songs_df.sort_values('release_year')
by_year

Unnamed: 0,track_id,track_name,artists,genre,release_year,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
1211,4I4aQGNJ2HufloNtB65nxR,That's Amore,Dean Martin,adult standards,1954,False,64,190400,All Out 50s,0.471,-13.600,0.0309,all_out_decades
1182,648TTtYB0bH0P8Hfy0FmkL,Unforgettable,Nat King Cole,adult standards,1954,False,71,191973,All Out 50s,0.349,-13.507,0.0310,all_out_decades
1257,7GnMzVWOHLBPcfco4L1GtE,Earth Angel,The Penguins,doo-wop,1954,False,58,171066,All Out 50s,0.487,-11.121,0.0271,all_out_decades
1259,0x0ffSAP6PkdoDgHOfroof,My Funny Valentine - Remastered,Frank Sinatra,adult standards,1954,False,58,150666,All Out 50s,0.257,-14.267,0.0332,all_out_decades
1204,1uRKT2LRANv4baowBWHfDS,(We're Gonna) Rock Around The Clock,Bill Haley & His Comets,rock-and-roll,1955,False,66,129893,All Out 50s,0.811,-6.317,0.1680,all_out_decades
...,...,...,...,...,...,...,...,...,...,...,...,...,...
254,65leXqfkdViSssEVN23uYL,Mama's Boy,Dominic Fike,alternative pop rock,2023,False,81,155721,alt/pop,0.807,-7.554,0.0321,mixed_pop
255,0LzidBf7cUsnZnG34OUPSF,Mosquito,PinkPantheress,bedroom pop,2023,False,76,146240,alt/pop,0.710,-4.032,0.1180,mixed_pop
256,2SXx7Ofa79CeJfio98aJcG,Eat Your Young,Hozier,irish singer-songwriter,2023,False,73,243946,alt/pop,0.546,-6.442,0.0286,mixed_pop
258,26vDr5jgWQoJOTH4Bu3KCQ,WORTHLESS,d4vd,bedroom pop,2023,False,69,163049,alt/pop,0.644,-6.299,0.0545,mixed_pop


To access the very first row of this dataset we either need to know the row label...

In [31]:
by_year.loc[1211]

track_id         4I4aQGNJ2HufloNtB65nxR
track_name                 That's Amore
artists                     Dean Martin
genre                   adult standards
release_year                       1954
explicit                          False
popularity                           64
duration_ms                      190400
playlist_name               All Out 50s
danceability                      0.471
loudness                          -13.6
speechiness                      0.0309
playlist_type           all_out_decades
Name: 1211, dtype: object

Or simply ask for the first row by index location like we would a list.

In [32]:
by_year.iloc[0]

track_id         4I4aQGNJ2HufloNtB65nxR
track_name                 That's Amore
artists                     Dean Martin
genre                   adult standards
release_year                       1954
explicit                          False
popularity                           64
duration_ms                      190400
playlist_name               All Out 50s
danceability                      0.471
loudness                          -13.6
speechiness                      0.0309
playlist_type           all_out_decades
Name: 1211, dtype: object

Or the very last row, again like a list.

In [33]:
by_year.iloc[-1]

track_id          5mjYQaktjmjcMKcUIcqz4s
track_name                     Strangers
artists                      Kenya Grace
genre              singer-songwriter pop
release_year                        2023
explicit                           False
popularity                            97
duration_ms                       172964
playlist_name    Top 50 - United Kingdom
danceability                       0.628
loudness                          -8.307
speechiness                          NaN
playlist_type                  mixed_pop
Name: 0, dtype: object

The index labels given to rows stick to them like an ID, however an index position can change.

| index | label | name   |
|-------|-------|--------|
| 0     | 0     | Arthur |
| 1     | 1     | Betty  |
| 2     | 2     | Carole |

After reversing the order of `name`

| index | label | name |
|-------|-------|----------|
| 0     | 2     | Carole   |
| 1     | 1     | Betty    |
| 2     | 0     | Arthur   |

Finally we can access specific cells, or collections of cells using `.loc` and `.iloc` as well.

In [41]:
# Row named 4, column artists
songs_df.loc[2,'artists']

'Tate McRae'

In [36]:
# Rows named 2 to 4, column artists and track_name
songs_df.loc[2:4,['artists','track_name']]

Unnamed: 0,artists,track_name
2,Tate McRae,greedy
3,cassö,Prada
4,Dave,Sprinter


In [40]:
# Rows at position -5 to -1, second column
songs_df.iloc[-5:-1, 1 ]

1275            Chantilly Lace
1276                Maybellene
1277            Little Darlin'
1278    It's Only Make Believe
Name: track_name, dtype: object

Often if you are accessing specific rows, you'll use `.loc` but it is helpful to know the difference from `.iloc` as sometimes you won't know the index names, but you'll know the position.