# SQL Walkthrough Using Spotify Data

## The Data
The data is coming from Yamac Eren Ay on Kaggle: 
https://www.kaggle.com/datasets/yamaerenay/spotify-dataset-19212020-600k-tracks

There are two csv files, artists and tracks. Before these can be added to a database, a little preprocessing is needed. This includes changing datatypes and filtering the results. To help reduce the file sizes, I will filter for artists with at least 5000 followers, and tracks from 2011 to now with at least a popularity of 50. Additionally, the artists and tracks files have columns that include lists. Relational databases don't work well with lists, so these should be expanded out to form their own tables that will have many-to-one relationships with the originals. 

### Load the data

In [1]:
import pandas as pd
import os

In [2]:
directory = os.getcwd()

artists_f = os.path.join(directory, 'Data', 'artists.csv')
tracks_f = os.path.join(directory, 'Data', 'tracks.csv')

# eval tells pandas to read the column as it's corresponding dtype in python instead of a string
artists = pd.read_csv(artists_f, converters={'genres': eval})
tracks = pd.read_csv(tracks_f, converters={'artists': eval, 'id_artists': eval})

In [3]:
# Reorganize the artists dataframe
artists = artists[['id', 'name', 'genres', 'followers', 'popularity']]

In [4]:
# Size of the tables
print(artists.shape, tracks.shape)

(1162095, 5) (586672, 20)


In [33]:
artists.sample(5)

Unnamed: 0,id,name,genres,followers,popularity
1149654,4OrxApfruWyJFoW7suH1Df,Luís Simões,[],8.0,13
468841,4Kh1Sq1wtjB1MiBz4qCo8u,Rock Abruham,[],629.0,21
510170,0HzHpbqQVVFQfePP5tM0SV,Nevergreen,[hungarian metal],1137.0,16
750031,1kdR0b6LW5Wc09bGyDbO0M,420Family,[],260.0,1
908968,5HHtIXKH3FKomw5CXSJC49,Traphouse,[],305.0,5


In [34]:
tracks.sample(5)

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
330245,6NdJz5wvazC9BtcL9SNSY1,Sabai Di,26,247333,0,[Pang Nakarin],[6PzfeMetkoQYaQ5GgtInPO],1995-01-01,0.693,0.219,9,-17.781,1,0.036,0.000248,0.000108,0.141,0.431,122.328,4
368564,72TFH9AiVOvADiEx9oU9gx,Gondol Şarkısı,33,270000,0,[Emre Aracı],[14rmqpY9L6DirsiSUy5J3Z],2000-04-03,0.114,0.0967,4,-23.924,0,0.0528,0.864,0.0698,0.109,0.0978,67.84,4
277956,6S5QQ4byCmcm7OcgG3vsyZ,Now,49,218697,0,[Trouble Maker],[0ztjVBmFk6OuHq6XBBwMI9],2013-10-28,0.694,0.9,2,-3.31,0,0.0393,0.39,5e-06,0.27,0.6,117.94,4
141280,2gZO3cIDZXecSd3pvRjuVS,Dönemez Ki Bana,25,226000,0,[Gönül Yazar],[0W9Yg3XLyYxcesMVrAixZo],1969-01-01,0.443,0.476,7,-8.945,0,0.0297,0.65,0.0,0.359,0.471,105.054,4
194099,2vRxqkvydU8olNd6JSe7ND,The Air That I Breathe,27,230307,0,[Julio Iglesias],[4etuCZVdP8yiNPn4xf0ie5],1984-08-24,0.488,0.501,7,-8.726,1,0.0261,0.529,4.4e-05,0.415,0.173,94.973,4


### Filter and change datatypes

In [13]:
# Filter to only select artists with 5000 or more followers
top_artists = artists[artists['followers'] >= 5000].copy()

In [14]:
# Check datatypes for top_artists
top_artists.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 88609 entries, 153 to 1162081
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          88609 non-null  object 
 1   name        88609 non-null  object 
 2   genres      88609 non-null  object 
 3   followers   88609 non-null  float64
 4   popularity  88609 non-null  int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 4.1+ MB


In [15]:
# Can't have half a follower, so convert followers to int64
top_artists['followers'] = top_artists['followers'].astype("int64")

In [16]:
# Convert release_date in tracks to a datetime variable
tracks['release_date'] = pd.to_datetime(tracks['release_date'])

In [17]:
# Filter to select tracks with a popularity of 50 or more that have been released since 2011
top_tracks = tracks[(tracks['popularity'] >= 50) & (tracks['release_date'] >= '2011-01-01')].copy()

In [18]:
# Check datatypes for top_tracks - they all look good
top_tracks.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41912 entries, 73439 to 586670
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   id                41912 non-null  object        
 1   name              41912 non-null  object        
 2   popularity        41912 non-null  int64         
 3   duration_ms       41912 non-null  int64         
 4   explicit          41912 non-null  int64         
 5   artists           41912 non-null  object        
 6   id_artists        41912 non-null  object        
 7   release_date      41912 non-null  datetime64[ns]
 8   danceability      41912 non-null  float64       
 9   energy            41912 non-null  float64       
 10  key               41912 non-null  int64         
 11  loudness          41912 non-null  float64       
 12  mode              41912 non-null  int64         
 13  speechiness       41912 non-null  float64       
 14  acousticness     

In [19]:
top_artists.sample(5)

Unnamed: 0,id,name,genres,followers,popularity
276666,6f9HZnIVnbEfyUaHRD6lzC,Tessie,[],5217,30
235871,78GTwoOGB8pub6iMy43fYc,Lucas Cervetti,"[432hz, musica de fondo, world meditation]",11851,42
236537,0Ku8nzqtaDxHYAo2kvCh4Z,Samba De La Muerte,[french indietronica],6148,31
378781,4azjRasNxgPzCKzAz1qGLv,OSAD VRT,[vietnamese hip hop],16439,25
1149349,71Cez1b1NqsxIn5u8XNiQD,The Dartmouth Aires,"[a cappella, college a cappella]",6631,25


In [20]:
top_tracks.sample(5)

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
210843,4icWKp3UtiqeuHmWvNfhNT,Cien,52,187747,0,[CNCO],[0eecdvMrqBftK0M1VKhaF4],2016-08-26,0.606,0.63,7,-4.472,1,0.0474,0.205,0.0,0.0966,0.397,160.171,4
114010,4F9jpNQDKRFoyM4Ebpni6S,Amadeus,61,207519,0,[Family and Friends],[2AmW5LU0vqfHoN2qvghRFe],2015-07-17,0.428,0.724,11,-7.264,0,0.0434,0.183,0.000136,0.251,0.184,146.917,4
444238,3AY6H68WNMDzWw9JhPW4Jv,"Fool's Gold - Recorded At RAK Studios, London",64,204295,0,[Niall Horan],[1Hsdzj7Dlq2I7tHP7501T4],2017-12-13,0.575,0.185,6,-11.988,1,0.0337,0.938,0.0,0.107,0.594,128.074,4
92544,2OKo7g3KfmCt3kyLvUAL0g,The Search,74,248040,0,[NF],[6fOMl44jA4Sp5b9PpYCkzz],2019-05-30,0.789,0.786,2,-4.788,1,0.297,0.596,0.0,0.0997,0.39,119.957,4
487791,1Dl4Dj73BJG6pK4PHMBx1y,Rettungsschwimmer,54,267796,0,[Bosshafte Beats],[4k3MIax7tvVw8G5ixZjRvm],2014-09-20,0.664,0.824,9,-9.259,0,0.0448,0.11,0.188,0.114,0.767,95.0,4


### Create additional tables

In [21]:
# Create the artist_genre table
artist_genre = top_artists[['id', 'name', 'genres']].copy()
top_artists = top_artists.drop(columns='genres')

In [22]:
# Explode out the genre list
artist_genre = artist_genre.explode('genres')

In [23]:
artist_genre.head()

Unnamed: 0,id,name,genres
153,7frYUe4C7A42uZqCzD34Y4,Sultaan,desi pop
153,7frYUe4C7A42uZqCzD34Y4,Sultaan,punjabi hip hop
153,7frYUe4C7A42uZqCzD34Y4,Sultaan,punjabi pop
154,6acbdy69rtlv8m9EW31MYl,Phyno,afro dancehall
154,6acbdy69rtlv8m9EW31MYl,Phyno,afropop


In [24]:
top_artists.head()

Unnamed: 0,id,name,followers,popularity
153,7frYUe4C7A42uZqCzD34Y4,Sultaan,53636,53
154,6acbdy69rtlv8m9EW31MYl,Phyno,72684,51
155,72578usTM6Cj5qWsi471Nc,Raghu Dixit,248568,52
156,4rK6HLvoZhLFUTcUhG9WfC,Deacon,5644,52
158,7b6Ui7JVaBDEfZB9k6nHL0,The Local Train,701766,57


In [25]:
# Create the artist_track table
artist_track = top_tracks[['id_artists', 'artists', 'id', 'name']].copy()
artist_track = artist_track.rename(columns = {'id': 'id_tracks', 'name':'tracks'})
top_tracks = top_tracks.drop(columns = ['artists', 'id_artists'])

In [26]:
# Explode out the artist list
artist_track = artist_track.explode(['id_artists', 'artists'])

In [27]:
# Make sure the id_artists (foreign key) are in the id (primary key) from the top_artists table
artist_track = artist_track[artist_track['id_artists'].isin(top_artists['id'])]

In [28]:
artist_track.head()

Unnamed: 0,id_artists,artists,id_tracks,tracks
73439,7gOdHgIoIKoe4i9Tta6qdD,Jonas Brothers,4zP7ADsgJgHGY6VzxbNp1z,Year 3000
76404,5ND0mGcL9SKSjWIjPd0xIb,Bowling For Soup,1AHGrKFv3nSCH9K7yg8gOz,Punk Rock 101
80314,7kwEvDE8e7EBGKh5bLczqQ,Anthem Lights,1dKDRs99KkNbtC9AHM7TLm,Best of 2012: Payphone / Call Me Maybe / Wide ...
80317,7kwEvDE8e7EBGKh5bLczqQ,Anthem Lights,65bcYKY0QzlXILxVuWspdT,Best of 2011: Just the Way You Are / For the F...
84076,7gP3bB2nilZXLfPHJhMdvc,Foster The People,7w87IxuO7BDcJ3YUqCyMTT,Pumped Up Kicks


In [29]:
top_tracks.head()

Unnamed: 0,id,name,popularity,duration_ms,explicit,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
73439,4zP7ADsgJgHGY6VzxbNp1z,Year 3000,67,201960,0,2019-05-09,0.659,0.857,11,-5.85,1,0.0437,0.0045,2e-06,0.335,0.798,106.965,4
76404,1AHGrKFv3nSCH9K7yg8gOz,Punk Rock 101,52,184322,0,2015-01-27,0.63,0.936,4,-4.576,1,0.084,0.00128,0.0,0.0823,0.733,117.962,4
80314,1dKDRs99KkNbtC9AHM7TLm,Best of 2012: Payphone / Call Me Maybe / Wide ...,55,209134,0,2015-10-16,0.375,0.418,11,-5.999,1,0.036,0.688,0.0,0.371,0.287,136.319,5
80317,65bcYKY0QzlXILxVuWspdT,Best of 2011: Just the Way You Are / For the F...,50,183814,0,2015-10-16,0.418,0.343,4,-7.492,1,0.0339,0.741,0.0,0.113,0.327,121.805,4
84076,7w87IxuO7BDcJ3YUqCyMTT,Pumped Up Kicks,85,239600,0,2011-05-23,0.733,0.71,5,-5.849,0,0.0292,0.145,0.115,0.0956,0.965,127.975,4


### Save files

In [25]:
# Convert DataFrames to csv files to load into the Postgres Database 
top_artists.to_csv('Data\\top_artists.csv', sep=',', encoding='utf-8', index=False)
artist_genre.to_csv('Data\\artist_genre.csv', sep=',', encoding='utf-8', index=False)
top_tracks.to_csv('Data\\top_tracks.csv', sep=',', encoding='utf-8', index=False)
artist_track.to_csv('Data\\artist_track.csv', sep=',', encoding='utf-8', index=False)

## The Database

### QuickDBD

https://app.quickdatabasediagrams.com/#/

Make the ERD. Can also export the PostgreSQL file to create the tables in our database. 

### pgAdmin - CREATE, DROP, and BACKUP DATABASE

To create a new database in postgres you can use pgAdmin. Go to Object, Create, and Database. To drop a database in pgAdmin, right click on the database and select Delete/Drop. If you need to backup a database, then right click on it and select Backup. 

Once created, the sql file exported from QuickDBD can be loaded into the Query Editor and run to create the tables. The database can be viewed in pgAdmin, and under Schemas you can find the tables which show all of their information, including columns and constraints. Import the csv files created above into their respective tables by right clicking on the table name and going to import/export. Select import at the top, select the filename, format, encoding, whether it has a header, and which columns you want to import. Make sure to import the tables with primary keys first (top_artists, top_tracks), and then the tables with foreign keys connecting to them (artist_genre, artist_track). 

### Connecting to the Database

To start, you'll want to download (pip install):
- ipython-sql - to get the %sql and %%sql magic commands
- sqlalchemy - which is a python SQL toolkit
- Psycopg2 - communicates your SQL statements to your postgres database 
    
Next, load the ipython-sql extension and use the magic command to connect to the Postgres database
- The database URL for sqlalchemy is: dialect+driver://username:password@host:port/database 

In [30]:
%load_ext sql

# Load the spotify database on localhost (remember to enter the password)
%sql postgresql://postgres:password@localhost:5432/Spotify
        
# To hide connection from outputs
%config SqlMagic.displaycon=False

## Table Queries

### CREATE TABLE Statement
Used to create a new table in a database.
- The table has a tablename, columns, and table constraints. 
- Each column has a column name, a data type, and column constraints.
- The data type is what values a column can hold like - INTEGER, REAL, DATE, VARCHAR(max lenght), TEXT, etc

#### Constraints
These can be specified when the table is made or altered
- NOT NULL - Ensures that a column cannot have a NULL value.
- UNIQUE - Ensures that all values in the column are different. 
- PRIMARY KEY - A combination of NOT NULL and UNIQUE. A table can only have one primary key, which can be made of multiple fields (composite key). 
- FOREIGN KEY - Uniquely identifies a row in another table, thus links two tables together. A table can have multiple foreign keys. 
    - ON DELETE SET NULL - If something is deleted, the foreign key associated will be set to null.
    - ON DELETE CASCADE - if we delete something the primary key rows associated will be deleted. 
- CHECK - Ensures that all values in a column satisfy a boolean expression condition. 
- DEFAULT - Sets a default value for a column when no value is specified. 
- INDEX - Used to create and retrieve data from the database very quickly. 
- AUTO_INCREMENT - Allows a unique number to be generated automatically when a new record is inserted into a table.

In [None]:
# %%sql
# This was part of the file used to create the tables in pgAdmin

CREATE TABLE "top_artists" (
    "id" TEXT   NOT NULL,
    "name" TEXT   NOT NULL,
    "followers" INTEGER   NOT NULL,
    "popularity" SMALLINT   NOT NULL,
    CONSTRAINT "pk_top_artists" PRIMARY KEY (
        "id"
     )
);

CREATE TABLE "artist_genre" (
    "id" TEXT   NOT NULL,
    "name" TEXT   NOT NULL,
    "genres" TEXT
);

CREATE TABLE "top_tracks" (
    "id" TEXT   NOT NULL,
    "name" TEXT   NOT NULL,
    "popularity" SMALLINT   NOT NULL,
    "duration_ms" INTEGER   NOT NULL,
    "explicit" SMALLINT   NOT NULL,
    "release_date" DATE   NOT NULL,
    "danceability" REAL   NOT NULL,
    "energy" REAL   NOT NULL,
    "key" SMALLINT   NOT NULL,
    "loudness" REAL   NOT NULL,
    "mode" SMALLINT   NOT NULL,
    "speechiness" REAL   NOT NULL,
    "acousticness" REAL   NOT NULL,
    "instrumentalness" REAL   NOT NULL,
    "liveness" REAL   NOT NULL,
    "valence" REAL   NOT NULL,
    "tempo" REAL   NOT NULL,
    "time_signature" SMALLINT   NOT NULL,
    CONSTRAINT "pk_top_tracks" PRIMARY KEY (
        "id"
     )
);

CREATE TABLE "artist_track" (
    "id_artists" TEXT   NOT NULL,
    "artists" TEXT   NOT NULL,
    "id_tracks" TEXT   NOT NULL,
    "tracks" TEXT   NOT NULL
);

### DROP TABLE Statement
Used to drop an existing table. Be careful with this. 
- Alternatively, TRUNCATE TABLE tablename;  - To delete the info in the table. 

In [None]:
# %%sql
DROP TABLE "table_name";

###  ALTER TABLE Statement
Used to add, delete, or modify columns in an existing table.
- Also used to add and drop various constraints on an existing table. 
- Here are some examples:

In [None]:
# %%sql
# This was the other part of the fileused to create the tables in pgAdmin

ALTER TABLE "artist_genre" ADD CONSTRAINT "fk_artist_genre_id" FOREIGN KEY("id")
REFERENCES "top_artists" ("id");

ALTER TABLE "artist_track" ADD CONSTRAINT "fk_artist_track_id_artists" FOREIGN KEY("id_artists")
REFERENCES "top_artists" ("id");

ALTER TABLE "artist_track" ADD CONSTRAINT "fk_artist_track_id_tracks" FOREIGN KEY("id_tracks")
REFERENCES "top_tracks" ("id");


# Additional Examples

ALTER TABLE "table_name"
ADD COLUMN "column_name" TEXT;

ALTER TABLE "table_name"
MODIFY COLUMN "column_name" INT;

ALTER TABLE "table_name"
CHANGE "column_name" "new_column_name" INT;

ALTER TABLE "table_name"
ADD FOREIGN KEY ("new_column_name")
REFERENCES "other_table"("other_column")
ON DELETE SET NULL;

ALTER TABLE "table_name"
DROP COLUMN "new_column_name";

## Querying the Database to Select Information from a Single Table

### SELECT & LIMIT
To look at one or more columns from a table. Use * to represent all of the columns.

- The LIMIT command will determine how many entries are shown, which is important for large datasets
- Leave it out if you want to see all of the entries

In [35]:
%%sql
SELECT *
FROM artist_track
LIMIT 5

5 rows affected.


id_artists,artists,id_tracks,tracks
7gOdHgIoIKoe4i9Tta6qdD,Jonas Brothers,4zP7ADsgJgHGY6VzxbNp1z,Year 3000
5ND0mGcL9SKSjWIjPd0xIb,Bowling For Soup,1AHGrKFv3nSCH9K7yg8gOz,Punk Rock 101
7kwEvDE8e7EBGKh5bLczqQ,Anthem Lights,1dKDRs99KkNbtC9AHM7TLm,Best of 2012: Payphone / Call Me Maybe / Wide Awake / Starships / We Are Young
7kwEvDE8e7EBGKh5bLczqQ,Anthem Lights,65bcYKY0QzlXILxVuWspdT,Best of 2011: Just the Way You Are / For the First Time / Someone Like You / Superbass / Grenade / Without You
7gP3bB2nilZXLfPHJhMdvc,Foster The People,7w87IxuO7BDcJ3YUqCyMTT,Pumped Up Kicks


In [36]:
%sql SELECT * FROM top_tracks LIMIT 5

5 rows affected.


id,name,popularity,duration_ms,explicit,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
4zP7ADsgJgHGY6VzxbNp1z,Year 3000,67,201960,0,2019-05-09,0.659,0.857,11,-5.85,1,0.0437,0.0045,1.93e-06,0.335,0.798,106.965,4
1AHGrKFv3nSCH9K7yg8gOz,Punk Rock 101,52,184322,0,2015-01-27,0.63,0.936,4,-4.576,1,0.084,0.00128,0.0,0.0823,0.733,117.962,4
1dKDRs99KkNbtC9AHM7TLm,Best of 2012: Payphone / Call Me Maybe / Wide Awake / Starships / We Are Young,55,209134,0,2015-10-16,0.375,0.418,11,-5.999,1,0.036,0.688,0.0,0.371,0.287,136.319,5
65bcYKY0QzlXILxVuWspdT,Best of 2011: Just the Way You Are / For the First Time / Someone Like You / Superbass / Grenade / Without You,50,183814,0,2015-10-16,0.418,0.343,4,-7.492,1,0.0339,0.741,0.0,0.113,0.327,121.805,4
7w87IxuO7BDcJ3YUqCyMTT,Pumped Up Kicks,85,239600,0,2011-05-23,0.733,0.71,5,-5.849,0,0.0292,0.145,0.115,0.0956,0.965,127.975,4


### Comments
SQL comments are used if you ever need to explain a SQL statement, or to prevent execution of a statement
- -- Single line comments, anything from it to the end of the line will be ignored
- /* multi line comments */ can be used to comment out multiple lines or part of a line

In [37]:
%%sql
SELECT name, followers, popularity -- Selecting these three columns
FROM top_artists
/* WHERE followers > 1000000
ORDER BY followers DESC */ 
LIMIT 5 

5 rows affected.


name,followers,popularity
Sultaan,53636,53
Phyno,72684,51
Raghu Dixit,248568,52
Deacon,5644,52
The Local Train,701766,57


### WHERE
Used to select records that fulfill some condition 
- Uses =, >, <, >=, <=, <>, IN, BETWEEN, LIKE
- Can be combined with AND, OR, and NOT operators, which can be combined: WHERE NOT, AND NOT, OR NOT

In [49]:
%%sql
SELECT *
FROM top_artists
WHERE followers < 10000000 AND popularity >= 90
LIMIT 10;

10 rows affected.


id,name,followers,popularity
2tIP7SsRs7vjIcLrU85W8J,The Kid LAROI,1624015,90
0eDvMgVFoNV3TpwtrVCoTj,Pop Smoke,5076597,92
7iK8PXO48WeuP03g8YR51W,Myke Towers,5001808,95
5cj0lLjcoR7YOSnhnX0Po5,Doja Cat,6208117,91
4fxd5Ee7UefO4CUXgwJ7IP,Giveon,946550,91
4r63FhuTkUYltbVAg5TQnk,DaBaby,6485079,93
6jGMq4yGs7aQzuGsMgVgZR,Lil Tjay,2889175,91
6AgTAQt8XS6jRWi4sX7w49,Polo G,3657199,91
28gNT5KBp7IjEOQoevXf9N,Camilo,8342103,90
3meJIgRw7YleJrmbpbJK6S,Die drei ???,613060,90


The IN operator allows you to specify multiple values in the WHERE clause.

In [52]:
%%sql 
SELECT *
FROM artist_genre
WHERE name IN ('Drake', 'Taylor Swift', 'Ed Sheeran');

11 rows affected.


id,name,genres
06HL4z0CvFAxyc27GXpf02,Taylor Swift,pop
06HL4z0CvFAxyc27GXpf02,Taylor Swift,post-teen pop
6eUKZXaKkcviH0Ku9w2n3V,Ed Sheeran,pop
6eUKZXaKkcviH0Ku9w2n3V,Ed Sheeran,uk pop
3TVXtAsR1Inumwj472S9r4,Drake,canadian hip hop
3TVXtAsR1Inumwj472S9r4,Drake,canadian pop
3TVXtAsR1Inumwj472S9r4,Drake,hip hop
3TVXtAsR1Inumwj472S9r4,Drake,pop rap
3TVXtAsR1Inumwj472S9r4,Drake,rap
3TVXtAsR1Inumwj472S9r4,Drake,toronto rap


The BETWEEN operator allows you to select values within a given range. Values can be numbers, text, or dates. 

In [53]:
%%sql
SELECT *
FROM top_tracks
WHERE release_date BETWEEN '06-01-2013' AND '09-01-2013'
LIMIT 10;

10 rows affected.


id,name,popularity,duration_ms,explicit,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
3U4isOIWM3VvDubwSI3y7a,All of Me,87,269560,0,2013-08-30,0.422,0.264,8,-7.064,1,0.0322,0.922,0.0,0.132,0.331,119.93,4
0NlGoUyOJSuSHmngoibVAs,All I Want,84,305747,0,2013-06-17,0.209,0.412,0,-9.733,1,0.0443,0.172,0.15,0.0843,0.162,86.26,3
3JvKfv6T31zO0ini8iNItO,Another Love,83,244360,1,2013-06-24,0.445,0.537,4,-8.532,0,0.04,0.695,1.65e-05,0.0944,0.131,122.769,4
3sNVsP50132BTNlImLx70i,Bound 2,80,229147,1,2013-06-18,0.367,0.665,1,-2.821,1,0.0465,0.145,0.0,0.113,0.31,148.913,4
5anCkDvJ17aznvK5TED5uo,Hail to the King,77,305907,0,2013-08-23,0.58,0.916,3,-4.358,0,0.0387,0.000297,0.0259,0.126,0.683,118.004,4
3QHMxEOAGD51PDlbFPHLyJ,Vivir Mi Vida,77,252347,0,2013-07-23,0.656,0.877,0,-3.231,0,0.0342,0.345,0.0,0.349,0.894,105.018,4
79MSEdtXuudhGhC5AtG07g,Break from Toronto,76,99213,1,2013-07-01,0.596,0.678,9,-5.18,1,0.0335,0.0199,0.004,0.418,0.259,117.066,4
722tgOgdIbNe3BEyLnejw4,Black Skinhead,76,188013,1,2013-06-18,0.766,0.809,1,-6.123,1,0.279,0.0011,0.0,0.168,0.325,130.127,4
2uwnP6tZVVmTovzX5ELooy,Power Trip (feat. Miguel),75,241160,1,2013-06-18,0.667,0.608,1,-7.054,1,0.216,0.324,0.000198,0.426,0.475,99.992,4
4KlL5Bwlm4yHYxr0B2rHci,Heal,74,193080,0,2013-06-24,0.445,0.179,1,-12.938,1,0.0396,0.952,0.00056,0.107,0.119,72.246,4


The LIKE operator allows you to search for a specified pattern in a column by using wildcards. 
- Wildcards are used to substitute one or more characters in a string. 
- Two or more wildcards are often used and can be used in combination.
- %  Represents zero, one, or multiple characters.
- _  Represents a single character. 

In [54]:
%%sql
SELECT *
FROM top_artists
WHERE name LIKE 'Lil %'
LIMIT 10;

10 rows affected.


id,name,followers,popularity
6FXCc0FAXCsG2WFR1plJjx,Lil Berete,31749,51
7jVv8c5Fj3E9VhNjxT4snq,Lil Nas X,4562300,89
6fU9vzuziiZNiZWawoCz2x,Lil Rain,6554,42
3FNZcjyqT7F5upP99JV0oN,Lil Debbie,55395,46
6y16j3NM7yVHoPxDjVquuq,Lil Golu,15792,37
7iDeMFJKjI1ak40N3hoYOZ,Lil Xxel,51524,60
2Kv0ApBohrL213X9avMrEn,Lil Silva,23408,52
1vOh8jgNLFHFxMY8i0lEKr,Lil Halima,9911,41
5HPsVk1MblCoa44WLJsQwN,Lil Suzy,45172,43
1qKzKUnuQsjB83hBZffoq0,Lil Rick,11790,40


### ORDER BY
Allows for sorting the results by a specified column
- Sort by ascending (default) - ASC, and descending - DESC
- Can order by multiple columns if there are two results with the same value

In [56]:
%%sql
SELECT *
FROM top_tracks
WHERE key = 10
ORDER BY popularity DESC, duration_ms ASC
LIMIT 10;

10 rows affected.


id,name,popularity,duration_ms,explicit,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
7lPN2DXiMsVn7XUKtOW1CS,drivers license,99,242014,1,2021-01-08,0.585,0.436,10,-8.761,1,0.0601,0.721,1.31e-05,0.105,0.132,143.874,4
1tkg4EHVoqnhR6iFEXb60y,What You Know Bout Love,91,160000,1,2020-07-03,0.709,0.548,10,-8.493,1,0.353,0.65,1.59e-06,0.133,0.543,83.995,4
6fRxMU4LWwyaSSowV441IU,Beautiful Mistakes (feat. Megan Thee Stallion),90,227395,0,2021-03-03,0.713,0.676,10,-5.483,1,0.027,0.0377,0.0,0.154,0.721,99.048,4
2QjOHCTQ1Jl3zawyYOpxh6,Sweater Weather,90,240400,0,2013-04-19,0.612,0.807,10,-2.81,1,0.0336,0.0495,0.0177,0.101,0.398,124.053,4
5RubKOuDoPn5Kj5TLVxSxY,TE MUDASTE,88,130014,1,2020-11-27,0.811,0.637,10,-4.835,0,0.0591,0.234,0.000572,0.118,0.471,92.025,4
0nbXyq5TXYPCO7pr3N8S4I,The Box,88,196653,1,2019-12-06,0.896,0.586,10,-6.687,0,0.0559,0.104,0.0,0.79,0.642,116.971,4
0pqnGHJpmpxLKifKRmU6WP,Believer,88,204347,0,2017-06-23,0.776,0.78,10,-4.374,0,0.128,0.0622,0.0,0.081,0.666,124.949,4
3UHPGOkUcE4hE7sqBF4Snt,Film out,88,214621,0,2021-04-01,0.499,0.709,10,-6.404,1,0.134,0.296,0.0,0.331,0.314,164.032,4
2r6OAV3WsYtXuXjvJ1lIDi,Hello (feat. A Boogie Wit da Hoodie),87,190534,1,2020-07-20,0.905,0.647,10,-5.065,0,0.107,0.0187,0.0,0.282,0.367,130.97,4
5uCax9HTNlzGybIStD3vDh,Say You Won't Let Go,87,211467,0,2016-10-28,0.358,0.557,10,-7.398,1,0.059,0.695,0.0,0.0902,0.494,85.043,4


### Aliases
Can be used to give a table or a column a temporary name. This can make them more readable, and only exists during that query. 
- To do this write the column or table and then write AS new_name
- Can combine multiple columns using CONCAT(column, column2) AS new_name
- Can also add the table name in front of the column name to make them more clear when querying mult tables - table.column

In [59]:
%%sql
SELECT COUNT(name) AS "Number of Artists"
FROM top_artists AS "Artist Profile";

1 rows affected.


Number of Artists
88609


In [60]:
%%sql
SELECT a.name, a.popularity
FROM top_artists AS a
WHERE popularity > 85
LIMIT 10;

10 rows affected.


name,popularity
Lil Nas X,89
The Kid LAROI,90
Lewis Capaldi,86
Ava Max,86
Shawn Mendes,89
Justin Quiles,87
Kygo,86
Pop Smoke,92
Bad Bunny,98
Juice WRLD,96


In [71]:
%%sql
SELECT CONCAT(name, ' | ', genres) AS "Artist and Genre"
FROM artist_genre
LIMIT 10; -- Not really applicable here, but its an example

10 rows affected.


Artist and Genre
Sultaan | desi pop
Sultaan | punjabi hip hop
Sultaan | punjabi pop
Phyno | afro dancehall
Phyno | afropop
Phyno | azontobeats
Phyno | nigerian hip hop
Phyno | nigerian pop
Raghu Dixit | filmi
Raghu Dixit | indian folk


### Aggregate Functions
Can be used on a column in a table to perform some additional computation and return a single value.
- MIN()  Returns the smallest value of the selected column.
- MAX()  Returns the largest value of the selected column.
- COUNT()  Returns the number of rows that matches a specified criteria - NULL not counted. 
- AVG()  Returns the average value of a numeric column - NULL values ignored.
- SUM()  Returns the total sum of a numeric column - NULL values ignored. 
- ROUND(,2) Can be combined with other aggregate functions. Rounds to the specified number of places.

In [74]:
%%sql 
SELECT MIN(popularity) AS "Min Popularity", 
    MAX(popularity) AS "Max Popularity", 
    COUNT(name) AS "Number of Artists", 
    ROUND(AVG(followers),0) AS "Average Followers", 
    SUM(followers) AS "Total Followers"
FROM top_artists;  -- Sum is not applicable here since the same spotify account could be following multiple artists

1 rows affected.


Min Popularity,Max Popularity,Number of Artists,Average Followers,Total Followers
0,100,88609,129848,11505685225


We could use aggregate functions to compare the average danceability, loudness, and tempo for all tracks vs tracks with a popularity above 80. We can also use count to determine how many tracks are in each category. 

In [82]:
%%sql
SELECT AVG(danceability) AS "Average Danceability", 
    AVG(loudness) AS "Averate Loudness", 
    AVG(tempo) AS "Average Tempo", 
    COUNT(name) AS "Number of Tracks"
FROM top_tracks
WHERE popularity > 80;

1 rows affected.


Average Danceability,Averate Loudness,Average Tempo,Number of Tracks
0.6823338521484652,-6.314658337933783,120.88065827915707,641


In [83]:
%%sql
SELECT AVG(danceability) AS "Average Danceability", 
    AVG(loudness) AS "Averate Loudness", 
    AVG(tempo) AS "Average Tempo", 
    COUNT(name) AS "Number of Tracks"
FROM top_tracks;

1 rows affected.


Average Danceability,Averate Loudness,Average Tempo,Number of Tracks
0.6427108298943242,-6.963591022972694,121.26952222804802,41912


### GROUP BY
Groups rows that have the same values into summary rows, like average loudness per genre.
- It's often used with aggregate functions (MIN, MAX, COUNT, AVG, SUM) to group the result-set by one or more columns.

In [88]:
%%sql
SELECT COUNT(name) AS "Number of Tracks", 
ROUND(AVG(duration_ms),3) AS "Average Duration in ms", 
MAX(popularity) AS "Max Popularity", key
FROM top_tracks
GROUP BY key
ORDER BY key ASC;

12 rows affected.


Number of Tracks,Average Duration in ms,Max Popularity,key
4698,216848.213,100,0
4625,214160.79,96,1
3752,222190.474,94,2
1410,217935.726,89,3
3024,220368.643,98,4
3438,216232.148,96,5
3272,215296.037,94,6
4340,220399.182,93,7
3154,216160.777,95,8
3789,221194.171,90,9


### HAVING Clause 
Added because the WHERE keyword can not take aggregate functions

In [93]:
%%sql
SELECT COUNT(name) AS "Number of Tracks", 
AVG(duration_ms) AS "Average Duration in ms", 
MAX(popularity) AS "Max Popularity", key
FROM top_tracks
GROUP BY key
HAVING MAX(popularity) > 95
ORDER BY COUNT(name) DESC
LIMIT 15;

6 rows affected.


Number of Tracks,Average Duration in ms,Max Popularity,key
4698,216848.2128565347,100,0
4625,214160.78983783783,96,1
3618,216337.31288004425,97,11
3438,216232.14776032575,96,5
3024,220368.64252645505,98,4
2792,216813.43624641837,99,10


## Querying Information from Multiple Tables and Combining the Results
Thus far we have only queried information from one table at a time, but there are multiple tables in a dataset. To query information from multiple tables and combine the results there are a few options: 
- Use information from one table to search in another, using the same column.
- Combine rows and columns from two or more tables, based on a shared column.
- Combine the result-set of tables if they have the same number of columns, similar datatypes, and are in the same order.

### NESTED QUERIES
Uses WHERE and IN with a query from one table in order to use that information to search the same column in another table.

In [96]:
%%sql
SELECT name, followers, popularity
FROM top_artists
WHERE name IN (
    SELECT name
    FROM artist_genre
    WHERE genres = 'pop'
)
ORDER BY followers DESC
LIMIT 15;

15 rows affected.


name,followers,popularity
Ed Sheeran,78900234,92
Ariana Grande,61301006,95
Justin Bieber,44606973,100
Rihanna,42244011,92
Billie Eilish,41792604,92
Taylor Swift,38869193,98
Shawn Mendes,32419313,89
The Weeknd,31308207,96
Maroon 5,30291109,91
Marshmello,30244604,88


### JOIN Clause 
Used to combine rows from two or more tables, based on a shared column

- (INNER) JOIN - Returns records that have matching values in both tables.
- LEFT (OUTER) JOIN - Returns all records from the left table, and matching records from the right table.
- RIGHT (OUTER) JOIN - Returns all records from the right table, and matching records from the left table.
- FULL (OUTER) JOIN - Returns all records when there is a match in either left or right table. 

In [102]:
%%sql
SELECT ta.name, ta.followers, ta.popularity, ag.genres
FROM top_artists AS ta
JOIN artist_genre AS ag
ON ta.id=ag.id
WHERE ta.name IN ('Ed Sheeran', 'Taylor Swift', 'Justin Bieber')
LIMIT 10;

7 rows affected.


name,followers,popularity,genres
Taylor Swift,38869193,98,pop
Taylor Swift,38869193,98,post-teen pop
Ed Sheeran,78900234,92,pop
Ed Sheeran,78900234,92,uk pop
Justin Bieber,44606973,100,canadian pop
Justin Bieber,44606973,100,pop
Justin Bieber,44606973,100,post-teen pop


In [103]:
%%sql
SELECT ta.name, ta.followers, ag.genres, at.tracks
FROM ((top_artists AS ta
JOIN artist_genre AS ag
       ON ta.id=ag.id)
      JOIN artist_track AS at
      ON ta.id=at.id_artists)
WHERE ta.name LIKE 'Ed Sheeran'
LIMIT 10;

10 rows affected.


name,followers,genres,tracks
Ed Sheeran,78900234,pop,Happier - Acoustic
Ed Sheeran,78900234,pop,Cold Coffee
Ed Sheeran,78900234,pop,I Was Made For Loving You
Ed Sheeran,78900234,pop,Thinking out Loud
Ed Sheeran,78900234,pop,Photograph
Ed Sheeran,78900234,pop,The Man
Ed Sheeran,78900234,pop,Take It Back
Ed Sheeran,78900234,pop,Gold Rush - Deluxe Edition
Ed Sheeran,78900234,pop,Little Bird - Deluxe Edition
Ed Sheeran,78900234,pop,This


In [104]:
%%sql
SELECT at.artists, tt.name, tt.duration_ms, tt.danceability
FROM artist_track AS at
FULL JOIN top_tracks AS tt
ON at.id_tracks=tt.id
WHERE tt.popularity > 80 AND tt.tempo < 100
ORDER BY at.artists ASC
LIMIT 15;

15 rows affected.


artists,name,duration_ms,danceability
21 Savage,Opp Stoppa (feat. 21 Savage),135431,0.829
24kGoldn,Mood (feat. iann dior),140533,0.701
24kGoldn,"Mood (Remix) feat. Justin Bieber, J Balvin & iann dior",192745,0.721
50 Cent,The Woo (feat. 50 Cent & Roddy Ricch),201600,0.49
A$AP Rocky,Praise The Lord (Da Shine) (feat. Skepta),205040,0.854
Adam Levine,Stereo Hearts (feat. Adam Levine),210960,0.646
Alexander 23,IDK You Yet,184638,0.648
Ali Gatie,It's You,212607,0.732
Anne-Marie,2002,186987,0.697
Anne-Marie,FRIENDS,202621,0.626


### UNION Operator
Used to combine the result-set of two or more SELECT statements.
- Must have the same number of columns and similar data types, and be in the same order.

In [116]:
%%sql
SELECT name AS "Track or Artist", popularity
FROM top_tracks
WHERE explicit=0 AND popularity >= 95
UNION
SELECT name, popularity
FROM top_artists
WHERE followers > 30000000 AND popularity >= 95
ORDER BY "Track or Artist" ASC
LIMIT 15;

12 rows affected.


Track or Artist,popularity
Ariana Grande,95
Astronaut In The Ocean,98
Bad Bunny,98
Blinding Lights,96
BTS,96
Drake,98
Justin Bieber,100
Leave The Door Open,96
Taylor Swift,98
telepatía,97


## Change the Contents of a Table

### INSERT INTO
Inserts new records into a table. May need to insert some values as null when creating your table if a value that's referenced hasn't been made yet. 

In [None]:
%%sql
INSERT INTO top_artists (id, name, followers, popularity)  -- Adds to the specified columns
VALUES ('New Artist ID', 'New Artist', 10000, 70.0);

Can also copy data from one table and insert it into another table. This can be useful when creating a new table to extract a file from.
- This requires that the data types in source and target tables match. 
- The existing records are unaffected. 

In [None]:
%%sql
INSERT INTO small_artists
SELECT * FROM top_artists
WHERE followers < 100000;

### UPDATE
Used to modify existing records in a table. 
- It's important to be careful about which records are updated in the WHERE clause. If this is missing, then all the values will be updated.

In [None]:
%%sql
UPDATE top_artists
SET followers=15000, popularity=75.0
WHERE name='New Artist';

### DELETE
Used to delete existing records in a table. 

In [None]:
%%sql
DELETE FROM top_artists;  -- This will delete all of the records in the artists table, but keep the table intact

In [None]:
%%
DELETE FROM top_artists
WHERE name='New Artist';