<a href="https://colab.research.google.com/github/DanielBerkes/git-intro/blob/master/spotify_aws_data_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The following command installs the `spotipy` library in the Jupyter notebook environment:
- `!` indicates that the command should be executed in the system's shell, not in the Python interpreter.
- `pip` is the package installer for Python, used to install and manage additional libraries.
- `install` is the pip command to install a package.
- `spotipy` is a lightweight Python library for the Spotify Web API.

In [1]:
!pip install spotipy

Collecting spotipy
  Downloading spotipy-2.24.0-py3-none-any.whl (30 kB)
Collecting redis>=3.5.3 (from spotipy)
  Downloading redis-5.0.7-py3-none-any.whl (252 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.1/252.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: redis, spotipy
Successfully installed redis-5.0.7 spotipy-2.24.0


The following code imports the `spotipy` library and the `SpotifyClientCredentials` class from `spotipy.oauth2` for interacting with Spotify's Web API

In [2]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd

Creates an instance of `SpotifyClientCredentials` with your Spotify application's `client_id` and `client_secret` for authentication with Spotify's Web API:

In [3]:
client_credentials_manager = SpotifyClientCredentials(client_id="80a16ccb47644b35a438945b39bf057c", client_secret="4aaeb6bf308344b39e132f5b4153cddb")

Initializes the Spotipy client with the provided `client_credentials_manager` for making requests to Spotify's Web API

In [4]:
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

Defines the Spotify playlist link that will be used to fetch playlist data

In [5]:
playlist_link = "https://open.spotify.com/playlist/37i9dQZEVXbNG2KDcFcKOF?si=1333723a6eff4b7f"

Extracts the playlist ID from the Spotify playlist link by splitting the URL and removing the query parameters. The function which will be used will expect this string as the argument.

In [6]:
playlist_URI = playlist_link.split("/")[-1].split('?')[0]

In [7]:
data = sp.playlist_tracks(playlist_URI)

Iterates through the playlist data to extract album information and appends it to `album_list`:
- `data['items']` contains the playlist tracks.
- for each track, extracts album ID, name, release date, total tracks, and Spotify URL
- Constructs a dictionary `album_element` with the extracted album information
- Appends `album_element` to `album_list`

Using dictionaries to store individual album data ensures all relevant information is kept together, while the list organize these dictionaries for efficient processing and analysis.

In [8]:
album_list = []
for row in data['items']:
  album_id = row['track']['album']['id']
  album_name = row['track']['album']['name']
  album_release_date = row['track']['album']['release_date']
  album_total_tracks = row['track']['album']['total_tracks']
  album_url = row['track']['album']['external_urls']['spotify']
  album_element = {
      'album_id': album_id,
      'name': album_name,
      'release_date': album_release_date,
      'total_tracks': album_total_tracks,
      'url': album_url}
  album_list.append(album_element)




Iterates through the playlist data to extract artist information and appends it to `artist_list`:
- `data['items]` contains the playlist tracks
- for each track, it checks if the key is `'track'` and then iterates through the list of artists associated with the track
- constructs a dictionary `artist_dict` with the artist ID, name, and external URL for structured storage
- appends `artist_dict` to `artist_list`, allowing for easy iteration, sorting, and filtering of multiple artists

Using dictionary to store individual artist data ensures all relevant information is kept together, while the list organizes these dictionaries for efficient processing and analysis

In [9]:
artist_list = []
for row in data['items']:
  for key, value in row.items():
    if key == 'track':
      for artist in value['artists']:
        artist_dict = {
            'artist_id': artist['id'],
            'artist_name': artist['name'],
            'external_url': artist['href']
        }
        artist_list.append(artist_dict)

Iterates through the playlist data to extract song information and appends it to `song_list`:
- `data['items']` contains the playlist tracks
- for each track, extracts song ID, name, duration, Spotify URL, popularity, and added date
- also extracts album ID and first artist ID associated with the track
- constructs a dictionary `song_element` with the extracted song information for structured storage
- appends `song_element` to `song_list`, allowing for easy iteration, sorting, and filtering of multiple songs

Using dictionaries to store individual song data ensure all revelent information is kept together, while the list organizes these dictionaries for efficient processing and analysis

In [10]:
song_list = []
for row in data['items']:
  song_id = row['track']['id']
  song_name = row['track']['name']
  song_duration = row['track']['duration_ms']
  song_url = row['track']['external_urls']['spotify']
  song_popularity = row['track']['popularity']
  song_added = row['added_at']
  album_id = row['track']['album']['id']
  artist_id = row['track']['artists'][0]['id']
  song_element = {
      'song_id': song_id,
      'song_name': song_name,
      'song_ms': song_duration,
      'url': song_url,
      'popularity': song_popularity,
      'song_added': song_added,
      'album_id': album_id,
      'artist_id': artist_id
      }
  song_list.append(song_element)

Converts the `album_list` to a Pandas DataFrame for easier data manipulation and analysis:
- `album_list` is a lit of dictionaries containing album information
- `pd.DataFrame.from(album_list)` creates a DataFrame from this list, organizing the data into a tabular format with columns corresponding to dictionary keys.

In [11]:
album_df = pd.DataFrame.from_dict(album_list)


Removes duplicates from the `album_df` DataFrame based on the `album_id` column to ensure each album appears only once:
- `album_df.drop_duplicates(subset=['album_id'])` identifies and drops row with duplicate `album_id` values
- this helps maintain a clean and accurate dataset for analysis

In [12]:
album_df = album_df.drop_duplicates(subset=['album_id'])

In [14]:
artist_df = df.DataFrame.from_dict(artist_list)

NameError: name 'df' is not defined

In [None]:
artist_df = artist_df.drop_duplicates(subset=['album_id'])