## Look what you made us do?

A Data-Driven Exploration of Taylor's Discography

This project is a deep dive into Taylor Swift's music, We're gonna use the power of data to uncover the hidden secrets within her discography.

First we start with ETL to finally create 3 data sets that we require fot our Analysis
### DATA 1: Feature Frenzy: The Invisible String ️‍♀️

We're bringing in Spotify's API like a decoder ring for Taylor's music.  This lets us create data points that describe a song's sonic personality,  like:

* `Danceability:` How likely are you to **shake it off** to this song? <br>
* `Valence:` Happy and carefree like **22** or a touch melancholic like **Teardrops on My Guitar**? <br>

### DATA 2: Blank Space: (Fill it with Repeated Rhymes)

Ever wondered which words Taylor loves to use together? Or how her rhyming style has evolved throughout her career?  <br>This is where we dissect her rhymes line by line and see if they're **never ever getting back together**.

### DATA 3: Call It What You Want (Love Story, Breakup Song, You Decide!)

We've compiled a list of words that scream "love song" and another list that embodies all things "breakup."<br> Then, we'll use sentiment analysis tools to understand the overall emotional tone of the lyrics.<br> Sad words = breakup anthem, happy words = love song celebration!

Based on this analysis, we'll classify each song into three categories:

* Love Songs
* Breakup Songs
* The ever-intriguing "Unknown" Category




In [None]:
#@title AllImports
import pandas as pd
import gensim
from gensim import corpora

import re
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('punkt')
nltk.download("stopwords")
nltk.download("wordnet")

nltk.download('vader_lexicon')
from wordcloud import WordCloud
import matplotlib.pyplot as plt

from gensim.models import LdaModel
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
#@title AllFiles

import os
# Get the current working directory
current_directory = "/content/drive/MyDrive/Colab Notebooks/TS Rant"

# List all files in the current directory
file_paths = [os.path.join(current_directory+"/archive", f) for f in os.listdir(current_directory+"/archive") if os.path.isfile(os.path.join(current_directory+"/archive", f))]

#@title Create DataFrame

## Read all data using filepaths and concatenate all data
maindf=pd.DataFrame()
for file_path in file_paths:
  df=pd.read_csv(file_path)
  df["File"]=file_path.split("/")[2].split("-")[0]
  maindf = pd.concat([maindf,df])
  #scan_data(df)
#scan_data(maindf)

#@title Function to preprocess text
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    lemmatizer = WordNetLemmatizer()  # Initialize lemmatizer
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens

#@title Get Lyrics DataFrame
justlyric = maindf.groupby("track_title").agg({"lyric":'sum'})
justlyric["preprocesslyric"] = justlyric["lyric"].apply(lambda x:preprocess_text(x))
justlyric.reset_index(inplace=True)

# Data ETL

## Topics of Songs


In [None]:
#@title List of Words
breakup_words = [
    "tears", "heartbreak", "alone", "pain", "goodbye", "sadness", "miss", "cry",
    "empty", "regret", "broken", "forget", "lost", "painful", "memories", "farewell",
    "lonely", "sorrow", "heartache", "hurt", "part", "betray", "teardrops", "sigh",
    "grief", "let go", "missed", "longing", "end", "break", "painfully", "ache",
    "mistake", "leave", "depart", "teardrop", "disappointment", "forlorn",
    "lament", "anguish", "despair", "grieve", "suffer", "regretful", "bittersweet",
    "heartfelt", "desolate", "unrequited", "downhearted", "melancholy", "pining",
    "rejection", "unhappy", "woeful", "woe", "disheartened", "betrayal",
    "heartrend", "heartrending", "heartstrings", "melancholic", "mourn", "pang",
    "separation", "unrequited", "anguished", "crush", "devastate", "devastation",
    "heartbrokenness", "lamentation", "loneliness", "lost",
    "romantic rejection", "romantic loss", "sad", "sob", "sorrowful", "tearful",
    "wail", "weep", "weeping", "whimper", "yowl"
]

love_words = [
    "cherished", "adored", "treasured", "devotion", "affection", "intimacy",
    "companionship", "soulmate", "passion", "inseparable", "blissful", "enchanted",
    "euphoric", "contentment", "harmony", "trust", "respect", "understanding",
    "supportive", "encouraging", "compassionate", "playful", "laughter", "joy",
    "excitement", "admiration", "gratitude", "appreciation", "vulnerability",
    "honesty", "communication", "growth", "security", "stability", "commitment",
    "loyalty", "partnership", "teamwork", "desire", "adventure", "future",
    "togetherness", "forever", "yearning", "dream", "hold", "inspire", "moonlight",
    "promise", "pure", "smitten", "spark", "starry", "sunshine", "truelove",
    "unconditional", "whisper", "yearn", "bliss", "destiny", "endless", "everlasting",
    "flame", "gaze", "harmony", "magic", "paradise", "poetry", "sunset", "wonder", "love", "lover", 'marry'

]

love_words = preprocess_text(" ".join(love_words))
breakup_words = preprocess_text(" ".join(breakup_words))

In [None]:
#@title Classify Song Function
def get_sentiment_score(lyrics):
  analyzer = SentimentIntensityAnalyzer()
  sentiment = analyzer.polarity_scores(lyrics)
  return sentiment['compound']

def check_domain_words(lyrics, word_list):
  count = 0
  for word in word_list:
    if word in lyrics:
      count += 1
  return count


def classify_song(new_lyrics):
  cleaned_new_lyrics = preprocess_text(new_lyrics)
  sentiment_score = get_sentiment_score(new_lyrics)
  love_word_count = check_domain_words(cleaned_new_lyrics, love_words)
  breakup_word_count = check_domain_words(cleaned_new_lyrics, breakup_words)
  print(f"Love count {love_word_count}")
  print(f"Breakup Count count {breakup_word_count}")
  # Rule-based classification with sentiment as a tie-breaker
  if love_word_count > breakup_word_count:# or sentiment_score > 0.5:
    return "Love Song"

  elif breakup_word_count > love_word_count and sentiment_score < 0.1:
    return "Breakup Song"
  else:
    return "Unknown"



In [None]:
#@title Assigning Topics
justlyric["Type"]=justlyric["lyric"].apply(lambda x:classify_song(x))

In [None]:
#@title SavingDF
justlyric.to_excel(current_directory+"/LyricsFeatures.xlsx", index=False)

## Counting Repeated Rhymes/Combination of words
What are the kind of repetation that happens when different songs are considered

In [None]:
#@title Function to build word matrix
def build_word_matrix(paragraph):
  word_matrix = {}
  words = preprocess_text(paragraph)
  for i in range(len(words) - 1):
    for j in range(len(words) - 1):
    if i==j:
      pass
    else:
      word_pair = (words[i], words[j])
      if word_pair not in word_matrix:
        word_matrix[word_pair] = 1
      else:
        word_matrix[word_pair] += 1
  return word_matrix

In [None]:
#@title Iterate and create the final word matrix
word_matrix = {}
new_df=pd.DataFrame()
lines_consider = 10
for iloc in range(0,maindf.shape[0]-maindf.shape[0]%lines_consider,lines_consider):
  main_sentence=""
  for sentence in list(maindf.iloc[iloc:iloc+lines_consider]["lyric"].apply(lambda x:x+" ")):
    main_sentence = main_sentence+sentence
    matrix = build_word_matrix(main_sentence)

    df = pd.DataFrame.from_dict(matrix, orient='index', columns=['Count'])
    df.index = pd.MultiIndex.from_tuples(df.index, names=['Word1', 'Word2'])
    new_df = pd.concat([new_df, df])

new_df.reset_index(inplace=True)
new_df_1 = new_df[new_df["Word1"]!=new_df["Word2"]]
new_df_1.sort_values(by='Count',ascending=False, inplace=True)
new_df_1_counter = pd.DataFrame(new_df_1.groupby(["Word1", "Word2"], as_index=False)["Count"].sum()).sort_values(by="Count", ascending=False)
new_df_1_counter.reset_index(drop=True, inplace=True)

In [None]:
#@title get Tracks which use combination of words using lookup
def get_tracks(row):
  a = row["Word1"]
  b = row["Word2"]
  listofsongs = justlyric[justlyric['preprocesslyric'].apply(lambda x:(a in x)&(b in x))]["track_title"].to_list()
  return listofsongs

new_df_1_counter["List Of Songs"]=new_df_1_counter.apply(lambda x:get_tracks(x),axis=1)
new_df_1_counter["total songs"] = new_df_1_counter["List Of Songs"].apply(lambda x:len(x))
new_df_1_counter.to_excel("/content/drive/MyDrive/Colab Notebooks/TS Rant/WordCounts.xlsx")

In [None]:
#@title Get Tracks Rhyme Score
def rhyme_score(row):
  score=0
  word1 = row["Word1"]
  word2 = row["Word2"]
  for i in range(1, min(len(word1), len(word2)) + 1):
      if word1[-i] == word2[-i]:
          score += 1
      else:
          break
  return score

new_df_1_counter["RhymeScore"]=new_df_1_counter.apply(lambda x:rhyme_score(x),axis=1)

In [None]:
#@title SaveDF
new_df_1_counter[new_df_1_counter['RhymeScore']>1].sort_values(by=['total songs',"RhymeScore"], ascending=False).to_excel("/content/drive/MyDrive/Colab Notebooks/TS Rant/WordCounts.xlsx", index=False)

## Song Features


In [None]:
#@title Import Spotify API Libraries
!pip install spotipy
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import numpy as np
from google.colab import userdata
clientpass = userdata.get('cleint_password')



In [None]:
#@title Authentication - without user
client_credentials_manager = SpotifyClientCredentials(client_id='f76ee452a0464e9198b742acce8b3ff6', client_secret=clientpass)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

In [None]:
#@title GetTrackDetails - Function
def get_track_details(track_name):
  try:
    input_song = sp.search(f'track:{track_name}%',limit=1)['tracks']['items'][0]['id']
    input_song_features = sp.audio_features(input_song)
    track_details = pd.DataFrame(input_song_features)
    track_details["track_title"]=track_name
    track_details = track_details[['track_title','danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms',
       'time_signature']]

  except:
    track_details = pd.DataFrame(columns=['track_title','danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms',
       'time_signature'])
    track_details['track_title']=track_name
  return track_details

In [None]:
#@title Create SongFeaturesDF
song_features = pd.DataFrame()
for song in maindf["track_title"].unique():
  song_df = get_track_details(song)
  song_features = pd.concat([song_features, song_df])

song_features["popularity"]=song_features["id"].apply(lambda x: sp.track(x)["popularity"])

In [None]:
#@title Any Null Values
song_features.isnull().sum().values!=0

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False])

In [None]:
#@title SaveDF
song_features.to_excel("/content/drive/MyDrive/Colab Notebooks/TS Rant/SongFeatures.xlsx", index=False)