In [1]:
# import packages
#install.packages("rjson")
library(rjson)
library(dplyr)
library(stringr)
library(data.table)
library(tidyverse)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


Attaching package: 'data.table'

The following objects are masked from 'package:dplyr':

    between, first, last

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.1.1     v readr   1.3.1
v tibble  2.1.1     v purrr   0.3.2
v tidyr   0.8.3     v forcats 0.4.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x data.table::between() masks dplyr::between()
x dplyr::filter()       masks stats::filter()
x data.table::first()   masks dplyr::first()
x dplyr::lag()          masks stats::lag()


In [2]:
# read the json formatted text in a dataframe
result <- fromJSON(file="track_lyrics.txt")
df <- stack(unlist(result))
# Convert lyricsIds into strings and not factors
df['ind']<- lapply(df['ind'], as.character)  

Majority of the songs have structure tags in square brackets [], exception:- Some lyrics have some additional content in [] other than structure tags, *ex: lyrics:7jS9N1cXM3b7oF35P2G6pm*. 
<br/>Parenthesis () contain extra lyrical content for most of the songs, exception: Few songs have structure tags like chorus within parenthesis. We will first analyze and clean lyrics which contain such bracketed content.

In [None]:
# pattern to find bracekted tag content-> (), [], {}
pattern = "[\\(\\[].*?[\\]\\)]" 
# Extracts strings that match the pattern and returns a matrix 
bracketed_patterns <- str_extract_all(df$values, pattern = pattern, simplify = TRUE)
# Add lyricsIds to the matrix
m <- cbind(df$ind, bracketed_patterns)
# File contains [] bracketed content corresponding to a particular lyricsId
write.csv(m, 'tags.csv')

In [4]:
tags <- read.csv("tags.csv", stringsAsFactors=FALSE)
tail(tags, 10)

Unnamed: 0,X,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V183,V184,V185,V186,V187,V188,V189,V190,V191,V192
35527,35527,lyrics:2JT6RdmoR8L6yPHwVcvwZO,[Intro],[Strophe 1],[Refrain],[Post-Refrain],[Strophe 2],[Refrain],[Outro],,...,,,,,,,,,,
35528,35528,lyrics:2xYZtNV1Mn9pPc6nYSzg1Z,[Verse 1],[Pre-Chorus],[Chorus],[Verse 2],[Pre-Chorus],[Chorus],[Bridge],[Chorus],...,,,,,,,,,,
35529,35529,lyrics:76nvqWPFucUra1xCkN1tnD,[Verse 1],[Chorus],[Verse 2],[Chorus],[Bridge],[Verse 3],[Chorus],,...,,,,,,,,,,
35530,35530,lyrics:6epvwUINain4iSHCTWA0sj,[Verse 1],[Chorus],[Verse 2],[Chorus],[Break],[Chorus],,,...,,,,,,,,,,
35531,35531,lyrics:6oCs7dQGdcU83QL5q0bfZX,[Intro],[Verse 1],[Hook],[Verse 2],[Verse 3],[Hook],[Verse 4],(Repeat),...,,,,,,,,,,
35532,35532,lyrics:44ut2KJnt0HQMnUaMlDv9W,[Intro],[Verse 1],[Chorus: Sample + Eminem],[Verse 2],(chig-chigga-ret-ret),[Chorus: Sample + Eminem],[Verse 3],[Chorus: Sample + Eminem],...,,,,,,,,,,
35533,35533,lyrics:0yN760GAUdsTiV4aJbbI5y,[Verse 1],[Chorus 1],[Verse 2],[Chorus 2],[Outro],[Verse 3],,,...,,,,,,,,,,
35534,35534,lyrics:2p4ghzfBBcxQhduiTMhivf,[Hook],[Part 1],[Hook],[Part 2],(sie wollen es),[Hook],,,...,,,,,,,,,,
35535,35535,lyrics:04000SjlfJUmFqFlE6ipYs,[Verse 1],[?],[Pre-Chorus],[Chorus],(it’s time to push back),(it's time to push back),[Verse 2],[Pre-Chorus],...,,,,,,,,,,
35536,35536,lyrics:6DuGNlfZwr8HtGzosIE6Ur,,,,,,,,,...,,,,,,,,,,


Many lyrics have song structure tags along with the artist in the following format (ex: [Intro: will.i.am] [Hook:Amewu] [Verse 1 - Carter]). To find it easier to get unique tags, let us process the tags by removing the extra content and other tags with length > 20

In [5]:
cleaned_tags <- tags %>% select(-c(X, V1)) 
cleaned_tags[] <- lapply(cleaned_tags, function (x) {s <- substring(x, 2, nchar(x)-1); gsub("( -.*)||(:.*)||(Produced.*)||(\\d$)||(/.*)","",s)})
tail(cleaned_tags, 10)

Unnamed: 0,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,...,V183,V184,V185,V186,V187,V188,V189,V190,V191,V192
35527,Intro,Strophe,Refrain,Post-Refrain,Strophe,Refrain,Outro,,,,...,,,,,,,,,,
35528,Verse,Pre-Chorus,Chorus,Verse,Pre-Chorus,Chorus,Bridge,Chorus,,,...,,,,,,,,,,
35529,Verse,Chorus,Verse,Chorus,Bridge,Verse,Chorus,,,,...,,,,,,,,,,
35530,Verse,Chorus,Verse,Chorus,Break,Chorus,,,,,...,,,,,,,,,,
35531,Intro,Verse,Hook,Verse,Verse,Hook,Verse,Repeat,Verse,Verse,...,,,,,,,,,,
35532,Intro,Verse,Chorus,Verse,chig-chigga-ret-ret,Chorus,Verse,Chorus,Outro,,...,,,,,,,,,,
35533,Verse,Chorus,Verse,Chorus,Outro,Verse,,,,,...,,,,,,,,,,
35534,Hook,Part,Hook,Part,sie wollen es,Hook,,,,,...,,,,,,,,,,
35535,Verse,?,Pre-Chorus,Chorus,it’s time to push back,it's time to push back,Verse,Pre-Chorus,Chorus,it's time to push back,...,,,,,,,,,,
35536,,,,,,,,,,,...,,,,,,,,,,


In [None]:
# Find a list of unique tags
unique_tags <- c()
for (col in colnames(cleaned_tags)){
    for (row in 1:nrow(cleaned_tags)){
        item <- cleaned_tags[row, col]
        if (length(item) <= 20 && !(item %chin% unique_tags)){
            unique_tags <- c(unique_tags, item)
        }
    }
}
write(unique_tags, "uniquetags.txt")


In [13]:
# Note: The uniquetags file contains the song structure tags as well as some additional lyrical content tags. 
# After analyzing the file, I could find the following main structure tags used in the lyrics

some_tags <- c("intro", "verse", "outro", "pre-chorus", "chorus", "post-chorus", "pre-hook", "hook", "post-hook", "pre-refrain", "refrain", "post-refrain", "strophe", "break", "bridge", "instrumental", "solo", "ad-lib", "couplet", "build", "interlude", "repeated", "blank", "?")
tag_count <- rep(0,24)
names(tag_count) <- some_tags
blank <- c()
repeated <- c()
quesIds <- c()

rows <- nrow(cleaned_tags)
for (row in 1:rows){
    ls <- tolower(unlist(cleaned_tags[row,]))
    if (all(ls == "")) {
        tag_count["blank"] <- tag_count["blank"] + 1
        blank <- c(blank, row)
    } else{
        if (any(grepl("intro", ls)))
            tag_count['intro'] <- tag_count['intro'] + 1
        if (any(grepl("outro", ls)) || any(ls == "conclusion"))
            tag_count['outro'] <- tag_count['outro'] + 1
        if (any(grepl("vers", ls)))
            tag_count['verse'] <- tag_count['verse'] + 1
        if (any(ls == "prechorus") || any(ls == "pre-chorus"))
            tag_count['pre-chorus'] <- tag_count['pre-chorus'] + 1
        if (any(ls == "chorus"))
            tag_count["chorus"] <- tag_count["chorus"] + 1
        if (any(ls == "postchorus") || any(ls == "post-chorus"))
            tag_count['post-chorus'] <- tag_count['post-chorus'] + 1
        if (any(ls == "prehook") || any(ls == "pre-hook"))
            tag_count['pre-hook'] <- tag_count['pre-hook'] + 1
        if (any(ls == "hook"))
            tag_count["hook"] <- tag_count["hook"] + 1
        if (any(ls == "posthook") || any(ls == "post-hook"))
            tag_count['post-hook'] <- tag_count['post-hook'] + 1
        if (any(ls == "prerefrain") || any(ls == "pre-refrain"))
            tag_count['pre-refrain'] <- tag_count['pre-refrain'] + 1
        if (any(ls == "refrain"))
            tag_count["refrain"] <- tag_count["refrain"] + 1
        if (any(ls == "postrefrain") || any(ls == "post-refrain"))
            tag_count['post-refrain'] <- tag_count['post-refrain'] + 1
        if (any(ls == "strophe") || any(ls == "strofa"))
            tag_count["strophe"] <- tag_count["strophe"] + 1
        if (any(grepl("break", ls)))
            tag_count["break"] <- tag_count["break"] + 1
        if (any(ls == "bridge"))
            tag_count["bridge"] <- tag_count["bridge"] + 1
        if (any(ls == "interlude"))
            tag_count["interlude"] <- tag_count["interlude"] + 1
        if (any(grepl("solo", ls)))
            tag_count["solo"] <- tag_count["solo"] + 1
        if (any(ls == "instrumental"))
            tag_count["instrumental"] <- tag_count["instrumental"] + 1
        if (any(grepl("couplet", ls)))
            tag_count["couplet"] <- tag_count["couplet"] + 1
        if (any(ls == "buildup") || any(ls == "build up") || any(ls == "build") || any(ls == "build-up"))
            tag_count["build"] <- tag_count["build"] + 1
        if (any(ls == "ad lib") || any(ls == "ad-lib") || any(ls == "ad-libs") || any(ls == "ad libs"))
            tag_count["ad-lib"] <- tag_count["ad-lib"] + 1
        if (any(ls == "?") || any(ls == "?")){
            tag_count["?"] <- tag_count["?"] + 1
            quesIds <- c(quesIds, row)
        }
        if (any(grepl("repeat", ls)) || any(grepl("\\dx", ls)) || any(grepl("x\\d", ls))){
            tag_count["repeated"] <- tag_count["repeated"] + 1
            repeated <- c(repeated, row)
        }
        
    }
    
     
}


In [14]:
# Distribution of tags
tag_count

In [15]:
# Get lyricsIds corresponding
lyricIds <- tags$V1
blankIds <- lyricIds[blank]
repeatIds <- lyricIds[repeated]
quesIds <- lyricIds[quesIds]
write(blankIds, "blankIds.txt")
write(repeatIds, "repeatIds.txt")
write(quesIds, "quesIds.txt")

**Missing Text**

In [18]:
# pattern to find missing text -> If there are two immediate structure tags like [chorus] [bridge], then this is a case 
# of missing text
pattern = "\\[.*?\\]\\s*\\[.*?\\]" 
# Extracts strings that match the pattern and returns a matrix 
missing_text <- str_extract_all(df$values, pattern = pattern)
s <- sapply(missing_text, function(x) length(x) > 0)

print(sum(s))
lw <- which(s == TRUE)
indices <- df$ind[lw]
write(indices, "missingText.txt") 

[1] 3967


The actual count will be less than 3967. The pattern has also captured many false positives like<br>
1. \[Letra de "Mi Buen Amor" ft. Bunbury\] \[Intro\] -> There is no missing text here
2. \[Instrumental\] \[Chorus\] -> There is no missing text 

# Fill in missing text 
**The distribution obtained above is an approximate estimate with false positives. Below we will do some regex processing to get more accurate counts and fill in the missing text**<br/>
*I observed the following pattern for many songs (Note: There are exceptions, ex- lyrics:0RBQjtCbJ2GXfSdN9MwJGt where tags corresponding to a paragraph are itself separated by \n\n)*
1. \n\n preceding repeat tags ex: \n\n(2x) text \n\n<br/>
    In this case, the "text" is repeated twice 
2. \n\n preceding and following ex: \n\n[Hook](2x)\n\n<br/>
    In this case, the [Hook] part present before in the lyrics is repeated twice
3. \n\n immediately following the tag. ex: \n\n text (2x)\n\n <br/>
   In this case, the text preceding (2x) and following previous \n\n
4. \n\n[Hook 2x]\n text \n\n  
   the text between two \n\n should be repeated twice
5. Tags like [Repeat] or [Repeated] do not mention the number of times the text has to be repeated -> count such songs

In [94]:
# Each song is separated into paragraphs where the delimiter is '\n\n'
# We will group the songs by "ind" and give a sequence number for each paragraph which will be later used to combine the text
df <- df %>% group_by(ind) %>% separate_rows(values, sep = "\n\n") %>% filter(values != "") %>% mutate(seqn = seq(from = 1, to = n(), by = 1.0))
# Extarct the structure tag of each paragraph and create a separate column
# Remove the tag from the text
df <- df %>% mutate(tags = str_extract(values, regex("(^\\[.*\\])|(^\\(.*\\))", ignore_case=TRUE))) %>% mutate(text = str_replace(values, regex("(^\\[.*\\])|(^\\(.*\\))", ignore_case=TRUE), replacement="")) %>% select(ind, seqn, text, tags)

head(df)

ind,seqn,text,tags
lyrics:7gnEnrwH5VLmM9rJuXphyL,1,Una isla en el mar Sin corrientes navegar Soy un barco de papel Al atardecer,
lyrics:7gnEnrwH5VLmM9rJuXphyL,2,Junto con la soledad Me persigue la verdad Solo siento tu calor Aquí sigo yo,
lyrics:7gnEnrwH5VLmM9rJuXphyL,3,"Esperándote Esperándote Esperándote Esperando, esperando Esperándote Con tus fotografías En el mar ya perdidas Esperándote, esperándote",
lyrics:7gnEnrwH5VLmM9rJuXphyL,4,En el reflejo puedo ver Tú a mi lado aquella vez Al tocarte se borró Solo una ilusión,
lyrics:7gnEnrwH5VLmM9rJuXphyL,5,Y dulces palabras de tu voz Desde aquí escucho yo Y debería dejarte ir Pero sigo aquí,
lyrics:7gnEnrwH5VLmM9rJuXphyL,6,"Esperándote Esperándote Esperándote Esperando, esperando Esperándote Con tus fotografías En el mar ya perdidas Esperándote, esperándote",


In [4]:
head(df %>% filter(!(is.na(tags))), 10)

ind,seqn,text,tags
lyrics:1oUKoIM88V89qcL6Sm6GvH,1,,[Instrumental]
lyrics:2Uebx28wsRfcxQpB7R0xA5,1,"Alle dachten, unser Lied wär' vorbei Sie wollten nur, dass wir uns verspiel'n Doch jeden Ton haben wir genauso gemeint Jede Note wird uns neu inspirier'n",[Strophe 1]
lyrics:2Uebx28wsRfcxQpB7R0xA5,2,"Bist du laut, bin ich gerne mal still Denn dein Sound trägt mich überallhin Nur zu zweit ergeben wir Sinn, Sinn",[Pre-Refrain]
lyrics:2Uebx28wsRfcxQpB7R0xA5,3,"Wie Schwarz und Weiß auf einem Klavier, yeah Mein Song funktioniert nur mit dir, yeah Wie Schwarz und Weiß auf einem Klavier, yeah Will jeden Tag mit dir komponier'n, -nier'n, -nier'n",[Refrain]
lyrics:2Uebx28wsRfcxQpB7R0xA5,4,"Und wir tanzen als wär'n wir hier allein Bis die Nacht im roten Morgen ertrinkt Bei jeder Schnapsidee bist du mit dabei Und geh'n die Lichter aus, versteh'n wir uns blind",[Strophe 2]
lyrics:2Uebx28wsRfcxQpB7R0xA5,5,"Bist du laut, bin ich gerne mal still Denn dein Sound trägt mich überallhin Nur zu zweit ergeben wir Sinn, Sinn",[Pre-Refrain]
lyrics:2Uebx28wsRfcxQpB7R0xA5,6,"Wie Schwarz und Weiß auf einem Klavier, yeah Mein Song funktioniert nur mit dir, yeah Wie Schwarz und Weiß auf einem Klavier, yeah Will jeden Tag mit dir komponier'n, -nier'n Wie Schwarz und Weiß auf einem Klavier (nur du und ich) Mein Song funktioniert nur mit dir ([?]) Auch wenn nicht jeder Ton harmoniert, yeah (nur du und ich) Wie Schwarz und Weiß auf einem Klavier, -vier, -vier",[Refrain]
lyrics:2Uebx28wsRfcxQpB7R0xA5,7,"(Break it down!) Yeah, du-du-du-du-du-du-du-du Zu zweit schaffen wir's im Akkord, -kord, -kord",[Bridge]
lyrics:2Uebx28wsRfcxQpB7R0xA5,8,"Wie Schwarz und Weiß auf einem Klavier, yeah Mein Song funktioniert nur mit dir, yeah Wie Schwarz und Weiß auf einem Klavier, yeah Will jeden Tag mit dir komponier'n, -nier'n Wie Schwarz und Weiß auf einem Klavier (nur du und ich) Mein Song funktioniert nur mit dir ([?]) Auch wenn nicht jeder Ton harmoniert, yeah (nur du und ich) Wie Schwarz und Weiß auf einem Klavier, -vier, -vier",[Refrain]
lyrics:1IqKlSMBlwWF0kxY2fBstA,1,"Yeah, it's over You can bet in mid-October I will still be ranting 'bout most early May Yeah, he's a winner He's a goddamn sinner While he dines, I'm on the wrong side of the day And I said, ""I don't understand why I'm fumbling after."" You're the reason I cannot forget this season Or the letter when you first referred to it And I said",[Verse 1]


There are six different cases to handle
1. Case: Structure tags with repetition number mentioned within them, ex:[Chorus 1, 2x]  and text column is not empty.<br/>
   Solution: Extract the repetition number and copy the text (number) times.
2. Case: Structure tags with repetition number mentioned within them, ex:[2x Refrain] but text column is empty.<br/>
   Solution: Retrieve the text for the song corresponding to the Tag (ex: Refrain) and copy the text (number) times.
3. Case: Structure tags with no repetition number but text column is blank. (This is a case of Missing Text)<br/> 
   Solution: Retrieve and add the text for the song corresponding to the Tag
4. Case: Structure tags with no repetition number but text column is present, ex: [chorus] <br/>
   Solution: Search the text for any repetitions of lines or words   
5. Case: Missing Tag column but text is present.<br/>
   Solution: Search the text for any repetitions of lines or words
6. Case: Tags which say only [Repeat] but do not provide any count
   Solution: Remove the tag and keep a count of such songs 

In [5]:
# Let us extract the tags with repetitions 
d1 <- df %>% filter(str_detect(tags, regex("(\\dx)|(x\\d)", ignore_case=TRUE))) 

# Extract repeated count
d1 <- d1 %>% mutate(repetition = case_when(str_detect(tags, regex("\\dx(?![0-9])", ignore_case=TRUE)) ~ str_extract(tags, regex("\\d+x", ignore_case=TRUE)), str_detect(tags, regex("x\\d", ignore_case=TRUE)) ~ str_extract(tags, regex("x\\d+", ignore_case=TRUE)) , TRUE ~ ""))
d1 <- d1 %>% mutate(count = as.numeric(str_extract(repetition, regex("\\d+")))) %>% select(-repetition) 

head(d1, 10)

ind,seqn,text,tags,count
lyrics:2e7dA6ow15brIG92PiaWtR,1,"Everybody moves, nobody get hurt (bo, bo) Tonight we'll shake the earth",[Refrain x4],4
lyrics:2e7dA6ow15brIG92PiaWtR,4,"Everybody moves, nobody get hurt (bo, bo) Tonight we'll shake the earth",[Refrain x4],4
lyrics:6RCHCcIwDcZtTMEARo5du1,6,"I would be nothing, oh well I would be something less than better Without You",(Bridge x2),2
lyrics:7nIQJzRSR1q6nU1hUP58sy,6,,[2x Refrain],2
lyrics:5WX61BvCKJmkkpmRLipPaW,2,LET YOURSELF GO! LET YOURSELF GO! LET YOURSELF GO!,[Chorus x4],4
lyrics:5WX61BvCKJmkkpmRLipPaW,5,,[Chorus x4],4
lyrics:5WX61BvCKJmkkpmRLipPaW,8,,[Chorus x8],8
lyrics:4nveEBWpCr7AE31MxDGVJy,11,Bevor wir eure Füße küssen Werden wir sie euch brechen Mit einer Knarre zwischen deinen Zähnen Kannst du nur in Vokalen sprechen,"[Chorus 1, 2x]",2
lyrics:3URx9ueVCeabhgqCRCBBqE,6,,(Chorusx3),3
lyrics:4BodkdQk2Ouo6BFbLqoWtG,6,"I say Tonight I've fallen and I can't get up I need your loving hands to come and pick me up And every night I miss you I can just look up And know the stars are Holding you, holding you, holding you Tonight",[Chorus X2],2


In [6]:
# CASE 1 -> Text column present
df_text <- d1 %>% filter(text != "")

# Most sentences start with '\n' but we will add space between two sentences to handle any exceptions
df_text <- df_text %>% mutate(updated = case_when((count != 0 || !is.na(count)) ~ str_dup(paste(text, ""), count), TRUE ~ text))
head(df_text)

ind,seqn,text,tags,count,updated
lyrics:2e7dA6ow15brIG92PiaWtR,1,"Everybody moves, nobody get hurt (bo, bo) Tonight we'll shake the earth",[Refrain x4],4,"Everybody moves, nobody get hurt (bo, bo) Tonight we'll shake the earth Everybody moves, nobody get hurt (bo, bo) Tonight we'll shake the earth Everybody moves, nobody get hurt (bo, bo) Tonight we'll shake the earth Everybody moves, nobody get hurt (bo, bo) Tonight we'll shake the earth"
lyrics:2e7dA6ow15brIG92PiaWtR,4,"Everybody moves, nobody get hurt (bo, bo) Tonight we'll shake the earth",[Refrain x4],4,"Everybody moves, nobody get hurt (bo, bo) Tonight we'll shake the earth Everybody moves, nobody get hurt (bo, bo) Tonight we'll shake the earth Everybody moves, nobody get hurt (bo, bo) Tonight we'll shake the earth Everybody moves, nobody get hurt (bo, bo) Tonight we'll shake the earth"
lyrics:6RCHCcIwDcZtTMEARo5du1,6,"I would be nothing, oh well I would be something less than better Without You",(Bridge x2),2,"I would be nothing, oh well I would be something less than better Without You I would be nothing, oh well I would be something less than better Without You"
lyrics:5WX61BvCKJmkkpmRLipPaW,2,LET YOURSELF GO! LET YOURSELF GO! LET YOURSELF GO!,[Chorus x4],4,LET YOURSELF GO! LET YOURSELF GO! LET YOURSELF GO! LET YOURSELF GO! LET YOURSELF GO! LET YOURSELF GO! LET YOURSELF GO! LET YOURSELF GO! LET YOURSELF GO! LET YOURSELF GO! LET YOURSELF GO! LET YOURSELF GO!
lyrics:4nveEBWpCr7AE31MxDGVJy,11,Bevor wir eure Füße küssen Werden wir sie euch brechen Mit einer Knarre zwischen deinen Zähnen Kannst du nur in Vokalen sprechen,"[Chorus 1, 2x]",2,Bevor wir eure Füße küssen Werden wir sie euch brechen Mit einer Knarre zwischen deinen Zähnen Kannst du nur in Vokalen sprechen Bevor wir eure Füße küssen Werden wir sie euch brechen Mit einer Knarre zwischen deinen Zähnen Kannst du nur in Vokalen sprechen
lyrics:4BodkdQk2Ouo6BFbLqoWtG,6,"I say Tonight I've fallen and I can't get up I need your loving hands to come and pick me up And every night I miss you I can just look up And know the stars are Holding you, holding you, holding you Tonight",[Chorus X2],2,"I say Tonight I've fallen and I can't get up I need your loving hands to come and pick me up And every night I miss you I can just look up And know the stars are Holding you, holding you, holding you Tonight I say Tonight I've fallen and I can't get up I need your loving hands to come and pick me up And every night I miss you I can just look up And know the stars are Holding you, holding you, holding you Tonight"


In [95]:
# clean the tags in the bigger table (df)
df <- df %>% mutate(new_tags = gsub("[^[:alnum:]]", "", tags))
head(df %>% filter(!(is.na(tags))))

ind,seqn,text,tags,new_tags
lyrics:1oUKoIM88V89qcL6Sm6GvH,1,,[Instrumental],Instrumental
lyrics:2Uebx28wsRfcxQpB7R0xA5,1,"Alle dachten, unser Lied wär' vorbei Sie wollten nur, dass wir uns verspiel'n Doch jeden Ton haben wir genauso gemeint Jede Note wird uns neu inspirier'n",[Strophe 1],Strophe1
lyrics:2Uebx28wsRfcxQpB7R0xA5,2,"Bist du laut, bin ich gerne mal still Denn dein Sound trägt mich überallhin Nur zu zweit ergeben wir Sinn, Sinn",[Pre-Refrain],PreRefrain
lyrics:2Uebx28wsRfcxQpB7R0xA5,3,"Wie Schwarz und Weiß auf einem Klavier, yeah Mein Song funktioniert nur mit dir, yeah Wie Schwarz und Weiß auf einem Klavier, yeah Will jeden Tag mit dir komponier'n, -nier'n, -nier'n",[Refrain],Refrain
lyrics:2Uebx28wsRfcxQpB7R0xA5,4,"Und wir tanzen als wär'n wir hier allein Bis die Nacht im roten Morgen ertrinkt Bei jeder Schnapsidee bist du mit dabei Und geh'n die Lichter aus, versteh'n wir uns blind",[Strophe 2],Strophe2
lyrics:2Uebx28wsRfcxQpB7R0xA5,5,"Bist du laut, bin ich gerne mal still Denn dein Sound trägt mich überallhin Nur zu zweit ergeben wir Sinn, Sinn",[Pre-Refrain],PreRefrain


In [96]:
# CASE 2 and CASE 3 - (Missing text)
# For case 2, firstly, we will find the missing text and then duplicate
# We remove some tags like instrumental which lead to false positives in the missing text case. There maybe more such tags. I have only removed some which I could find
df <- df %>% mutate(new_tags = str_replace(new_tags, regex("(.*Instrumental.*)|(.*Guitar.*)|(.*Piano.*)|(.*Flute.*)|(Letrade.*)|(.*Violin.*)|(.*Music.*)|(.*Nonlyrical.*)|(.*Break.*)|(Writtenby.*)|(Producedby.*)|(Directedby.*)|(Videoby.*)|(.*Interlude.*)|(Songtext.*)|(Versurilepiesei.*)", ignore_case=TRUE), replacement=""))

# Count the non-empty text and tag rows
# This is done to handle exceptional cases where the tags corresponding to paragraphs are itself separated by "\n\n". 
# Since we have splitted using "\n\n", such cases will leave the tag column blank
# finding the text and tag count will identify such cases since lyrics written according to genius guidelines will have one tag corresponding to each paragraph and text_count will be equal to tag_count
count_df <- df %>% group_by(ind) %>% mutate(text_count = sum(text != ""), tag_count = sum(new_tags != "")) %>% select(ind, text_count, tag_count) %>% distinct()
head(count_df)

ind,text_count,tag_count
lyrics:7gnEnrwH5VLmM9rJuXphyL,11,
lyrics:1oUKoIM88V89qcL6Sm6GvH,0,0.0
lyrics:69jpwMvg0nQvUp967EyKlQ,6,
lyrics:4ZdUIPWmzvCON2oCqCbG9n,1,
lyrics:0VsSTcBOG7aN2iWPt84fYx,6,
lyrics:2Uebx28wsRfcxQpB7R0xA5,8,8.0


In [9]:
# find cases where text_count is less than tag_count
missing <- count_df %>% filter(text_count < tag_count) %>% select(ind) %>% unlist()
missing_df <- df %>% filter(ind %in% missing)
# find tags with no text
temp <- missing_df %>% filter(text == "")

# find tags within such tags where there is repetition
d2 <- temp %>% filter(str_detect(new_tags, regex("(\\dx)|(x\\d)", ignore_case=TRUE)))

# Extract repeated count
d2 <- d2 %>% mutate(repetition = case_when(str_detect(new_tags, regex("\\dx(?![0-9])", ignore_case=TRUE)) ~ str_extract(new_tags, regex("\\d+x", ignore_case=TRUE)), str_detect(new_tags, regex("x\\d", ignore_case=TRUE)) ~ str_extract(new_tags, regex("x\\d+", ignore_case=TRUE)) , TRUE ~ ""))
d2 <- d2 %>% mutate(count = as.numeric(str_extract(repetition, regex("\\d+")))) %>% select(-repetition) 
d2 <- d2 %>% mutate(only_tags = case_when(str_detect(new_tags, regex("\\dx(?![0-9])", ignore_case=TRUE)) ~ gsub("[^[:alnum:]]", "", str_replace(new_tags, regex("\\d+x", ignore_case=TRUE), replacement="")), str_detect(new_tags, regex("x\\d", ignore_case=TRUE)) ~ gsub("[^[:alnum:]]", "", str_replace(new_tags, regex("x\\d+", ignore_case=TRUE), replacement="")) , TRUE ~ ""))

head(d2)

ind,seqn,text,tags,new_tags,count,only_tags
lyrics:76scZFO4mJYr1uOBuMbsoM,4,,[Hook x2],Hookx2,2,Hook
lyrics:22toXNoPAv9mo7S1ooO9f8,4,,[Chorus: 3x],Chorus3x,3,Chorus
lyrics:1pU1mucfoUalVw9apwnDhh,6,,[Chorus x2],Chorusx2,2,Chorus
lyrics:7zRHAPEXcjhTBl5DYxApsB,8,,[Chorus x2],Chorusx2,2,Chorus
lyrics:2cGlXalkjDkmHj1UsqfP4c,4,,[Refrain 2x],Refrain2x,2,Refrain
lyrics:7x2GpUybeyWCdEZoLNcFAI,5,,[2x Hook],2xHook,2,Hook


In [10]:
# Remaining tags without repetition
d3 <- temp %>% filter(!(str_detect(new_tags, regex("(\\dx)|(x\\d)", ignore_case=TRUE))))
head(d3)

ind,seqn,text,tags,new_tags
lyrics:0YdQC0bMttPyFCqL3cEJd0,5,,[Hook],Hook
lyrics:7zWhta5mrjEwBBwki9vao7,4,,[Chorus],Chorus
lyrics:3SppxNj9EhZJ8VBfxruP1S,3,,[Outro],Outro
lyrics:6Wor2nJG9F0n6qJczmHR58,4,,[Chorus],Chorus
lyrics:3zo3hPlqMWO9NezEMl3Kjh,5,,[Mini Bridge],MiniBridge
lyrics:3zo3hPlqMWO9NezEMl3Kjh,7,,[Bridge],Bridge


In [11]:
# Remove repetition string from the tags in missing_df table
missing_df <- missing_df %>% mutate(new_tags = case_when(str_detect(new_tags, regex("\\dx(?![0-9])", ignore_case=TRUE)) ~ gsub("[^[:alnum:]]", "", str_replace(new_tags, regex("\\d+x", ignore_case=TRUE), replacement="")), str_detect(new_tags, regex("x\\d", ignore_case=TRUE)) ~ gsub("[^[:alnum:]]", "", str_replace(new_tags, regex("x\\d+", ignore_case=TRUE), replacement="")) , TRUE ~ new_tags))

# Method to find text corresponding to the missing text
# Match the tag with previous tags such that the sequence number of rows is less than the current row
# If exact match is not found, then search for substring matches (example case: current tag [Hook], previous tag [Hook: ABC])
# If the partial substring match gives result with more than one row then we leave such text blank to handle the case manually
find_text <- function(row, df, colName, cleaned){
    search_tag <- row[[colName]]
    if(!cleaned)
        search_tag <- gsub("[^[:alnum:]]", "", search_tag)
    temp <- subset(df, (ind == row[['ind']]) & (seqn < row[['seqn']]) & (new_tags == search_tag))
    temp <- temp %>% filter(text != "")
    # To capture cases where the structure tags are not within brackets. ex: lyrics:7nIQJzRSR1q6nU1hUP58sy
    if(nrow(temp) > 1)
        return("")
    if(nrow(temp) == 0){
        temp <- subset(df, (ind == row[['ind']]) & (seqn < row[['seqn']]))
        tags <- temp[["new_tags"]]
        res <- grepl(search_tag, tags)
        temp <- temp %>% filter(res)
        temp <- temp %>% filter(text != "")
        if(nrow(temp) == 0 || nrow(temp) > 1)
            return("")
        else
            return(temp$text)
        
    }
    return(temp$text)
    
}

# iterate over the dataframe and assign values to the text column
for(i in 1:nrow(d2)){
    text <- find_text(d2[i, ], missing_df, 'only_tags', TRUE)
    d2[i, ]$text <- text
}

head(d2, 4)


ind,seqn,text,tags,new_tags,count,only_tags
lyrics:76scZFO4mJYr1uOBuMbsoM,4,"Du weißt, dass du für mich alles bist Baby, ich lass dich nie mehr gehen Du siehst nie wieder Tageslicht Ansonsten wird es dir an gar nichts fehlen",[Hook x2],Hookx2,2,Hook
lyrics:22toXNoPAv9mo7S1ooO9f8,4,He says times like these I don't want to be a superstar Cause reality tv killed them all in America Though the sun always shines in the magazines Tonight can we be free to be who we really are,[Chorus: 3x],Chorus3x,3,Chorus
lyrics:1pU1mucfoUalVw9apwnDhh,6,I won't lie I won't sin Maybe I don't wanna go Can't you wait Maybe I don't wanna go,[Chorus x2],Chorusx2,2,Chorus
lyrics:7zRHAPEXcjhTBl5DYxApsB,8,"Change, oh you wear me out You’re something I know nothing about I’m the same Don’t you call me up for nothing Because today you’re stressing me out",[Chorus x2],Chorusx2,2,Chorus


In [12]:
# Write the lyricsIds where the text data could not be found to a file for manual inspection
exceptions <- d2 %>% filter(text == "") %>% select(ind) %>% unlist()
write(exceptions, "exceptions.txt")

# repeat the values in the text column as per count
d2 <- d2 %>% mutate(updated = case_when((count != 0 || !is.na(count)) ~ str_dup(text, count), TRUE ~ text))
head(d2, 2)

ind,seqn,text,tags,new_tags,count,only_tags,updated
lyrics:76scZFO4mJYr1uOBuMbsoM,4,"Du weißt, dass du für mich alles bist Baby, ich lass dich nie mehr gehen Du siehst nie wieder Tageslicht Ansonsten wird es dir an gar nichts fehlen",[Hook x2],Hookx2,2,Hook,"Du weißt, dass du für mich alles bist Baby, ich lass dich nie mehr gehen Du siehst nie wieder Tageslicht Ansonsten wird es dir an gar nichts fehlen Du weißt, dass du für mich alles bist Baby, ich lass dich nie mehr gehen Du siehst nie wieder Tageslicht Ansonsten wird es dir an gar nichts fehlen"
lyrics:22toXNoPAv9mo7S1ooO9f8,4,He says times like these I don't want to be a superstar Cause reality tv killed them all in America Though the sun always shines in the magazines Tonight can we be free to be who we really are,[Chorus: 3x],Chorus3x,3,Chorus,He says times like these I don't want to be a superstar Cause reality tv killed them all in America Though the sun always shines in the magazines Tonight can we be free to be who we really are He says times like these I don't want to be a superstar Cause reality tv killed them all in America Though the sun always shines in the magazines Tonight can we be free to be who we really are He says times like these I don't want to be a superstar Cause reality tv killed them all in America Though the sun always shines in the magazines Tonight can we be free to be who we really are


In [13]:
# Repeat the same steps for the table without repetition tags
for(i in 1:nrow(d3)){
    text <- find_text(d3[i, ], missing_df,'new_tags', TRUE)
    d3[i, ]$text <- text
}

# append the exceptions to the file
exceptions <- d3 %>% filter(text == "") %>% select(ind) %>% unlist()
write(exceptions, "exceptions.txt", append=TRUE)

head(d3, 4)

ind,seqn,text,tags,new_tags
lyrics:0YdQC0bMttPyFCqL3cEJd0,5,"No matter how far we go We forever be real, we'll never fold Know my niggas stay down, we all we know We gotta get it so we live it just remember what we do and won't stop",[Hook],Hook
lyrics:7zWhta5mrjEwBBwki9vao7,4,When I'm alone I make believe that I'm in a different time and place where Nobody wants to know my name And no one will recognize my face,[Chorus],Chorus
lyrics:3SppxNj9EhZJ8VBfxruP1S,3,,[Outro],Outro
lyrics:6Wor2nJG9F0n6qJczmHR58,4,"Sahti, kannusta kaadetaan Kalja, kannateltavaksi Saaren sahti, kannon kalja Tuopista tulinen iltamme Sahti, kannusta kaadetaan Kalja, kannateltavaksi Saaren sahti, kannon kalja Tuopista tulinen iltamme",[Chorus],Chorus


In [14]:
# Now let us combine the lyrics that has been updated so far
# We have the main table 'df' and three other tables 'df_text', 'd2', 'd3'

t1 <- df_text %>% select(ind, seqn, updated, tags)
t2 <- d2 %>% select(ind, seqn, updated, tags)
colnames(d3)[3] <- 'updated'
t3 <- d3 %>% select(ind, seqn, updated, tags)

# combine the three tables -> each table has three columns 'ind', 'seqn', 'updated'(new text)
combined <- rbind(t1, t2, t3)
# find the remaining paragraphs of song from the main table using indices from combined
indices <- combined %>% select(ind) %>% distinct() %>% unlist()
part <- df %>% filter(ind %in% indices) %>% select(ind, seqn, text, tags)
colnames(part)[3] <- "updated"
# extract the original rows from the 'part' table that have not been updated
extras <- anti_join(part, combined, by=c('ind', 'seqn'))
# merge these rows with combined table
final_df <- rbind(combined, extras) 
# group and arrange the rows as per sequence number of each songs and join the lyrics
final_df <- final_df %>% group_by(ind) %>% arrange(ind, seqn)

In [15]:
ls <- final_df %>% filter(updated=="") %>% ungroup() %>% select(tags) %>% unlist()
write(ls, "ls.txt")

Some Manual changes

In [487]:
# There were 1229 rows with updated="", but most of them were non-lyrical parts.
# The tags were manually checked and text was updated wherever necessary
final_df[final_df$ind == 'lyrics:07vOXII9g8ERQLtVZpYGvu' & final_df$seqn == 6, ]$updated <- str_dup(final_df[final_df$ind == 'lyrics:07vOXII9g8ERQLtVZpYGvu' & final_df$seqn == 4, ]$updated, 2)
final_df[final_df$ind == 'lyrics:0b16PJlDwSaBPqM9uyyujg' & final_df$seqn == 5, ]$updated <- final_df[final_df$ind == 'lyrics:0b16PJlDwSaBPqM9uyyujg' & final_df$seqn == 3, ]$updated
final_df[final_df$ind == 'lyrics:0b16PJlDwSaBPqM9uyyujg' & final_df$seqn == 7, ]$updated <- '\nMonday morning\nMonday morning\nOh\nOh' 
final_df[final_df$ind == 'lyrics:0E0JKMR4uiCZhpI3brAoxI' & final_df$seqn == 4, ]$updated <- paste(final_df[final_df$ind == 'lyrics:0E0JKMR4uiCZhpI3brAoxI' & final_df$seqn == 1, ]$updated, final_df[final_df$ind == 'lyrics:0E0JKMR4uiCZhpI3brAoxI' & final_df$seqn == 3, ]$updated, sep="")
final_df[final_df$ind == 'lyrics:0e2woI6ayIRo8nqin2Ky06' & final_df$seqn == 2, ]$updated <- paste(str_dup('\nDipdapdudadingding', 4), '\nDipdudadodeydau\nHmmmmm, Hmmmm', sep="")
final_df[final_df$ind == 'lyrics:0hKr166QnNZ0a37G4UO0VY' & final_df$seqn == 5, ]$updated <- final_df[final_df$ind == 'lyrics:0hKr166QnNZ0a37G4UO0VY' & final_df$seqn == 1, ]$updated 
final_df[final_df$ind == 'lyrics:0lJoAtJYGRN7PslBDmrb88' & final_df$seqn == 7, ]$updated <- final_df[final_df$ind == 'lyrics:0lJoAtJYGRN7PslBDmrb88' & final_df$seqn == 4, ]$updated 
final_df[final_df$ind == 'lyrics:0QepvU0N2fC2B5uIPafO1q' & final_df$seqn == 8, ]$updated <- paste(final_df[final_df$ind == 'lyrics:0QepvU0N2fC2B5uIPafO1q' & final_df$seqn == 6, ]$updated, final_df[final_df$ind == 'lyrics:0QepvU0N2fC2B5uIPafO1q' & final_df$seqn == 3, ]$updated, sep="")
final_df[final_df$ind == 'lyrics:0ryawd1Tj2c0y4ddq6X0Nu' & final_df$seqn == 7, ]$updated <- paste(final_df[final_df$ind == 'lyrics:0ryawd1Tj2c0y4ddq6X0Nu' & final_df$seqn == 4, ]$updated, final_df[final_df$ind == 'lyrics:0ryawd1Tj2c0y4ddq6X0Nu' & final_df$seqn == 5, ]$updated, sep="")
final_df[final_df$ind == 'lyrics:0tgORTeHCkavGIMoyH5HCl' & final_df$seqn == 6, ]$updated <- final_df[final_df$ind == 'lyrics:0tgORTeHCkavGIMoyH5HCl' & final_df$seqn == 2, ]$updated
final_df[final_df$ind == 'lyrics:0uQSHrFDVtbQyxNFCwm1EC' & final_df$seqn == 5, ]$updated <- paste(final_df[final_df$ind == 'lyrics:0uQSHrFDVtbQyxNFCwm1EC' & final_df$seqn == 2, ]$updated, final_df[final_df$ind == 'lyrics:0uQSHrFDVtbQyxNFCwm1EC' & final_df$seqn == 3, ]$updated, sep="")
final_df[final_df$ind == 'lyrics:0yhfs0tJi01RoeXyh1oFUx' & final_df$seqn == 12, ]$updated <- final_df[final_df$ind == 'lyrics:0yhfs0tJi01RoeXyh1oFUx' & final_df$seqn == 7, ]$updated
final_df[final_df$ind == 'lyrics:1CmwjNrIK3fDuDOa8ZHsOT' & final_df$seqn == 23, ]$updated <- final_df[final_df$ind == 'lyrics:1CmwjNrIK3fDuDOa8ZHsOT' & final_df$seqn == 6, ]$updated
final_df[final_df$ind == 'lyrics:1LStwIpx0LZYUplBJEn50y' & final_df$seqn == 11, ]$updated <- final_df[final_df$ind == 'lyrics:1LStwIpx0LZYUplBJEn50y' & final_df$seqn == 1, ]$updated
final_df[final_df$ind == 'lyrics:1lUZ5H9aXN0iJjnz1vZ3TX' & final_df$seqn == 6, ]$updated <- str_dup('\nNigga guard yo face!', 8)
final_df[final_df$ind == 'lyrics:5jMvwTgvkTc9730UgUf8Fj' & final_df$seqn == 6, ]$updated <- paste(final_df[final_df$ind == 'lyrics:5jMvwTgvkTc9730UgUf8Fj' & final_df$seqn == 6, ]$updated, final_df[final_df$ind == 'lyrics:5jMvwTgvkTc9730UgUf8Fj' & final_df$seqn == 4, ]$updated, sep="")
final_df[final_df$ind == 'lyrics:1yDYKhv8XeaJ41L0Ov4EYm' & final_df$seqn == 7, ]$updated <- str_dup(final_df[final_df$ind == 'lyrics:1yDYKhv8XeaJ41L0Ov4EYm' & final_df$seqn == 2, ]$updated, 2)
final_df[final_df$ind == 'lyrics:7u58EuBVtq3XGonlqATqJ4' & final_df$seqn == 5, ]$updated <- final_df[final_df$ind == 'lyrics:7u58EuBVtq3XGonlqATqJ4' & final_df$seqn == 2, ]$updated
final_df[final_df$ind == 'lyrics:7ujppwhl3JrUoBsKjLWADd' & final_df$seqn == 5, ]$updated <- paste(str_dup(final_df[final_df$ind == 'lyrics:7ujppwhl3JrUoBsKjLWADd' & final_df$seqn == 2, ]$updated, 2), final_df[final_df$ind == 'lyrics:7ujppwhl3JrUoBsKjLWADd' & final_df$seqn == 3, ]$updated, sep="")
final_df[final_df$ind == 'lyrics:4KywNDM23e0y9zzrDKRof2' & final_df$seqn == 5, ]$updated <- str_dup('\nUnd ich bin immer noch hier',4)
final_df[final_df$ind == 'lyrics:48d7VJbQF6OWxoSdoLKnsy' & final_df$seqn == 5, ]$updated <- paste('\nNow I saw a man stand up one day and fight to save his life\nJust a common worker, supporting his kids and his wife\nPut a plug in his jug, things looked up for sure\nBut the whole damn thing of it was there is no blasted cure', final_df[final_df$ind == 'lyrics:48d7VJbQF6OWxoSdoLKnsy' & final_df$seqn == 2, ]$updated, sep="")
final_df[final_df$ind == 'lyrics:7I0OuZtXqOH25wtmNovWPu' & final_df$seqn == 7, ]$updated <- "\nOh, oh\nOh, oh"

write.csv(final_df, "final.csv")

Songs Without Tags

In [17]:
# Let us find the songs without any tags
notags <- count_df %>% filter((text_count > 0) & is.na(tag_count)) %>% select(ind) %>% unlist()
no_tags <- df %>% filter(ind %in% notags)
no_tags <- no_tags %>% filter(text != "") %>% select(ind, seqn, text, new_tags)
print(dim(no_tags))
head(no_tags)

[1] 130180      4


ind,seqn,text,new_tags
lyrics:7gnEnrwH5VLmM9rJuXphyL,1,Una isla en el mar Sin corrientes navegar Soy un barco de papel Al atardecer,
lyrics:7gnEnrwH5VLmM9rJuXphyL,2,Junto con la soledad Me persigue la verdad Solo siento tu calor Aquí sigo yo,
lyrics:7gnEnrwH5VLmM9rJuXphyL,3,"Esperándote Esperándote Esperándote Esperando, esperando Esperándote Con tus fotografías En el mar ya perdidas Esperándote, esperándote",
lyrics:7gnEnrwH5VLmM9rJuXphyL,4,En el reflejo puedo ver Tú a mi lado aquella vez Al tocarte se borró Solo una ilusión,
lyrics:7gnEnrwH5VLmM9rJuXphyL,5,Y dulces palabras de tu voz Desde aquí escucho yo Y debería dejarte ir Pero sigo aquí,
lyrics:7gnEnrwH5VLmM9rJuXphyL,6,"Esperándote Esperándote Esperándote Esperando, esperando Esperándote Con tus fotografías En el mar ya perdidas Esperándote, esperándote",


Songs With Text-Count Greater Than Tag-Count

In [54]:
gttag <- count_df %>% filter(text_count > tag_count) %>% select(ind) %>% unlist()
gt_tag <- df %>% filter(ind %in% gttag) 
gt_tag <- gt_tag %>% mutate(text = str_replace(text, regex(".*Genius-Deutschland-Community.*", ignore_case=TRUE), replacement=""))
gt_tag <- gt_tag %>% select(ind, seqn, text, new_tags)
gt_tag <- gt_tag[-c(5),]
print(dim(gt_tag))
head(gt_tag,3)

[1] 9502    4


ind,seqn,text,new_tags
lyrics:78YmhYOIuIiIPhacVSluQi,1,"Du musst raus an die Luft In den Wind und du trittst Nüchtern auf kalten Asphalt Hörst keinen Laut Keinen Mucks Sondern nur wie dein Schritt In den Schluchten der Straßen verhallt Blickst dich um folgst dem Rauschen Der Boden vibriert Dein Puls steigt, weil dich etwas treibt Zu den flimmernden Lichtern, wo sich alles verliert Die Stille, der Raum und die Zeit Und du fällst in ein Meer aus Rot und aus Weiß Und gehst mit dem Takt der Dich trägt Nichts wiegt mehr schwer Du wirst ruhig, du wirst leicht, wie der Wind der Wolken bewegt",StropheI
lyrics:78YmhYOIuIiIPhacVSluQi,2,"Halte nicht an, bleibe nicht stehen Niemand und nichts hält dich auf Keine Tür, keine Wand, kein Gesetz, kein Problem Nichts unterbricht deinen Lauf Jeder Muskel verspannt, es glühen die Lungen Aber das bringt dich nicht raus Du atmest konstant, kommst du an deine Grenzen Dann gehst du drüber hinaus Die Straße wird breiter, wohin du auch siehst Siehst du nur nicht zurück Alles wird leichter, je weiter du gehst Du wächst mit jedem Schritt Die Welt zieht vorbei, die Gedanken sind frei Der Boden unter dir brennt Im Feuer, im Rausch, im Tunnel, fast taub Weil alles so laut in dir schreit: RENN!",Chorus
lyrics:78YmhYOIuIiIPhacVSluQi,3,,


Songs With Text-Count Equal To Tag-Count

In [19]:
eqtag <- count_df %>% filter((text_count == tag_count) & tag_count>0) %>% select(ind) %>% unlist()
eq_tag <- df %>% filter(ind %in% eqtag) 
eq_tag <- eq_tag %>% filter(text != "") %>% select(ind, seqn, text, new_tags)
print(dim(eq_tag))
head(eq_tag)

[1] 105542      4


ind,seqn,text,new_tags
lyrics:2Uebx28wsRfcxQpB7R0xA5,1,"Alle dachten, unser Lied wär' vorbei Sie wollten nur, dass wir uns verspiel'n Doch jeden Ton haben wir genauso gemeint Jede Note wird uns neu inspirier'n",Strophe1
lyrics:2Uebx28wsRfcxQpB7R0xA5,2,"Bist du laut, bin ich gerne mal still Denn dein Sound trägt mich überallhin Nur zu zweit ergeben wir Sinn, Sinn",PreRefrain
lyrics:2Uebx28wsRfcxQpB7R0xA5,3,"Wie Schwarz und Weiß auf einem Klavier, yeah Mein Song funktioniert nur mit dir, yeah Wie Schwarz und Weiß auf einem Klavier, yeah Will jeden Tag mit dir komponier'n, -nier'n, -nier'n",Refrain
lyrics:2Uebx28wsRfcxQpB7R0xA5,4,"Und wir tanzen als wär'n wir hier allein Bis die Nacht im roten Morgen ertrinkt Bei jeder Schnapsidee bist du mit dabei Und geh'n die Lichter aus, versteh'n wir uns blind",Strophe2
lyrics:2Uebx28wsRfcxQpB7R0xA5,5,"Bist du laut, bin ich gerne mal still Denn dein Sound trägt mich überallhin Nur zu zweit ergeben wir Sinn, Sinn",PreRefrain
lyrics:2Uebx28wsRfcxQpB7R0xA5,6,"Wie Schwarz und Weiß auf einem Klavier, yeah Mein Song funktioniert nur mit dir, yeah Wie Schwarz und Weiß auf einem Klavier, yeah Will jeden Tag mit dir komponier'n, -nier'n Wie Schwarz und Weiß auf einem Klavier (nur du und ich) Mein Song funktioniert nur mit dir ([?]) Auch wenn nicht jeder Ton harmoniert, yeah (nur du und ich) Wie Schwarz und Weiß auf einem Klavier, -vier, -vier",Refrain


Instrumental or non-lyrical songs

In [97]:
instr <- count_df %>% filter((text_count==0 & tag_count>=0)|(is.na(text_count) & is.na(tag_count))|(is.na(text_count) & tag_count>=0)) %>% select(ind) %>% unlist()
instrumental <- df %>% filter(ind %in% instr) 
instrumental <- instrumental %>% select(ind, seqn, text, tags)
print(instrumental %>% distinct(ind) %>% nrow())
write(instr, "instrumental.txt")
head(instrumental)

[1] 1360


ind,seqn,text,tags
lyrics:1oUKoIM88V89qcL6Sm6GvH,1,,[Instrumental]
lyrics:6ekMALFUBI4AaxazzS3hiC,1,,[Instrumental]
lyrics:7EfuY2EFxL0pRVl27RuvIa,1,,[Instrumental]
lyrics:6d4G5sbAUby9l3opHgqww9,1,,[Instrumental]
lyrics:2tAeN2TKlQLOoSPXtARzBV,1,,[Instrumental]
lyrics:3yfHjH8MGxlHF5pzZcYZ8F,1,,[Instrumental]


Combine the tables with text except instrumental

In [57]:
colnames(final_df)[5] <- "text"
colnames(no_tags)[4] <- "tags"
colnames(gt_tag)[4] <- "tags"
colnames(eq_tag)[4] <- "tags"
final_df <- final_df %>% group_by(ind) %>% select(ind, seqn, text, tags)
merged <- rbind(final_df, no_tags, eq_tag, gt_tag)

In [58]:
# Covers case 4 and case 5 
# Now we will find lyrical content with repetitions in the text itself -> These were not extracted as tags earlier
# we will extract the sentences to be duplicated and then repeat
# Case A: When the repetition part (x2) is at the beginning, the whole paragraph is repeated 
  # There were 6 songs with repetition symbol at the beginning and the end. I have manually verified those and handled the exceptions
# Case B: When there is only one line -> the whole line is repeated
  # There were 22 songs with multiple repetiotions -> handled manually
# When the repetition part (x4) is in the middle of the paragraph, it referes to the sentence prior to it
# We will find the location of the repetiotion symbol and label it as "B"(Beginning), "I"(Inside), "O"(Outside)
d <- merged %>% filter(str_detect(text, regex("(\\dx)|(x\\d)", ignore_case=TRUE))) 
d <- d %>% mutate(text = str_replace(text, regex("[:!]" ,ignore_case=TRUE), replacement=""))
d <- d %>% mutate(repetition = case_when(str_detect(text, regex("\\dx(?![0-9])", ignore_case=TRUE)) ~ str_extract(text, regex("\\d+x", ignore_case=TRUE)), str_detect(text, regex("x\\d", ignore_case=TRUE)) ~ str_extract(text, regex("x\\d+", ignore_case=TRUE)) , TRUE ~ ""))
d <- d %>% mutate(count = as.numeric(str_extract(repetition, regex("\\d+")))) %>% select(-repetition) 
d <- d %>% mutate(present = case_when(str_detect(text, regex("\\dx(?![0-9])", ignore_case=TRUE)) ~ str_count(text, regex("\\d+x", ignore_case=TRUE)), str_detect(text, regex("x\\d", ignore_case=TRUE)) ~ str_count(text, regex("x\\d+", ignore_case=TRUE))))
d <- d %>% mutate(loc = case_when(str_detect(trimws(text), regex("(^(\\n)?[\\(\\[{]?\\d+x[}\\]\\)]?)|(^(\\n)?[\\(\\[{]?x\\d+[\\)\\]}]?)", ignore_case=TRUE))~"B", str_detect(trimws(text), regex("([\\(\\[{]?\\d+x[\\)\\]}]?$)|([\\(\\[{]?x\\d+[\\)\\]}]?$)", ignore_case=TRUE))~"O", TRUE~"I"))
d <- d %>% mutate(lines = str_count(text, "\n"))
d <- d %>% mutate(updated = case_when((lines>0 & loc=="B")~str_dup(text, count), (lines==1)~str_dup(text, count),(lines==0 & length(text) >= 3)~str_dup(text, count), TRUE~text))

no_lines <- d %>% filter(lines == 0 & (length(text) < 3) & !is.na(tags))

for(i in 1:nrow(no_lines)){
    text <- find_text(no_lines[i, ], df, 'tags', FALSE)
    index <- no_lines[i, ]$ind
    seqn <- no_lines[i, ]$seqn
    d[(d$ind == index) & (d$seqn == seqn), ]$text <- text
    d[(d$ind == index) & (d$seqn == seqn), ]$updated <- str_dup(text, d[(d$ind == index) & (d$seqn == seqn), ]$count)
}

head(d, 4)


ind,seqn,text,tags,count,present,loc,lines,updated
lyrics:07TwX5DB1dq9tJ1L7z77qb,8,Pause silence Another moment dropped off Left behind and Hanging still You won't see me (x4) I can't see you (x4),[Bridge],4,2,O,6,
lyrics:08MsLUAnUDjyFUrZKPqW5k,2,So lonely [x12],[Chorus],12,1,O,1,So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12]
lyrics:08MsLUAnUDjyFUrZKPqW5k,4,So lonely [x12],[Chorus],12,1,O,1,So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12] So lonely [x12]
lyrics:08MsLUAnUDjyFUrZKPqW5k,5,"I feel lonely, I'm so lonely, I feel so low [x2]",[Outro],2,1,O,1,"I feel lonely, I'm so lonely, I feel so low [x2] I feel lonely, I'm so lonely, I feel so low [x2]"


In [59]:
# Tags which are indside and multiple tags present
# we separate the text at the delimeter "\n"
inside <- d %>% filter((present>1 & updated == "") | (loc == "I"))
inside <- inside %>% group_by(ind) %>% separate_rows(text, sep = "\n") %>% filter(text != "") %>% mutate(line_seq = seq(from = 1, to = n(), by = 1.0))
inside <- inside %>% select(ind, seqn, text, tags, line_seq)
head(inside)

ind,seqn,text,tags,line_seq
lyrics:07TwX5DB1dq9tJ1L7z77qb,8,Pause silence,[Bridge],1
lyrics:07TwX5DB1dq9tJ1L7z77qb,8,Another moment dropped off,[Bridge],2
lyrics:07TwX5DB1dq9tJ1L7z77qb,8,Left behind and,[Bridge],3
lyrics:07TwX5DB1dq9tJ1L7z77qb,8,Hanging still,[Bridge],4
lyrics:07TwX5DB1dq9tJ1L7z77qb,8,You won't see me (x4),[Bridge],5
lyrics:07TwX5DB1dq9tJ1L7z77qb,8,I can't see you (x4),[Bridge],6


In [60]:
# We follow the same steps as earlier to extract lines the repeated symbol and then duplicate the lines
inside <- inside %>% mutate(repetition = case_when(str_detect(text, regex("\\dx(?![0-9])", ignore_case=TRUE)) ~ str_extract(text, regex("\\d+x", ignore_case=TRUE)), str_detect(text, regex("x\\d", ignore_case=TRUE)) ~ str_extract(text, regex("x\\d+", ignore_case=TRUE)) , TRUE ~ ""))
inside <- inside %>% mutate(count = as.numeric(str_extract(repetition, regex("\\d+")))) %>% select(-repetition) 
inside <- inside %>% mutate(text = case_when(str_detect(text, regex("\\dx(?![0-9])", ignore_case=TRUE)) ~ str_replace(text, regex("[\\[\\({]?\\d+x[}\\)\\]]?", ignore_case=TRUE), replacement=""), str_detect(text, regex("x\\d", ignore_case=TRUE)) ~ str_replace(text, regex("[\\[\\({]?x\\d+[}\\)\\]]?", ignore_case=TRUE), replacement="") , TRUE ~ text))
# duplicate the lines where count is not NA
inside <- inside %>% mutate(text = case_when(!is.na(count) ~ str_dup(paste(text, ""), count), TRUE~text))
head(inside)

ind,seqn,text,tags,line_seq,count
lyrics:07TwX5DB1dq9tJ1L7z77qb,8,Pause silence,[Bridge],1,
lyrics:07TwX5DB1dq9tJ1L7z77qb,8,Another moment dropped off,[Bridge],2,
lyrics:07TwX5DB1dq9tJ1L7z77qb,8,Left behind and,[Bridge],3,
lyrics:07TwX5DB1dq9tJ1L7z77qb,8,Hanging still,[Bridge],4,
lyrics:07TwX5DB1dq9tJ1L7z77qb,8,You won't see me You won't see me You won't see me You won't see me,[Bridge],5,4.0
lyrics:07TwX5DB1dq9tJ1L7z77qb,8,I can't see you I can't see you I can't see you I can't see you,[Bridge],6,4.0


In [61]:
# Then combine the text to form the paragraph again
inside <- inside %>% select(ind, seqn, text)
inside <- inside %>% group_by(ind, seqn) %>% summarise(text = paste(text, collapse = "\n"))
head(inside,3)

ind,seqn,text
lyrics:01QwobyKNu7WRCVuTQbRDN,5,"""Hey, hey, Hello Mary Lou, Goodbye heart Sweet Mary Lou, I'm so in love with you I knew Mary Lou, We'd never part So Hello Mary Lou, Goodbye heart So Hello Mary Lou, Goodbye heart Yes, Hello Mary Lou, Goodbye heart"
lyrics:043Re81uRxCx2Nw6LfAheU,4,Who that is? What it say? Who that is? What it say? Who that is? What it say? What it look like? What it look like baby? Who that is? What it say? Who that is? What it say? Who that is? What it say? What it look like? What it look like baby?
lyrics:043Re81uRxCx2Nw6LfAheU,8,Chorus Who that is? What it say? Who that is? What it say? Who that is? What it say? What it look like? What it look like baby? Who that is? What it say? Who that is? What it say? Who that is? What it say? What it look like? What it look like baby?


In [63]:
# update the main table
for(i in 1:nrow(inside)){
    index <- inside[i,]$ind
    seq <- inside[i,]$seqn
    d[(d$ind == index) & (d$seqn == seq), ]$updated <- inside[i, ]$text
}
# remove the repetition symbols from the text
d <- d %>% mutate(updated = case_when(str_detect(updated, regex("\\dx(?![0-9])", ignore_case=TRUE)) ~ str_replace_all(updated, regex("[\\(\\[{]?\\d+x[}\\]\\)]?", ignore_case=TRUE), replacement=""), str_detect(updated, regex("x\\d", ignore_case=TRUE)) ~ str_replace_all(updated, regex("[\\(\\[{]?x\\d+[\\)\\]}]?", ignore_case=TRUE), replacement="") , TRUE ~ updated))

# The text with repetition symbol at the end("O"), does not have a clear pattern. Sometimes the whole paragraph is repeated 
# whereas sometimes it is one line or som words. 
# There are 328 such songs. 
# We note the lyricsId containing such text where updated column is empty
no_pattern <- d %>% filter(loc=="O" | updated=="") %>% select(ind) %>% distinct() %>% unlist()
write(no_pattern, "repeated_no_pattern.txt")

head(d,4)

ind,seqn,text,tags,count,present,loc,lines,updated
lyrics:07TwX5DB1dq9tJ1L7z77qb,8,Pause silence Another moment dropped off Left behind and Hanging still You won't see me (x4) I can't see you (x4),[Bridge],4,2,O,6,Pause silence Another moment dropped off Left behind and Hanging still You won't see me You won't see me You won't see me You won't see me I can't see you I can't see you I can't see you I can't see you
lyrics:08MsLUAnUDjyFUrZKPqW5k,2,So lonely [x12],[Chorus],12,1,O,1,So lonely So lonely So lonely So lonely So lonely So lonely So lonely So lonely So lonely So lonely So lonely So lonely
lyrics:08MsLUAnUDjyFUrZKPqW5k,4,So lonely [x12],[Chorus],12,1,O,1,So lonely So lonely So lonely So lonely So lonely So lonely So lonely So lonely So lonely So lonely So lonely So lonely
lyrics:08MsLUAnUDjyFUrZKPqW5k,5,"I feel lonely, I'm so lonely, I feel so low [x2]",[Outro],2,1,O,1,"I feel lonely, I'm so lonely, I feel so low I feel lonely, I'm so lonely, I feel so low"


In [65]:
# Now let us update the text in "merged" table 
for(i in 1:nrow(d)){
    index <- d[i,]$ind
    seq <- d[i,]$seqn
    merged[(merged$ind == index) & (merged$seqn == seq), ]$text <- d[i, ]$updated
}

**Now we will find the count of text with [Repeat] tag and also identify structure tags which are not within the brackets inside the lyrics.**

In [387]:
m <- merged %>% mutate(repeats = str_detect(tags, regex("Repeat(s)?(ed)?", ignore_case=TRUE))) %>% filter(repeats == TRUE) %>% distinct(ind) %>% nrow()
ls <- merged %>% mutate(repeats = str_detect(tags, regex("Repeat(s)?(ed)?", ignore_case=TRUE))) %>% filter(repeats == TRUE) %>% distinct(ind) %>% unlist()
write(ls, "repeat_tags.txt")

# Find structure tags in the text without brackets
pattern <- "(^((Pre)?(Post)?-?Chorus:?))|(^((Pre)?(Post)?-?Refrain:?))|([\\(\\[{]repeat:?[}\\]\\)])"
p <- merged %>% mutate(is_present = str_extract(text, regex(pattern, ignore_case=TRUE)))
n <- p %>% mutate(struc = str_detect(is_present, regex(pattern, ignore_case=TRUE))) %>% filter(struc == TRUE)
head(n)

X,ind,seqn,text,tags,is_present,struc
2960,lyrics:1TAzhvKe3iFfvHVVE9foCy,9,A low [repeat] Can you hear me? A low [repeat] Can't you hear me? A low [repeat] Can you hear me? A low [repeat],[Bridge],[repeat],True
9934,lyrics:5tUXQvc0yMQD42799A1mec,2,"Guess who loves you more Oh love guess who loves you more than he did girl, guess who treats you betta than he did, me girl me that thats right me when you gon see wake up and see (repeat)",(Chorus:),(repeat),True
12715,lyrics:7HDCn9GB7qInP5bOHOCPpt,3,I know what I want I'll say what I want And no-one can take it away I know what I want I'll say what I want And no-one can take it away [repeat],[Chorus],[repeat],True
12718,lyrics:7HDCn9GB7qInP5bOHOCPpt,6,I know what I want I'll say what I want And no-one can take it away I know what I want I'll say what I want And no-one can take it away [repeat],[Chorus],[repeat],True
13352,lyrics:7zWhta5mrjEwBBwki9vao7,6,We are alone [repeat],[Outro],[repeat],True
13829,lyrics:5zSeoUY9JbMwJHDS26xSJN,2,"Chorus: Airwoven divinity, mind, soul in victory Cured of her malady, could this be my lost; Ligeia",,Chorus:,True


537 rows were identified with the pattern. The missing text rows were manually handled. 

In [389]:
# Now we clean the patterns from the text and note the count of (repeat) tags
p <- n %>% mutate(repeats = str_detect(is_present, regex("Repeat(s)?(ed)?", ignore_case=TRUE))) %>% filter(repeats == TRUE) %>% distinct(ind) %>% nrow()
ls <- n %>% mutate(repeats = str_detect(is_present, regex("Repeat(s)?(ed)?", ignore_case=TRUE))) %>% filter(repeats == TRUE) %>% distinct(ind) %>% unlist()
write(ls, "repeat_tags.txt", append=TRUE)

sprintf("There are %i + %i = %i songs with [Repeat] tag in the lyrics", m, p, m+p)

In [439]:
pattern = "(^\\[.*\\])|^\\(.*\\)|^.*:|^( ?-?\\*?CHORUS ?-?)|^(Pre-Chorus)|^(VERSE \\d)|(\\((Repeat)\\))|(\\[(Repeat)\\])|(^(Refrain)+)"
merged <- merged %>% mutate(text = case_when(str_detect(text, regex(pattern, ignore_case=TRUE)) ~ str_replace(text, regex(pattern, ignore_case=TRUE), replacement=""), TRUE ~ text))
head(merged, 3)

X.1,X,ind,seqn,text,tags
1,1,lyrics:00WVclSOdTPuLYTott68Wl,1,Will you listen to what I said? Do you believe that I’m worth it? I can’t fall asleep I can’t live with my dreams,[Verse 1]
2,2,lyrics:00WVclSOdTPuLYTott68Wl,2,"It’s hard to know myself Trapped in my own head I’m trapped, I’m trapped Trapped in my head It’s hard to know myself Trapped in my own head I’m trapped, I’m trapped Trapped in my head",[Chorus x2]
3,3,lyrics:00WVclSOdTPuLYTott68Wl,3,"You’re running from everything That’s seeping from the depths of me You’re begging for anything That’s a life I’ll never see I’ll keep searching, I’m angry! I’ve been haunted by my dreams!",[Verse 2]


In [519]:
# There are some rows duplicated in the process in the merged table where the text corresponding to a lower row id has been correctly 
# replicated or substituted but for in case of higher row id, it is incorrect 
# This is the case for 332 such songs
# Let us remove those before combining
t <- merged %>% select(ind, seqn)
ids <- t[duplicated(t), ] %>% select(ind) %>% distinct() %>% unlist()
merged <- merged %>% select(ind, seqn, text)
id <- as.numeric(rownames(merged))
merged <- cbind(id=id, merged)

for(i in 1:length(ids)){
    index <- ids[[i]]
    num <- merged %>% filter(ind == index) %>% select(seqn) %>% distinct() %>% unlist()
    for(v in num){
        data <- merged[(merged$ind == index & merged$seqn == v), ]
        if(nrow(data)==2){
            maximum <- max(data$id)
            merged[(merged$ind == index & merged$seqn == v & merged$id == maximum), ]$text = ""
        }
    }
}

merged <- merged %>% filter(text != "") %>% select(ind, seqn, text)
head(merged, 15)

ind,seqn,text
lyrics:00WVclSOdTPuLYTott68Wl,1,Will you listen to what I said? Do you believe that I’m worth it? I can’t fall asleep I can’t live with my dreams
lyrics:00WVclSOdTPuLYTott68Wl,2,"It’s hard to know myself Trapped in my own head I’m trapped, I’m trapped Trapped in my head It’s hard to know myself Trapped in my own head I’m trapped, I’m trapped Trapped in my head"
lyrics:00WVclSOdTPuLYTott68Wl,3,"You’re running from everything That’s seeping from the depths of me You’re begging for anything That’s a life I’ll never see I’ll keep searching, I’m angry! I’ve been haunted by my dreams!"
lyrics:00WVclSOdTPuLYTott68Wl,4,"It’s hard to know myself Trapped in my own head I’m trapped, I’m trapped Trapped in my head It’s hard to know myself Trapped in my own head I’m trapped, I’m trapped Trapped in my head"
lyrics:00WVclSOdTPuLYTott68Wl,5,It’s hard to know myself Trapped in my own head It’s hard to know myself Trapped in my own head
lyrics:00WVclSOdTPuLYTott68Wl,6,"It’s hard to know myself Trapped in my own head I’m trapped, I’m trapped Trapped in my head It’s hard to know myself Trapped in my own head I’m trapped, I’m trapped Trapped in my head"
lyrics:00WVclSOdTPuLYTott68Wl,7,"I'm trapped, I'm trapped Trapped in my head I'm trapped, I'm trapped Trapped in my own head"
lyrics:023Jtd3l1gEeWgL1w1m8bt,1,I was guided by voices to the backyard where I hear That someone is talking about someone we used to know I think that I know what’s gotta be done I’ll take a minute alone and decide what to do
lyrics:023Jtd3l1gEeWgL1w1m8bt,2,"You know I’d leave any party for you 'Cause no party’s so sweet as a party of two Sugar, I got no question of the right thing to do Oh, you know I’d leave any party for you"
lyrics:023Jtd3l1gEeWgL1w1m8bt,3,"You planned meeting me on your way home And I tried reaching you on your new flip phone You didn’t have the ringer on, so I couldn’t warn you That a stranger is here looking for a reunion"


In [520]:
# Finally, let us combine the lyrics for each song
merged <- merged %>% group_by(ind) %>% summarise(text = paste(text, collapse = "\n\n")) %>% select(ind, text)

# to verify the above lyrics id
merged %>% filter(ind == "lyrics:00WVclSOdTPuLYTott68Wl" | ind == "lyrics:023Jtd3l1gEeWgL1w1m8bt")

ind,text
lyrics:00WVclSOdTPuLYTott68Wl,"Will you listen to what I said? Do you believe that I’m worth it? I can’t fall asleep I can’t live with my dreams It’s hard to know myself Trapped in my own head I’m trapped, I’m trapped Trapped in my head It’s hard to know myself Trapped in my own head I’m trapped, I’m trapped Trapped in my head You’re running from everything That’s seeping from the depths of me You’re begging for anything That’s a life I’ll never see I’ll keep searching, I’m angry! I’ve been haunted by my dreams! It’s hard to know myself Trapped in my own head I’m trapped, I’m trapped Trapped in my head It’s hard to know myself Trapped in my own head I’m trapped, I’m trapped Trapped in my head It’s hard to know myself Trapped in my own head It’s hard to know myself Trapped in my own head It’s hard to know myself Trapped in my own head I’m trapped, I’m trapped Trapped in my head It’s hard to know myself Trapped in my own head I’m trapped, I’m trapped Trapped in my head I'm trapped, I'm trapped Trapped in my head I'm trapped, I'm trapped Trapped in my own head"
lyrics:023Jtd3l1gEeWgL1w1m8bt,"I was guided by voices to the backyard where I hear That someone is talking about someone we used to know I think that I know what’s gotta be done I’ll take a minute alone and decide what to do You know I’d leave any party for you 'Cause no party’s so sweet as a party of two Sugar, I got no question of the right thing to do Oh, you know I’d leave any party for you You planned meeting me on your way home And I tried reaching you on your new flip phone You didn’t have the ringer on, so I couldn’t warn you That a stranger is here looking for a reunion You know I’d leave any party for you 'Cause no party’s so sweet as our party of two I’m getting tired of these clowns and balloons Oh, you know I’d leave any party for you I won’t lead you astray You know, I’ve got your back any day Oh no, I got your back every day, every day Every day, every day And you know I would leave, I would leave I would leave, I would leave You know I’d leave any party for you No party’s so sweet as our party of two"


In [98]:
# Lets remove the instrumental songs 
merged %>% filter(str_detect(text, regex("(^([\\n\\(\\[{\\* ]*(instrumental ?(track)?)[\\n\\]\\)}\\* ]*)$)", ignore_case=TRUE))) %>% head()

ind,text
lyrics:001VMKfkHZrlyj7JlQbQFL,Instrumental
lyrics:00T4mz4MQjrjatqWKZIHax,Instrumental
lyrics:01sn9NAAu2Nbf3T8HapOfi,Instrumental
lyrics:02m6fS1f1bsQxcLPdJvocl,Instrumental
lyrics:03AILJKNtXyZmvZV06yIsW,Instrumental
lyrics:03iwF1wfUp5XGkRz9RPCDG,Instrumental


In [116]:
indices <- merged %>% filter(str_detect(text, regex("(^([\\n\\(\\[{\\* ]*(instrumental ?(track)?)[\\n\\]\\)}\\* ]*)$)", ignore_case=TRUE))) %>% select(ind) %>% unlist()
write(indices, "instrumental.txt", append=TRUE)
merged <- merged %>% filter(!(ind %chin% indices))

In [10]:
# Let us identify the songs with [?]
ids <- merged %>% filter(str_detect(text, regex("\\[\\?\\]"))) %>% select(ind) %>% unlist()
write(ids, "question_mark.txt")

In [120]:
write.csv(merged, "cleaned_lyrics.csv", row.names=FALSE)

# Summary
1. The cleaned lyrics dataset contains 34169 songs whereas the original data contained 35536 songs. <br/><br/>
2. Out of 35536 songs, 1789 are instrumental songs. The lyrics Ids of these songs are present in **"instrumental.txt"** file.<br/><br/>
3. Songs with ids lyrics:5PQmSHzWnlgG4EBuIqjac2, lyrics:1Qf7H7d1TNP8VfPVDRLY0D are present in the original data without any text. This is because there is a genius lyrics url for these two songs but the page is not responsive. Thus, they have been removed from the cleaned data.<br/><br/>
4. Following ids have missing text, which I could not correctly identify
   ("lyrics:48ncRBVLgiu8MY7O70VVw5", "lyrics:0OTrzRHzb8nALYYUGxUoBV", "lyrics:1KEKtRzR7YA2x8hss5qLSv", "lyrics:5g2lGWu4HTnNsSfo2Bruzr", "lyrics:6mv2faMbSa4MNtUIXu0AJT", "lyrics:7jrstlBpa8SAOWZLeITu3o", "lyrics:7nD87C4i8S8KrpMBuWJyDe", "lyrics:0yIIcN1RIMPChsrrZWbum0")<br/><br/>
5. There are 328 songs with repeat symbol (i.e [2x] or [x4]) at the end of the sentence in some part of the lyrics. For this only the tags([2x], [x4]) have been removed from the text. These could not be processed as there is no pattern. In some cases, the whole paragraph is repeated whereas sometimes it is only the prior sentence. The lyrics ids for these songs are present in **"repeated_no_pattern.txt"** file.<br/><br/>
6. There are 82 songs with [repeat] tags in some part of the lyrics. These are not processed as the number of times that the text should be repeated is not clear. The [repeat] tag has been removed from the text. The lyrics Ids for these songs are present in **"repeat_tags.txt** file.<br/><br/>
7. There are 420 songs with [?] in some sentences of the lyrics. As per the genius guidelines, if one is unclear about the lyrics part, [?] should be used. I verified with lyrics from different websites for some songs. Sometimes even if there is [?] in the genius lyrics text, there is no missing lyrics. But sometimes there is a word missing. The lyrics ids for such songs are in **"question_mark.txt"**<br/><br/> 
7. All other structure tags are removed from the lyrics using regular expressions (Few lyrics may still contain some tags which were not captured by the regex). Missing text has been added for other songs and repetition has been processed.

