# Corpus Analysis on **Ice Nine Kills** Lyrics

> Conducting Corpus Analysis on the lyrics of the singles off of Ice Nine Kills' two horror themed albums *The Silver Scream* and *The Silver Scream 2: Welcome to Horrorwood*. 

## Installing, Importing, and Processing

In [1]:
%%capture
import spacy

In [2]:
%%capture
!spacy download en_core_web_sm

In [3]:
%%capture
import os
from spacy import displacy
import pandas as pd
pd.options.mode.chained_assignment = None 
import plotly.express as px

In [4]:
lyrics = []
file_names = [] 

for _file_name in os.listdir('/Users/praga/Desktop/Collecting Data/CDResubA2/Data'):
    if _file_name.endswith('.txt'):
        lyrics.append(open('Data' + '/' + _file_name, 'r', encoding = 'utf-8').read())
        file_names.append(_file_name) 

In [5]:
d = {'Filename': file_names, 'Text': lyrics}

In [6]:
song_df = pd.DataFrame(d)

In [7]:
song_df.head()

Unnamed: 0,Filename,Text
0,american_nightmare.txt,Getting ready for bed at a regular time\nIs on...
1,assault_n_batteries.txt,Breaking news alert\nA deadly shootout at a lo...
2,enjoy_your_slay.txt,"Going down, sir? (Indeed)\nHere you are\nPlagu..."
3,funeral_derangements.txt,"Slave to the plot, let them rot\nOr bring them..."
4,grave_mistake.txt,Here lies a lifeless bride and groom\nTill dea...


In [8]:
song_df['Text'] = song_df['Text'].str.replace('\n', ' ', regex = True).str.strip()
song_df.head()

Unnamed: 0,Filename,Text
0,american_nightmare.txt,Getting ready for bed at a regular time Is one...
1,assault_n_batteries.txt,Breaking news alert A deadly shootout at a loc...
2,enjoy_your_slay.txt,"Going down, sir? (Indeed) Here you are Plagued..."
3,funeral_derangements.txt,"Slave to the plot, let them rot Or bring them ..."
4,grave_mistake.txt,Here lies a lifeless bride and groom Till deat...


In [9]:
metadata_df = pd.read_csv('metadata.csv')

In [10]:
metadata_df.head()

Unnamed: 0,Filename,Title,Track Listing,Song Length,Release Date,Album,Horror Reference,Lyric Source
0,american_nightmare.txt,The American Nightmare,1,4:11,20-06-2018,The Silver Scream,A Nightmare on Elm Street,LyricFind
1,thank_god_its_friday.txt,Thank God It's Friday,2,4:24,13-07-2018,The Silver Scream,Friday the 13th,LyricFind
2,enjoy_your_slay.txt,Enjoy Your Slay (featuring Sam Kubrick of Shie...,8,4:16,26-05-2017,The Silver Scream,The Shining,musixmatch
3,grave_mistake.txt,A Grave Mistake,6,3:04,14-09-2018,The Silver Scream,The Crow,musixmatch
4,stabbing_in_the_dark.txt,Stabbing in the Dark,3,4:36,19-10-2018,The Silver Scream,Halloween,musixmatch


In [11]:
song_df['Filename'] = song_df['Filename'].str.replace('.txt', '', regex=True)
#since title in metadata is not the same as filename, take out .txt in metadata df as well
metadata_df['Filename'] = metadata_df['Filename'].str.replace('.txt', '', regex=True)

In [12]:
playlist_df = metadata_df.merge(song_df, on = 'Filename')

In [13]:
playlist_df.head()

Unnamed: 0,Filename,Title,Track Listing,Song Length,Release Date,Album,Horror Reference,Lyric Source,Text
0,american_nightmare,The American Nightmare,1,4:11,20-06-2018,The Silver Scream,A Nightmare on Elm Street,LyricFind,Getting ready for bed at a regular time Is one...
1,thank_god_its_friday,Thank God It's Friday,2,4:24,13-07-2018,The Silver Scream,Friday the 13th,LyricFind,He drowned in all our sins He drowned in our m...
2,enjoy_your_slay,Enjoy Your Slay (featuring Sam Kubrick of Shie...,8,4:16,26-05-2017,The Silver Scream,The Shining,musixmatch,"Going down, sir? (Indeed) Here you are Plagued..."
3,grave_mistake,A Grave Mistake,6,3:04,14-09-2018,The Silver Scream,The Crow,musixmatch,Here lies a lifeless bride and groom Till deat...
4,stabbing_in_the_dark,Stabbing in the Dark,3,4:36,19-10-2018,The Silver Scream,Halloween,musixmatch,In calculated silence Captivated by the violen...


## Text Enrichment with spaCy

In [14]:
nlp = spacy.load('en_core_web_sm')
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [15]:
def process_lyrics(text):
    return nlp(text)

In [16]:
playlist_df['Doc'] = playlist_df['Text'].apply(process_lyrics)

## Text Reduction

### Tokenization

In [17]:
def get_token(doc):
    return[(token.text) for token in doc]

In [18]:
playlist_df['Tokens'] = playlist_df['Doc'].apply(get_token)

In [19]:
tokens = playlist_df[['Text', 'Tokens']].copy()
tokens.head()

Unnamed: 0,Text,Tokens
0,Getting ready for bed at a regular time Is one...,"[Getting, ready, for, bed, at, a, regular, tim..."
1,He drowned in all our sins He drowned in our m...,"[He, drowned, in, all, our, sins, He, drowned,..."
2,"Going down, sir? (Indeed) Here you are Plagued...","[Going, down, ,, sir, ?, (, Indeed, ), Here, y..."
3,Here lies a lifeless bride and groom Till deat...,"[Here, lies, a, lifeless, bride, and, groom, T..."
4,In calculated silence Captivated by the violen...,"[In, calculated, silence, Captivated, by, the,..."


### Lemmatization

In [20]:
def get_lemma(doc):
    return[(token.lemma_) for token in doc] 

In [21]:
playlist_df['Lemmas'] = playlist_df['Doc'].apply(get_lemma)

## Text Annotation

### Parts of Speech (POS) Tagging

In [22]:
def get_pos(doc): 
    return[(token.pos_, token.tag_) for token in doc]

In [23]:
playlist_df['POS'] = playlist_df['Doc'].apply(get_pos) 

In [24]:
def extract_proper_nouns(doc):
    return[token.text for token in doc if token.pos_ == 'PROPN'] 

In [25]:
playlist_df['Proper_Nouns'] = playlist_df['Doc'].apply(extract_proper_nouns) 

In [26]:
#testing progress
list(playlist_df.loc[[0, 4], 'Proper_Nouns'])

[['David', 'Dreams', 'Craven', 'Sweet', 'Wicked', 'morgue', 'Five', 'Seven'],
 ['Fall',
  'Haddonfield',
  'Knife',
  'Day',
  'Knife',
  'Fall',
  'Scream',
  'Halloween',
  'Fall',
  'Orange',
  'Grove',
  'Ave',
  'Suspect',
  'Michael',
  'Myers',
  'Michael',
  'Fall']]

# Analysis of the Corpus

From the above demonstration it's clear to see that the examination of proper nouns in each of the songs would allow any avid horror-media fan to guess with a high accuracy the inspiration behind the song even without listening to the song itself. Some cult classic-inspired songs could also be easily identifiable to most people simply because of how pervasive the media piece was and how commonly it is referenced now. 

Named Entity Recognition (NER) can help similarly in a more sophisticated way to answer the following question(s): 
How many times does each song reference a person, place, and time? And which song has the most references to a person, place, or time? 

## NER and Annotation Analysis

In [27]:
labels = nlp.get_pipe("ner").labels
for label in labels:
    print(label + ' : ' + spacy.explain(label))

CARDINAL : Numerals that do not fall under another type
DATE : Absolute or relative dates or periods
EVENT : Named hurricanes, battles, wars, sports events, etc.
FAC : Buildings, airports, highways, bridges, etc.
GPE : Countries, cities, states
LANGUAGE : Any named language
LAW : Named documents made into laws.
LOC : Non-GPE locations, mountain ranges, bodies of water
MONEY : Monetary values, including unit
NORP : Nationalities or religious or political groups
ORDINAL : "first", "second", etc.
ORG : Companies, agencies, institutions, etc.
PERCENT : Percentage, including "%"
PERSON : People, including fictional
PRODUCT : Objects, vehicles, foods, etc. (not services)
QUANTITY : Measurements, as of weight or distance
TIME : Times smaller than a day
WORK_OF_ART : Titles of books, songs, etc.


In [28]:
def extract_named_entities(doc):
    return [ent.label_ for ent in doc.ents]

playlist_df['Named_Entities'] = playlist_df['Doc'].apply(extract_named_entities)
playlist_df['Named_Entities']

0     [PERSON, PERSON, PERSON, PRODUCT, TIME, NORP, ...
1     [LOC, NORP, DATE, LOC, LOC, DATE, PERSON, DATE...
2     [ORG, CARDINAL, PERSON, CARDINAL, DATE, PERSON...
3                                        [PERSON, DATE]
4     [DATE, TIME, ORG, PERSON, TIME, PERSON, TIME, ...
5     [ORG, TIME, DATE, ORDINAL, ORG, DATE, ORG, TIM...
6     [PERSON, PERSON, PERSON, PERSON, PERSON, PERSO...
7     [PERSON, PERSON, PERSON, PERSON, PERSON, DATE,...
8     [PRODUCT, PERSON, CARDINAL, PERSON, NORP, PERS...
9     [CARDINAL, DATE, PERSON, QUANTITY, CARDINAL, C...
10                                         [DATE, DATE]
11             [ORG, TIME, PRODUCT, PERSON, TIME, TIME]
12    [ORDINAL, CARDINAL, WORK_OF_ART, PERSON, CARDI...
13    [TIME, TIME, CARDINAL, WORK_OF_ART, TIME, TIME...
14        [WORK_OF_ART, PERSON, PERSON, PERSON, PERSON]
Name: Named_Entities, dtype: object

In [29]:
def extract_named_entities(doc):
    return [ent for ent in doc.ents]

playlist_df['NE_Words'] = playlist_df['Doc'].apply(extract_named_entities)
playlist_df['NE_Words']

0     [(David), (Makes), (Rest), (Craven), (night), ...
1     [(Crystal, Lake, Ki, -, Ki, -, Ki), (Throats),...
2     [(Lies), (one), (Cursed), (One), (five, -, yea...
3                                    [(Payback), (May)]
4     [(Fifteen, years, ago), (the, midnight, hour),...
5     [(Santa, Clause), (tonight), (five, years, old...
6     [(Eating), (Villains), (Eating), (Villains), (...
7     [(Georgie, You), (Pick), (Prey), (Georgie), (D...
8     [(Valentino), (gore), (spleens), (Devil), (Ame...
9     [(one), (the, day), (Andy), (two, -, foot), (O...
10             [(a, rainy, day, Fall), (a, rainy, day)]
11    [(Yeah, Sometimes, Sometimes), (tonight), (Spa...
12    [(14th), (more, than, one), (Love), (Blunt), (...
13    [(tonight), (tonight), (one), (Listen, to, Mot...
14    [(Cut), (Horrorwood), (Horrorwood), (Silver, S...
Name: NE_Words, dtype: object

In [30]:
# Define a function to count the named entity types
def count_entity_types(entities, entity_types):
    # Initialize counters
    person_count = 0
    place_count = 0
    time_count = 0
    
    # Loop through the entity types and count
    for entity_type in entity_types:
        if entity_type == 'PERSON':
            person_count += 1
        elif entity_type == 'LOC':  # lOC refs non-GPE (Geopolitical Entity) refers to places (cities, countries, etc.)
            place_count += 1
        elif entity_type == 'DATE':  # Assuming 'DATE' and 'TIME' are both time references
            time_count += 1
        elif entity_type == 'TIME':
            time_count += 1
    
    return person_count, place_count, time_count

# Apply the function to each row and store the results
playlist_df[['person_count', 'place_count', 'time_count']] = playlist_df.apply(
    lambda row: pd.Series(count_entity_types(row['NE_Words'], row['Named_Entities'])),
    axis=1
)

# Show the DataFrame section with the counts
title = playlist_df.columns[1]
counts = playlist_df.columns[-3:]

# Create the new DataFrame
count_df = playlist_df[[title] + list(counts)]


### Results

In [31]:
count_df

Unnamed: 0,Title,person_count,place_count,time_count
0,The American Nightmare,4,0,4
1,Thank God It's Friday,3,8,6
2,Enjoy Your Slay (featuring Sam Kubrick of Shie...,5,0,1
3,A Grave Mistake,1,0,1
4,Stabbing in the Dark,5,0,7
5,Merry Axe-mas,2,0,14
6,Savages,10,0,0
7,"IT Is The End (featuring Peter ""JR"" Wasilewski...",7,0,2
8,Hip to Be Scared (featuring Jacoby Shaddix),8,0,0
9,Assault & Batteries,4,0,1


In [32]:
most_ppl = playlist_df.sort_values(by = 'person_count', ascending = False).head(1)['Title'].values[0]
most_plc = playlist_df.sort_values(by = 'place_count', ascending = False).head(1)['Title'].values[0]
most_time = playlist_df.sort_values(by = 'time_count', ascending = False).head(1)['Title'].values[0]

In [33]:
print (f'"{most_ppl}" has the highest person_count in this playlist')
print(f'"{most_plc}" has the highest place_count in this playlist')
print(f'"{most_time}" has the highest time_count in this playlist') 

"Savages" has the highest person_count in this playlist
"Thank God It's Friday" has the highest place_count in this playlist
"Merry Axe-mas" has the highest time_count in this playlist


The new dataframe count_df lists each song and their respective counts while the code blocks below it reveals the songs with the highest person, place, and time count respectively. 

## Cleaning up and Exporting CSV

In [34]:
#Doc column is now redundant as is the Proper nouns column
playlist_df = playlist_df.drop('Doc', axis = 1)
playlist_df = playlist_df.drop('Proper_Nouns', axis = 1)

In [35]:
playlist_df.head(1) #test

Unnamed: 0,Filename,Title,Track Listing,Song Length,Release Date,Album,Horror Reference,Lyric Source,Text,Tokens,Lemmas,POS,Named_Entities,NE_Words,person_count,place_count,time_count
0,american_nightmare,The American Nightmare,1,4:11,20-06-2018,The Silver Scream,A Nightmare on Elm Street,LyricFind,Getting ready for bed at a regular time Is one...,"[Getting, ready, for, bed, at, a, regular, tim...","[get, ready, for, bed, at, a, regular, time, b...","[(VERB, VBG), (ADJ, JJ), (ADP, IN), (NOUN, NN)...","[PERSON, PERSON, PERSON, PRODUCT, TIME, NORP, ...","[(David), (Makes), (Rest), (Craven), (night), ...",4,0,4


In [36]:
#saving dataframe as csv file
playlist_df.to_csv('INK_silverscream_singles_with_spaCy_tags.csv') 