<a href="https://colab.research.google.com/github/SijieQiu/CD_A4/blob/main/TS_Song_Ranking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Corpus Analysis with spaCy
## Analysis of the Top 50 Songs by Taylor Swift
#### https://variety.com/lists/best-taylor-swift-songs-ranked/tim-mcgraw/

### Introduction
“I’m doing good, I’m on some new shit,” Taylor Swift softly declared at the outset of “Folklore,” and truer sentiments are never constantly spoken than in the case of the woman who somehow manages to the best and most prolific songwriter in pop. Consider that, just since the last time she went on tour, Swift has released six albums, four of which were all-new, two of which dug into her vaults and proved she’s even more of a constant song fount than we knew. New shit is her brand — maybe her compulsion, too — and definitely our pleasure.

So sometimes it takes a special occasion to tear yourself away from the chronic relistenability of a “Midnights” to ask yourself: What do “Speak Now,” “1989,” “Reputation,” et al. have to say to me today?

If your favorite Swift song is missing from this highly subjective, critical 50-best list, rest assured that it’s probably in our unspoken 51st or 52nd slot. And know that on any given day, the winds might have blown differently and we might even have put “Shake It Off” or “Love Story” on the list. For now, there were just too many brilliant deep cuts to consider to let all the bigger hits hog the top ranks of the canon. And having just named “Midnights” the album of the year, I didn’t think it was too soon to push some cuts from that record onto this “Eras”-spanning list. Are you ready for it? (Boom, boom, boom.)

### Installing, Importing and Preprocessing

In [None]:
# Install and import spacy and plotly.
%pip install spaCy
%pip install plotly
%pip install nbformat --upgrade



In [None]:
import spacy

In [None]:
!python -m spacy download en_core_web_sm
import spacy

# Load the downloaded model
nlp = spacy.load("en_core_web_sm")

# Import os to upload documents and metadata
import os

# Load spaCy visualizer
from spacy import displacy

# Import pandas DataFrame packages
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

# Import graphing package
import plotly.graph_objects as go
import plotly.express as px

2023-12-22 02:35:34.812780: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-22 02:35:34.812867: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-22 02:35:34.814084: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m73.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load

In [None]:
from google.colab import files
uploaded = files.upload()

Saving 1.txt to 1.txt
Saving 2.txt to 2.txt
Saving 3.txt to 3.txt
Saving 4.txt to 4.txt
Saving 5.txt to 5.txt
Saving 6.txt to 6.txt
Saving 7.txt to 7.txt
Saving 8.txt to 8.txt
Saving 9.txt to 9.txt
Saving 10.txt to 10.txt
Saving 11.txt to 11.txt
Saving 12.txt to 12.txt
Saving 13.txt to 13.txt
Saving 14.txt to 14.txt
Saving 15.txt to 15.txt
Saving 16.txt to 16.txt
Saving 17.txt to 17.txt
Saving 18.txt to 18.txt
Saving 19.txt to 19.txt
Saving 20.txt to 20.txt


In [None]:
import io

# Create empty lists for file names and contents
texts = []
file_names = []

# Iterate through each uploaded file
for _file_name in uploaded.keys():
    # Look for only text files
    if _file_name.endswith('.txt'):
        # Read contents of each text file and append to the text list
        text_content = io.StringIO(uploaded[_file_name].decode('utf-8')).read()
        texts.append(text_content)
        # Append name of each file to file name list
        file_names.append(_file_name)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Create dictionary object associating each file name with its text
d = {'Filename':file_names,'Text':texts}

In [None]:
# Turn dictionary into a dataframe
song_df = pd.DataFrame(d)

In [None]:
song_df.head()

Unnamed: 0,Filename,Text
0,1.txt,You Belong With Me\r\nIt’s as eternally teenag...
1,2.txt,"All Too Well\r\nIn its original, truncated for..."
2,3.txt,Get Away Car\r\nSwift was playing with a lot o...
3,4.txt,Style\r\nSwift’s partner in this deeply sexy t...
4,5.txt,"Mirrorball\r\n“I’ll be your mirror,” the Velve..."


The beginnings of some of the texts may contain extra spaces (indicated by \t or \n). These characters can be replaced by a single space using the str.replace() method.



In [None]:
# Remove extra spaces from songs
song_df['Text'] = song_df['Text'].str.replace('\s+', ' ', regex=True).str.strip()
song_df.head()

Unnamed: 0,Filename,Text
0,1.txt,You Belong With Me It’s as eternally teenaged ...
1,2.txt,"All Too Well In its original, truncated form o..."
2,3.txt,Get Away Car Swift was playing with a lot of f...
3,4.txt,Style Swift’s partner in this deeply sexy trac...
4,5.txt,"Mirrorball “I’ll be your mirror,” the Velvet U..."


The resulting DataFrame is now ready for analysis.

## Text Enrichment with spaCy


### Creating Doc Object



In [None]:
# Load nlp pipeline
nlp = spacy.load('en_core_web_sm')

# Check what functions it performs
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [None]:
# Define a function that runs the nlp pipeline on any given input text
def process_text(text):
    return nlp(text)

After the function is defined, use .apply() to apply it to every cell in a given DataFrame column. In this case, nlp will run on each cell in the Text column of the song_df DataFrame, creating a Doc object from every text. These Doc objects will be stored in a new column of the DataFrame called Doc.

In [None]:
# Apply the function to the "Text" column, so that the nlp pipeline is called on each student essay
song_df['Doc'] = song_df['Text'].apply(process_text)

###Text Reduction

#### Tokenization

A critical first step spaCy performs is tokenization, or the segmentation of strings into individual words and punctuation markers. Tokenization enables spaCy to parse the grammatical structures of a text and identify characteristics of each word-like part-of-speech.

To retrieve a tokenized version of each text in the DataFrame, I’ll write a function that iterates through any given Doc object and returns all functions found within it.

In [None]:
# Define a function to retrieve tokens from a doc object
def get_token(doc):
    return [(token.text) for token in doc]

In [None]:
# Run the token retrieval function on the doc objects in the dataframe
song_df['Tokens'] = song_df['Doc'].apply(get_token)
song_df.head()

Unnamed: 0,Filename,Text,Doc,Tokens
0,1.txt,You Belong With Me It’s as eternally teenaged ...,"(You, Belong, With, Me, It, ’s, as, eternally,...","[You, Belong, With, Me, It, ’s, as, eternally,..."
1,2.txt,"All Too Well In its original, truncated form o...","(All, Too, Well, In, its, original, ,, truncat...","[All, Too, Well, In, its, original, ,, truncat..."
2,3.txt,Get Away Car Swift was playing with a lot of f...,"(Get, Away, Car, Swift, was, playing, with, a,...","[Get, Away, Car, Swift, was, playing, with, a,..."
3,4.txt,Style Swift’s partner in this deeply sexy trac...,"(Style, Swift, ’s, partner, in, this, deeply, ...","[Style, Swift, ’s, partner, in, this, deeply, ..."
4,5.txt,"Mirrorball “I’ll be your mirror,” the Velvet U...","(Mirrorball, “, I, ’ll, be, your, mirror, ,, ”...","[Mirrorball, “, I, ’ll, be, your, mirror, ,, ”..."


In [None]:
tokens = song_df[['Text', 'Tokens']].copy()
tokens.head()

Unnamed: 0,Text,Tokens
0,You Belong With Me It’s as eternally teenaged ...,"[You, Belong, With, Me, It, ’s, as, eternally,..."
1,"All Too Well In its original, truncated form o...","[All, Too, Well, In, its, original, ,, truncat..."
2,Get Away Car Swift was playing with a lot of f...,"[Get, Away, Car, Swift, was, playing, with, a,..."
3,Style Swift’s partner in this deeply sexy trac...,"[Style, Swift, ’s, partner, in, this, deeply, ..."
4,"Mirrorball “I’ll be your mirror,” the Velvet U...","[Mirrorball, “, I, ’ll, be, your, mirror, ,, ”..."


###Lemmatization
Another process performed by spaCy is lemmatization, or the retrieval of the dictionary root word of each word (for example “brighten” for “brightening”). I’ll perform a similar set of steps to those above to create a function to call the lemmas from the Doc object, then apply it to the DataFrame.

In [None]:
# Define a function to retrieve lemmas from a doc object
def get_lemma(doc):
    return [(token.lemma_) for token in doc]

# Run the lemma retrieval function on the doc objects in the dataframe
song_df['Lemmas'] = song_df['Doc'].apply(get_lemma)

In [None]:
print(f'"Write" appears in the text tokens column ' + str(song_df['Tokens'].apply(lambda x: x.count('write')).sum()) + ' times.')
print(f'"Write" appears in the lemmas column ' + str(song_df['Lemmas'].apply(lambda x: x.count('write')).sum()) + ' times.')

"Write" appears in the text tokens column 1 times.
"Write" appears in the lemmas column 6 times.


As expected, there are more instances of “write” in the Lemmas column, as the lemmatization process has grouped inflected word forms (writing, writer) into the base word “write.”

###Text Annotation

#### Part of Speech Tagging

SpaCy facilitates two levels of part-of-speech tagging: coarse-grained tagging, which predicts the simple universal part-of-speech of each token in a text (such as noun, verb, adjective, adverb), and detailed tagging, which uses a larger, more fine-grained set of part-of-speech tags (for example 3rd person singular present verb). The part-of-speech tags used are determined by the English language model we use. In this case, I'll use the small English model to explore the differences between the models on spaCy’s website.

We can call the part-of-speech tags in the same way as the lemmas. Create a function to extract them from any given Doc object and apply the function to each Doc object in the DataFrame. The function I’ll create will extract both the coarse- and fine-grained part-of-speech for each token (token.pos_ and token.tag_, respectively).

In [None]:
# Define a function to retrieve lemmas from a doc object
def get_pos(doc):
    #Return the coarse- and fine-grained part of speech text for each token in the doc
    return [(token.pos_, token.tag_) for token in doc]

# Define a function to retrieve parts of speech from a doc object
song_df['POS'] = song_df['Doc'].apply(get_pos)

In [None]:
# Create a list of part of speech tags
list(song_df['POS'])

[[('PRON', 'PRP'),
  ('VERB', 'VBP'),
  ('ADP', 'IN'),
  ('PRON', 'PRP'),
  ('PRON', 'PRP'),
  ('VERB', 'VBZ'),
  ('ADV', 'RB'),
  ('ADV', 'RB'),
  ('VERB', 'VBN'),
  ('ADP', 'IN'),
  ('PRON', 'NN'),
  ('PROPN', 'NNP'),
  ('ADV', 'RB'),
  ('VERB', 'VBD'),
  ('PUNCT', ','),
  ('CCONJ', 'CC'),
  ('PRON', 'DT'),
  ('VERB', 'VBZ'),
  ('DET', 'DT'),
  ('NOUN', 'NN'),
  ('PART', 'TO'),
  ('VERB', 'VB'),
  ('PART', 'TO'),
  ('VERB', 'VB'),
  ('ADP', 'IN'),
  ('ADP', 'IN'),
  ('PRON', 'PRP'),
  ('PUNCT', '.'),
  ('ADJ', 'JJ'),
  ('NOUN', 'NN'),
  ('AUX', 'VBZ'),
  ('ADJ', 'JJ'),
  ('ADP', 'IN'),
  ('PRON', 'NN'),
  ('SCONJ', 'IN'),
  ('PART', 'RB'),
  ('ADJ', 'JJ'),
  ('NOUN', 'NNS'),
  ('ADP', 'IN'),
  ('ADJ', 'JJ'),
  ('PUNCT', 'HYPH'),
  ('NOUN', 'NN'),
  ('NOUN', 'NN'),
  ('PUNCT', 'HYPH'),
  ('NOUN', 'NN'),
  ('PUNCT', '.'),
  ('PRON', 'WP'),
  ('ADP', 'IN'),
  ('PRON', 'PRP'),
  ('PROPN', 'NNP'),
  ('NOUN', 'NNS'),
  ('VERB', 'VBG'),
  ('PRON', 'PRP'),
  ('AUX', 'PRP'),
  ('AUX', 'VB'),


In [None]:
spacy.explain("IN")

'conjunction, subordinating or preposition'

Extracting only words which have been fitted with the proper noun tag

In [None]:
# Define function to extract proper nouns from Doc object
def extract_proper_nouns(doc):
    return [token.text for token in doc if token.pos_ == 'PROPN']

# Apply function to Doc column and store resulting proper nouns in new column
song_df['Proper_Nouns'] = song_df['Doc'].apply(extract_proper_nouns)

Listing the nouns in each text can help us ascertain the texts’ subjects. Let’s list the nouns in two different texts, the text located in row 2 of the DataFrame and the text located in row 19.

In [None]:
list(song_df.loc[[2, 19], 'Proper_Nouns'])

[['Car',
  'Swift',
  'Reputation',
  'Max',
  'Martin',
  'Shellback',
  'Jack',
  'Antonoff',
  'Swift',
  'Getaway',
  'Car',
  'Getaway',
  'Car',
  'Bonnie',
  'Clyde',
  'Swift',
  'Ridin'],
 ['Cruel', 'LoverFest', 'Vocoder', 'ELO', 'Banamarama']]

#### Named Entity Recognition

SpaCy can tag named entities in the text, such as names, dates, organizations, and locations. Call the full list of named entities and their descriptions using this code:

In [None]:
# Get all NE labels and assign to variable
labels = nlp.get_pipe("ner").labels

# Print each label and its description
for label in labels:
    print(label + ' : ' + spacy.explain(label))

CARDINAL : Numerals that do not fall under another type
DATE : Absolute or relative dates or periods
EVENT : Named hurricanes, battles, wars, sports events, etc.
FAC : Buildings, airports, highways, bridges, etc.
GPE : Countries, cities, states
LANGUAGE : Any named language
LAW : Named documents made into laws.
LOC : Non-GPE locations, mountain ranges, bodies of water
MONEY : Monetary values, including unit
NORP : Nationalities or religious or political groups
ORDINAL : "first", "second", etc.
ORG : Companies, agencies, institutions, etc.
PERCENT : Percentage, including "%"
PERSON : People, including fictional
PRODUCT : Objects, vehicles, foods, etc. (not services)
QUANTITY : Measurements, as of weight or distance
TIME : Times smaller than a day
WORK_OF_ART : Titles of books, songs, etc.


I’ll create a function to extract the named entity tags from each Doc object and apply it to the Doc objects in the DataFrame, storing the named entities in a new column:

In [None]:
# Define function to extract named entities from doc objects
def extract_named_entities(doc):
    return [ent.label_ for ent in doc.ents]

# Apply function to Doc column and store resulting named entities in new column
song_df['Named_Entities'] = song_df['Doc'].apply(extract_named_entities)
song_df['Named_Entities']

0                         [DATE, PERSON, ORDINAL, DATE]
1     [WORK_OF_ART, WORK_OF_ART, ORG, TIME, NORP, TI...
2     [PERSON, ORG, PERSON, PRODUCT, PERSON, GPE, OR...
3     [PERSON, PERSON, PERSON, PERSON, PERSON, PERSO...
4                    [ORG, PERSON, ORG, DATE, CARDINAL]
5                    [ORG, PERSON, ORG, DATE, CARDINAL]
6                    [WORK_OF_ART, PERSON, ORG, PERSON]
7     [ORDINAL, LOC, CARDINAL, WORK_OF_ART, WORK_OF_...
8        [WORK_OF_ART, ORDINAL, PERSON, CARDINAL, DATE]
9                          [ORG, WORK_OF_ART, CARDINAL]
10         [PERSON, ORG, PERSON, PERSON, CARDINAL, ORG]
11           [LOC, ORDINAL, GPE, TIME, PERSON, ORDINAL]
12                [DATE, PERSON, CARDINAL, WORK_OF_ART]
13                  [TIME, CARDINAL, WORK_OF_ART, DATE]
14                        [ORG, PERSON, CARDINAL, NORP]
15    [WORK_OF_ART, WORK_OF_ART, CARDINAL, ORDINAL, ...
16                   [LOC, ORG, CARDINAL, DATE, PERSON]
17                   [LOC, ORG, CARDINAL, DATE, 

Add another column with the words and phrases identified as named entities:



In [None]:
# Define function to extract text tagged with named entities from doc objects
def extract_named_entities(doc):
    return [ent for ent in doc.ents]

# Apply function to Doc column and store resulting text in new column
song_df['NE_Words'] = song_df['Doc'].apply(extract_named_entities)
song_df['NE_Words']

0     [(decade), (Swift), (second), (all, 21st, cent...
1     [(All, Too, Well), (Taylor, ’s, Version), (lin...
2     [(Max, Martin), (Shellback), (Jack, Antonoff),...
3     [(Style, Swift, ’s), (James, Dean), (Taylor), ...
4     [(Velvet, Underground), (Swift), (Folklore), (...
5     [(Velvet, Underground), (Swift), (Folklore), (...
6     [(We, Are, Never), (Max, Martin), (Shellback),...
7     [(first), (Swift), (one), (Right, Where, You, ...
8       [(Midnights), (first), (Swift), (100), (weeks)]
9                     [(Folklore), (Gold, Rush), (one)]
10    [(Grammys), (the, “, Speak, Now, ”), (David), ...
11    [(Swift), (first), (America), (late, -, night)...
12    [(2016), (Swift), (one), (Your, Good, Girl, ’s...
13    [(6, minutes, and, 44, seconds), (One), (Midni...
14           [(Lover, One), (Swift), (6/8), (American)]
15    [(Love, Story), (White, Horse), (Fifteen), (fi...
16    [(Swifties), (Folklore), (one), (years, later)...
17    [(Swifties), (Folklore), (one), (years, la

Then I'll visualize the words and their named entity tags in a single text. Call the first text’s Doc object and use displacy.render to visualize the text with the named entities highlighted and tagged:

In [None]:
# Extract the first Doc object
doc = song_df['Doc'][1]

# Visualize named entity tagging in a single paper
displacy.render(doc, style='ent', jupyter=True)

### Download Enriched Dataset

To save the dataset of doc objects, text reductions and linguistic annotations generated with spaCy, download the song_df DataFrame to my local computer as a .csv file:

In [None]:
# Save DataFrame as csv (in Google Drive)
# Use this step only to save  csv to your computer's working directory
song_df.to_csv('TS_song_ranking_with_spaCy_tags.csv')

In [None]:
from google.colab import files
files.download('TS_song_ranking_with_spaCy_tags.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>