# Text Analysis with spaCy
Compared to NLTK, Spacy is a more modern and efficient NLP toolkit that plays nicely with newer approaches like vector embeddings and transformer-based large language models.

Lots more details at https://spacy.io/

This notebook contains some basic demos, using data from the same source we'll use in this week's datathon.

In [None]:
import pandas as pd
import spacy
from spacy import displacy
from collections import Counter
import altair as alt

In [None]:
# ⇣ spaCy's default small English language model — comes loaded by default and does most of the basics
#nlp = spacy.load("en_core_web_sm")

# ⇣ spaCy's full English language model with word vectors (~560Mb download) - needed to play with embeddings, similarity, etc.
!python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')

Running spaCy's initial processing pipeline (applied by calling `nlp()`) gives us a bunch of features we'd have needed to handle more manually with NLTK. Let's try that with just one song.

In [None]:
# A sample of 50 songs from one artist, drawn from Genius.com
one_artist = pd.read_excel('https://drive.google.com/uc?export=download&id=1LIKWcgLHw19lS174dpA2sfhn3hA5ZPZh')
one_artist.head()

In [None]:
one_song = one_artist[one_artist['title'] == 'It’s All About the Pentiums'].iloc[0].lyrics
one_song_doc = nlp(one_song)

In [None]:
one_song_doc

The resulting object gives us access to things like
- Language detection

In [None]:
print("Language:",one_song_doc.lang_)

- Tokenization

In [None]:
print("Tokens:", [token.text for token in one_song_doc])

- Lemmatization (converting terms into their 'base forms').

In [None]:
print("Lemmatized:", [token.lemma_ for token in one_song_doc])

- Helpers for handling stopwords and more (here we lemmatize, lowercase, remove stopwords, and keep only tokens composed of letters and numbers)

In [None]:
print("Minus stopwords, etc.:", [token.lemma_.lower() for token in one_song_doc if not token.is_stop and (token.is_alpha or token.like_num)])

- Part-of-speech tagging.
(SpaCy's parts of speech codes here: https://github.com/explosion/spaCy/blob/master/spacy/glossary.py)

In [None]:
print("Parts of Speech:", [token.pos_ + ": " + token.text for token in one_song_doc])
print("Just the Verbs:", [token.lemma_ for token in one_song_doc if token.pos_ == "VERB"])
print("Just the Nouns:", [token.lemma_ for token in one_song_doc if token.pos_ == "NOUN" or token.pos_ == "PROPN"])

SpaCy can also extracts noun phrases and does entity detection.
(We're doing all of this with the default base pipeline so results are hit-or-miss, but you have the ability to roll your own pipelines if you want better performance.)

In [None]:
print("Noun phrases:", [chunk.text for chunk in one_song_doc.noun_chunks ])
print("Entities:", [entity.label_ + ": " + entity.text for entity in one_song_doc.ents])

There's also some basic visualization support via the `displayCy` package if you want to see entities inline, inspect the parse tree, etc.

In [None]:
displacy.render(one_song_doc,style='ent',jupyter=True)

# Frequency plots
Using these basics we can also start creating our own charts to examine the text.

In [None]:
# Renders a quick Altair frequency chart from a list of strings
def frequency_chart(phrase_list, top=50, normalize=True, ymax=.1, color=None):
  counter = Counter(phrase_list)
  if(normalize):
    total = sum(counter.values())
    for k in counter:
      counter[k] /= total
  top_counts = counter.most_common(top)
  chart = alt.Chart(pd.DataFrame(top_counts)).mark_bar()
  if normalize:
    chart = chart.encode(
      x=alt.X('0:O',sort='-y',title='word'),
      y=alt.Y('1:Q',title='% of words',axis=alt.Axis(format='%'),scale=alt.Scale(domain=[0,ymax])))
  else:
    chart = chart.encode(
      x=alt.X('0:O',sort='-y',title='word'),
      y=alt.Y('1:Q',title='count'))
  if color:
    chart = chart.configure_mark(color=color)
  return chart

Using this plus the tokens++ returned by spaCy, we can pretty quickly examine some characteristics of the text.

In [None]:
# Look at the 20 most-used words in the song.
terms = [token.lower_ for token in one_song_doc if not (token.is_stop or token.is_punct or token.is_space)]
frequency_chart(terms,20)

In [None]:
# Look at the different parts of speech used
terms = [token.pos_ for token in one_song_doc]
frequency_chart(terms,20,normalize=False,color='orange')

# Word embeddings
(If we're using a language model that supports it) spaCy can also generate vector embeddings for individual tokens. It can also return embeddings for documents (generated by averaging the vectors for the words in them).

The assigned word vectors are based on how the words tend to occur in written text (as captured in spaCy's default corpus, **not** just on current document. This particular approach is efficient, but spaCy also lets you use a variety of other pretrained language models that can give better results. Lots more detail here: https://spacy.io/usage/embeddings-transformers

In [None]:
# Here's the first token in this song.
print(one_song_doc[0])
print(one_song_doc[0].vector.shape)


In [None]:
# And a n=300 embedding for it.
one_song_doc[0].vector

Create a dataframe with embeddings for all nouns in the song that occur in the corpus. The boolean `is_oov' indicates if a token is ("out-of-vocabulary") — if it is spaCy doesn't have the word in its corpus and can't produce a vector for it, so we'll ignore.

In [None]:
nouns = [token.lower_ for token in one_song_doc if not token.is_oov and not token.is_stop and (token.pos_ == "PROPN" or token.pos_ == 'NOUN')]
noun_vecs = [token.vector for token in one_song_doc if not token.is_oov and not token.is_stop and (token.pos_ == "PROPN" or token.pos_ == 'NOUN')]
noun_vecs_df = pd.DataFrame(noun_vecs)
noun_vecs_df

## Word embeddings PCA
Let's try projecting these into 2D space with PCA to have a look.

In [None]:
from sklearn.decomposition import PCA

In [None]:
# Run PCA on the feature set dataframe
pca = PCA(n_components = 2)
noun_principle_components = pca.fit_transform(noun_vecs_df)

In [None]:
# Stick back into a DataFrame with the original labels and plot
noun_pca = pd.DataFrame(noun_principle_components)
noun_pca['word'] = nouns
noun_pca.columns = ['pc1','pc2','word']
noun_pca = noun_pca.groupby('word',as_index=False).mean()
noun_pca

In [None]:
# Plot words
scatter = alt.Chart(noun_pca).mark_point().encode(
    x="pc1",
    y="pc2",
    tooltip=['pc1','pc2','word'])
text = scatter.mark_text(align='left',baseline='middle', dx=10).encode(text="word")
(scatter + text).properties(width=600,height=600).configure_axis(grid=False)

# Document-Level Embeddings
We can also jump up a level and look at aggregated vectors for whole songs.

In [None]:
# Experiment 1 — Just take spaCy's aggregated vectors for each song
song_vecs = [nlp(song['lyrics']).vector for index,song in one_artist.iterrows()]
song_vecs = pd.DataFrame(song_vecs)
song_vecs.head()

In [None]:
# Experiment 2  — Generate averaged vectors for each song using all the nouns (even repeats)
song_vecs_nounsonly = []
for index, song in one_artist.iterrows():
  song_noun_vecs = [token.vector for token in nlp(song['lyrics']) if not token.is_oov and not token.is_stop and (token.pos_ == "PROPN" or token.pos_ == 'NOUN')]
  song_vecs_nounsonly.append(pd.DataFrame(song_noun_vecs).mean())
song_vecs_nounsonly = pd.DataFrame(song_vecs_nounsonly)
song_vecs_nounsonly.head()

In [None]:
# Experiment 3  — Generate averaged vectors for each song using all the nouns (no repeats)
song_vecs_nounsonly_unique = []
for index, song in one_artist.iterrows():
  nouns_seen = set()
  song_noun_vecs = []
  for token in nlp(song['lyrics']):
    if token.text not in nouns_seen and not token.is_oov and not token.is_stop and (token.pos_ == "PROPN" or token.pos_ == 'NOUN'):
      nouns_seen.add(token.text)
      song_noun_vecs.append(token.vector)
  song_vecs_nounsonly_unique.append(pd.DataFrame(song_noun_vecs).mean())
song_vecs_nounsonly_unique = pd.DataFrame(song_vecs_nounsonly_unique)
song_vecs_nounsonly_unique.head()

In [None]:
# Run PCA to project our 300-dimensional space down to 2D
pca = PCA(n_components = 2)

# Our three different Experiments
songs_principle_components = pca.fit_transform(song_vecs)
# songs_principle_components = pca.fit_transform(song_vecs_nounsonly)
# songs_principle_components = pca.fit_transform(song_vecs_nounsonly_unique)

songs_pca = pd.DataFrame(songs_principle_components)
songs_pca.columns = ['pc1','pc2']
songs_pca = pd.concat([songs_pca,one_artist],axis=1)
songs_pca['lyrics_chars'] = songs_pca['lyrics'].map(lambda l: len(str(l)))
songs_pca.sample(3)

In [None]:
# Plot songs
scatter = alt.Chart(songs_pca).mark_point().encode(
    x='pc1',
    y='pc2',
    size='lyrics_chars',
    color=alt.Color('source:N',scale=alt.Scale(scheme='category20')),
    tooltip=['pc1','pc2','title','artist_names','album'],
    href='url')
text = alt.Chart(songs_pca).mark_text(align='left',baseline='middle', dx=10).encode(
    x='pc1', y='pc2', text="title")
(scatter+text).properties(width=600,height=600).configure_axis(grid=False)

## Semantic Similarity
spaCy can also perform pairwise document similarity comparisons using word vectors directly.

In [None]:
amish_paradise = nlp(one_artist.iloc[0]['lyrics'])
albuquerque = nlp(one_artist.iloc[2]['lyrics'])
yoda = nlp(one_artist.iloc[11]['lyrics'])

print('"Amish Paradise" semantic similarity to..')
print(' "Albuquerque":', amish_paradise.similarity(albuquerque))
print(' "Yoda":', amish_paradise.similarity(yoda))

There are many additional approaches you can try here.

See the spaCy docs and intro course here: https://spacy.io/usage/spacy-101