## importing libraries

In [2]:
# !python -m spacy download en_core_web_sm


In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 18500000

In [3]:
import pandas as pd
df = pd.read_csv("sample_text.csv", encoding="latin1")

In [4]:
df.head()

Unnamed: 0,Response
0,Safety has been a facade. Education for educat...
1,More time should be spent with education aroun...
2,I am a special education teacher with autistic...
3,It has been a very challenging year and I pers...
4,I support the decision to keep students in sch...


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Response  6 non-null      object
dtypes: object(1)
memory usage: 180.0+ bytes


In [6]:
df.describe()

Unnamed: 0,Response
count,6
unique,6
top,Safety has been a facade. Education for educat...
freq,1


The first step is to join all the responses into a single mega string, since I want to analyze the responses as a whole. For this, I use pandas’ handy cat string method to concatenate all the strings in the Reponse column.

In [7]:
all_text = df.Response.str.cat(sep = ' ')

Now create a spaCy document with that text. I don’t need the named entity recognizer (NER) so I disable that to save on memory and computing time.

In [8]:
doc = nlp(all_text, disable = ['ner'])

This does a few things: it splits the text into individual words and tags them with their part-of-speech, like nouns, verbs, adjectives, etc. It also recognizes common words (stop-words) like “and”, “I”, and “with” that don’t have much meaning and can be excluded from word counts.

Now I can do an overall word frequency analysis to see the most common words that aren’t stop words or punctuation marks.

In [9]:
from collections import Counter
words = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
word_freq = Counter(words)
word_freq.most_common(20)

[('school', 5),
 ('teacher', 5),
 ('year', 5),
 ('support', 5),
 ('education', 3),
 ('\x85', 3),
 ('teaching', 3),
 ('health', 3),
 ('feel', 3),
 ('case', 2),
 ('student', 2),
 ('enhanced', 2),
 ('invisible', 2),
 ('aide', 2),
 ('mental', 2),
 ('way', 2),
 ('safety', 1),
 ('facade', 1),
 ('education\x92s', 1),
 ('sake', 1)]

Note that I asked for the lemma_ attribute of each token, which is the lemmatized version of a word. That means that words with different variations, like “be”, “am”, “is” and “are” are all standardized to a root version like “be”. Plural words are transformed to their singular versions. Already you can get a sense of what teachers talk about the most. The most commonly-used verbs are quite telling: Feel and need. What are teachers feeling? What do they need? spaCy can help us answer this with pattern matching.

Pattern matching in linguistics is a bit like regular expressions, but for language. Instead of matching a sequence of characters, you can match a sequence of word types. For example, what are the most common adjective-noun phrases?

In [10]:
from spacy.matcher import Matcher 

matcher = Matcher(nlp.vocab) 
pattern = [{'POS':'ADJ'}, {'POS':'NOUN'}] 
matcher.add('ADJ_PHRASE', [pattern]) 

matches = matcher(doc, as_spans=True) 
phrases = [] 
for span in matches:
    phrases.append(span.text.lower())
    phrase_freq = Counter(phrases)

phrase_freq.most_common(30)

[('mental health', 2),
 ('more time', 1),
 ('proper mask', 1),
 ('transparent information', 1),
 ('positive cases', 1),
 ('special education', 1),
 ('autistic students', 1),
 ('enhanced cleaning', 1),
 ('medical aides', 1),
 ('quadruple duties', 1),
 ('discriminated event', 1),
 ('final year', 1),
 ('challenging year', 1),
 ('little understanding', 1),
 ('many people', 1),
 ('sure kids', 1),
 ('bigger focuses', 1),
 ('curricular outcomes', 1)]

Note how pattern is defined: a list of dictionaries, each defining a part of speech (POS). In this case, it’s an adjective, then a noun. So spaCy will look for all instances of this pattern in the text. Incredible. “Mental health” was by far the most common phrase in this pattern written by teachers.

Now let’s look for the most common adjective that follow the phrase “I am” or “I feel”. For this the pattern has to be more complex. Because these are all valid constructions that we’d like to capture:

-   I feel exhausted
-   I really feel exhausted
-   We’re pretty exhausted

for this spacy Matcher allows wildcards

In [11]:
feel_adj = []
matcher = Matcher(nlp.vocab)
pattern = [
    {'LOWER': {'IN': ['i', 'we']}}, {'OP': '?'},
    {'LOWER': {'IN': ['feel', 'am', "'m", 'are', "'re"]}},
    {'OP': '?'}, {'OP': '?'}, {'POS': 'ADJ'}
]

matcher.add('FeelAdj', [pattern])
matches = matcher(doc, as_spans=True)

for span in matches:
    feel_adj.extend([token.lemma_ for token in span if token.pos_ == 'ADJ'])

Counter(feel_adj).most_common(20)


[('invisible', 2), ('special', 1), ('alone', 1)]

The pattern now asks for the following: the lower-case version of “I” or “we” (so it’s case insensitive), any possible word in one or zero occurrences (operator ‘?’, like regex), the lower-case version of any of “feel”, “am”, “are” and contractions thereof, two possible filler words, and an adjective.

Then it loops through the matches, looks only for the adjectives captured, and adds it to a list. This is incredibly informative. When we asked teachers to write whatever they wanted, so many expressed feeling of exhaustion, concern, and fear.

Here’s a pattern that looks for phrases that start with “I/we want/need”, followed by a noun, with optional filler words in between

In [13]:
want_adj = []
matcher = Matcher(nlp.vocab)
pattern = [{'LOWER' : {'IN' : ['i', 'we']}}, {'IS_ALPHA':True, 'OP': '?'},
           {'LOWER': {'IN' : ['need', 'want']}}, {'IS_ALPHA': True, 'OP': '?'},
           {'IS_ALPHA': True, 'OP': '?'}, {'POS': 'NOUN'}]

matcher.add("WantPhrase", [pattern])
matches = matcher(doc, as_spans=True)

[]


Another flavour of spaCy’s Matcher is the PhraseMatcher, which looks for instances of a specific phrase that you define. Let’s say I want to find the words that most frequently occur near the phrase “mental health”: Look at how I defined span : it grabs the 10 tokens before and after “mental health”. Then I strip out stop-words and count the words that remain.

In [14]:
from spacy.matcher import PhraseMatcher

mental_health_colloc = []
matcher = PhraseMatcher(nlp.vocab, attr = 'LOWER') 
# The attr above ensures all instances are converted to lower-case so the search is case-insensitive

pattern = [nlp.make_doc('mental health')]
matcher.add('mentalHealth', pattern) 
matches = matcher(doc)

for match_id, start, end in matches:
    span = doc[start-10 : end+10]   
    mental_health_colloc.extend([token.lemma_.lower() for token in span if not token.is_stop and not token.is_punct]) 

Counter(mental_health_colloc).most_common(20)

[('health', 3),
 ('aide', 2),
 ('mental', 2),
 ('work', 1),
 ('custodian', 1),
 ('medical', 1),
 ('family', 1),
 ('counsellor', 1),
 ('\x85', 1),
 ('finally', 1),
 ('teacher', 1),
 ('absolutely', 1),
 ('vaccine', 1),
 ('effect', 1),
 ('people', 1),
 ('make', 1),
 ('sure', 1),
 ('kid', 1),
 ('happy', 1)]