# How to 'find stuff' your text. Use Spacy as a tool for discourse analysis

We start by importing necessary libraries and import our data to the notebook

In [None]:
import os
import numpy as np

In [None]:
# Read the text file
text = open(r"C:\Users\au576018\OneDrive - Aarhus Universitet\Documents\Kurser\kvantitativ diskursanalyse\quantitative_discourse_analysis\data\politicians\english\trump\trump_03_01_2020.txt", encoding="utf8")
speech = text.read()
text.close()

In [None]:
speech # Display the raw speech text

In [None]:
print(speech) # Display the speech with formatting

To get rid of "disturbing elements" in the raw text, such as "\n", we clean up the speech a bit

In [None]:
# Clean the speech text by replacing newline characters with spaces

speech_clean = speech.replace("\n", " ")
speech_clean

In [None]:
print(speech_clean)

We want to use the library Spacy for performing our discourse analysis

For guides and documentation for spacy, see the following: https://spacy.io/

In [None]:
import spacy # this imports the spacy library

In [None]:
nlp = spacy.load("en_core_web_lg") # specify which language model you wants to use. This is the large english model

Now we want to transform our text file into a nlp-element. We use our language model to do that.
By doing this, metadata such as part of speech tags, sentences, morphological information etc. is added to every word. We can use 'calls' (code commands) to retrieve this information when needed.

In [None]:
doc = nlp(speech_clean) # save text as nlp-element under the variable name 'doc'

In [None]:
type(speech_clean) # Display the type of the cleaned speech text

In [None]:
type(doc) # Display the type of the Spacy document

We can now use the functionality from spacy in our analysis of the document. For example we can go through each sentence one by one

In [None]:
print(list(doc.sents)) # evey sentences in a list. The sentences are seperated by a comma

In [None]:
for sentence in doc.sents: # Print sentences one by one
    print(sentence)

# Keyword analysis
In "Acceptable Bias? Using corpus linguistic methods with critical discourse analysis" by Paul Baker, he uses a "key word" analysis. In order to choose appropiate keywords for your analysis, you could base it on your knowledge from within your field, from close reading analysis' etc. Another apporach could be to let your data decide for you (this would be an explorative approach). Is some words more frequent than others? Are there specific patterns?

In the following we will find the most frequent words within the speech and base our analysis on those words.

We use the in-built functionality that labes every word with their part of speech-tag:

Alphabetical listing

- ADJ: adjective
- ADP: adposition
- ADV: adverb
- AUX: auxiliary
- CCONJ: coordinating conjunction
- DET: determiner
- INTJ: interjection
- NOUN: noun
- NUM: numeral
- PART: particle
- PRON: pronoun
- PROPN: proper noun
- PUNCT: punctuation
- SCONJ: subordinating conjunction
- SYM: symbol
- VERB: verb
- X: other

© 2014–2022 Universal Dependencies contributors. Site powered by Annodoc and brat. https://universaldependencies.org/u/pos/

In [None]:
# Print tokens with their part-of-speech tags

for token in doc:
    print(token.text, token.pos_)

In order to look at relevant words in the text, we are interested in the nouns

In [None]:
nouns = [token.text for token in doc if token.pos_ == "NOUN"] # Extract nouns from the document

In [None]:
nouns # prints list of nouns within the document

In [None]:
print(sorted(nouns))

in order to identify alike words as the same word, even though they might appear in unlike forms, we lemmatize our words.
Lemmatization involves reducing words to their base or root form, which helps in grouping different inflected forms of a word together.

In [None]:
for token in doc:
    print(token.text, token.lemma_)

In [None]:
lemma_nouns = [token.lemma_ for token in doc if token.pos_ == "NOUN"] # Extract lemmatized nouns from the document

In [None]:
print(sorted(lemma_nouns))

In [None]:
print(sorted(lemma_nouns))

# The most frequent words

When we are working with one text it is pretty straight forward to count the most frequent word. But when working with several texts you should use a counter function for this task.

In [None]:
# here we define a function that counts the frequence of every word in a list of words, and returns every word 
# with its number of appearances, sorted from the most to the least frequent word

from collections import Counter #imports a counting-function from the library 'collections'
def sorted_count(list_of_str): # defining the name of the function 'sorted_count' and the input "list_of_str"
    b = Counter(list_of_str) # counts the frequency of every word on the list_of_str, and assign it to the value b
    c = sorted(b.items(), key=lambda item: item[1], reverse = True) # sort the elements assigned to value b and assign to c
    return c # returns the sorted counts 'c'

In [None]:
sorted_count(lemma_nouns) # here we call the function on our list of strings "lemma_nouns"

We see the five most frequent words are: 
    
```
('people', 6),
('world', 5),
('nation', 3),
('military', 3),
()'leadership', 3)

```

We want to use "people" as our keyword in the following analysis.

In what context does 'people' appear? 
Let's first find the sentences containing the word, in order to make a close reading.

In [None]:
# Define a function to find sentences containing a specific keyword

def sentence_with_word(nlp_object, keyword): # name of function is "sentence with word", and takes a nlp-element and a the keyword as an input
    sentences = [sentence for sentence in nlp_object.sents] # makes a list with all the sentences in the text
    with_keyword = [] # makes an empty list as placeholder for all the sentences containing the keyword
    for sentence in sentences: # looks in every sentence on the list of sentences
        tokens = [str(token) for token in sentence] # makes a list of every word in the sentence
        if keyword in tokens: # checks whether the keyword is in the list of words from the sentence
            with_keyword.append(sentence) # if yes, apply the whole sentence to the list "with_keyword"
    if len(with_keyword) > 0: # checks whether we have any sentences in the list of sentences with the keyword
        return with_keyword # if yes, print the list of sentences with the keyword
    else: # if the list is empty
        return "Keyword not in text" # print this text

In [None]:
sentence_with_word(doc, "people")

In [None]:
with_people = sentence_with_word(doc, "people")

We can read four ourselves the associated adjectives, when we are working with a single textfile. But we want to make a program that finds the associated adjectives, in order to use it when working with several text files.

# Simple example on extracting adjectives

In [None]:
test = "We have green apples and sour lemons given with great love from the King of the seas"

In [None]:
doc_test = nlp(test)

We want to find the associated adjectives, and use 'noun chunks' in order to do so. Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world’s largest tech fund”. To get the noun chunks in a document, simply iterate over Doc.noun_chunks.

In [None]:
for chunk in doc_test.noun_chunks: # find noun chunks
    print(chunk.text)

In [None]:
# some other features to get from the chunks
for chunk in doc_test.noun_chunks:
    print(chunk.text, "\n", "Text of root:", chunk.root.text, "\n", "Dependency tag:", chunk.root.dep_, "\n",
          "Root in sentence:", chunk.root.head.text, "\n")

please read the documentation, if you wants to dig deeper into the noun chunks: https://spacy.io/usage/linguistic-features

we now want to go through each noun chunk in the text

In [None]:
for chunk in doc_test.noun_chunks: # for every chunk en the chunks of the nlp_document
    for token in chunk: # for every word in the chunk
        print(token.text, token.pos_) # #print the text of the word and the part of speech tag of the word
    

and extract the adjectives

In [None]:
adjectives = [] # makes an empty list as a placeholder for the adjectives

for chunk in doc_test.noun_chunks: # for every chunk en the chunks of the nlp_document
    for token in chunk: # for every word in the chunk
        if token.pos_ == "ADJ": # if the part of speech tag of the word is "ADJ"
            adjectives.append(token.text) # append the word to the list "adjectives"
            
adjectives

# Let's apply it in our analysis

Now we want to find the adjectives related to "people" in the original speech by Trump

In [None]:
with_people

In [None]:
for sentence in with_people: # for every sentence in the list of sentences
    for chunk in sentence.noun_chunks: # for every noun chunk in the sentences
        print(chunk.text) # print the noun chunk

but we only wanted the chunks describing the word "people"

In [None]:
chunks_with_people = [] # makes an empty list as a placeholder for the noun chunks

for sentence in with_people: # for every sentence in the list of sentences
    for chunk in sentence.noun_chunks: # for every noun chunk in the sentences
        if str(chunk.root) == "people": # if the root of the chunk is "people"
            chunks_with_people.append(chunk) # append the noun chunk to the list "chunks_with_people"
            
chunks_with_people

now we can extract the adjectives from these noun chunks

In [None]:
adjectives = [] # makes an empty list as a placeholder for the adjectives

for chunk in chunks_with_people: # for every sentence in the list of chunks_with_people
    for token in chunk: # for every word in the chunk
        if token.pos_ == "ADJ": # if the part of speech tag is "ADJ"
            adjectives.append(token.text) # append the word to the list "adjectives"
            
adjectives # print the list "adjectives"

# Exercise: Make the same analysis for another frequent word in the speech

Help:
- start by finding the senteces which include your word
- find the noun chunks in the sentences which include your word
- find the specific noun chunks that describe your word


(SOLUTION is further down. But try your best and ask for help before looking at the solution)

# Solution

We remember the five most frequent words:

```
('people', 6),
('world', 5),
('nation', 3),
('military', 3),
('leadership', 3)
```

I now want to use "military" as our keyword in the following analysis.


I use my function from earlier to find the sentences that include the word "military"

In [None]:
sentence_with_word(doc, "military")

I save these sentences by assigning it to a variable "with_military"

In [None]:
with_military = sentence_with_word(doc, "military")

I now find the nouns chunks by using the code from above amd change the word "people" with "military"

In [None]:
chunks_with_military = [] # makes an empty list as a placeholder for the noun chunks

for sentence in with_military: # for every sentence in the list of sentences
    for chunk in sentence.noun_chunks: # for every noun chunk in the sentences
        if str(chunk.root) == "military": # if the root of the chunk is "people"
            chunks_with_military.append(chunk) # append the noun chunk to the list "chunks_with_people"
            
chunks_with_military