# Extracting part of speech from ELTeC-ENG

Adaptation of a great Colab by Borja Navarro for the LT4DH course in the University of the Basque Country.

This version (to be cleaned) uses English resources in contrast to the Spanish one used by Borja Navarro.

Original data here:

Borja Navarro Colorado | University of Alicante

In this case, the information about part of speech has not been manually annotated in the corpus. It is necessary first analyze the novels with a NLP system and then extract the linguistic information. The NLP system used is [SpaCy](https://spacy.io/).

The notebook shows:

- how to open a novel from ELTeC in COLAB and to analyse it with SpaCy, and
- analysing the output of Spacy for DH.


## Loading ELTeC-SPA corpus in Colab

In [None]:
import zipfile

!wget "https://github.com/COST-ELTeC/ELTeC-eng/archive/refs/heads/master.zip" # paste here corpus url

zip_ref = zipfile.ZipFile('master.zip', 'r') #Opens the zip file in read mode
zip_ref.extractall() #Extracts files here (/content/)
zip_ref.close() 
!rm master.zip #Removes ZIP to save space

--2023-04-16 12:59:39--  https://github.com/COST-ELTeC/ELTeC-eng/archive/refs/heads/master.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/COST-ELTeC/ELTeC-eng/zip/refs/heads/master [following]
--2023-04-16 12:59:39--  https://codeload.github.com/COST-ELTeC/ELTeC-eng/zip/refs/heads/master
Resolving codeload.github.com (codeload.github.com)... 20.27.177.114
Connecting to codeload.github.com (codeload.github.com)|20.27.177.114|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’

master.zip              [        <=>         ]  91.32M  3.95MB/s    in 22s     

2023-04-16 13:00:02 (4.06 MB/s) - ‘master.zip’ saved [95756782]



## SpaCy: download and installing

[SpaCy](https://spacy.io/) is a NLP system. It analyzes part of speech and lemmas, sintax (dependencies) and named entities. 

Three steps:

1. Import SpaCy to Colab
2. Download langauge module (Spanish)
3. Activate module


In [None]:
import spacy

!python -m spacy download en_core_web_sm #Download English module (the "small" module in this case: "sm").

import en_core_web_sm
nlp_eng = en_core_web_sm.load() #Load English analyzer in "nlp_eng".

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m72.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## Analyzing a novel from ELTeC-SPA

Once we have downloaded the corpus and activated SpaCy, let's analyze one novel.

First, select from the corpus [ELTeC-SPA](https://github.com/COST-ELTeC/ELTeC-spa/tree/master/level1) a novel and copy the file name. Then paste the name in the variable "novela_name". In this example, we will analyze the novel of Gertrudis Gómez de Avellaneda [*Sab*](https://github.com/COST-ELTeC/ELTeC-spa/blob/master/level1/SPA1021_GomezDeAvellaneda_Sab.xml): SPA1021_GomezDeAvellaneda_Sab.xml

In [None]:
import os
from bs4 import BeautifulSoup

novela_name = "ENG18621_Braddon.xml" # T1: ENG18471_Bronte.xml      //  T2:  ENG18621_Braddon.xml         Put here the name of the file
dir_in = "/content/ELTeC-eng-master/level1/"

novela_text = '' 

print('Analyzing', novela_name)

ficheroEntrada = dir_in + novela_name
with open(ficheroEntrada, 'r') as tei: #Opens the file
  print("Opening the file and extracting text")
  soup = BeautifulSoup(tei, 'xml') #Parse the XML
  capitulos = soup.find_all(type="chapter") #Only chapters are taking into account. No letters (To Do)
  for cap in capitulos:
    parrafos = cap.find_all('p') #Extract all paragraphs of each chapter
    for parrafo in parrafos:
      #print(parrafo.text)
      novela_text+=parrafo.text+'\n'

print('Analyzing PoS and lemmas')
analisis = nlp_eng(novela_text) #Here the novel is analyzed with SpaCy. All the analysis is stored in "analisis" variable.
print('Done!')

Analyzing ENG18471_Bronte.xml
Opening the file and extracting text
Analyzing PoS and lemmas
Done!


Now all the analysis is stored in "analisis" variable. It only remains to iterate over the variable and extract the information: in this case, part of speech. How to extract information about syntax, named entities, etc. see [SpaCy 101](https://spacy.io/usage/spacy-101)

In [None]:
NVA = '\tNovel\tNouns\tVerbs\tAdjectives\tUnique_nouns\tUnique_verbs\tUnique_adjs\n' #

nom_novela = 'The Daughters of Danaus'
#nom_novela = novela_name

nouns=[]
verbs=[]
adjs=[]

noun_counts= dict()
verb_counts= dict()
adj_counts= dict()

# for token in analisis: 
#   if token.pos_ == 'NOUN':
#     if token.text.lower() in noun_counts:
#        noun_counts[token.text.lower()] += 1
#     else:
#        noun_counts[token.text.lower()] = 1
#   elif token.pos_ == 'VERB':
#     if token.text.lower() in verb_counts:
#        verb_counts[token.text.lower()] += 1
#     else:
#         verb_counts[token.text.lower()] = 1
#   elif token.pos_ == 'ADJ':
#     if token.text.lower() in adj_counts:
#        adj_counts[token.text.lower()] += 1
#     else:
#        adj_counts[token.text.lower()] = 1

for token in analisis: 
  if token.pos_ == 'NOUN':
    if token.lemma_ in noun_counts:
       noun_counts[token.lemma_] += 1
    else:
       noun_counts[token.lemma_] = 1
  elif token.pos_ == 'VERB':
    if token.lemma_ in verb_counts:
       verb_counts[token.lemma_] += 1
    else:
        verb_counts[token.lemma_] = 1
  elif token.pos_ == 'ADJ':
    if token.lemma_ in adj_counts:
       adj_counts[token.lemma_] += 1
    else:
       adj_counts[token.lemma_] = 1




# Sort the noun_counts dictionary by appearance count in descending order
sorted_nouns = sorted(noun_counts.items(), key=lambda x: x[1], reverse=True)
sorted_verbs = sorted(verb_counts.items(), key=lambda x: x[1], reverse=True)
sorted_adjs = sorted(adj_counts.items(), key=lambda x: x[1], reverse=True)

# Print the sorted nouns and their appearance counts
for i, (noun, count) in enumerate(sorted_nouns):
    if i >= 50:
        break
    print(noun, count)

print("\n-----------------\n")

for i, (verb, count) in enumerate(sorted_verbs):
    if i >= 50:
        break
    print(verb, count)

print("\n-----------------\n")

for i, (adj, count) in enumerate(sorted_adjs):
    if i >= 50:
        break
    print(adj, count)






master 183
door 161
hand 150
time 148
house 143
eye 140
day 139
father 126
face 111
night 111
room 107
man 103
head 103
thing 96
way 93
word 92
hour 88
child 87
heart 83
place 80
fire 80
bed 76
window 74
evening 70
lady 68
minute 68
cousin 67
kitchen 66
servant 65
side 64
arm 63
morning 62
book 61
one 59
year 59
life 58
mind 57
stair 57
home 55
chair 53
friend 52
mistress 50
death 49
companion 47
table 45
road 45
boy 44
world 44
grange 44
dog 42

-----------------

have 560
say 533
go 407
come 357
do 314
see 312
take 258
make 239
think 237
tell 213
get 209
look 189
know 184
answer 180
hear 161
cry 157
leave 152
ask 147
let 143
be 140
give 139
sit 124
reply 121
keep 119
speak 119
wish 113
feel 107
turn 104
put 100
return 96
begin 93
seem 91
bring 90
want 89
call 88
find 84
love 83
stand 83
exclaim 82
enter 80
continue 78
run 77
live 73
lie 72
hold 69
walk 67
grow 67
bear 67
show 66
set 65

-----------------

little 174
good 124
other 113
young 113
own 107
more 100
last 94
bad 91
old 85


In [None]:
import os
import spacy
from bs4 import BeautifulSoup

nlp_eng = spacy.load("en_core_web_sm")

novela_name = "ENG18621_Braddon.xml" # Put here the name of the file
dir_in = "/content/ELTeC-eng-master/level1/"

novela_text = '' 

print('Analyzing', novela_name)

ficheroEntrada = dir_in + novela_name
with open(ficheroEntrada, 'r') as tei: #Opens the file
  print("Opening the file and extracting text")
  soup = BeautifulSoup(tei, 'xml') #Parse the XML
  capitulos = soup.find_all(type="chapter") #Only chapters are taking into account. No letters (To Do)
  for cap in capitulos:
    parrafos = cap.find_all('p') #Extract all paragraphs of each chapter
    for parrafo in parrafos:
      #print(parrafo.text)
      novela_text+=parrafo.text+'\n'

print('Done reading the novel text')

palabra_busqueda = input('Introduce la palabra que deseas buscar: ')
doc = nlp_eng(novela_text)
for token in doc:
  if token.text.lower() == palabra_busqueda.lower():
    print(token.text, token.pos_)

Analyzing ENG18471_Bronte.xml
Opening the file and extracting text
Done reading the novel text
Introduce la palabra que deseas buscar: domestic
domestic ADJ
domestic ADJ


In [None]:
import os
import spacy
from bs4 import BeautifulSoup

nlp_eng = spacy.load("en_core_web_sm")

novela_name = "ENG18621_Braddon.xml" # Put here the name of the file
dir_in = "/content/ELTeC-eng-master/level1/"

novela_text = '' 

print('Analyzing', novela_name)

ficheroEntrada = dir_in + novela_name
with open(ficheroEntrada, 'r') as tei: #Opens the file
  print("Opening the file and extracting text")
  soup = BeautifulSoup(tei, 'xml') #Parse the XML
  capitulos = soup.find_all(type="chapter") #Only chapters are taking into account. No letters (To Do)
  for cap in capitulos:
    parrafos = cap.find_all('p') #Extract all paragraphs of each chapter
    for parrafo in parrafos:
      #print(parrafo.text)
      novela_text+=parrafo.text+'\n'

print('Done reading the novel text')

palabra_busqueda = input('Introduce la palabra que deseas buscar: ')
doc = nlp_eng(novela_text)

pos_count = {} # Diccionario para almacenar el conteo de las etiquetas POS

for token in doc:
  if token.text.lower() == palabra_busqueda.lower():
    if token.pos_ in pos_count:
      pos_count[token.pos_] += 1 # Incrementa el conteo si la etiqueta ya existe en el diccionario
    else:
      pos_count[token.pos_] = 1 # Agrega la etiqueta al diccionario si es la primera vez que aparece

if len(pos_count) == 0:
  print('No se encontró la palabra', palabra_busqueda)
else:
  print('Conteo de etiquetas POS para', palabra_busqueda)
  for pos in pos_count:
    print(pos, ':', pos_count[pos])


Analyzing ENG18471_Bronte.xml
Opening the file and extracting text
Done reading the novel text
Introduce la palabra que deseas buscar: domestic


In [None]:
import os
import spacy
from bs4 import BeautifulSoup

nlp_eng = spacy.load("en_core_web_sm")

novela_name = "ENG18621_Braddon.xml" # Put here the name of the file
dir_in = "/content/ELTeC-eng-master/level1/"

novela_text = '' 

print('Analyzing', novela_name)

ficheroEntrada = dir_in + novela_name
with open(ficheroEntrada, 'r') as tei: #Opens the file
  print("Opening the file and extracting text")
  soup = BeautifulSoup(tei, 'xml') #Parse the XML
  capitulos = soup.find_all(type="chapter") #Only chapters are taking into account. No letters (To Do)
  for cap in capitulos:
    parrafos = cap.find_all('p') #Extract all paragraphs of each chapter
    for parrafo in parrafos:
      #print(parrafo.text)
      novela_text+=parrafo.text+'\n'

print('Done reading the novel text')

palabra_busqueda = input('Introduce la palabra que deseas buscar: ')
doc = nlp_eng(novela_text)

pos_count = {} # Diccionario para almacenar el conteo de las etiquetas POS
sentences_with_word = [] # Lista para almacenar las frases en las que aparece la palabra buscada

for sent in doc.sents:
  if palabra_busqueda.lower() in sent.text.lower():
    for token in sent:
      if token.text.lower() == palabra_busqueda.lower():
        if token.pos_ in pos_count:
          pos_count[token.pos_] += 1 # Incrementa el conteo si la etiqueta ya existe en el diccionario
        else:
          pos_count[token.pos_] = 1 # Agrega la etiqueta al diccionario si es la primera vez que aparece
    sentences_with_word.append(sent.text.strip())

if len(pos_count) == 0:
  print('No se encontró la palabra', palabra_busqueda)
else:
  print('Conteo de etiquetas POS para', palabra_busqueda)
  for pos in pos_count:
    print(pos, ':', pos_count[pos])
  print('\nFrases en las que aparece', palabra_busqueda)
  for sentence in sentences_with_word:
    print(sentence)


In [None]:
import os
import spacy
from bs4 import BeautifulSoup

nlp_eng = spacy.load("en_core_web_sm")

novela_name = "ENG18621_Braddon.xml" # Put here the name of the file
dir_in = "/content/ELTeC-eng-master/level1/"

novela_text = '' 

print('Analyzing', novela_name)

ficheroEntrada = dir_in + novela_name
with open(ficheroEntrada, 'r') as tei: #Opens the file
  print("Opening the file and extracting text")
  soup = BeautifulSoup(tei, 'xml') #Parse the XML
  capitulos = soup.find_all(type="chapter") #Only chapters are taking into account. No letters (To Do)
  for cap in capitulos:
    parrafos = cap.find_all('p') #Extract all paragraphs of each chapter
    for parrafo in parrafos:
      #print(parrafo.text)
      novela_text+=parrafo.text+'\n'

print('Done reading the novel text')

palabra_busqueda = input('Introduce la palabra que deseas buscar: ')
doc = nlp_eng(novela_text)

pos_count = {} # Diccionario para almacenar el conteo de las etiquetas POS
sentences_with_word = [] # Lista para almacenar las frases en las que aparece la palabra buscada

for sent in doc.sents:
  if palabra_busqueda.lower() in sent.text.lower():
    pos_dict = {} # Diccionario para almacenar la etiqueta POS de la palabra buscada en la frase actual
    for token in sent:
      if token.text.lower() == palabra_busqueda.lower():
        if token.pos_ in pos_count:
          pos_count[token.pos_] += 1 # Incrementa el conteo si la etiqueta ya existe en el diccionario
        else:
          pos_count[token.pos_] = 1 # Agrega la etiqueta al diccionario si es la primera vez que aparece
        pos_dict[token.text] = token.pos_ # Agrega la etiqueta POS de la palabra buscada al diccionario pos_dict
    sentences_with_word.append((sent.text.strip(), pos_dict)) # Agrega la frase actual junto con su diccionario pos_dict a la lista sentences_with_word

if len(pos_count) == 0:
  print('No se encontró la palabra', palabra_busqueda)
else:
  print('Conteo de etiquetas POS para', palabra_busqueda)
  for pos in pos_count:
    print(pos, ':', pos_count[pos])
  print('\nFrases en las que aparece', palabra_busqueda)
  for sentence, pos_dict in sentences_with_word:
    print(sentence)
    for word, pos in pos_dict.items():
      print(f"  '{word}' -> {pos}")
    print()
