# 1. Text Preprocessing

#### Step 1: Collect 400 sentences randomly from the 1347 page book (FullText.pdf). 
#### Note: The following pages are NOT taken into account: 
 *  The Hobbit: Page 1-2 (Cover Art, Special Note)
 *  The Fellowship of the Ring: Page 217-225 (Cover Art, Table of Contents, Foreword)
 *  The Two Towers: Page 664 (Cover Art)
 *  The Return of the King: Page 1021 (Cover Art)
 *  Page 1329 (Definitions of LotR related terminology) 
 *  Page 1330-1342 (Maps)
 *  Page 1343-1346 (Geneology)
 *  Page 1347 (Disclaimer)
 
#### This leaves us with 1315 pages (FullText_removed.pdf). Chapter Titles and Book Numerations are NOT considered as sentences. 

## 1.1 Text Extraction
#### In order to annotate the words in Lord of the Rings, we need to extract the text from the PDF.

In [91]:
# Imports
from pdfminer.high_level import extract_text
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
import re
import random
import csv

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tobiasmichelsen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [60]:
def extract_text_from_pdf(pdf_path):
    text = extract_text(pdf_path)
    return text


# Remove Chapter Titles and Book Numerations

def filter_lines(lines):
    chapter_pattern = re.compile(r'^Chapter \w+', re.IGNORECASE)
    book_pattern = re.compile(r'^\* BOOK \w+ \*$', re.IGNORECASE)
    title_pattern = re.compile(r'^[A-Za-z ]+$')  
    
    filtered_lines = []
    for line in lines:
        stripped_line = line.strip()
        if not (chapter_pattern.match(stripped_line) or
                book_pattern.match(stripped_line) or
                title_pattern.match(stripped_line)):
            filtered_lines.append(stripped_line)
    return filtered_lines

def split_into_sentences(text):
    sentences = sent_tokenize(text)
    return sentences


pdf_path = '/Users/tobiasmichelsen/Downloads/FullText_removed.pdf'
text = extract_text_from_pdf(pdf_path)
sentences = split_into_sentences(text)

lines = text.splitlines()
filtered_lines = filter_lines(lines)
    
    
filtered_text = ' '.join(filtered_lines)
sentences = split_into_sentences(filtered_text)


### Test Output

In [61]:
print("Example Sentence:")
for sentence in sentences[0:1]:
        print(sentence)

Example Sentence:
  In a hole in the ground there lived a hobbit.


## 1.2 Divide Sentences into Training/Test Subsets:

In [72]:
random.seed(42)

selected_sentences = random.sample(sentences, 1400)

training_set = selected_sentences[:1000]
test_set = selected_sentences[1000:]

### Save Subsets as .txt files:

In [73]:
with open('training_sentences.txt', 'w', encoding='utf-8') as train:
    for sentence in training_set:
        train.write(sentence + "\n")
        
with open('testing_sentences.txt', 'w', encoding='utf-8') as test:
    for sentence in test_set:
        test.write(sentence + "\n")

In [None]:
#training_set

## Annotation Guide (Before, Inside, Outside)

* O: Outside of any named entity. This tag is used for words that aren't part of an entity.

* B-PER: Beginning of a person's name. This tag indicates that the word is the beginning of a person's name. If the name is only one word long, it still gets tagged as B-PER.

* I-PER: Inside a person's name. This tag is used for all words that are part of a person's name but are not the first word.

* B-LOC: Beginning of a location name. This tag denotes that the word is the beginning of a location name. Like B-PER, it's used even if the location is only one word.

* I-LOC: Inside a location name. This tag is for words that are part of a location's name but are not the first word.

* B-ORG: Beginning of an organization name. This tag indicates the start of an organization's name and is used even for single-word organization names.

* I-ORG: Inside an organization name. It's used for words within an organization's name that are not the first word.

In [None]:
def annotate_sentence(sentence, writer, sentence_id):
    print("\nAnnotating sentence:", sentence)
    for token in word_tokenize(sentence):
        print(f"\nToken: {token}")
        for tag, desc in tag_description.items():
            print(f"{tag}: {desc}")
        tag_choice = input("Enter the tag for the token above: ")
        writer.writerow([sentence_id, token, tag_choice])

def handle_annotation(sentences_set, set_name):
    file_path = f'{set_name}_annotations.csv'
    with open(file_path, 'a', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        # Check if we're starting fresh or appending
        csvfile.seek(0, 2)  # Move to the end of file
        if csvfile.tell() == 0:  # If file is empty, write the header
            writer.writerow(["Sentence_ID", "Token", "Tag"])
        for sentence_id, sentence in enumerate(sentences_set, start=1):
            annotate_sentence(sentence, writer, sentence_id)
            writer.writerow([])  # Add a blank row for readability

# Starting annotation for the training set
print("Starting annotation for the training set...")
handle_annotation(training_set, "training")

# Starting annotation for the validation set
print("\nStarting annotation for the validation set...")
handle_annotation(validation_set, "validation")

Starting annotation for the training set...

Annotating sentence: He  was interested in  roots and beginnings; he dived into  deep pools; he  burrowed under  trees and  growing plants; he tunnelled  into green  mounds;  and  he ceased to  look up at the hill-tops, or the leaves on trees, or  the flowers opening in the air: his head and his eyes were downward.

Token: He
O: Outside any named entity
B-PER: Beginning of a person’s name
I-PER: Inside a person’s name
B-LOC: Beginning of a location
I-LOC: Inside a location
B-ORG: Beginning of an organization
I-ORG: Inside an organization


In [88]:
sentences

['Example sentence one.',
 'Example sentence two.',
 'In a hole in the ground there lived a hobbit.']