This script performs various text processing tasks on different documents.

Task 1: Removing Punctuation and Stopwords from a Text File
Utilizes SpaCy to remove punctuation, stopwords, and newline symbols from the given text file.

Task 2: Printing the First Three Named Entities
Extracts and prints the first three named entities along with their types from the text file.

Task 3: Visualizing Named Entities in the Fourth Sentence
Uses SpaCy to visualize named entities in the fourth sentence of the text file, applying custom background colors and gradients.

Task 4: Removing Unnecessary Spaces and Printing the Document Text
Removes unnecessary spaces from the text extracted from a PDF file and prints the cleaned text.

Task 5: Providing POS Tag Frequencies
Extracts text from a PDF file and calculates the frequencies of POS tags, then prints the results.

Task 6: Visualizing Named Entities in the First Three Sentences
Extracts the first three sentences from a PDF file and visualizes the named entities, applying custom styles.

Task 7: Printing Stopwords in the Document Text
Extracts text from a PDF file and prints the stopwords found in the document.

#### <font color="red">To perform the tasks, use the text in the document tekstas.txt.</font>

#### 1. Remove all punctuation, stopwords, and newline symbols (i.e., "\n") from the existing text (i.e., the entire read text).

In [2]:
import spacy

with open('tekstas.txt', 'r', encoding='utf-8') as file:
    text = file.read()

nlp = spacy.load("en_core_web_sm")

stop_words = spacy.lang.en.stop_words.STOP_WORDS

doc = nlp(text)

new_text = ' '.join(token.text for token in doc if token.text.lower() not in stop_words).replace('\n', '').translate(str.maketrans('', '', '.,?!()[]{}:;\"\''))

new_text = ' '.join(word if word != ' ' * 3 else ' ' for word in new_text.split())

print(new_text)


Foolish Donkey salt seller carry salt bag donkey market day way cross stream day donkey suddenly tumbled stream salt bag fell water salt dissolved water bag light carry donkey happy donkey started play trick day salt seller came understand trick decided teach lesson day loaded cotton bag donkey played trick hoping cotton bag lighter dampened cotton heavy carry donkey suffered learnt lesson play trick anymore day seller happy


#### <font color="red">To perform the tasks, use the text in the document tekstas2.txt.</font>

#### 2. Print the first three named entities in the text

In [5]:
import spacy

with open('tekstas2.txt', 'r', encoding='utf-8') as file:
    text = file.read()

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

entities = [(ent.text, ent.label_) for ent in doc.ents[:3]]

print("The first three named entities in the text:")
for entity in entities:
    print(f"{entity[0].ljust(65)} {entity[1].ljust(8)} {spacy.explain(entity[1])}")


The first three named entities in the text:
The Faculty of Mathematics and Informatics of Vilnius University  ORG      Companies, agencies, institutions, etc.
four                                                              CARDINAL Numerals that do not fall under another type
Data Science                                                      ORG      Companies, agencies, institutions, etc.


#### 3. Visualize named entities in the fourth sentence. Change the background color of elements and apply a gradient effect (choose colors at your discretion).

In [6]:
import spacy
from spacy import displacy
from spacy.tokens import Span

with open('tekstas2.txt', 'r', encoding='utf-8') as file:
    text = file.read()

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

PERSON = doc.vocab.strings[u'PERSON']

target_word = "Paulius"
positions = [token.i for token in doc if token.text == target_word]
start_index = positions[0]
end_index = start_index + len(target_word)
new_ent = Span(doc, start_index, 75, label=PERSON)

fourth_sentence = list(doc.sents)[3]

entities = list(doc.ents) + [new_ent] 
doc.ents = entities

options = {
    "colors": {"ORG": "linear-gradient(90deg, #00ff00, #ff0000)",
               "CARDINAL": "linear-gradient(90deg, #ff00ff, #0000ff)",
               "MONEY": "linear-gradient(90deg, #00ffff, #0000ff)",
               "DATE": "linear-gradient(90deg, #00ffff, #0000ff)",
               "PERSON": "linear-gradient(90deg, #00ffff, #0000ff)"},
}

html_output = displacy.render(fourth_sentence, style="ent", options=options, jupyter=False)

with open("Fourth_sentence.html", "w", encoding="utf-8") as html_file:
    html_file.write(html_output)


#### <font color="red">To perform the tasks, use the text in the document tekstas3.pdf.</font>

#### 4. Remove unnecessary spaces and print the available text.

In [None]:
import fitz 

def extract_text_from_pdf(pdf_file):
    with fitz.open(pdf_file) as pdf_document:
        text = ''
        for page_number in range(pdf_document.page_count):
            page = pdf_document[page_number]
            text += page.get_text()
    return text

pdf_file = 'tekstas3.pdf'
text_from_pdf = extract_text_from_pdf(pdf_file)

text_without_spaces = ' '.join(text_from_pdf.split())

print(text_without_spaces)

#### 5. Provide a list of POS tag frequencies in the document.

In [32]:
import spacy
import fitz

def extract_text_from_pdf(pdf_file):
    with fitz.open(pdf_file) as pdf_document:
        text = ''
        for page_number in range(pdf_document.page_count):
            page = pdf_document[page_number]
            text += page.get_text()
    return text

nlp = spacy.load("en_core_web_sm")

pdf_file = 'tekstas3.pdf'
doc = nlp(extract_text_from_pdf(pdf_file))

pos_frequencies = doc.count_by(spacy.attrs.POS)

for pos in sorted(pos_frequencies):
    pos_text = doc.vocab[pos].text
    print(f"{pos}. {pos_text.ljust(7)} : {pos_frequencies[pos]}")


84. ADJ   : 45
85. ADP   : 63
86. ADV   : 10
87. AUX   : 15
89. CCONJ : 15
90. DET   : 51
91. INTJ  : 1
92. NOUN  : 79
93. NUM   : 26
94. PART  : 4
95. PRON  : 7
96. PROPN : 80
97. PUNCT : 60
98. SCONJ : 10
100. VERB  : 33


#### 6. Visualize named entities in the first three sentences.

In [41]:
import spacy
from spacy import displacy
import fitz

def extract_text_from_pdf(pdf_file):
    with fitz.open(pdf_file) as pdf_document:
        text = ''
        for page_number in range(pdf_document.page_count):
            page = pdf_document[page_number]
            text += page.get_text()
    return text

pdf_file = 'tekstas3.pdf'

nlp = spacy.load("en_core_web_sm")

doc = nlp(extract_text_from_pdf(pdf_file))

first_three_sentences = list(doc.sents)[:3]

entities = [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]

options = {
    "colors": {"ORG": "linear-gradient(90deg, #00ff00, #ff0000)",
               "CARDINAL": "linear-gradient(90deg, #ff00ff, #0000ff)",
               "MONEY": "linear-gradient(90deg, #00ffff, #0000ff)",
               "DATE": "linear-gradient(90deg, #00ffff, #0000ff)",
               "PERSON": "linear-gradient(90deg, #00ffff, #0000ff)",
               "LAW": "linear-gradient(90deg, #00ffff, #0000ff)"},
}

html_output = displacy.render(first_three_sentences, style="ent", options=options, jupyter=False)

with open("first_three_sentences.html", "w", encoding="utf-8") as html_file:
    html_file.write(html_output)

#### 7. Print the irrelevant words in the text

In [43]:
import fitz
import spacy

def extract_text_from_pdf(pdf_file):
    with fitz.open(pdf_file) as pdf_document:
        text = ''
        for page_number in range(pdf_document.page_count):
            page = pdf_document[page_number]
            text += page.get_text()
    return text

pdf_file = 'tekstas3.pdf'

nlp = spacy.load("en_core_web_sm")

stop_words = spacy.lang.en.stop_words.STOP_WORDS

doc = nlp(extract_text_from_pdf(pdf_file))

new_text = ' '.join(token.text for token in doc if token.text.lower() in stop_words)

print(new_text)

['is', 'a', 'which', 'is', 'the', 'first', 'and', 'in', 'as', 'well', 'as', 'one', 'of', 'the', 'and', 'most', 'in', 'and', 'it', 'is', "'s", 'among', 'the', 'Top', 'in', 'the', 'The', 'in', 'as', 'the', 'was', 'the', 'third', 'after', 'the', 'and', 'the', 'in', 'the', 'Due', 'to', 'the', 'of', 'the', 'the', 'was', 'down', 'and', 'its', 'until', 'In', 'the', 'of', 'I', 'the', 'to', 'it', 'by', 'the', 'and', 'by', 'It', 'as', 'in', 'the', 'of', 'in', 'the', 'was', 'by', 'the', 'from', 'and', 'then', 'after', 'of', 'by', 'a', 'of', 'after', 'from', 'to', 'when', 'it', 'was', 'as', 'the', 'In', 'the', 'of', 'and', 'of', 'was', 'to', 'its', 'in', 'the', 'of', 'the', 'it', 'its', 'as', 'one', 'of', 'the', 'in', 'It', 'is', 'of', 'fifteen', 'that', 'more', 'than', 'in', 'a', 'of', 'for', 'over', 'is', 'for', 'its', 'and', 'in', 'by', 'the', 'of', 'the', 'such', 'as', 'the', 'and', 'and', 'other', 'Since', 'has', 'been', 'a', 'of', 'a', 'of', 'the', 'and', 'since', 'it', 'has', 'to', 'the', '