This Python script demonstrates various text processing tasks using different libraries such as SpaCy and NLTK. It includes examples of tokenization, sentence segmentation, lemmatization, and stemming, providing insights into the different approaches and tools available for handling textual data.

1. Tokenization with SpaCy:
   Utilizes the SpaCy library to tokenize the text and count the total number of words and punctuation marks.

2. Tokenization with NLTK:
   Demonstrates tokenization using the NLTK library, calculating the word and punctuation mark counts in the text.

3. Sentence Segmentation with SpaCy:
   Uses SpaCy to segment the text into sentences and counts the total number of sentences.

4. Sentence Segmentation with NLTK:
   Applies NLTK for sentence segmentation and calculates the total number of sentences in the text.

5. Printing the Fourth Sentence:
   Prints the fourth sentence from the provided text.

6. Tokenizing and Stemming the Fourth Sentence:
   Tokenizes the fourth sentence into words and punctuation marks and then applies stemming to the words using NLTK.

7. Lemmatizing the Words of the Fourth Sentence:
   Lemmatizes the words of the fourth sentence using SpaCy, obtaining their base forms.

8. Finding Lemma Forms for Words in the fourth Sentence:
   Identifies the lemma forms of words in the fourth sentence using SpaCy.

9. Tokenizing the First Sentence of Lithuanian Text:
   Tokenizes the first sentence of the Lithuanian text into words and punctuation marks using NLTK.


In [5]:
import spacy

text="Welcome to Vilnius University – the oldest and largest Lithuanian higher education institution. Since its establishment in the 16th century, Vilnius University, as integral part of European science and culture has embodied the concept­ of a classical university and the unity of studies and research. Vilnius University is an active participant in international scientific and academic activities and boasts many prominent scientists, professors and graduates. Scientific development and the expanding relations with global research centres have contributed to the variety of research and studies at Vilnius University. With the support of social partners, the university educates globally–minded specialists who successfully integrate in the modern European community."

## Splitting text into text units (Tokenization)

#### 1. How many words and punctuation marks are there in the text?
Calculate using the SpaCy library

In [6]:

nlp = spacy.load("en_core_web_sm")
nlp_text = nlp(text)

word_number = len([token for token in nlp_text if not token.is_punct])
punctuation_sign_number = len([token for token in nlp_text if token.is_punct])

print(word_number + punctuation_sign_number)


117


#### 2. How many words and punctuation marks are there in the text?
Compute using the NLTK library

In [33]:

from nltk.tokenize import word_tokenize

words = word_tokenize(text)

word_number = len(words)

punctuation_sign_number = len([token for token in words if not token.isalpha()])

print(f"Numbers of words: {word_number}")
print(f"Number of punctuation marks: {punctuation_sign_number}")

Numbers of words: 115
Number of punctuation marks: 13


#### 3. How many sentences are there in the text?
Calculate using the SpaCy library

In [10]:

nlp = spacy.load("en_core_web_sm")

nlp_text = nlp(text)

number_sentences = len(list(nlp_text.sents))

print(f"Number of sentences: {number_sentences}")

Number of sentences: 5


#### 4. How many sentences are there in the text?
Compute using the NLTK library

In [14]:

#nltk.download('punkt')

sentences = nltk.sent_tokenize(text)

sentence_count = len(sentences)

print(f"Number of sentences: {sentence_count}")

Number of sentences: 5


#### 5. Print the fourth sentence of the given text

In [15]:

#nltk.download('punkt')

sentences = nltk.sent_tokenize(text)

fourth_sentence = sentences[3]
print(f"Fourth sentence: {fourth_sentence}")

Fourth sentence: Scientific development and the expanding relations with global research centres have contributed to the variety of research and studies at Vilnius University.


#### 6. Break the fourth sentence into words and punctuation marks

In [16]:

#nltk.download('punkt')

sentences = nltk.sent_tokenize(text)

fourth_sentence = sentences[3]
words_and_punctuation = nltk.word_tokenize(fourth_sentence)

print(f"Split fourth sentence: {words_and_punctuation}")

Split fourth sentence: ['Scientific', 'development', 'and', 'the', 'expanding', 'relations', 'with', 'global', 'research', 'centres', 'have', 'contributed', 'to', 'the', 'variety', 'of', 'research', 'and', 'studies', 'at', 'Vilnius', 'University', '.']


#### 7. Lemmatizing the Words of the Fourth Sentence

In [17]:

nlp = spacy.load("en_core_web_sm")

sentences = nltk.sent_tokenize(text)

fourth_sentence = sentences[3]
nlp_fourth_sentence = nlp(fourth_sentence)

lemmas = [token.lemma_ for token in nlp_fourth_sentence]
print(f"Lemmas: {lemmas}")


Lemmas: ['scientific', 'development', 'and', 'the', 'expand', 'relation', 'with', 'global', 'research', 'centre', 'have', 'contribute', 'to', 'the', 'variety', 'of', 'research', 'and', 'study', 'at', 'Vilnius', 'University', '.']


#### 8. Finding Lemma Forms for Words in the fourth Sentence

In [19]:

from nltk.stem import PorterStemmer

nlp = spacy.load("en_core_web_sm")

sentences = nltk.sent_tokenize(text)

fourth_sentence = sentences[3]

doc = nlp(fourth_sentence)

lemmas = [token.lemma_ for token in doc]

porter_stemmer = PorterStemmer()

stems = [porter_stemmer.stem(word) for word in lemmas]

print(f"Stems: {stems}")


Stems: ['scientif', 'develop', 'and', 'the', 'expand', 'relat', 'with', 'global', 'research', 'centr', 'have', 'contribut', 'to', 'the', 'varieti', 'of', 'research', 'and', 'studi', 'at', 'vilniu', 'univers', '.']


# Working with Lithuanian text

In [20]:
text2="Vilniaus universitetas (VU) – pirmasis ir didžiausias Lietuvos universitetas, įsikūręs šalies sostinėje Vilniuje ir turintis po padalinį Kaune ir Šiauliuose. Įkurtas 1579 m. VU yra šalies lyderis absoliučioje daugumoje mokslo ir studijų krypčių. Būdamas viena seniausių ir žymiausių Vidurio ir Rytų Europos aukštųjų mokyklų, VU darė didelę įtaką ne tik Lietuvos, bet ir kaimyninių šalių kultūriniam gyvenimui, išugdė ne vieną mokslininkų, poetų, kultūros veikėjų kartą. VU profesoriavo ir mokėsi daug garsių asmenybių: medikai vokiečiai Johanas Frankas ir jo sūnus Jozefas Frankas, istorikas Joachimas Lelevelis, poetai Adomas Mickevičius ir Julius Slovackis, istorikas Simonas Daukantas, rašytojas, poetas ir literatūros mokslininkas Česlovas Milošas. VU įkurtas sklindant renesanso, reformacijos ir katalikiškosios reformos idėjoms. Už VU anksčiau Europoje įkurti tik Prahos, Krokuvos, Pečo, Budapešto, Bratislavos ir Karaliaučiaus universitetai."

#### 9. Tokenizing the First Sentence of Lithuanian Text

In [22]:
# -*- coding: utf-8 -*-

sentences = nltk.sent_tokenize(text2)

first_sentence = sentences[0]

words_and_punctuation = nltk.word_tokenize(first_sentence)
print(f"Tokenized first sentence: {words_and_punctuation}")

Tokenized first sentence: ['Vilniaus', 'universitetas', '(', 'VU', ')', '–', 'pirmasis', 'ir', 'didžiausias', 'Lietuvos', 'universitetas', ',', 'įsikūręs', 'šalies', 'sostinėje', 'Vilniuje', 'ir', 'turintis', 'po', 'padalinį', 'Kaune', 'ir', 'Šiauliuose', '.']
