Шаг 1: Установка необходимых библиотек и загрузка текста

In [2]:
# Устанавливаем необходимые библиотеки
!pip install datasets nltk spacy -q

In [3]:
# Импорт наших инструментов
from datasets import Dataset
import pandas as pd
import spacy
import nltk
from nltk import pos_tag, word_tokenize, download
from spacy import displacy

In [4]:
# Загружаем Spacy
!python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m83.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [6]:
# Подключаем ресурсы NLTK
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [7]:
!wget -O text.txt https://raw.githubusercontent.com/NastiaSheffka/comp-ling/refs/heads/main/gensim/crime_and_punishment.txt

--2025-02-19 00:16:54--  https://raw.githubusercontent.com/NastiaSheffka/comp-ling/refs/heads/main/gensim/crime_and_punishment.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1292968 (1.2M) [text/plain]
Saving to: ‘text.txt’


2025-02-19 00:16:54 (19.5 MB/s) - ‘text.txt’ saved [1292968/1292968]



In [8]:
with open("text.txt", "r", encoding="utf-8") as file:
    text = file.read()

print(text[:400])

Genre
prose_classic


Author Info
Fyodor Dostoevsky


Crime and Punishment


Dostoevsky’s drama of sin, guilt, and redemption transforms the sordid story of an old woman’s murder into the nineteenth century’s profoundest and most compelling philosophical novel. Raskolnikov, an impoverished student living in the St. Petersburg of the tsars, is determined to overreach his humanity and assert his unt


Шаг 2: Выделение частей речи с использованием NLTK

In [9]:
# Дана функция для выделения частей речи с помощью NLTK
def nltk_pos_tagging(text):
    tokens = word_tokenize(text)
    return pos_tag(tokens)

Ссылки:

*   word_tokenize

https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize
*   pos_tag

https://www.nltk.org/api/nltk.tag.html#nltk.tag.pos_tag

In [10]:
from nltk.tokenize import sent_tokenize

In [11]:
text_tokenized = sent_tokenize(text)

nltk_pos_results = [nltk_pos_tagging(sent) for sent in text_tokenized]

print(nltk_pos_results[:2])

[[('Genre', 'NNP'), ('prose_classic', 'JJ'), ('Author', 'NNP'), ('Info', 'NNP'), ('Fyodor', 'NNP'), ('Dostoevsky', 'NNP'), ('Crime', 'NNP'), ('and', 'CC'), ('Punishment', 'NNP'), ('Dostoevsky', 'NNP'), ('’', 'NNP'), ('s', 'JJ'), ('drama', 'NN'), ('of', 'IN'), ('sin', 'NN'), (',', ','), ('guilt', 'NN'), (',', ','), ('and', 'CC'), ('redemption', 'NN'), ('transforms', 'VBZ'), ('the', 'DT'), ('sordid', 'JJ'), ('story', 'NN'), ('of', 'IN'), ('an', 'DT'), ('old', 'JJ'), ('woman', 'NN'), ('’', 'NNP'), ('s', 'NN'), ('murder', 'NN'), ('into', 'IN'), ('the', 'DT'), ('nineteenth', 'JJ'), ('century', 'NN'), ('’', 'NNP'), ('s', 'NN'), ('profoundest', 'NN'), ('and', 'CC'), ('most', 'RBS'), ('compelling', 'JJ'), ('philosophical', 'JJ'), ('novel', 'NN'), ('.', '.')], [('Raskolnikov', 'NNP'), (',', ','), ('an', 'DT'), ('impoverished', 'JJ'), ('student', 'NN'), ('living', 'NN'), ('in', 'IN'), ('the', 'DT'), ('St.', 'NNP'), ('Petersburg', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('tsars', 'NNS'), (',', ','),

Шаг 3: разметка с помощью spacy

In [12]:
def spacy_annotation(text):
    doc = nlp(text)

    pos_tags = [(token.text, token.pos_) for token in doc]
    named_entities = [(ent.text, ent.label_) for ent in doc.ents]
    dependencies = [(token.text, token.dep_, token.head.text) for token in doc]

    return {
        "pos_tags": pos_tags,
        "named_entities": named_entities,
        "dependencies": dependencies
    }

In [13]:
spacy_results = [spacy_annotation(sent) for sent in text_tokenized]

print(spacy_results[:2])

[{'pos_tags': [('Genre', 'PROPN'), ('\n', 'SPACE'), ('prose_classic', 'ADP'), ('\n\n\n', 'SPACE'), ('Author', 'PROPN'), ('Info', 'PROPN'), ('\n', 'SPACE'), ('Fyodor', 'PROPN'), ('Dostoevsky', 'PROPN'), ('\n\n\n', 'SPACE'), ('Crime', 'PROPN'), ('and', 'CCONJ'), ('Punishment', 'PROPN'), ('\n\n\n', 'SPACE'), ('Dostoevsky', 'PROPN'), ('’s', 'PART'), ('drama', 'NOUN'), ('of', 'ADP'), ('sin', 'NOUN'), (',', 'PUNCT'), ('guilt', 'NOUN'), (',', 'PUNCT'), ('and', 'CCONJ'), ('redemption', 'NOUN'), ('transforms', 'VERB'), ('the', 'DET'), ('sordid', 'VERB'), ('story', 'NOUN'), ('of', 'ADP'), ('an', 'DET'), ('old', 'ADJ'), ('woman', 'NOUN'), ('’s', 'PART'), ('murder', 'NOUN'), ('into', 'ADP'), ('the', 'DET'), ('nineteenth', 'ADJ'), ('century', 'NOUN'), ('’s', 'PART'), ('profoundest', 'ADJ'), ('and', 'CCONJ'), ('most', 'ADV'), ('compelling', 'ADJ'), ('philosophical', 'ADJ'), ('novel', 'NOUN'), ('.', 'PUNCT')], 'named_entities': [('Genre', 'PERSON'), ('Info\nFyodor Dostoevsky', 'PERSON'), ('Dostoevsky

Ссылка:
https://spacy.io/api/token#text

Шаг 4: Преобразование результатов в датафрейм

In [14]:
data = []

for i, text in enumerate(text_tokenized):

    data.append({
        "text": text,
        "nltk_pos": nltk_pos_results[i],
        "spacy_pos": spacy_results[i]["pos_tags"],
        "named_entities": spacy_results[i]["named_entities"],
        "dependencies": spacy_results[i]["dependencies"]
    })

df = pd. DataFrame(data)
df.head()

Unnamed: 0,text,nltk_pos,spacy_pos,named_entities,dependencies
0,Genre\nprose_classic\n\n\nAuthor Info\nFyodor ...,"[(Genre, NNP), (prose_classic, JJ), (Author, N...","[(Genre, PROPN), (\n, SPACE), (prose_classic, ...","[(Genre, PERSON), (Info\nFyodor Dostoevsky, PE...","[(Genre, nmod, Dostoevsky), (\n, dep, Genre), ..."
1,"Raskolnikov, an impoverished student living in...","[(Raskolnikov, NNP), (,, ,), (an, DT), (impove...","[(Raskolnikov, PROPN), (,, PUNCT), (an, DET), ...","[(Raskolnikov, PERSON), (the St. Petersburg, G...","[(Raskolnikov, nsubj, is), (,, punct, Raskolni..."
2,"When he commits an act of murder and theft, he...","[(When, WRB), (he, PRP), (commits, VBZ), (an, ...","[(When, SCONJ), (he, PRON), (commits, VERB), (...",[],"[(When, advmod, commits), (he, nsubj, commits)..."
3,"The best known of Dostoevsky’s masterpieces,Cr...","[(The, DT), (best, JJS), (known, NN), (of, IN)...","[(The, DET), (best, ADV), (known, VERB), (of, ...","[(Dostoevsky, ORG), (Crime and Punishment, ORG)]","[(The, det, known), (best, advmod, known), (kn..."
4,Award-winning translators Richard Pevear and L...,"[(Award-winning, JJ), (translators, NNS), (Ric...","[(Award, NOUN), (-, PUNCT), (winning, VERB), (...","[(Richard Pevear, PERSON), (Larissa Volokhonsk...","[(Award, npadvmod, winning), (-, punct, winnin..."


Шаг 5: визуализация результатов

In [15]:
!wget -O text_1.txt https://raw.githubusercontent.com/vifirsanova/compling/refs/heads/main/tasks/task1/data.txt

--2025-02-19 00:30:25--  https://raw.githubusercontent.com/vifirsanova/compling/refs/heads/main/tasks/task1/data.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 444304 (434K) [text/plain]
Saving to: ‘text_1.txt’


2025-02-19 00:30:25 (7.70 MB/s) - ‘text_1.txt’ saved [444304/444304]



In [19]:
with open("text_1.txt", "r", encoding="utf-8") as file:
    text_1 = sent_tokenize(file.read())

print(text_1[:400])

['Title: 5 Common SQL Mistakes and How to Avoid Them\n\nEver wondered how to effectively use 5 common sql mistakes and how to avoid them.', 'This topic is crucial for developers because it helps in creating efficient and reusable code.', "We'll cover the basics and also discuss some advanced techniques.", "By the end of this article, you'll have a clear understanding of how to implement this in your projects.", "To deepen your understanding, explore the additional resources we've shared.", "Don't hesitate to experiment and learn.", "Title: Understanding Python Generators\n\nIf you're new to coding, this guide will help you understand understanding python generators.", 'Through this blog, we aim to provide a clear explanation of the concepts, complete with code snippets and real-world applications.', "If you've been avoiding this topic, now's the time to tackle it!", 'We hope this guide has provided you with a solid foundation.', 'If you have any questions, feel free to leave a comment 

In [20]:
example_text = text_1[0]
doc = nlp(example_text)
displacy.render(doc, style="dep", jupyter=True, options={"distance": 120})

Шаг 7: загрузка на HF Datasets

In [25]:
from datasets import Dataset
dataset = Dataset.from_pandas(df)

dataset.push_to_hub("nastia-sheff/task_2", token='my_token')

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/13 [00:00<?, ?ba/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['text', 'nltk_pos', 'spacy_pos', 'named_entities', 'dependencies'],
    num_rows: 12954
})

In [26]:
dataset

Dataset({
    features: ['text', 'nltk_pos', 'spacy_pos', 'named_entities', 'dependencies'],
    num_rows: 12954
})

Ссылка на датасет: https://huggingface.co/datasets/nastia-sheff/task_2