# Практическое задание 6: Использование spaCy для Named Entity Recognition (NER)

**Цель:**
Освоить извлечение именованных сущностей (NER) из текста с помощью библиотеки spaCy, визуализировать сущности и добавить пользовательские сущности.


## Шаг 1: Установка spaCy и модели для английского языка

Запуск следующих команд в терминале для установки spaCy и модели для английского языка:
```bash
pip install spacy
python -m spacy download en_core_web_sm
```


In [28]:
import spacy

# Загрузка модели
nlp = spacy.load("en_core_web_sm")


## Шаг 2: Извлечение именованных сущностей из текста

Используйте spaCy для анализа текста и извлечения именованных сущностей.


In [29]:
text = """
Kazakhstan hosted Expo 2017 in Astana, marking the first time a Central Asian country organized a world exposition.
The event attracted over 4 million visitors from around the world and featured participation from 115 countries and 22 international organizations.
The theme of the exhibition was 'Future Energy,' aiming to address global challenges related to renewable energy, energy efficiency, and environmental sustainability.
Pavilions showcased innovations in solar, wind, and hydro energy, as well as smart grid technologies.

The United Nations, the World Bank, the European Union, and the International Renewable Energy Agency (IRENA) were among the key participants.
Notable world leaders, including Emmanuel Macron, Vladimir Putin, and Xi Jinping, visited the event, reinforcing Kazakhstan's role in international diplomacy.
During the exposition, major technology and energy corporations such as Tesla, Siemens, General Electric, Shell, and Rosatom presented cutting-edge developments in sustainable energy.

The financial investment in the Expo 2017 infrastructure exceeded $3 billion.
Following the event, the Expo site was transformed into the Astana International Financial Centre (AIFC), which became a hub for financial and technological innovation in Eurasia.
Additionally, the area now hosts the Nazarbayev University research campus and the Astana Hub for IT startups.

Expo 2017 was widely recognized for its success in promoting green technology and sustainable development, leaving a long-term economic and environmental impact on Kazakhstan.
"""

doc = nlp(text)

print("Extracted Named Entities:")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")


Extracted Named Entities:
Kazakhstan (GPE)
Astana (GPE)
first (ORDINAL)
Central Asian (NORP)
over 4 million (CARDINAL)
115 (CARDINAL)
22 (CARDINAL)
Future Energy (ORG)
The United Nations (ORG)
the World Bank (ORG)
the European Union (ORG)
the International Renewable Energy Agency (ORG)
Emmanuel Macron (PERSON)
Vladimir Putin (PERSON)
Xi Jinping (PERSON)
Kazakhstan (GPE)
Tesla (ORG)
Siemens (ORG)
General Electric (ORG)
Shell (ORG)
Rosatom (PERSON)
$3 billion (MONEY)
Expo (EVENT)
the Astana International Financial Centre (AIFC (ORG)
Eurasia (GPE)
Nazarbayev University (ORG)
the Astana Hub (FAC)
Kazakhstan (GPE)


## Шаг 3: Группировка сущностей по категориям (типам)

Используем словарь для группировки сущностей по их типам (меткам).


In [30]:
from collections import defaultdict

entities_by_label = defaultdict(list)

for ent in doc.ents:
    entities_by_label[ent.label_].append(ent.text)

print("Сущности, сгруппированные по категориям:")
for label, entities in entities_by_label.items():
    print(f"{label}: {', '.join(entities)}")


Сущности, сгруппированные по категориям:
GPE: Kazakhstan, Astana, Kazakhstan, Eurasia, Kazakhstan
ORDINAL: first
NORP: Central Asian
CARDINAL: over 4 million, 115, 22
ORG: Future Energy, The United Nations, the World Bank, the European Union, the International Renewable Energy Agency, Tesla, Siemens, General Electric, Shell, the Astana International Financial Centre (AIFC, Nazarbayev University
PERSON: Emmanuel Macron, Vladimir Putin, Xi Jinping, Rosatom
MONEY: $3 billion
EVENT: Expo
FAC: the Astana Hub


## Шаг 4: Визуализация именованных сущностей с помощью displaCy

Создаем визуализацию извлечённых сущностей и сохраняем её в HTML-файл.


In [31]:
from spacy import displacy

# Визуализация сущностей в Jupyter
displacy.render(doc, style="ent", jupyter=True)


In [32]:
# Сохранение визуализации в HTML-файл (Исправленный код)
html = displacy.render(doc, style="ent", jupyter=False)  # Убираем page=True

if html:  # Проверяем, что html не None
    with open("ner_visualization.html", "w", encoding="utf-8") as f:
        f.write(html)
    print("HTML файл успешно сохранен!")
else:
    print("Ошибка: displacy.render() вернул None.")


HTML файл успешно сохранен!


## Шаг 4: Добавление пользовательских сущностей вручную

В случае, если модель не распознала нужные сущности, их можно добавить вручную.


In [33]:
from spacy.tokens import Span

custom_text = "Satbayev University is one of the leading universities in Almaty."
doc_custom = nlp(custom_text)  # Process text with spaCy

# Создаем новую сущность (Satbayev University как ORG)
new_entity = Span(doc_custom, 0, 2, label="ORG")

# Перезаписываем сущности с учетом новой
doc_custom.set_ents([new_entity], default="unmodified")

# Печать обновленных сущностей
print("Обновленные сущности:")
for ent in doc_custom.ents:
    print(f"{ent.text} ({ent.label_})")

# Визуализация в Jupyter Notebook
displacy.render(doc_custom, style="ent", jupyter=True)


Обновленные сущности:
Satbayev University (ORG)
Almaty (GPE)
