# Практическое задание 6: Использование spaCy для Named Entity Recognition (NER)

**Цель:**
Освоить извлечение именованных сущностей (NER) из текста с помощью библиотеки spaCy, визуализировать сущности и добавить пользовательские сущности.


## Шаг 1: Установка spaCy и модели для английского языка

Запуск следующих команд в терминале для установки spaCy и модели для английского языка:
```bash
pip install spacy
python -m spacy download en_core_web_sm
```


In [31]:
import spacy

# Загрузка модели
nlp = spacy.load("en_core_web_sm")


## Шаг 2: Извлечение именованных сущностей из текста

Используйте spaCy для анализа текста и извлечения именованных сущностей.


In [32]:
text = """
The Mars 2020 mission, launched by NASA on July 30, 2020, from Cape Canaveral Space Force Station, successfully landed the Perseverance rover on the surface of Mars on February 18, 2021.
This marked a significant milestone in space exploration, as Perseverance was tasked with searching for signs of ancient microbial life and collecting samples for future missions.

The rover, which carries the Ingenuity helicopter, landed in the Jezero Crater, an area believed to have hosted a lake billions of years ago. Scientists from the Jet Propulsion Laboratory (JPL) and the California Institute of Technology (Caltech) are leading the mission, collaborating with international space agencies, including the European Space Agency (ESA).

Perseverance is equipped with advanced instruments such as the SuperCam, Mastcam-Z, and the PIXL (Planetary Instrument for X-ray Lithochemistry), which allow it to analyze the Martian surface.
NASA estimates that the mission's total cost is approximately $2.7 billion.

During its exploration, Perseverance has already sent thousands of images back to Earth, revealing details about the planet's geology and climate history. The Ingenuity helicopter, originally planned for five test flights, has exceeded expectations by completing more than 60 successful flights, proving the feasibility of powered flight in Mars' thin atmosphere.

One of the mission’s most ambitious goals is the Mars Sample Return program, a joint initiative between NASA and ESA. The plan involves sending a retrieval mission by 2033 to collect the rock and soil samples gathered by Perseverance and bring them back to Earth for in-depth analysis.

With the success of Perseverance and Ingenuity, NASA is preparing for future manned missions to Mars, potentially in the late 2030s. Private companies such as SpaceX, led by Elon Musk, are also developing technologies to support human exploration of the Red Planet.

This mission is expected to pave the way for a deeper understanding of Mars, providing critical insights into planetary evolution, habitability, and the potential for human colonization in the future.
"""

doc = nlp(text)

print("Extracted Named Entities:")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")


Extracted Named Entities:
Mars 2020 (DATE)
NASA (ORG)
July 30, 2020 (DATE)
Cape Canaveral Space Force Station (ORG)
Perseverance (PERSON)
Mars (LOC)
February 18, 2021 (DATE)
Perseverance (PERSON)
Ingenuity (ORG)
the Jezero Crater (LOC)
billions of years ago (DATE)
the Jet Propulsion Laboratory (ORG)
JPL (ORG)
the California Institute of Technology (ORG)
Caltech (ORG)
the European Space Agency (ORG)
ESA (ORG)
SuperCam (ORG)
PIXL (ORG)
Martian (NORP)
NASA (ORG)
approximately $2.7 billion (MONEY)
Perseverance (PERSON)
thousands (CARDINAL)
Earth (LOC)
Ingenuity (ORG)
five (CARDINAL)
more than 60 (CARDINAL)
Mars (LOC)
One (CARDINAL)
the Mars Sample Return (LOC)
NASA (ORG)
ESA (ORG)
2033 (DATE)
Perseverance (PERSON)
Earth (LOC)
Perseverance and Ingenuity (ORG)
NASA (ORG)
Mars (LOC)
the late 2030s (DATE)
SpaceX (PERSON)
Elon Musk (PERSON)
the Red Planet (LOC)
Mars (LOC)


## Шаг 3: Группировка сущностей по категориям (типам)

Используем словарь для группировки сущностей по их типам (меткам).


In [33]:
from collections import defaultdict

entities_by_label = defaultdict(list)

for ent in doc.ents:
    entities_by_label[ent.label_].append(ent.text)

print("Сущности, сгруппированные по категориям:")
for label, entities in entities_by_label.items():
    print(f"{label}: {', '.join(entities)}")


Сущности, сгруппированные по категориям:
DATE: Mars 2020, July 30, 2020, February 18, 2021, billions of years ago, 2033, the late 2030s
ORG: NASA, Cape Canaveral Space Force Station, Ingenuity, the Jet Propulsion Laboratory, JPL, the California Institute of Technology, Caltech, the European Space Agency, ESA, SuperCam, PIXL, NASA, Ingenuity, NASA, ESA, Perseverance and Ingenuity, NASA
PERSON: Perseverance, Perseverance, Perseverance, Perseverance, SpaceX, Elon Musk
LOC: Mars, the Jezero Crater, Earth, Mars, the Mars Sample Return, Earth, Mars, the Red Planet, Mars
NORP: Martian
MONEY: approximately $2.7 billion
CARDINAL: thousands, five, more than 60, One


## Шаг 4: Визуализация именованных сущностей с помощью displaCy

Создаем визуализацию извлечённых сущностей и сохраняем её в HTML-файл.


In [34]:
from spacy import displacy

# Визуализация сущностей в Jupyter
displacy.render(doc, style="ent", jupyter=True)


In [35]:
# Сохранение визуализации в HTML-файл
html = displacy.render(doc, style="ent", jupyter=False)  # Убираем page=True

if html:  # Проверяем, что html не None
    with open("ner_visualization.html", "w", encoding="utf-8") as f:
        f.write(html)
    print("HTML файл успешно сохранен!")
else:
    print("Ошибка: displacy.render() вернул None.")


HTML файл успешно сохранен!


## Шаг 4: Добавление пользовательских сущностей вручную

В случае, если модель не распознала нужные сущности, их можно добавить вручную.


In [36]:
from spacy.tokens import Span

custom_text = """
Satbayev University, officially known as the Kazakh National Research Technical University, is one of the leading institutions of higher education in Almaty, Kazakhstan.
Founded in 1934, it has played a crucial role in the development of engineering, science, and technology in Central Asia.

The university is named after Kanysh Satbayev, a renowned Kazakh geologist and the first president of the Academy of Sciences of Kazakhstan.
Over the years, Satbayev University has collaborated with major international institutions, including the Massachusetts Institute of Technology (MIT), Moscow State University, and the Technical University of Munich.

With more than 12,000 students enrolled, the university offers programs in engineering, artificial intelligence, cybersecurity, and renewable energy.
In 2022, it was ranked among the top 500 technical universities worldwide according to the QS World University Rankings.

The university campus is located near Dostyk Avenue in Almaty and includes state-of-the-art research laboratories, an innovation hub, and the Satbayev Space Center.
This facility actively contributes to Kazakhstan’s space research programs in collaboration with Kazcosmos and the Baikonur Cosmodrome.

Notable alumni of Satbayev University include leading scientists, politicians, and entrepreneurs who have contributed to Kazakhstan’s industrial and technological advancements.
In recent years, the university has been actively involved in projects related to artificial intelligence and sustainable energy, receiving funding from the World Bank and the European Union.

With a rich history and a strong focus on innovation, Satbayev University continues to be a driving force in shaping the future of technology and education in Kazakhstan and beyond.
"""

doc_custom = nlp(custom_text)  # Process text with spaCy

# Создаем новую сущность (Satbayev University как ORG)
new_entity = Span(doc_custom, 0, 2, label="ORG")

# Перезаписываем сущности с учетом новой
doc_custom.set_ents([new_entity], default="unmodified")

# Печать обновленных сущностей
print("Обновленные сущности:")
for ent in doc_custom.ents:
    print(f"{ent.text} ({ent.label_})")

# Визуализация в Jupyter Notebook
displacy.render(doc_custom, style="ent", jupyter=True)


Обновленные сущности:

Satbayev University (ORG)
the Kazakh National Research Technical University (ORG)
Almaty (GPE)
Kazakhstan (GPE)
1934 (DATE)
Central Asia (LOC)
Kanysh Satbayev (PERSON)
Kazakh (ORG)
first (ORDINAL)
the Academy of Sciences of Kazakhstan (ORG)
the years (DATE)
Satbayev University (ORG)
the Massachusetts Institute of Technology (ORG)
MIT (ORG)
Moscow State University (ORG)
the Technical University of Munich (ORG)
more than 12,000 (CARDINAL)
2022 (DATE)
500 (CARDINAL)
the QS World University Rankings (ORG)
Dostyk Avenue (FAC)
Almaty (GPE)
the Satbayev Space Center (ORG)
Kazakhstan (GPE)
Kazcosmos (PERSON)
Satbayev University (ORG)
Kazakhstan (GPE)
recent years (DATE)
the World Bank (ORG)
the European Union (ORG)
Satbayev University (ORG)
Kazakhstan (GPE)
