### Создание NLP-конвейера на Python
Вот так выглядит алгоритм обработки естественного языка:

входные данные: исходный текст

- сегментация на предложения
 
- лексический анализ
 
- частеречная разметка
 
- лемматизация текста
 
- выявление стоповых слов
 
- синтаксический анализ на основе грамматики зависимостей
 
- поиск именных групп
 
- выделение именованных сущностей
 
- разрешение кореференции

In [1]:
#!pip3 install -U spacy

In [2]:
#!python -m spacy download en_core_web_lg

In [3]:
import spacy

nlp = spacy.load('en_core_web_lg')

In [4]:
text = """I know a girl who is a hardcore Trump supporter, and I've been
asking her questions just to see how they could possibly support
him and the GOP. some of the things these people believe is mind
boggling. Their only source of information, is from other like
minded people that continuously re enforce their narrative and
makes them ignorant and oblivious to anything else. Some of the
gems this dumbass has tried to defend:
Trump has only golfed twice in his presidency, and much less than
Obama; because if he had golfed more, someone must have mentioned
it. And when I showed her with sources how much Trump has golfed,
the reply was a shrug and fake news
People attacking the Capitol were ANTIFA and far left that were
dressed as Trump supporters to make them look bad. ( I had no
comment on this one, as I legitimately couldn't believe someone
could be so dumb)
There was massive election fraud, as evident by multiple blogs
where people claimed they saw it with their own eyes and had
proof. When I asked for the proof, she claimed the left media and
the Jews won't allow it to be leaked to the public.
Black women in this country abort 485,000 babies/yr after giving
birth. I told her that's not abortion, and there is no source
verifying that. Once again, silence. Talking to her made me
realize, it absolutely doesn't matter what you show them, what you
prove with reliable sources, they'll simply won't believe you.
Trump's narrative of fake news from early on in his presidency
allowed him and his supporters to simply dismiss any bit of
information that didn't align with their views. So in short, I
don't think there's any hope in making these people realize what
the truth is, quite frankly, it's exhaustive; and I feel my brain
cells commit suicide after talking to her and other Trumpsters.
"""

In [5]:
doc = nlp(text)

for entity in doc.ents:
    print(f"{entity.text} ({entity.label_})")

Trump (ORG)
GOP (ORG)
Trump (ORG)
Obama (ORG)
Trump (PERSON)
Capitol (ORG)
ANTIFA (NORP)
Trump (ORG)
Jews (NORP)
485,000 (CARDINAL)
Trump (ORG)
Trumpsters (NORP)


In [6]:
import spacy

nlp = spacy.load('en_core_web_lg')

def replace_name_with_placeholder(token):
    if token.ent_iob != 0 and token.ent_type_ == "PERSON":
        return "[REDACTED] "
    else:
        return token.string

def scrub(text):
    doc = nlp(text)
    for ent in doc.ents:
        ent.merge()
    tokens = map(replace_name_with_placeholder, doc)
    return "".join(tokens)

s = """
I know a girl who is a hardcore Trump supporter, and I've been
asking her questions just to see how they could possibly support
him and the GOP. some of the things these people believe is mind
boggling. Their only source of information, is from other like
minded people that continuously re enforce their narrative and
makes them ignorant and oblivious to anything else. Some of the
gems this dumbass has tried to defend:
Trump has only golfed twice in his presidency, and much less than
Obama; because if he had golfed more, someone must have mentioned
it. And when I showed her with sources how much Trump has golfed,
the reply was a shrug and fake news
People attacking the Capitol were ANTIFA and far left that were
dressed as Trump supporters to make them look bad. ( I had no
comment on this one, as I legitimately couldn't believe someone
could be so dumb)
There was massive election fraud, as evident by multiple blogs
where people claimed they saw it with their own eyes and had
proof. When I asked for the proof, she claimed the left media and
the Jews won't allow it to be leaked to the public.
Black women in this country abort 485,000 babies/yr after giving
birth. I told her that's not abortion, and there is no source
verifying that. Once again, silence. Talking to her made me
realize, it absolutely doesn't matter what you show them, what you
prove with reliable sources, they'll simply won't believe you.
Trump's narrative of fake news from early on in his presidency
allowed him and his supporters to simply dismiss any bit of
information that didn't align with their views. So in short, I
don't think there's any hope in making these people realize what
the truth is, quite frankly, it's exhaustive; and I feel my brain
cells commit suicide after talking to her and other Trumpsters.
"""

print(scrub(s))


I know a girl who is a hardcore Trump supporter, and I've been
asking her questions just to see how they could possibly support
him and the GOP. some of the things these people believe is mind
boggling. Their only source of information, is from other like
minded people that continuously re enforce their narrative and
makes them ignorant and oblivious to anything else. Some of the
gems this dumbass has tried to defend:
Trump has only golfed twice in his presidency, and much less than
Obama; because if he had golfed more, someone must have mentioned
it. And when I showed her with sources how much [REDACTED] has golfed,
the reply was a shrug and fake news
People attacking the Capitol were ANTIFA and far left that were
dressed as Trump supporters to make them look bad. ( I had no
comment on this one, as I legitimately couldn't believe someone
could be so dumb)
There was massive election fraud, as evident by multiple blogs
where people claimed they saw it with their own eyes and had
proof.



### Извлечение фактов из текста

In [7]:
pip install textacy

Note: you may need to restart the kernel to use updated packages.


In [8]:
import textacy

In [9]:
nlp = spacy.load('en_core_web_lg')

text = """London is the capital and most populous city of England and  the United Kingdom.  
Standing on the River Thames in the south east of the island of Great Britain, 
London has been a major settlement  for two millennia.  It was founded by the Romans, 
who named it Londinium.
"""

doc = nlp(text)

statements = textacy.extract.semistructured_statements(doc, "London")


print("Here are the things I know about London:")

for statement in statements:
    subject, verb, fact = statement
    print(f" - {fact}")

Here are the things I know about London:
 - the capital and most populous city of England and  the United Kingdom.  

 - a major settlement  for two millennia.  


In [10]:
nlp = spacy.load('en_core_web_lg')

text_1 = """Moscow is the capital and largest city of Russia. The city stands on the Moskva River in Central Russia, with a population estimated at 12.4 million residents within the city limits, while over 17 million residents in the urban area,and over 20 million residents in the Moscow Metropolitan Area. The city covers an area of 2,511 square kilometres (970 sq mi), while the urban area covers 5,891 square kilometres (2,275 sq mi), and the metropolitan area covers over 26,000 square kilometres (10,000 sq mi). Moscow is among the world's largest cities, being the largest city entirely in Europe, the largest urban area in Europe, the largest metropolitan area in Europe, and the largest city by land area on the European continent. 
Originally established in 1147, Moscow grew to become a prosperous and powerful city that served as the capital of the Grand Duchy that bears its namesake. When the Grand Duchy of Moscow evolved into the Tsardom of Russia, Moscow still remained as the political and economic center for most of the Tsardom's history. When the Tsardom was reformed into the Russian Empire, the capital was moved from Moscow to Saint Petersburg, diminishing the influence of the city. The capital was then moved back to Moscow following the Russian Revolution and the city was brought back as the political centre of the Russian SFSR and the Soviet Union. In the aftermath of the dissolution of the Soviet Union, Moscow remained as the capital city of the contemporary and newly established Russian Federation.
"""

doc = nlp(text_1)

statements = textacy.extract.semistructured_statements(doc, "Moscow")

print("Here are the things I know about Moscow:")

for statement in statements:
    subject, verb, fact = statement
    print(f" - {fact}")

Here are the things I know about Moscow:
 - the capital and largest city of Russia


Самые частотные отрывки фраз из текста о Москве

In [11]:
import spacy
import textacy.extract

nlp = spacy.load('en_core_web_lg')

text = """Moscow is the capital and largest city of Russia. The city stands on the Moskva River in Central Russia, with a population estimated at 12.4 million residents within the city limits, while over 17 million residents in the urban area,and over 20 million residents in the Moscow Metropolitan Area. The city covers an area of 2,511 square kilometres (970 sq mi), while the urban area covers 5,891 square kilometres (2,275 sq mi), and the metropolitan area covers over 26,000 square kilometres (10,000 sq mi). Moscow is among the world's largest cities, being the largest city entirely in Europe, the largest urban area in Europe, the largest metropolitan area in Europe, and the largest city by land area on the European continent. 
Originally established in 1147, Moscow grew to become a prosperous and powerful city that served as the capital of the Grand Duchy that bears its namesake. When the Grand Duchy of Moscow evolved into the Tsardom of Russia, Moscow still remained as the political and economic center for most of the Tsardom's history. When the Tsardom was reformed into the Russian Empire, the capital was moved from Moscow to Saint Petersburg, diminishing the influence of the city. The capital was then moved back to Moscow following the Russian Revolution and the city was brought back as the political centre of the Russian SFSR and the Soviet Union. In the aftermath of the dissolution of the Soviet Union, Moscow remained as the capital city of the contemporary and newly established Russian Federation."""

doc = nlp(text)

noun_chunks = textacy.extract.noun_chunks(doc, min_freq=3)

noun_chunks = map(str, noun_chunks)
noun_chunks = map(str.lower, noun_chunks)

for noun_chunk in set(noun_chunks):
    if len(noun_chunk.split(" ")) > 1:
        print(noun_chunk)

largest city
