In [1]:
! pip install spacy
! python -m spacy download en_core_web_sm
! pip install wikipedia

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11678 sha256=faa1a1ef952753e8c8ad3462034b8e625a1aeb4afc0cf6387e4fe7d196431ed9
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed w

In [14]:
import spacy
nlp = spacy.load("en_core_web_sm")
from spacy import displacy
import pandas as pd
import re
import itertools
from IPython.core.display import display, HTML
import wikipedia

In [3]:
df = pd.read_csv('/content/biology_new.csv')
df = df[['text', 'title', 'date', 'link', 'cleaned_text', 'fully_cleaned_text']]
df.head()

Unnamed: 0,text,title,date,link,cleaned_text,fully_cleaned_text
0,FOR THE past four billion years or so the only...,The promise and perils of synthetic biology,Apr 4th 2019,https://www.economist.com/leaders/2019/04/04/t...,for the past four billion years or so the only...,past four billion years way life earth produce...
1,IN A former leatherworks just off Euston Road ...,Will artificial intelligence help to crack bio...,Jan 7th 2017,https://www.economist.com/science-and-technolo...,in a former leatherworks just off euston road ...,former leatherworks euston road london hopeful...
2,“How many cells are there in a human being?” I...,The idea of “holobionts” represents a paradigm...,Jun 14th 2023,https://www.economist.com/science-and-technolo...,how many cells are there in a human being it...,many cells human sounds like question nerdy pu...
3,LIVING creatures are jolly useful. Farmers rea...,The remarkable promise of cell-free biology,May 4th 2017,https://www.economist.com/leaders/2017/05/04/t...,living creatures are jolly useful farmers rear...,living creatures jolly useful farmers rear ani...
4,"A broken brain, hidden inside a skull, is hard...",Better brain biology will deliver better medic...,Sep 21st 2022,https://www.economist.com/technology-quarterly...,a broken brain hidden inside a skull is harder...,broken brain hidden inside skull harder diagno...


In [6]:
# Исходя из опыта работы с научными текстами медико-биологического дискурса мы предположили, что при цитировании может быть указано имя автора
# Функция для извлечения имен собственных из текста
def extract_names(text):
    doc = nlp(text)
    names = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
    return names

# Применение функции к столбцу 'text' и создание нового столбца 'names'
df['names'] = df['text'].apply(extract_names)

# Вывод DataFrame с именами
df = df[['text', 'title', 'date', 'link', 'names']]
df.head()

Unnamed: 0,text,title,date,link,names
0,FOR THE past four billion years or so the only...,The promise and perils of synthetic biology,Apr 4th 2019,https://www.economist.com/leaders/2019/04/04/t...,[Fossil]
1,IN A former leatherworks just off Euston Road ...,Will artificial intelligence help to crack bio...,Jan 7th 2017,https://www.economist.com/science-and-technolo...,"[Chris Bishop, Chan Zuckerberg, Richard Mead, ..."
2,“How many cells are there in a human being?” I...,The idea of “holobionts” represents a paradigm...,Jun 14th 2023,https://www.economist.com/science-and-technolo...,"[Thomas Bell, Scott Gilbert, Joan Roughgarden,..."
3,LIVING creatures are jolly useful. Farmers rea...,The remarkable promise of cell-free biology,May 4th 2017,https://www.economist.com/leaders/2017/05/04/t...,[Genzyme]
4,"A broken brain, hidden inside a skull, is hard...",Better brain biology will deliver better medic...,Sep 21st 2022,https://www.economist.com/technology-quarterly...,"[Daniel Karlin, Alto Neuroscience, Dr Etkin, A..."


In [7]:
for value in df['names']:
    print(value)

['Fossil']
['Chris Bishop', 'Chan Zuckerberg', 'Richard Mead', 'Watson', 'Antonio Criminsi', 'Dr Bishop', 'Watson', 'Isaac Newton']
['Thomas Bell', 'Scott Gilbert', 'Joan Roughgarden', 'Thomas Juenger', 'Dr Juenger', 'Mixotricha', 'Lynn Margulis', 'Buchnera', 'Buchnera', 'Buchnera', 'Mitochondria', 'Jean-Michel', 'Cassandra Allsup', 'Isabelle George', 'Richard Lankau', 'Madison', 'Madeleine van Oppen', 'Raquel Peixoto']
['Genzyme']
['Daniel Karlin', 'Alto Neuroscience', 'Dr Etkin', 'Alto', 'Uta Frith', 'Ms Bingham', 'Dr Etkin', 'Neumora', 'Paul Berns', 'Neumora', 'the Michael J. Fox Foundation', 'Neumora', 'John Dunlop', 'Neumora', 'Zolgensma', 'sod1', 'Sabah Oney', 'Vigil Neuroscience', 'Neumora', 'Vigil', 'Duncan Emerton', 'Jeff Jonas', 'Dr Jonas']
['ALFRED NOBEL', 'Stanley Whittingham', 'Whittingham', 'John Goodenough', 'Akira Yoshino', 'Yoshino', 'Yoshino', 'Dr Yoshino', 'Dr Yoshino', 'Queloz', 'Dr Queloz', 'Doppler', 'Doppler', 'Dr Queloz', 'James Peebles', 'Martin Rees', 'Astrono

In [8]:
# Объединим списки и представим имена в алфавитном порядке, предварительно произведем предобработку:

# функция для удаления 's
def replacer(match):
  return match.group(0).replace("’s", "").replace("'s", "").replace("’", "").replace("'", "")

name = "Dr Clauser’s"
reg = r"\w+[’|']\w*"
name = re.sub(reg, replacer, name)

In [9]:
# удаляем все числовые значения
def del_digit(names):
  ans = []
  for name in names:
    for word in name:
      if word.isdigit():
        break
    else:
      ans.append(name)
  return ans

# удаляем все дубликаты имен
def del_duplicate(names):
  return list(set(names))

def del_apostrophy(names):
  reg = r"\w+[’|']\w*"
  ans = []
  for name in names:
    ans.append(re.sub(reg, replacer, name))

  return ans

def sort(name):
  return name.split()[1].lower() if len(name.split()) > 1 else name.lower()

names = df['names'].values
names = list(itertools.chain.from_iterable(names))
names = del_digit(names)
names = del_apostrophy(names)
names = del_duplicate(names)
names = sorted(names, key=sort)
names

['Influenza A',
 'Krista A. McNally',
 'AAAS',
 'Aadhaar',
 'Aaron',
 'Abba',
 'Abbott',
 'Tony Abbott',
 'Abdellaoui',
 'Abdel Abdellaoui',
 'Salim Abdool Karim',
 'Dr Abel',
 'Laurent Abel',
 'Abisko',
 'Norman Abjorensen',
 'Mr Abjorensen',
 'Accuracy',
 'Acetobacter',
 'Enoch Achigan-Dako',
 'Dr Achigan-Dako',
 'Achilles',
 'Achondroplasia',
 'Josef Ackermann',
 'Ackman',
 'Bill Ackman',
 'Actin',
 'Adalimumab',
 'Adam',
 'Ansel Adams',
 'John Adams',
 'Richard Adams',
 'Rachel Adato',
 'Adderall',
 'Eric Adelberger',
 'Dr Adelberger',
 'Konrad Adenauer',
 'Akinwumi Adesina',
 'Adesina',
 'Tedros Adhanom Ghebreyesus',
 'Adleman',
 'DNA.Dr Adleman',
 'Aedes',
 'Aequorea',
 'P. aeruginosa',
 'Afeyan',
 'Affara',
 'Nabeel Affara',
 'Affymetrix',
 'Arman Afifi',
 'Mr Afifi',
 'Aftonbladet',
 'Steven Agaba',
 'Agamemnon',
 'Louis Agassiz',
 'Agassiz',
 'Pierre Agostini',
 'Agouron',
 'Agra',
 'Peter Agre',
 'Hind Agro',
 'David Agus',
 'Ahab',
 'Said Ahamada',
 'Ahamada',
 'Abiy Ahmed',

In [10]:
# Вызываем текст в котором встречается искомое имя
df[df['text'].str.contains('William Blake', case=False)]

Unnamed: 0,text,title,date,link,names
7,BridgemanNATURE is full of surprises. When ato...,Biology's Big Bang,Jun 14th 2007,https://www.economist.com/leaders/2007/06/14/b...,"[James Chadwick, Samuel Goldwyn, William Blake..."


In [16]:
# Вызываем текст
idx = 7
res_df = df.iloc[idx, :]
res_df['text']

"BridgemanNATURE is full of surprises. When atoms were first proved to exist (and that was a mere century ago), they were thought to be made only of electrons and protons. That explained a lot, but it did not quite square with other observations. Then, in 1932, James Chadwick discovered the neutron. Suddenly everything made sense—so much sense that it took only another 13 years to build an atomic bomb.It is probably no exaggeration to say that biology is now undergoing its “neutron moment”. For more than half a century the fundamental story of living things has been a tale of the interplay between genes, in the form of DNA, and proteins, which the genes encode and which do the donkey work of keeping living organisms living. The past couple of years, however, have seen the rise and rise of a third type of molecule, called RNA. The analogy is not perfect. Unlike the neutron, RNA has been known about for a long time. Until the past couple of years, however, its role had seemed restricted 

In [17]:
# Вызываем информацию о статье
res_df

text     BridgemanNATURE is full of surprises. When ato...
title                                   Biology's Big Bang
date                                         Jun 14th 2007
link     https://www.economist.com/leaders/2007/06/14/b...
names    [James Chadwick, Samuel Goldwyn, William Blake...
Name: 7, dtype: object

In [19]:
# Подсвечиваем в тексте именованные сущности
idx = 7
res_df = df.iloc[idx, :]
res_df['text']
doc = nlp(res_df['text'])
html = displacy.render(doc, style="ent")
display(HTML(html))

But physics gave the 20th century a more subtle boon than mere power. It also brought an understanding of the vastness of the universe and humanity's insignificant place in it. ***It allowed people, in William Blake's phrase, to hold infinity in the palm of a hand, and eternity in an hour.*** Biology, though, does more than describe humanity's place in the universe. It describes humanity itself. And here, surprisingly, the rise of RNA may be an important part of that description.

***As Samuel Goldwyn so wisely advised, never make predictions—especially about the future.*** But here is one: the analogy between 20th-century physics and 21st-century biology will continue, for both good and ill.
(интересная цитата с точки зрения авторства и вариантов)

In [18]:
# Осуществляем поиск в Википедии
page = wikipedia.page("James Chadwick", auto_suggest=False)
page.title
page.content

'Sir James Chadwick,  (20 October 1891 – 24 July 1974) was an English physicist who was awarded the 1935 Nobel Prize in Physics for his discovery of the neutron in 1932. In 1941, he wrote the final draft of the MAUD Report, which inspired the U.S. government to begin serious atom bomb research efforts. He was the head of the British team that worked on the Manhattan Project during World War II. He was knighted in Britain in 1945 for his achievements in physics.\nChadwick graduated from the Victoria University of Manchester in 1911, where he studied under Ernest Rutherford (known as the "father of nuclear physics"). At Manchester, he continued to study under Rutherford until he was awarded his MSc in 1913. The same year, Chadwick was awarded an 1851 Research Fellowship from the Royal Commission for the Exhibition of 1851. He elected to study beta radiation under Hans Geiger in Berlin. Using Geiger\'s recently developed Geiger counter, Chadwick was able to demonstrate that beta radiation

In [20]:
# Создание HTML для визуализации выделенных сущностей (имен собственных)
html = "<div style='line-height: 2.5;'>"
for ent in doc.ents:
    if ent.label_ == 'PERSON':
        html += f"<span style='background-color: #ffcccb;'>{ent.text}</span> "
    else:
        html += f"{ent.text} "
html += "</div>"

# Отображение HTML
display(HTML(html))

In [None]:
idx = 201
res_df = df.iloc[idx, :]
res_df['text']

"SOME are born great. Some achieve greatness. Some have greatness thrust upon them. Substitute “fame” for “greatness” and you have an updated version of Shakespeare's quip that applies nicely to this year's Nobel prize for medicine, which was awarded for the development of in vitro fertilisation (IVF). The born-famous was Louise Brown, the world's first test-tube baby. The achiever of fame, celebrated at the time in newspapers and on television, was Patrick Steptoe, the gynecologist who created Ms Brown in his laboratory in 1978. And the man who has had fame thrust upon him, a mere 32 years after the event, is Robert Edwards, who spent more than two decades developing the science that IVF relies on. Dr Edwards was honoured for this work by the Karolinska Institute, on October 4th (though the prize will not actually be handed over until December). Steptoe died in 1988, and prizes are not awarded posthumously, so Dr Edwards scoops the whole pool of SKr10m ($1.5m).Dr Edwards began his wor

Act II Scene 5 of ***Twelfth Night*** by W. Shakespeare:

"Some are born great. Some achieve greatness. Some have greatness thrust upon them."


In [None]:
res_df

text     SOME are born great. Some achieve greatness. S...
title                      The 2010 Nobel prizes: Medicine
date                                          Oct 4th 2010
link     https://www.economist.com/babbage/2010/10/04/t...
names    [Shakespeare, Louise Brown, Patrick Steptoe, M...
Name: 201, dtype: object

In [None]:
df[df['text'].str.contains('John Keats', case=False)]

Unnamed: 0,text,title,date,link,names
80,The Origins of Creativity. By Edward Wilson. L...,What makes humans inventive?,Jan 11th 2018,https://www.economist.com/books-and-arts/2018/...,"[Edward Wilson, Allen Lane, Anthony Brandt, Da..."


In [None]:
idx = 80
res_df = df.iloc[idx, :]
res_df['text']

'The Origins of Creativity. By Edward Wilson. Liveright; 198 pages; $24.95. Allen Lane; £20.The Runaway Species. By Anthony Brandt and David Eagleman. Catapult; 287 pages; $28. Canongate; £20.DOES science spoil beauty? John Keats, an English Romantic poet, thought so. When Sir Isaac Newton separated white light into its prismatic colours, the effect, Keats wrote, was to “unweave a rainbow”. By explaining how rainbows occurred, the mystery and the lustre were lost. The idea that science and the arts are distinct, incompatible cultures is an enduring one. Two new books seem to cut to the heart of the matter: human creativity.Edward Wilson, 88 and the author of “The Origins of Creativity”, is the grand old man of Harvard biology. His speciality is myrmecology—the study of ants. For a short book, “The Origins of Creativity” is brimming with ideas, many of which wander, as Mr Wilson’s writing often does, beyond the brief of the title. Ultimately, though, everything in the book ties back to 

In [None]:
doc = nlp(res_df['text'])
    html = displacy.render(doc, style="ent")
display(HTML(html))


In [None]:
page = wikipedia.page("John Keats", auto_suggest=False)
page.title
page.content

'John Keats (31 October 1795 – 23 February 1821) was an English poet of the second generation of Romantic poets, along with Lord Byron and Percy Bysshe Shelley. His poems had been in publication for less than four years when he died of tuberculosis at the age of 25. They were indifferently received in his lifetime, but his fame grew rapidly after his death. By the end of the century, he was placed in the canon of English literature, strongly influencing many writers of the Pre-Raphaelite Brotherhood; the Encyclopædia Britannica of 1888 called one ode "one of the final masterpieces". Jorge Luis Borges named his first time reading Keats an experience he felt all his life. Keats had a style "heavily loaded with sensualities", notably in the series of odes. Typically of the Romantics, he accentuated extreme emotion through natural imagery. Today his poems and letters remain among the most popular and analysed in English literature – in particular "Ode to a Nightingale", "Ode on a Grecian U

In [None]:
doc = nlp(res_df['text'])
html = displacy.render(doc, style="ent")
display(HTML(html))

As Jonathan Swift put it in a much-misquoted poem, ***“So, naturalists observe, a flea/Hath smaller fleas that on him prey”***. Parasites, in other words, are everywhere. They are also, usually, more abundant than their hosts. An astute observer might therefore have suspected that the actual most-common species on Earth would be a “flea” that parasitised P. ubique, rather than the bacterium itself.

The reason is that a virus relies for its growth and reproduction on the metabolic processes of the cell it infects. This means viruses themselves are hard to parasitise, since they do no work on which another organism can free-ride. Which is why the next two lines of Swift’s poem, ***“And these have smaller fleas to bite ’em/And so proceed ad infinitum”***, are wrong—and why, because HTVC010P itself can have no parasites, it probably really is the commonest organism on the planet