### Import Libraries

In [1]:
!pip install rake_nltk
!pip install pytextrank

Collecting rake_nltk
  Downloading rake_nltk-1.0.6-py3-none-any.whl (9.1 kB)
Installing collected packages: rake_nltk
Successfully installed rake_nltk-1.0.6
[0mCollecting pytextrank
  Downloading pytextrank-3.2.4-py3-none-any.whl (30 kB)
Collecting icecream>=2.1
  Downloading icecream-2.1.3-py2.py3-none-any.whl (8.4 kB)
Collecting networkx[default]>=2.6
  Downloading networkx-2.6.3-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting graphviz>=0.13
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.0/47.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting asttokens>=2.0.1
  Downloading asttokens-2.1.0-py2.py3-none-any.whl (26 kB)
Collecting executing>=0.3.1
  Downloading executing-1.2.0-py2.py3-none-any.whl (24 kB)
Collecting typing-extensions<4.2.0,>=3.7.4
  Downloading typing_extensi

In [2]:
# web scraping libraries
from bs4 import BeautifulSoup
import requests

In [3]:
# string processing libraries
import re
from string import digits, punctuation

In [4]:
# keyword extraction libraries
import spacy
import pytextrank

### Web Scraping

In [5]:
url = 'https://en.wikipedia.org/wiki/Dog'

In [6]:
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
text = ''.join([element.text for element in soup.find_all('p')])
print(text[:300]) # show first 300 characters


The dog (Canis familiaris[4][5] or Canis lupus familiaris[5]) is a domesticated descendant of the wolf. Also called the domestic dog, it is derived from the extinct Pleistocene wolf,[6][7] and the modern wolf is the dog's nearest living relative.[8] The dog was the first species to be domesticated,


### Preprocessing

In [7]:
nlp1 = spacy.load('en_core_web_sm')

def lemmatize(text):
    doc = nlp1(text)
    tokens = [token for token in doc]
    return " ".join([token.lemma_ for token in doc])

In [8]:
# remove brackets and numbers
remove_digits = str.maketrans('', '', digits+'[]')
text = text.translate(remove_digits)

# remove excess space
text = re.sub(' +', ' ', text)

# convert to lower case
text = text.lower()

# lemmatization
text = lemmatize(text)

# show first 300 characters
print(text[:300])


 the dog ( canis familiaris or canis lupus familiaris ) be a domesticate descendant of the wolf . also call the domestic dog , it be derive from the extinct pleistocene wolf , and the modern wolf be the dog 's near living relative . the dog be the first specie to be domesticate , by hunter - gather


### Keyword Extraction

In [9]:
def get_top_ten_keywords(text):
    # load spacy model
    nlp = spacy.load('en_core_web_sm')
    # add PyTextRank to spacy pipeline
    nlp.add_pipe('textrank')
    doc = nlp(text)
    return [phrase.text for phrase in doc._.phrases[:10]]

In [10]:
get_top_ten_keywords(text)

['dog breed',
 'dog',
 'other dog',
 'pet dog',
 'dog behavior',
 'dog meat',
 'domestic dog',
 'dog tail',
 'dog disease',
 'guard dog']