# spaCy in Python
[Reference1](https://spacy.io/usage/models) <br>
[Reference2](https://medium.com/better-programming/extract-keywords-using-spacy-in-python-4a8415478fbf)

In [1]:
!pip install -U spacy

Collecting spacy
[?25l  Downloading https://files.pythonhosted.org/packages/10/b5/c7a92c7ce5d4b353b70b4b5b4385687206c8b230ddfe08746ab0fd310a3a/spacy-2.3.2-cp36-cp36m-manylinux1_x86_64.whl (9.9MB)
[K     |████████████████████████████████| 10.0MB 2.7MB/s 
Collecting thinc==7.4.1
[?25l  Downloading https://files.pythonhosted.org/packages/10/ae/ef3ae5e93639c0ef8e3eb32e3c18341e511b3c515fcfc603f4b808087651/thinc-7.4.1-cp36-cp36m-manylinux1_x86_64.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 40.0MB/s 
Installing collected packages: thinc, spacy
  Found existing installation: thinc 7.4.0
    Uninstalling thinc-7.4.0:
      Successfully uninstalled thinc-7.4.0
  Found existing installation: spacy 2.2.4
    Uninstalling spacy-2.2.4:
      Successfully uninstalled spacy-2.2.4
Successfully installed spacy-2.3.2 thinc-7.4.1


In [3]:
# en_core_web_lg (large)
!python -m spacy download en_core_web_lg
# en_core_web_md (medium)
# !python -m spacy download en_core_web_md
# en_core_web_sm (small)
# !python -m spacy download en_core_web_sm

Collecting en_core_web_lg==2.3.1
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.3.1/en_core_web_lg-2.3.1.tar.gz (782.7MB)
[K     |████████████████████████████████| 782.7MB 1.3MB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.3.1-cp36-none-any.whl size=782936124 sha256=c45a16be18f9e599b54858cc0a0b4d41c54ee092904bc155c098209fbf92b58e
  Stored in directory: /tmp/pip-ephem-wheel-cache-9dq0hj8i/wheels/ce/4d/1b/bc6cabb6df139c5f0318927be3ae9e51363fb44d6ea328d3f4
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.3.1
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [4]:
!python -m spacy validate

⠙ Loading compatibility table...⠹ Loading compatibility table...[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation: /usr/local/lib/python3.6/dist-packages/spacy[0m

TYPE      NAME             MODEL            VERSION                            
package   en-core-web-sm   en_core_web_sm   [38;5;1m2.2.5[0m   --> 2.3.1     
package   en-core-web-lg   en_core_web_lg   [38;5;2m2.3.1[0m   [38;5;2m✔[0m

[1m
Use the following commands to update the model packages:
python -m spacy download en_core_web_sm



In [14]:
import spacy
from collections import Counter
from string import punctuation
import en_core_web_lg

In [11]:
# Load spaCy model
# nlp = spacy.load("en_core_web_lg")



In [15]:
nlp = en_core_web_lg.load()

In [16]:
def get_hotwords(text):
    result = []
    pos_tag = ['PROPN', 'ADJ', 'NOUN']
    doc = nlp(text.lower())
    for token in doc:
        if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
            continue
        if(token.pos_ in pos_tag):
            result.append(token.text)           
    return result

In [17]:
output = get_hotwords('''Welcome to Medium! Medium is a publishing 
                        platform where people can read important, 
                        insightful stories on the topics that matter 
                        most to them and share ideas with the world.''')
print(output)

['welcome', 'medium', 'medium', 'publishing', 'platform', 'people', 'important', 'insightful', 'stories', 'topics', 'ideas', 'world']


In [18]:
# Remove duplicate items
output = set(get_hotwords('''Welcome to Medium! Medium is a publishing 
                        platform where people can read important, 
                        insightful stories on the topics that matter 
                        most to them and share ideas with the world.'''))
print(output)

{'platform', 'insightful', 'ideas', 'medium', 'stories', 'important', 'world', 'welcome', 'publishing', 'people', 'topics'}


In [20]:
# Generate hashtags from keywords
output = set(get_hotwords('''Welcome to Medium! Medium is a publishing 
                        platform where people can read important, 
                        insightful stories on the topics that matter 
                        most to them and share ideas with the world.'''))
hashtags = [('#' + x) for x in output]
print(' '.join(hashtags))

#platform #insightful #ideas #medium #stories #important #world #welcome #publishing #people #topics


In [22]:
# Sort by frequency
# Generate hashtags from keywords
output = set(get_hotwords('''Welcome to Medium! Medium is a publishing 
                        platform where people can read important, 
                        insightful stories on the topics that matter 
                        most to them and share ideas with the world.'''))
hashtags = [('#' + x[0]) for x in Counter(output).most_common(5)]
print(' '.join(hashtags))

#platform #insightful #ideas #medium #stories
