The tutorial link is: https://www.youtube.com/watch?v=KOCnVyxVks8

#Linguistic feature extraction.

But first we will define some concepts: 

- Tokenization: The process to segment text into words, punctuations marks, etc. 

- Part-of-speech Tagging: Assigning word type to tokens, like a verb or noun. 

- Dependency Parsing: Assigning syntactic dependecy labels, describing the relations between individual tokens like subject or object.

- Lemmatization: Assigning the **base form** of a word. For example, the lemma of was is be, and the lemma of rats is rat. 

- Sentence Boundary Detection (SBD): Finding and segmenting individual sentences. For example, before a final dot. 

- Named Entity Recognition: Labelling named "real world" objects, like persons, companies or locations. 

- Entity linking (EL): Disambiguating textual entities to unique identifiers in a Knowledge Base. 

- Similarity: Comparing two words, for example, cat & dog are used on the same contexts. 

- Text Classification: Assigning categories or labels to a whole document, or parts of a document. 

- Rule-based Matching: Finding sequences of tokens based on their texts and linguistic annotations, similar to regex. 

- Serialization: Saving objects to files or byte strings. 

SpaCy is basically designed to receive raw data, process it and outputs a Doc Object that contains a variety of annotations. 

In [0]:
#Spacy has models available to implement! You can look up on: https://spacy.io/models/en
import spacy
nlp = spacy.load('en_core_web_sm') #We load a small form of core web

In [4]:
doc = nlp('Apple is looking at buying U.K. startup for $1 billion')
#If I put isn't instead of is, we can see the is - n't as a two separate tokens
#Tokenization: 

for token in doc:
  print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


###Part-of_Speech Tagging (PoS)

In [6]:
#Spacy can hash all the worlds in a dictionary 
doc

Apple is looking at buying U.K. startup for $1 billion

In [14]:
#Lemmatization
for token in doc:
  print(token.text, token.lemma_,'POS: ',token.pos_, 'Stop_word:',token.is_stop)

#With token.pos_ (Part of Speech) we can know which kind of word there is in the sentence. 
#For example, looking is a verb (VERB) and $ is a symbol (SYM)

#Also we can see if there is a stop_word or not. 

Apple Apple POS:  PROPN Stop_word: False
is be POS:  VERB Stop_word: True
looking look POS:  VERB Stop_word: False
at at POS:  ADP Stop_word: True
buying buy POS:  VERB Stop_word: False
U.K. U.K. POS:  PROPN Stop_word: False
startup startup POS:  NOUN Stop_word: False
for for POS:  ADP Stop_word: True
$ $ POS:  SYM Stop_word: False
1 1 POS:  NUM Stop_word: False
billion billion POS:  NUM Stop_word: False


### Dependency parsing: How one token is depending with other token... 



In [17]:
for chunk in doc.noun_chunks:
  print(chunk.text, 'Root: ', chunk.root.text, 'dep:', chunk.root.dep_)

Apple Root:  Apple dep: nsubj
U.K. startup Root:  startup dep: dobj


### Named Entity Recognition: We can recognise the class of the world in our sentences.

In [20]:
#We can recognise the Organization, the GPE and if there are some money... 
for ent in doc.ents:
  print(ent.text, '---', ent.label_)

Apple --- ORG
U.K. --- GPE
$1 billion --- MONEY


### Sentence Segmentation

In [24]:
#doc.sents
for sent in doc.sents:
  print(sent)

Apple is looking at buying U.K. startup for $1 billion


In [26]:
doc = nlp("Welcome to KGP Talkie. Thanks for watching. Please like and subscribe")
for sent in doc.sents:
  print(sent)

Welcome to KGP Talkie.
Thanks for watching.
Please like and subscribe


In [0]:
#How to write a costume rule... 
doc = nlp("Welcome to.*.KGP Talkie.*.Thanks for watching. Please like and subscribe")
def set_rule(doc):
  for token in doc[:-1]:
    if token.text == '.*.':
      doc[token.i + 1].is_sent_start = True
    return doc

In [33]:
for sent in doc.sents:
  print(sent)

Welcome to.*.KGP Talkie.*.Thanks for watching.
Please like and subscribe


In [35]:
#We can add pipelies to the nlp object!!! 
nlp.add_pipe(set_rule, before='parser')
#For remove a pipe put: nlp.remove_pipe(set_rule, before='parser)
for token in doc:
  print(token.text)

Welcome
to.*.KGP
Talkie.*.Thanks
for
watching
.
Please
like
and
subscribe


## We can use also the visualization tool of Spacy! 

In [36]:
from spacy import displacy

displacy.render(doc, style='dep')

'<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" id="155f9fb202ba43369b663251ac9b8a0f-0" class="displacy" width="1625" height="312.0" direction="ltr" style="max-width: none; height: 312.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr">\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="222.0">\n    <tspan class="displacy-word" fill="currentColor" x="50">Welcome</tspan>\n    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="50">INTJ</tspan>\n</text>\n\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="222.0">\n    <tspan class="displacy-word" fill="currentColor" x="225">to.*.KGP</tspan>\n    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="225">PROPN</tspan>\n</text>\n\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="222.0">\n    <tspan class="displacy-word" fill="currentColor" x="400">Talkie.*.Thanks</tspan>\n    <tspa

In [37]:
displacy.render(doc, style='dep',options={'compact':True})

'<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" id="fa5af98bca7f4c84939d0d150bf4c6e0-0" class="displacy" width="1400" height="287.0" direction="ltr" style="max-width: none; height: 287.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr">\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="197.0">\n    <tspan class="displacy-word" fill="currentColor" x="50">Welcome</tspan>\n    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="50">INTJ</tspan>\n</text>\n\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="197.0">\n    <tspan class="displacy-word" fill="currentColor" x="200">to.*.KGP</tspan>\n    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="200">PROPN</tspan>\n</text>\n\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="197.0">\n    <tspan class="displacy-word" fill="currentColor" x="350">Talkie.*.Thanks</tspan>\n    <tspa