<a href="https://colab.research.google.com/github/Shindora/NLP-with-PyTorch/blob/master/Chapter2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Corpus,Tokens,Types**<br>
- All NLP methods, classic or modern, begin with a text dataset, called **corpus**, it usually contains raw text (in ASCII or UTF-8).
- The raw text is a sequence of characters (bytes), but most times it is usefull to group those characters into contiguous units called **tokens**. In English, tokens correspond to words and numeric sequences separated by space or punctuation.
- The process of breaking a text down into tokens is called **tokenization**.

In [0]:
import spacy
nlp=spacy.load('en')
text="Mary, don't slap the green witch."
print([str(token) for token  in nlp(text.lower())]) 

['mary', ',', 'do', "n't", 'slap', 'the', 'green', 'witch', '.']


In [0]:
from nltk.tokenize import TweetTokenizer
tweet=u"Snow White and the Seven Degrees #MakeAMovieCold@midnight:) "
tokenizer=TweetTokenizer()
print(tokenizer.tokenize(tweet.lower()))

['snow', 'white', 'and', 'the', 'seven', 'degrees', '#makeamoviecold', '@midnight', ':)']


**Types** are unique tokens present in a corpus. The set of all types in a corpus is its vocabulary or
lexicon. Words can be distinguished as **content words** and **stopwords**. Stopwords such as articles and
prepositions serve mostly a grammatical purpose, like filler holding the content words

**Unigrams, Bigrams, Trigrams, …, N-gram**<br>
N­-grams are fixed­length (n) consecutive token sequences occurring in the text

In [0]:
def n_grams(text,n):
  '''
  takes tokens or text, returns a list of n-grams
  '''
  return [text[i:i+n] for i in range (len(text)-n+1)]
cleaned=['mary',',',"n't",'slap','green','witch','.']
print(n_grams(cleaned,3))

[['mary', ',', "n't"], [',', "n't", 'slap'], ["n't", 'slap', 'green'], ['slap', 'green', 'witch'], ['green', 'witch', '.']]


**Lemmas and Stems**<br>
- **Lemmas**: are root form of words. For example: "Fly" can be inflected  into many different words -- flow,flew,flies,flown,flowing,.."fly" is the lemma for all of these seemingly different words.<br>
Lemmas might be useful to reduce the tokens to their lemmas to *keep the dimensionality of the vector representations low.*


In [0]:
nlp=spacy.load('en')
doc=nlp(u"he was running late")
for token in doc:
  print('{}-->{}'.format(token,token.lemma_))

he-->-PRON-
was-->be
running-->run
late-->late


**Stemming** is the poor­man’s lemmatization.<br> It involves the use of handcrafted rules to strip endings
of words to reduce them to a common form called stems. 

In [0]:
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
   
ps = PorterStemmer() 
   
#doc=word_tokenize(u"Programers program with programing languages")
doc=['programers','program','with','programing','languages']   
for token in doc:
  print('{}-->{}'.format(token,ps.stem(token)))

programers-->program
program-->program
with-->with
programing-->program
languages-->languag


**Categorizing sentences and documents**<br>
Problems such as assigning topic
labels, predicting sentiment of reviews, filtering spam emails, language identification, and email
triaging can be framed as supervised document classification (categorition) problems.

**Categorizing words: POS Tagging**<br>
We can extend the concept of labeling from documents to individual words or tokens. A common
example of categorizing words is part­-of-­speech (POS) tagging

In [0]:
doc=nlp(u"Mary slapped the green witch.")
for token in doc:
  print("{}->{}".format(token,token.pos_))

Mary->PROPN
slapped->VERB
the->DET
green->ADJ
witch->NOUN
.->PUNCT


**Categorizing Spans: Chunking and Named Entity Recognition**<br>
We often need to label a span of text, that is a contiguous multitoken boundary.<br>
For example, "Marry slapped the green witch." -> [NP Mary] [VP slapped] [NP the green witch]. This is called *chunking* or *shallow parsing*.<br>
*Shallow parsing* aims to derive higher­order units
composed of the grammatical atoms, like nouns, verbs, adjectives, and so on. It is possible to write
regular expressions over the part­of­speech tags to approximate shallow parsing if you do not have
data to train models for shallow parsing.

In [0]:
doc=nlp(u'Marry slapped the green witch.')
for chunk in doc.noun_chunks:
  print("{}->{}".format(chunk,chunk.label_))

Marry->NP
the green witch->NP


**Structure of sentences**<br>
Whereas shallow parsing identifies phrasal units, the task of identifying the relationship between them
is called parsing.<br>
Parse trees indicate how different grammatical units in a sentence are related hierarchically.