# Natural Language Processing
Natural Language Processing (NLP) is the subfield of computer science which studies the human languages. As an example in NLP, people explore documents to extract meaningful information and try to train the computers to run algorithms to find the information. Examples of NLP are:
-   Text processing
-   Speech recognition
-   Speech synthesis

I would like to work on the **Text processing** part here. Therefore, in this notebook there is a bit of touch in this field of NLP. The fundamental parts of NLP in text processing are: To find stop words, tokenization, stemming, speech tagging.

## Stop words
There are common words that most of the time do not carry useful information. These words can be removed from the text. These words can be defined by the user which is based on the goal of the information extraction from the text. These words are called **stop words**. In python libraries we can make the use of well-known packages such as `spaCy`, and `NLTK`. 

In [2]:
# stop words in NLTK
import nltk
from nltk.corpus import stopwords
print("Here is the list of stop words in NLTK package:")
set(stopwords.words('english'))

Here is the list of stop words in NLTK package:


{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [6]:
# stop words in spaCy
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
print("Below is the list of stop words in spaCy:\n")
list(STOP_WORDS)

Below is the list of stop words in spaCy:



['show',
 'would',
 'within',
 'via',
 'thus',
 'serious',
 '’ll',
 'next',
 'ours',
 'across',
 'also',
 'therefore',
 'whenever',
 'why',
 'per',
 'three',
 'more',
 'behind',
 'fifteen',
 'had',
 'twelve',
 'toward',
 'others',
 'already',
 'nothing',
 'eight',
 'on',
 'too',
 'ever',
 'same',
 'there',
 'where',
 'rather',
 'thereafter',
 'that',
 'whereupon',
 'much',
 'doing',
 'other',
 'anyone',
 'whole',
 'beyond',
 'should',
 'out',
 'five',
 'part',
 'except',
 'myself',
 'your',
 'hereby',
 'someone',
 'yourselves',
 'so',
 'moreover',
 'will',
 'one',
 'only',
 'fifty',
 'yours',
 'or',
 'every',
 'thence',
 'two',
 'at',
 'done',
 "'d",
 '‘re',
 'between',
 '‘ll',
 'above',
 "'re",
 'since',
 'otherwise',
 'sometimes',
 'thereupon',
 'perhaps',
 'noone',
 '‘ve',
 'might',
 'both',
 'wherever',
 'all',
 'call',
 'n’t',
 'ten',
 'yourself',
 'four',
 'front',
 'seems',
 'mostly',
 'among',
 'becoming',
 'never',
 'twenty',
 'for',
 'latterly',
 'first',
 'made',
 'latter',


## Tokenization
Splitting the text into sentences and words is called tokenization. In this process we can have different numbers of words bind together. If the sentences are split into n words in a batch is called n-gram. As an example if each two words of sentences are wrapped together is called 2-gram.  

In [7]:
# Tokenizing with spaCy
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This sentence will be tokenized here.")

for token in doc:
    print(token.text)

This
sentence
will
be
tokenized
here
.


To find out more information about tokenization process simply click on [spaCy](https://spacy.io/usage/spacy-101) and [NLTK](https://www.nltk.org/api/nltk.tokenize.html) 

In [16]:
# Tokenizing with NLTK
from nltk.tokenize import word_tokenize
text = ["This sentence will be tokenized here."]
word_tokenize(text[0])

['This', 'sentence', 'will', 'be', 'tokenized', 'here', '.']

In [19]:
# Tokenizing with countvectorizer
from sklearn.feature_extraction.text import CountVectorizer

counter = CountVectorizer(max_features=10, 
                             ngram_range=(2,2))

counter.fit(text)

print("2-grams:\n",counter.get_feature_names())

2-grams:
 ['be tokenized', 'sentence will', 'this sentence', 'tokenized here', 'will be']


## Stemming & Lemmitization
Many words have the same root such as: buy, buys, buying, bought. The process of replaccing the words by their root is called **stemming**. 
**Lemmatization** refers to the process of grouping the words which are from the same root together and all will be analyzed by a single word. 

In [9]:
nlp = spacy.load("en_core_web_sm")

doc = nlp('This sentence needs(needed, need) to be lemmitized.')
for token in doc:
    print(token.text, token.lemma_)

This this
sentence sentence
needs(needed needs(needed
, ,
need need
) )
to to
be be
lemmitized lemmitize
. .


In [12]:
import nltk
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
text ="walk, walking, walks, walked, went, go, goes"

tokenization = nltk.word_tokenize(text)
for w in tokenization:
    print("Stemming for {} is {}".format(w,stemmer.stem(w)))  


Stemming for walk is walk
Stemming for , is ,
Stemming for walking is walk
Stemming for , is ,
Stemming for walks is walk
Stemming for , is ,
Stemming for walked is walk
Stemming for , is ,
Stemming for went is went
Stemming for , is ,
Stemming for go is go
Stemming for , is ,
Stemming for goes is goe
