# NLP

Definition (source:Wikipedia)
Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

## NLTK

The **Natural Language Toolkit**, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.[4] NLTK includes graphical demonstrations and sample data. It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit,[5] plus a cookbook.[6] source: Wikipedia

Further information
- https://github.com/nltk/nltk
- http://www.nltk.org/

## Input Data##

- **Token** Definition : Tokens could be paragraphs, sentences, individual words.

In [6]:
text = "Here is an exercise. You will need to do all preprocessing steps on it."

## Data Preparation Steps ##

- **Tokenisation** : Splitting up a text into smaller lines(tokens).

### Step 1: Tokenisation  ###

In [7]:
# You need to have nltk library in your machine

In [8]:
import nltk
tokenizer = nltk.tokenize.WhitespaceTokenizer()
tokenized_sentene =tokenizer.tokenize(text)

In [9]:
 # whitespace (space, tab, newline)

### Step 2: Stemming ###


**Stem definition**
-In linguistics, a stem is a part of a word used with slightly different meanings and would depend on the morphology of the language in question. In Athabaskan linguistics, for example, a verb stem is a root that cannot appear on its own, and that carries the tone of the word. Athabaskan verbs typically have two stems in this analysis, each preceded by prefixes.

A stemmer for English operating on the stem cat should identify such strings as cats, catlike, and catty. A stemming algorithm might also reduce the words fishing, fished, and fisher to the stem fish. The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu. 


The stem of the verb wait is wait: it is the part that is common to all its inflected variants.
- wait (infinitive)
- wait (imperative)
- waits (present, 3rd person, singular)
- wait (present, other persons and/or plural)
- waited (simple past)
- waited (past participle)
- waiting (progressive)
- Source: Wikipedia.

In [10]:
import nltk
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
for token in tokenized_sentene:
    print (stemmer.stem(token))

here
is
an
exercise.
you
will
need
to
do
all
preprocess
step
on
it.


### Step 1: Lemmatization  ###

in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.[1]

In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning.

In [5]:
import nltk
nltk.download('wordnet') # in MS Azure we had to do it manually. It takes a while.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
for token in tokenized_sentene:
    print (lemmatizer.lemmatize(token))


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Maria\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


NameError: name 'tokenized_sentene' is not defined

**Wordnet** is a lexical database of English.

https://www.nltk.org/howto/wordnet.html

## Further information
- nltk.download('wordnet') We need to download a dictionary
- https://wordnet.princeton.edu/

### Stemming vs Lemmatization ?

Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document. As a result, developing efficient lemmatisation algorithms is an open area of research.

### more normalization steps? (optional)  ###

In [None]:
#lower casing all tokens in the tokenized_sentene
for token in tokenized_sentene:
    print (token.lower())
