## NLP

    Natural language processing(NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

**Topics to learn**

1. [Tokenization](#Tokenization) (Pre-trained or Unsupervised)
    - [Word Tokenize](#NLTK-word-tokenizer-nltk.word_tokenize())
    - [Sentence Tokenize](#NLTK-sentence-tokenizer-nltk.sent_tokenize())
    - [Token Span](#NLTK-token-span-WhitespaceTokenizer)
    - [Text Processsing using PlaintextCorpusReader](#Simple-Text-Processing-NLTK)
        - [Methods](#Methods-in-PlaintextCorpusReader)
    - [Using NLTK Text module](#Using-the-Text-method-to-analyze-the-text-nltk.text.Text)
        - [Methods](#Methods)
2. Stemming and Stop Words
3. Lematization and POS tagging
4. NER, Chunking and Chinking
5. Word Cloud

### Tokenization

     The conversion of a string of words to a list of words is called tokenization

Resources:
1. <a href="http://www.tulane.edu/~howard/NLP/nlp.html#tokenization-again"> Howard </a>

#### NLTK word tokenizer ```nltk.word_tokenize() ```

    Return a tokenized copy of text, using NLTK’s recommended word tokenizer (currently an improved 
    TreebankWordTokenizer along with PunktSentenceTokenizer for the specified language).

Parameters
- text (str) – text to split into words
- language (str) – the model name in the Punkt corpus
- preserve_line (bool) – An option to keep the preserve the sentence and not sentence tokenize it.

Points
- It splits standard contractions, e.g. “don’t” -> “do”, “n’t” and “they’ll” -> “they”, “‘ll.”
- It treats most punctuation characters as separate tokens.
- It splits off commas and single quotes, when followed by whitespace.
- It separates periods that appear at the end of line

#### NLTK sentence tokenizer ```nltk.sent_tokenize()``` 

    Return a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer (currently 
    PunktSentenceTokenizer for the specified language).

Parameters
- text – text to split into sentences
- language – the model name in the Punkt corpus

#### NLTK token-span ```WhitespaceTokenizer```

    NLTK tokenizers can produce token-spans, represented as tuples of integers having the same semantics as string 
    slices, to support efficient comparison of tokenizers. (These methods are implemented as generators.)
    

In [None]:
import nltk
S = '''This above all: to thine own self be true, 
    And it must follow, as the night the day, 
    Thou canst not then be false to any man.'''

In [None]:
# Word Tokenizer

nltk.word_tokenize(S) 

In [None]:
# Sentence Tokenizer

nltk.sent_tokenize(S) 

In [None]:
nltk.WhitespaceTokenizer().span_tokenize(S) # Used as Generator

In [None]:
list(nltk.WhitespaceTokenizer().span_tokenize(S))

In [None]:
nltk.download()

#### Simple Text Processing NLTK

    One of the reasons for using NLTK is that it relieves us of much of the effort of making a raw text 
    amenable to computational analysis. It does so by including a module of corpus readers, which pre-process 
    files for certain tasks or formats. Most of them are specialized for particular corpora, so we will start 
    with the basic one, called the PlaintextCorpusReader

In [None]:
from nltk.corpus import PlaintextCorpusReader
Reader = PlaintextCorpusReader('./data/', 'review.txt', encoding='utf-8')

In [None]:
Words = Reader.words()
print(len(Words))
Words[:50]

#### Methods in PlaintextCorpusReader
- **Reader.raw()** # Returns the string from which the file was read
- **Reader.sents()** # Tokenizes the string to a list of lists of of strings, each of which is a sentence,
- **Reader.fileids()** # Returns the file that the reader is reading.
- **Reader.abspath('review.txt')** # Returns a FileSystemPathPointer to that file
- **Reader.root** # Returns a FileSystemPathPointer` to the current working directory 
- **Reader.encoding('review.txt')** # Returns the encoding of the file being read.
- **Reader.readme()** # Returns the Readme for the file which is not there in this case

#### Using the ```Text``` method to analyze the text ```nltk.text.Text```

    The text methods of Text provide a shortcut to text analysis.

#### Methods

 
- collocations(num=20, window_size=2)
        
        A collocation is a group of words that occur together frequently in a text.
        Print collocations derived from the text, ignoring stopwords.

- collocation_list(num=20, window_size=2)
        
        Return collocations derived from the text, ignoring stopwords.

- common_contexts(words, num=20)
 
        Find contexts where the specified words appear; list most frequent common contexts first.

- concordance(self, word, width=79, lines=25)
    
        It is often helpful to know the context of a word. The concordance view shows a certain number of 
        characters before and after every occurrence of a given word:
    
- concordance_list(self, word, width=79, lines=25)
    
        Generate a concordance for "word" with the specified context window.
        Word matching is not case-sensitive.

- similar(self, word, num=20)
    
        Distributional similarity: find other words which appear in the
        same contexts as the specified word; list most similar words first.
 

In [None]:
def textLoader(doc, loc = '', encoding='utf-8'):
    from nltk.corpus import PlaintextCorpusReader
    from nltk.text import Text
    return Text(PlaintextCorpusReader(loc, doc, encoding=encoding).words())

In [None]:
review = textLoader('review.txt', './data')

In [None]:
# Known bug with
# review.collocation() 

print('; '.join(review.collocation_list()))

In [None]:
review.common_contexts(['jackie'])

In [None]:
review.concordance('jackie')

In [None]:
review.similar('sahara')