### *NLP is a field of Computer Science and subfield of Artificial Intelligence, information engineering, and human-computer interaction. It focuses on how to process and analyze large amounts of natural language data efficiently. It is difficult to perform as the process of reading and understanding languages is far more complex than it seems at first glance.*

*Free Available Pre-trained Models*:
1. Fasttext
2. Tensorflow Hub
3. Hugging Face

#### Information Extraction (IE)
*Power of RegEx*
- Sequence of characters that define a search pattern
- Helpful Links : 
    - [Pythex]((https://pythex.org/))
    - [regular expression 101](https://regex101.com/)

In [1]:
import re

#### Tokenization :
- Fundamental step in NLP Pipeline. 
- Divide a Textual (String) input into smaller units known as tokens.
- Uses a tokenizer to segment unstructured data and natural language text into distinct chunks of information, treating them as different elements.
- These tokens can be in the form of words, characters, sub-words, or sentences.
- It helps in improving interpretability of text by different models.
- Application: Multiple NLP tasks, text processing, language modelling, and machine translation.

<img src='Tokenization-in-Natural-Language-Processing.png'>

*Types of Tokenization*:
- Word Tokenization : text is divided into individual words. 
- Character Tokenization : the textual data is split and converted to a sequence of individual characters.
- Sentence Tokenization : make a division of paragraphs or large set of sentences into separated sentences as tokens
- Subword Tokenization : strikes a balance between word and character tokenization by breaking down text into units that are larger than a single character but smaller than a full word
- N-gram Tokenization : splits words into fixed-sized chunks (size = n) of data

*Popular Approaches of Tokenization*:
- Whitespace Tokenization
- Statistical Tokenization
- Transformer-based Tokenization
- Rule-based Tokenization
- Byte-pair Tokenization

*Sentence_Tokenizer*

In [5]:
# Sentence Tokenization using sent_tokenize
from nltk.tokenize import sent_tokenize

text = "Hello everyone. I am Ahmed Akram Amer. I hope , i will be a successful."
sent_tokenize(text=text, language='english')

# It uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module,
# which is already been trained and thus very well knows to mark the end and beginning of sentence at what characters and punctuation.

['Hello everyone.',
 'I am Ahmed Akram Amer.',
 'I hope , i will be a successful.']

In [None]:
# Sentence Tokenization using PunktSentenceTokenize
import nltk.data

# Loading PunktSentenceTokenizer using English pickle file
tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
tokenizer.tokenize(text)

# Sentences from different languages can also be tokenized using different pickle file other than English.

['Hello everyone.',
 'I am Ahmed Akram Amer.',
 'I hope , i will be a successful.']

In [8]:
# Tokenize a Spanish text into sentences using pre-trained Punkt tokenizer for Spanish.

spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')

text = 'Hola amigo. Estoy bien.'
spanish_tokenizer.tokenize(text)

['Hola amigo.', 'Estoy bien.']

*Word_Tokenizer*

In [None]:
# Word Tokenization using work_tokenize
from nltk.tokenize import word_tokenize

text = "Hello, World From Egypt!\nHello!"
word_tokenize(text=text, language='english', preserve_line=False)

# it is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer class.

['Hello', ',', 'World', 'From', 'Egypt', '!', 'Hello', '!']

In [17]:
# Word Tokenization Using TreebankWordTokenizer 
from nltk.tokenize import TreebankWordDetokenizer

tokenizer = TreebankWordDetokenizer()
tokenizer.tokenize(text, convert_parentheses=False)

'H e l l o,   W o r l d   F r o m   E g y p t! \n H e l l o!'

In [18]:
# Word Tokenization using WordPunctTokenizer
# splits words based on punctuation boundaries.
# Each punctuation mark is treated as a separate token.

from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Let's see how it's working.")

['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working', '.']

In [19]:
# Word Tokenization using Regular Expression
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(pattern=r'\w+')

tokenizer.tokenize('Cristiano Ronaldo is the Best Player in the World, Hey CR7!')

['Cristiano',
 'Ronaldo',
 'is',
 'the',
 'Best',
 'Player',
 'in',
 'the',
 'World',
 'Hey',
 'CR7']