# Advanced Text Preprocessing
1. **Text Cleaning**
    - Removing digits and words containing digits
    - Removing newline characters and extra spaces
    - Removing HTML tags
    - Removing URLs
    - Removing punctuations
    

2. **Basic Text Preprocessing**
    - Case folding
    - Expand contractions
    - Chat word treatment
    - Handle emojis
    - Spelling correction
    - Tokenization
    - Creating N-grams
    - Stop words Removal
 
 
3. **Advanced Preprocessing**
    - Stemming
    - Lemmatization
    - POS tagging
    - NER
    - Parsing
    - Coreference Resolution

## 1) Stemming
- ***Word Normalization***
     - Case Folding
     - Stemming
     - Lematization
     >- ***Examples*** 
          - Consult      --> Consult
          - Consultant   --> Consult
          - COnsultants  --> Consult
          - Consulting   --> Consult
          - Consultative --> Consult

- By filtering multiple words to their root words, the distinct count of unique words get reduced without affecting the meaning of the sentence. This has a good affect on the performance of the Machine Learning Algorithms
- Both stemming and lematization aims to reduce terms to their stems and are heavily used in information retrival
- ***Stemming*** algorithms used fixed rules such as cutting the prefix/suffix to drive the base/root word. It do so even if the stem itself is not a valid word in the language. It is faster as it cuts the words without knowing the context.
>- ***Examples***
      - Studies --> Studi
      - Cries   --> Cri

***Stemming Algorithms***
1) Porter Stemmer
2) Snowball Stemmer
3) Lancaster Stemmer
4) Regex-based Stemmer
- ***Downloading & Installing the Libraries***

In [1]:
import sys
!{sys.executable} -m pip install -q --upgrade pip
!{sys.executable} -m pip install -q numpy pandas sklearn
!{sys.executable} -m pip install -q nltk spacy gensim wordcloud textblob contractions text-clean unicode

### 1) Stemming Using the NLTK's `Porter Stemmer`
- One of the most common and effective stemming tolls is Porters Algorithm developed by the Martin Porter in 1980

In [3]:
# Import the toolkit and the PorterStemmer() method from the stem module
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.regexp import RegexpStemmer

ps = PorterStemmer()
print(dir(ps))

['MARTIN_EXTENSIONS', 'NLTK_EXTENSIONS', 'ORIGINAL_ALGORITHM', '__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_apply_rule_list', '_contains_vowel', '_ends_cvc', '_ends_double_consonant', '_has_positive_measure', '_is_consonant', '_measure', '_replace_suffix', '_step1a', '_step1b', '_step1c', '_step2', '_step3', '_step4', '_step5a', '_step5b', 'mode', 'pool', 'stem', 'vowels']


In [5]:
ps._step1a('ponies')
# Removes 'es' from the last in step1a

'poni'

In [6]:
ps._step1a('cutting')

'cutting'

In [8]:
ps._step1b('cutting')
# In the step1b, chop 'ing' from the end

'cut'

In [10]:
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize 

mystr = 'Wait waiting waits waited \
         consult, consultative, consulting, consultation\
         university, universe, universal\
         studies, cry , cries bicycle, USA\
          data datum'
ps = PorterStemmer()
for i in word_tokenize(mystr):
    root_words = ps.stem(i)
    print(i, ' ----> ', root_words)

Wait  ---->  wait
waiting  ---->  wait
waits  ---->  wait
waited  ---->  wait
consult  ---->  consult
,  ---->  ,
consultative  ---->  consult
,  ---->  ,
consulting  ---->  consult
,  ---->  ,
consultation  ---->  consult
university  ---->  univers
,  ---->  ,
universe  ---->  univers
,  ---->  ,
universal  ---->  univers
studies  ---->  studi
,  ---->  ,
cry  ---->  cri
,  ---->  ,
cries  ---->  cri
bicycle  ---->  bicycl
,  ---->  ,
USA  ---->  usa
data  ---->  data
datum  ---->  datum


>- ***Over Stemming:*** In over stemming too much of the data is cut off (i.e studies --> studi) or two words of different stems map to same word (i.e university, universal and universe are all reduced worngly to the same stem universe)
>- ***Under Stemming:*** It is just the opposite of the over stemming, in which two words of the same stem  are mapped to the different stems (the stem of both the data and datum is data, but are reduced wrongly to different stems

### 2) Lemmatization Using NLTK's `WordNetLemmatizer`
- ***Stemming*** algorithms use fix rules such as cutting the prefix/suffix to drive the base/root word. It do so even if the stem itself is not valid word in the language. It is faster as it cuts the words without knowing the context
- It is rule-based approach 
- When we convert any word into root-form then the stemming may create the non-existence meaning of the word
- Stemming is preferred when meaning of the words are not important in analysis
- For example: "Studies"-->"Studi:
- ***Lemmatization*** use knowledge of language to drive the base/root word also known as lemma. Lemmatization ensures that the root word (lemma) belongs to a language. Since lemmatization involves the meaning of the word from something like dictionary, it's time comsuming. 
- It is dictionary-based approach
- It always gives the dictionary word while converting to root form
- - Lemmatization is preferred when meaning of the words are important in analysis
- For example: Question Answer

In [12]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()

mystr = 'Wait waiting waits waited \
         consult, consultative, consulting, consultation\
         university, universe, universal\
         studies, cry , cries bicycle, USA\
          data datum'
for i in nltk.word_tokenize(mystr):
    print(i, '-->', lemmatizer.lemmatize(i))

Wait --> Wait
waiting --> waiting
waits --> wait
waited --> waited
consult --> consult
, --> ,
consultative --> consultative
, --> ,
consulting --> consulting
, --> ,
consultation --> consultation
university --> university
, --> ,
universe --> universe
, --> ,
universal --> universal
studies --> study
, --> ,
cry --> cry
, --> ,
cries --> cry
bicycle --> bicycle
, --> ,
USA --> USA
data --> data
datum --> datum


### 3) POS tagging
- Part of speech (POS) tagging is a process of assigning a part-of-speech to each word in the text
- POS tagging actually a classification problem where each word in the text is assigned a proper Part of Speech
- These are the ten parts of speech (POS)
>- Noun
>- Pronoun
>- Verb
>- Adverb
>- Articles
>- Adjective
>- Punctuations
>- Interjections
>- Conjunctions
>- Numeral
- POS are divided into two broad categories
> 1) Closed Class Type: (Prepositions, Articles, Pronouns)
> 2) Open Class Type: (Nouns, Verbs, Adjectives, Adverbs)
- These are the three types of POS tagging:
> 1) Rule based POS tagging (E-Brill's Tagger)
> 2) Stochastic POS tagging (Hidden Markov Model and Viterbi Algo)
> 3) Transformation based POS

#### POS tagging using Spacy
- POS tagging in Spacy library is an easy task. We just instantiate a spacy object as doc. We iterate over tokens of `spacy doc` object and use `pos_` and `tag_` attributes to print the coarse-grained POS tag. Spacy also lets you to acccess the detailed explanation of these POS tags by using `spacy.explain()` function

In [1]:
# Load the english library of the spacy
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
# This is a pre-trained model, let us check the pipes of the pipeline
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

>- A pipe is an individual component of a pipeline
>- In the case of spacy, there are few different pipes that perform different tasks. The tokenizer, tokenizes the text into individual tokens, the parser parses the text, and the NER identifies entities and labels then accordingly. All of this data is stored in the Doc object. 

In [3]:
mytext = "The quick brown box name is Zoro and it high jumped over the lazy dog's back"
Doc = nlp(mytext)
print(type(Doc))

<class 'spacy.tokens.doc.Doc'>


In [4]:
Doc

The quick brown box name is Zoro and it high jumped over the lazy dog's back

In [6]:
# Doc level attributes
print(dir(Doc))

['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '_bulk_merge', '_context', '_get_array_attrs', '_realloc', '_vector', '_vector_norm', 'cats', 'char_span', 'copy', 'count_by', 'doc', 'ents', 'extend_tensor', 'from_array', 'from_bytes', 'from_dict', 'from_disk', 'from_docs', 'from_json', 'get_extension', 'get_lca_matrix', 'has_annotation', 'has_extension', 'has_unknown_spaces', 'has_vector', 'is_nered', 'is_parsed', 'is_sentenced', 'is_tagged', 'lang', 'lang_', 'mem', 'noun_chunks', 'noun_chunks_iterator', 'remove_extension', 'retokenize', 'sentiment', 'sents', 'set_ents', 'set_extension', 'similarity', 'spans', 'tensor', 'text', 'te

In [8]:
# Token level attributes
print(dir(Doc[0]))

['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_dep', 'has_extension', 'has_head', 'has_morph', 'has_vector', 'head', 'i', 'idx', 'iob_strings', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_end', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex', 'lex_id', 'like_email', 'like

In [9]:
print('\033[1m')
print(f'{"Token":{10}}{"Course POS Tag":{10}}{"Fined Graned POS Tag":{21}}{"Explanation"}')
print('\033[m')
for token in Doc:
    print(f'{token.text:{10}}{token.pos_:{10}}{token.tag_:{21}}{spacy.explain(token.tag_)}')

[1m
Token     Course POS TagFined Graned POS Tag Explanation
[m
The       DET       DT                   determiner
quick     ADJ       JJ                   adjective (English), other noun-modifier (Chinese)
brown     PROPN     NNP                  noun, proper singular
box       NOUN      NN                   noun, singular or mass
name      NOUN      NN                   noun, singular or mass
is        AUX       VBZ                  verb, 3rd person singular present
Zoro      PROPN     NNP                  noun, proper singular
and       CCONJ     CC                   conjunction, coordinating
it        PRON      PRP                  pronoun, personal
high      ADV       RB                   adverb
jumped    VERB      VBD                  verb, past tense
over      ADP       IN                   conjunction, subordinating or preposition
the       DET       DT                   determiner
lazy      ADJ       JJ                   adjective (English), other noun-modifier (Chinese)
do

>- To view the coarse POS tag use `token.pos_`
>- To view the fine_graned POS tag use `tag.tag_`
>- To view the description of either type of tag use `spacy.explain(tag)`

***Example #02***

In [10]:
doc = nlp('I am going to make dinner')
word = doc[4]
print(word, '--->', word.pos_, word.tag_, spacy.explain(word.tag_))

make ---> VERB VB verb, base form


In [12]:
doc = nlp('What is the make of your laptop?')
word = doc[3]
print(word, '--->', word.pos_, word.tag_, spacy.explain(word.tag_))

make ---> NOUN NN noun, singular or mass


#### Visualizing Parts of Speech Using `displacy`
- To visually render POS, we can use `displacy` module of the spacy
- On Jupyter Notebook: `displacy.render()`
- On other IDEs: `displacy.serve()`

In [13]:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Arif is playing Cricket')
displacy.render(doc)

### 4) Named Entity Recognition (NER)
- An ***Entity*** is a common thing that belongs to a noun family. It can be subject or onject
- A ***Named Entity*** refers to a real world or conceptual object that can be represented by a proper noun
- ***Named Entity Recognition (NER)*** is a subtask of `Information Extraction` that seeks to locate and classify named entities mentioned in unstructed text into predefined categories such as person names, organizations, locations, medical codes, date time expressions, monetary values, percentages etc.
- It is also known as entity identification or entity chunking or entity extraction  

***Example:*** Arif (`person`) has moved to Karachi (`GPE`) where he will be playing (`event`) basketball (`product`) on 26 Dec 2023 (`date`)

***Use Cases of NER***
- 1) Coreference Resolution
- 2) Information Extraction
- 3) Search Engines
- 4) Recommendation Systems
- 5) Questions Answers systems and chatbots

***How to NER***
- Dictionary-Based Approach
- Rule-Based Approach
- Machine Learning-Based Approach

#### NER using spacy
***Example # 01***

In [14]:
# Load the english library of the spacy
import spacy
nlp = spacy.load('en_core_web_sm')

In [15]:
# This is a pre-trained model, let us check the pipes of the pipeline
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [16]:
# The pre-trained model support following named entities
nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

In [21]:
mystr = 'Iqbal works full time in Punjab University since 2022 and part time in twitter'
doc = nlp(mystr)
for i in doc.ents:
    print(i.text, '--->', i.label_)
# Visualization of NET
displacy.render(doc, style = 'ent', jupyter = True)

Punjab University ---> ORG
2022 ---> DATE


### 5) Corefernce Resolution
- Coreference is a process of identify all noun phrases that refer to the same entity

***Use Cases of Coreference Resolution***
- 1) Coreference Resolution
- 2) Information Retrival
- 3) Text Summarization
- 4) Machine Translation
- 5) Question Answering System 