### Text Preprocessing

It is necessary to transform raw text into something a ML algo can digest and this is the mainn objective of text  pre processing. 
The main steps of text pre processing are as follows:
1. Remove punctuations
2. Tokenization
3. Remove stopwords
4. Lemmatization
5. Stemming
6. Remove numbers
7. Lowercasing
8. Remove special characters
9. Remove extra spaces
10. Remove URLs
11. Remove HTML tags
12. Remove emojis
13. Remove emoticons
14. Remove non-ASCII characters
15. Remove multiple whitespaces
16. etc


1. Tokenization
    Tokenization is the process of breaking the text into smaller parts called tokens. Tokens can be words, sentences, or subwords. It is an    essential step in both NLP and text analysis. Tokenization is the first step in text analytics. The process of tokenization is divided into two types: word tokenization and sentence tokenization. Word tokenization is the process of splitting the text into individual words. Sentence tokenization is the process of splitting the text into individual sentences.

2. Text Normalization
    Text normalization is the process of transforming text into a single canonical form. This is done by converting text to lowercase, removing special characters, removing extra spaces, removing numbers, removing stopwords, and so on. Text normalization is an essential step in text analysis. It helps in reducing the size of the vocabulary and improving the performance of the model.

3. Part of Speech tagging
    Part of speech tagging is the process of assigning a part of speech to each word in a sentence. The part of speech indicates the role of the word in the sentence. There are eight parts of speech: noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection. Part of speech tagging is an essential step in text analysis. It helps in understanding the grammatical structure of the text and extracting useful information from it.

4. Dependency Parsing
    Dependency parsing is the process of analyzing the grammatical structure of a sentence. It involves identifying the relationships between words in a sentence and representing them as a tree structure. Dependency parsing is an essential step in text analysis. It helps in understanding the syntactic structure of the text and extracting useful information from it.

About the module:
- Corpus, Tokens, Ngram
- Tokenization
- Stemming 
- Lemmatization
- Stopwords
- Part of Speech Tagging
- Dependency parsing

1. Corpus : Collection of text documents, Corpus > Documents > Paragraphs > Sentences > Tokens

2. Tokens : Words or sentences or basically smallest unit of a text.

3. Ngram : A contiguous sequence of n items from a given sample of text or speech i.e combination of N words/characters together.
    a. Unigrams(n=1) : I, Love, my, phone, very, much
    b. Bigrams(n=2) : I Love, Love my, my phone, phone very, very much
    c. Trigrams(n=3) : I Love my, Love my phone, my phone very, phone very much

4. Tokenization: Process of splitting a text object into smaller units. There are many Tokenization techniques and many tokenizers are present
    there. For example White Space Tokenizer, Unigram Tokenizer, Regex Tokenizer etc

5. Normalization: Morpheme is defined as base form of a word. Structure of token: <prefix> <morpheme> <suffix>
    Example: Antinationalist -> Anti National ist
    Process of converting Sa token into its base form, helpful in reducing data dimensionality and text cleaning. Types: Stemminng and Lemmatization

#### Stemming: 
    Process of reducing inflected words to their word stem, base or root form. It removes suffices or prefixes from a word and reduce it to its root word. 
    Example: Entitling -> Entitle, Entitled -> Entitle, Entitles -> Entitle
    Types: Porter Stemmer, Snowball Stemmer, Lancaster Stemmer
    Advantages: Fast, Easy to implement, Reduces data dimensionality
    Disadvantages: Overstemming, Produces non-existant words, Produces words that are not semantically correct

    Sometimes the stemmed word is not a regular word and doesnt have any meaning.
    Example: Entitling -> Entitl, Entitled -> Entitl, Entitles -> Entitl

#### Lemmatization:
    Process of converting a word into its base form. It is similar to stemming but it is more powerful. It reduces the word into its base form based on the dictionary meaning of a word. It makes use of vocabulary and morphological analysis of words. 
    Example: Entitling -> Entitle, Entitled -> Entitle, Entitles -> Entitle
    Advantages: Produces real words, Reduces data dimensionality, Reduces overstemming
    Disadvantages: Slow, Complex, Requires a dictionary

    Lemmatization is preferred over stemming because lemmatization does morphological analysis of the words. It is more sophisticated than stemming. It reduces the word into its base form based on the dictionary meaning of a word. It makes use of vocabulary and morphological analysis of words.

6. Stopwords: 
Words that are very common in the text and provide no useful information. They are removed from the text before processing. Example: is, am, are, the, a, an, in, on, at, to, etc


7. Part of Speech Tagging: 
Defines the syntactic context and role of words in a sentence. Process of assigning a part of speech to each word in a sentence. The part of speech indicates the role of the word in the sentence. There are eight parts of speech: noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection. Part of speech tagging is an essential step in text analysis. It helps in understanding the grammatical structure of the text and extracting useful information from it. It is defined by their relations with adjacent words and ML or Rule Based Processes are Used.

* Used of POS Tags:
    - text cleaning 
    - feature engineering
    - word sense disambiguation
    - sentiment analysis

8. Constituency Grammar: 
Constituents : Words/ Phrases/ group of words
    - Noun Phrase(NP) : A phrase that has a noun as its head. Example: The big dog
    - Verb Phrase(VP) : A phrase that has a verb as its head. Example: The dog is barking
    - Adjective Phrase(AP) : A phrase that has an adjective as its head. Example: The big dog
    - Adverb Phrase(ADVP) : A phrase that has an adverb as its head. Example: The dog is barking loudly
    - Prepositional Phrase(PP) : A phrase that has a preposition as its head. Example: The dog is in the house
    - Sentence(S) : A sentence is a group of words that expresses a complete thought. Example: The dog is barking loudly in the house
    - Clause(C) : A clause is a group of words that contains a subject and a verb. Example: The dog is barking loudly in

9. Dependency Grammar:
Dependency Parsing is the process of analyzing the grammatical structure of a sentence. It involves identifying the relationships between words in a sentence and representing them as a tree structure. Dependency parsing is an essential step in text analysis. It helps in understanding the syntactic structure of the text and extracting useful information from it. It is used in Named Entity Recognition, Sentiment Analysis, Question Answering, Machine Translation, etc.
In short all the words are directly or indirectly linked to roots of the words in a sentence.





## Implementation

In [1]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = "Hi John, How are you doing? I will be going to the market. Let's Catchup"

In [2]:
word_tokenize(text)

['Hi',
 'John',
 ',',
 'How',
 'are',
 'you',
 'doing',
 '?',
 'I',
 'will',
 'be',
 'going',
 'to',
 'the',
 'market',
 '.',
 'Let',
 "'s",
 'Catchup']

In [3]:
sent_tokenize(text)

['Hi John, How are you doing?',
 'I will be going to the market.',
 "Let's Catchup"]

In [8]:
#Stemming
from nltk.stem import PorterStemmer
ps = PorterStemmer()

example_words = ["python","pythoner","pythoning","pythoned","pythonly"]

for w in example_words:
    print(ps.stem(w))


print(ps.stem("playing"))
print(ps.stem("increases")) 
print(ps.stem("lively"))



python
python
python
python
pythonli
play
increas
live


In [12]:
#Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

example_words = ["python","pythoner","pythoning","pythoned","pythonly"]

for w in example_words:
    print(lemmatizer.lemmatize(w))

print(lemmatizer.lemmatize("increases"))
print(lemmatizer.lemmatize("running", pos="v"))
print(lemmatizer.lemmatize("running", pos="n"))


python
pythoner
pythoning
pythoned
pythonly
increase
run
running


In [15]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to C:\Users\Sahib Preet
[nltk_data]     Singh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Sahib Preet
[nltk_data]     Singh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [16]:
#POSTagging
from nltk import pos_tag
from nltk.tokenize import word_tokenize

text = "Hi John, How are you doing? I will be going to the market. Let's Catchup"
tokens = word_tokenize(text)
pos_tag(tokens)

[('Hi', 'NNP'),
 ('John', 'NNP'),
 (',', ','),
 ('How', 'NNP'),
 ('are', 'VBP'),
 ('you', 'PRP'),
 ('doing', 'VBG'),
 ('?', '.'),
 ('I', 'PRP'),
 ('will', 'MD'),
 ('be', 'VB'),
 ('going', 'VBG'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('market', 'NN'),
 ('.', '.'),
 ('Let', 'VB'),
 ("'s", 'POS'),
 ('Catchup', 'NNP')]

In [18]:
#Wordnet
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

wordnet.synsets("Book")

[Synset('book.n.01'),
 Synset('book.n.02'),
 Synset('record.n.05'),
 Synset('script.n.01'),
 Synset('ledger.n.01'),
 Synset('book.n.06'),
 Synset('book.n.07'),
 Synset('koran.n.01'),
 Synset('bible.n.01'),
 Synset('book.n.10'),
 Synset('book.n.11'),
 Synset('book.v.01'),
 Synset('reserve.v.04'),
 Synset('book.v.03'),
 Synset('book.v.04')]

In [20]:
from nltk import ngrams
sentence = "I love to play cricket"

n = 2

for i in ngrams(word_tokenize(sentence), n):
    print(i)

('I', 'love')
('love', 'to')
('to', 'play')
('play', 'cricket')
