## Text Preprocessing 1 (Cleaning)


#### 1) Tokenization:
    * Takes the sentence and converting it into words.

#### ^) Stop words

#### 2) Stemming:
    * Process of reducing words to their Base word stem.
    * We'll be trying to find the base stem of the word.
    Eg: historical , History --stemming--> histori (This word won't have any meaning in some case it may have meaning )
    * In short we are getting root word/base form. 

    Pro - Stemming is really fast | Cons - It is removing the meaning of the word

    > Use case:
        1) Spam classifier
        2) Review classifier

#### 3) Lemmatization:
    * To overcome the con in stemming.
    * historical , History --Lemmatization--> history (Meaningful word)
    * Identify the meaningful base word from dict

    Pro - Meaningful word | Cons - It is slow as it searches the dictionary for words.

    > Use case:
        1) Text Summarization
        2) Language translation
        3) Chat bot
        


## Text preprocessing 2 (Convert words to vector)

[OHE(One Hot Encoding), Bag of Words, TD-IDF(Term Frequency-Inverse Document Frequency), Word2Vector]

#### Basic Terminologies used in NLP:
    1) CORPUS --> paragraph
    2) Documents --> sentence
    3) Vocabulary --> Unique words in para/sentence
    4) Words

### 1) Bag of Words:

    D1 -> He is a good boy
    D2 -> She is a good girl
    D3 -> Boy and girl are good

    * Using Stop word remove the umimportant words like [is, are, he...]. Make sure all the words are conveted to small case

        Note: Unimportance words are applicable only in sentiment classification, toxic classification. In someother cases it would be important.

    D1 -> good boy
    D2 -> good girl
    D3 -> boy girl good.

    * Identify the Vocabularies and its frequency

        Vocabulary | Frequeny
        
        good            3
        boy             2
        girl            2

    * convert to feature.features
        From the docs after using stop word, increase the count of the word in f (feature) and make the docu to a vector

                f1     | f2    | f3
                good     boy     girl
            D1  1          1       0
            D2  1          0       1
            D3  1          1       1  

        --> if in case D1 is like below
            D1 -> He is a good good boy --> good good boy

                f1     | f2    | f3
                good     boy     girl
            D1  2          1       0
            D2  1          0       1
            D3  1          1       1  

        --> In binary BoW we only use 1s and 0s, so our vectors will be like, it doc should have the word, if it has it will be 1 esle 0.
                f1     | f2    | f3
                good     boy     girl
            D1  1          1       0
            D2  1          0       1
            D3  1          1       1 

    Advantages:
        1) Simple and Intence

    DisAdv:
        1) Sparcity - Matrixs 1XN - N may increase based on no of Words.
        2) OOV (out of vocabulary) - If a new Word is introduced in any of doc it can't handle.
        3) Ordering has completly changed
        4) Sematic meaning is lost.

In BoW we can capture the simatic meaning, inorder to capture the sementic meaning we can go with N-Grams.

### N-Grams
    * Bigrams, Trigrams, ... N gram

#### Bi-gram and Tri Gram.

    Let's take the below docs
    D1 -> good boy
    D2 -> good girl
    D3 -> boy girl good.

        f1     | f2    | f3     | f4            | f5
         good     boy     girl  good boy        good girl      
     D1  1          1       0       1               0
     D2  1          0       1       0               1
     D3  1          1       1       0               0

        f1,f2,f3 - Uni Gram
        f4, f3 - Bi-Gram
     * How do we take the bigrams?

        Eg: 
            1) I am suganth -> Has 2 Bi-Grams

                I am , am suganth
            
            2) I am not feeling well -> Has 3 Tri Grams

                I am not, am not feeling, not feeling well.
            
        from the above example we can observ and see it is creating sematic meaning.

#### Representation of N grams:

    (1,3) -> we will create the features from unigram to trigram.

    Eg:

    D1 - I am not feeling well

    D1  f1|     f2|     f3|     f4| .....  f6|    f7|......fn
        I       am      not     feeling    I am   am not..not feeling well


In [1]:
import nltk

In [2]:
paragraph = """
Reviewers say 'Breaking Bad' is celebrated for its intricate storytelling, 
complex characters, and moral ambiguity. Walter White's transformation captivates 
audiences, praised for meticulous writing and compelling arcs. Bryan Cranston 
and Aaron Paul deliver standout performances. Cinematography, dark humor, and 
exploration of human nature are frequently highlighted. Often compared to 
'The Sopranos' and 'The Wire', it is considered a modern TV masterpiece. 
Minor flaws are noted, but overall quality is exceptional.
"""
paragraph

"\nReviewers say 'Breaking Bad' is celebrated for its intricate storytelling, \ncomplex characters, and moral ambiguity. Walter White's transformation captivates \naudiences, praised for meticulous writing and compelling arcs. Bryan Cranston \nand Aaron Paul deliver standout performances. Cinematography, dark humor, and \nexploration of human nature are frequently highlighted. Often compared to \n'The Sopranos' and 'The Wire', it is considered a modern TV masterpiece. \nMinor flaws are noted, but overall quality is exceptional.\n"

In [3]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [4]:
## tokenization-- Convert paragraph->sentences->words

nltk.download('punkt')
sentences=nltk.sent_tokenize(paragraph) # inorder to use it we need to download punkt age in nltk


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\suganth\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
#Stemming
stemmer=PorterStemmer()
stemmer.stem('history')
#stemmer.stem('goes')

'histori'

In [6]:
## lemmatization
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
lemmatizer.lemmatize('history')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\suganth\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


'history'

In [8]:
import re
corpus = []
for i in range(len(sentences)):
    review=re.sub('[^a-zA-Z]',' ',sentences[i])
    review=review.lower()
    corpus.append(review)

In [9]:
corpus

[' reviewers say  breaking bad  is celebrated for its intricate storytelling   complex characters  and moral ambiguity ',
 'walter white s transformation captivates  audiences  praised for meticulous writing and compelling arcs ',
 'bryan cranston  and aaron paul deliver standout performances ',
 'cinematography  dark humor  and  exploration of human nature are frequently highlighted ',
 'often compared to   the sopranos  and  the wire   it is considered a modern tv masterpiece ',
 'minor flaws are noted  but overall quality is exceptional ']

In [11]:
##English stop words
nltk.download('stopwords')
stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\suganth\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [20]:
## stemming
for i in corpus:
    words = nltk.word_tokenize(i)
    for word in words:
        #print(word)
        if word not in set(stopwords.words('english')):
            print(stemmer.stem(word))

review
say
break
bad
celebr
intric
storytel
complex
charact
moral
ambigu
walter
white
transform
captiv
audienc
prais
meticul
write
compel
arc
bryan
cranston
aaron
paul
deliv
standout
perform
cinematographi
dark
humor
explor
human
natur
frequent
highlight
often
compar
soprano
wire
consid
modern
tv
masterpiec
minor
flaw
note
overal
qualiti
except


In [22]:
## Lemmatizton
for i in corpus:
    words = nltk.word_tokenize(i)
    for word in words:
        #print(word)
        if word not in set(stopwords.words('english')):
            print(lemmatizer.lemmatize(word))


reviewer
say
breaking
bad
celebrated
intricate
storytelling
complex
character
moral
ambiguity
walter
white
transformation
captivates
audience
praised
meticulous
writing
compelling
arc
bryan
cranston
aaron
paul
deliver
standout
performance
cinematography
dark
humor
exploration
human
nature
frequently
highlighted
often
compared
soprano
wire
considered
modern
tv
masterpiece
minor
flaw
noted
overall
quality
exceptional
