## NLP Pipeline
➡️ **Text** ➡️ **Text Preprocessing** ➡️ **Feature Extraction/ Feature vectorization** ➡️ **Structured & Numerical Data**

### Text Preprocessing Challenges
- Lowercase
- Spelling Mistake/ Correction
- Emoji Prediction
- Html tags
- URL's
- Chat Words (OMG, ASAP...etc)
- Puntuations
- Stop Words
- Tokenization
- Stemming
- Lemmatization

In [1]:
text = 'I am Data Scientist.'

### Lower Case

In [2]:
text.lower()

'i am data scientist.'

### Spelling Correction

In [3]:
text1 = 'I am Dala Scintistey.'

In [4]:
import autocorrect

In [5]:
speller = autocorrect.Speller()

In [6]:
speller.autocorrect_sentence(text)

'I am Data Scientist.'

### Emoji Prediction

In [7]:
text = "Biryani is Tasty😋"

In [8]:
import emoji

In [9]:
emoji.demojize(text).replace(":",' ')

'Biryani is Tasty face_savoring_food '

### Chat Words

In [1]:
chat = {
    "brb": "be right back",
    "btw": "by the way",
    "lol": "laugh out loud",
    "omg": "oh my god",
    "ttyl": "talk to you later",
    "idk": "I don't know",
    "imo": "in my opinion",
    "imho": "in my humble opinion",
    "fyi": "for your information",
    "smh": "shaking my head",
    "np": "no problem",
    "tbh": "to be honest",
    "wbu": "what about you",
    "bc": "because",
    "afaik": "as far as I know",
    "asap": "as soon as possible",
    "atm": "at the moment",
    "bbl": "be back later",
    "bfn": "bye for now",
    "bff": "best friends forever",
    "cu": "see you",
    "cya": "see you",
    "dm": "direct message",
    "fb": "Facebook",
    "ftw": "for the win",
    "gg": "good game",
    "gr8": "great",
    "gtg": "got to go",
    "hbu": "how about you",
    "ily": "I love you",
    "jk": "just kidding",
    "lmao": "laughing my ass off",
    "lmk": "let me know",
    "nvm": "never mind",
    "omw": "on my way",
    "plz": "please",
    "ppl": "people",
    "rofl": "rolling on the floor laughing",
    "thx": "thanks",
    "u": "you",
    "ur": "your",
    "yolo": "you only live once",
    "yw": "you're welcome",
    "ty": "thank you",
    "abt" : "about"
}

In [11]:
text = "omg idk abt it ty so much."

In [12]:
for word in text.split():
    if word in chat.keys():
        text = text.replace(word,chat[word])

text

"oh my god I don't know about it thank you so much."

### HTML Tags

In [13]:
text = '''<!DOCTYPE html>
<html>
<head>
    <title>Sample Text Data</title>
</head>
<body>
    <h1>Welcome to the Sample Text Data</h1>
    <p>This is a <strong>paragraph</strong> with some <em>HTML</em> tags for <a href="https://example.com">linking</a>.</p>
    <div>
        <p>Another paragraph in a <code>&lt;div&gt;</code> element.</p>
        <ul>
            <li>List item one</li>
            <li>List item two</li>
            <li>List item three</li>
        </ul>
    </div>
    <footer>
        <p>Contact us at <a href="mailto:example@example.com">example@example.com</a></p>
    </footer>
</body>
</html>
'''

In [14]:
pattern = r'<[^>]+>'

In [15]:
import re

In [16]:
print(re.sub(pattern," ",text))

 
 
 
     Sample Text Data 
 
 
     Welcome to the Sample Text Data 
     This is a  paragraph  with some  HTML  tags for  linking . 
     
         Another paragraph in a  &lt;div&gt;  element. 
         
             List item one 
             List item two 
             List item three 
         
     
     
         Contact us at  example@example.com  
     
 
 



### URL's

In [17]:
text = '''Check out the latest updates on technology at TechCrunch https://techcrunch.com.

For coding tutorials and resources, visit freeCodeCamp https://www.freecodecamp.org.

If you are looking for academic research papers, Google Scholar https://scholar.google.com is a great resource.

For news and cu'''

In [18]:
pattern = r'www.\S+ |https?:\S+'

In [19]:
print(re.sub(pattern,"",text))

Check out the latest updates on technology at TechCrunch 

For coding tutorials and resources, visit freeCodeCamp 

If you are looking for academic research papers, Google Scholar  is a great resource.

For news and cu


### Punctuations

In [20]:
text = "i'm data scientist, data analyst and ML enginering. and you?"

In [21]:
pattern = r'[^\w]'

In [22]:
re.sub(pattern," ",text)

'i m data scientist  data analyst and ML enginering  and you '

### Stop Words

In [23]:
import nltk

In [24]:
from nltk.corpus import stopwords

In [25]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\91830\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [26]:
text = "i am data scientist  data analyst and ML enginering  and you"

In [27]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [28]:
doc = []
for i in text.split():
    if i not in stopwords.words('english'):
        doc.append(i) 

In [29]:
' '.join(doc)

'data scientist data analyst ML enginering'

## Tokenization
- Tokenization is used to convert document to words(tokens) / sentences

In [30]:
text = "I am data analyst. I am Engineer."

In [31]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [32]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\91830\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [33]:
sent_tokenize(text)

['I am data analyst.', 'I am Engineer.']

In [34]:
word_tokenize(text)

['I', 'am', 'data', 'analyst', '.', 'I', 'am', 'Engineer', '.']

## Stemming
- Stemming is the process of converting tokenized words into it's base form / root word by removing the suffixes.
- **Types of stemming :**
  1. Porter Stemmer
     - English language
  2. Lancaster Stemmer
     - English Language
     - more Agressive (It removes the more suffiex words)
  3. Snowball Stemmer
     - Multiple Language

In [35]:
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

In [36]:
port = PorterStemmer()

In [37]:
port.stem("coming")

'come'

In [38]:
port.stem("goes")

'goe'

## Lematization
- The process of converting the word to the dictionary form/ Lemma.
- It is slower when compared to stemming.
- It is accurate.

In [39]:
from nltk.stem import WordNetLemmatizer

In [40]:
lemma = WordNetLemmatizer()

In [41]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\91830\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [42]:
lemma.lemmatize("comes")

'come'

In [57]:
lemma.lemmatize("goes")

'go'

## POS Tagging

In [44]:
from nltk.tag import pos_tag

In [45]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\91830\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [46]:
text = ["i","am","data"]
pos_tag(text)

[('i', 'NN'), ('am', 'VBP'), ('data', 'NNS')]

# Text Preprocessing

In [47]:
import autocorrect
from nltk.stem import PorterStemmer
import emoji
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [48]:
def text_preprocess(text):
    speller = autocorrect.Speller()
    stem = PorterStemmer()
    lemma = WordNetLemmatizer()
    
    text = text.lower() # Converting to lowercase for uniformity
    text = speller.autocorrect_sentence(text) # correcting the spelling mistakes
    text = emoji.demojize(text).replace(':',' ') # emoji prediction and converting to text
    text = re.sub(r'www.\S+|https?://\S+',' ',text) # rerplacing the urls
    text = re.sub(r'<[^>]+>'," ",text) # Html tags
    text = re.sub(r"[^a-zA-Z0-9']",' ',text) # removing the puntuations
    text = re.sub(r'[0-9] ','',text) # replacing the numbers
    text = text = ' '.join(map(lambda i: chat[i] if i in chat.keys() else i, text.split())) # caht words 
    text = word_tokenize(text) # word tokenization
    text = [stem.stem(i) for i in text] #stemming
    text = [lemma.lemmatize(i)for i in text] # LEmatization
    text = [i for i in text if i not in stopwords.words('english') ] # stop words remover
   
    return text

In [49]:
import pandas as pd

In [50]:
df = pd.read_csv(r"C:\Users\91830\Downloads\Data Science  Course\Machine Learning\Text Processing (NLP)\sample_text_dataset.csv")

In [51]:
df

Unnamed: 0,id,text
0,1,OMG! I can not believe it! 😂😂 #unbelievable
1,2,I just had the best pizza ever!!! 🍕🍕 @BestPizz...
2,3,"Weather today is so nice, totally loving it. ☀️🌴"
3,4,"Ugh, Monday mornings are the worst 😩"
4,5,"Just finished a 10k run, feeling great! #fitness"
5,6,Going to the movies tonight. Any recommendatio...
6,7,Why is this happening to me? 😢
7,8,I love my new phone! 📱 So fast and sleek.
8,9,Had a great time at the beach today with frien...
9,10,I can not wait for the weekend! 🎉🎉


In [52]:
df['text'].apply(text_preprocess)

0    [oh, god, believ, face, tear, joy, face, tear,...
1     [best, pizza, ever, pizza, pizza, bestpizzaplac]
2    [weather, today, nice, total, love, sun, palm,...
3               [gh, monday, morn, worst, weari, face]
4                 [finish, 10k, run, feel, great, fit]
5    [go, movi, tonight, ani, recommend, clapper, b...
6                        [whi, thi, happen, cri, face]
7        [love, new, phone, mobil, phone, fast, sleek]
8    [great, time, beach, today, friend, beach, umb...
9        [wait, weekend, parti, popper, parti, popper]
Name: text, dtype: object

# Feature Vectorization
- Feature Vectorization is used to convert the preprocesed text to numerical.
- **Types of Feature Vectorization**
  1. Heuristic
     - re
     - wordnet
  2. ML
     - One hot Encoding
     - Index Based
     - BOW(Bag Of Words)
     - TF-IDF
  3. DL Approch
     - wordZvec
     - Fast text
     - gpt
     - llms ....... etc
## One Hot Encoding
- OHE is the binary representation of feature vectorization.
-  0 -> Absent  |   1 -> Present
-  words -> Columns | document -> Rows
-  **Steps to implement OHE:**
      1. Preprocessing text.
      2. Tokenize the text data.
      3. Create vocabulary.
      4. Assign the binary values to the words  0 -> Absent  |   1 -> Present.
#### Pros / Con's for applying OHE
- **PROS**
  - Intuitive
  - Simple to understand.
- **Con's**
  - Input shape is not constant.
    - Based on vocabulary, th i/p shape gets changed. 
  - Order/ sequence is missing.
    - Contextual meaning is missed.
  - Sparse matrix.
    - The data which contains more zero's. (The opposite of spares matrix is **Dense Matrix**, means less Zero's.)
    - Multi Colinearity(almost same data in both columns it meeans it has strong relationship.)
    - ML can't capture more pattrens.
  - OOV(order of vocabulary).
    - Outside of set of unique words.
    - During training ML algorithm trained on vocabulary of test data, during testing if a new word comes which is not present in the vocabulary, Ml cant understand that word.
  - Lack of semmanting.
    - The relationship between words.

## Index Based Encoding
- The idea behind the index-based encoding is to map each word with one index, i.e., a number. The first step is to create a dictionary that maps words to indexes.
- **PROS**
  - Intuitive
  - Simple to understand.
  - Present contextual meaning.
- **Con's**
    - Input shape is not constant.
    - Order/ sequence is missing.
    - Sparse matrix.
    - OOV(order of vocabulary).
    - Lack of semmanting.
 
## BOW(Bag Of Words)
- BOw is used to convert preprocessed to structural & numerical.
- BOW deals with the representation of <ins>count</ins> ie. no.of ocurences of word in the <ins>document(row/cell)</ins>.
- BOW gives the importance to words in the document.
-  **Steps to implement BOW:**
      1. Preprocessing text.
      2. Tokenize the text data.
      3. Create vocabulary.
      4. Assign the values based upon no of occurences.
- **PROS**
  - Intuitive
  - Simple to understand.
  - Importance to words.
- **Con's**
    - Input shape is not constant.
      - Solution: max_features
    - Order/ sequence is missing.
      - Solution: N grams.
    - Sparse matrix.
    - OOV(order of vocabulary).
    - Lack of semmanting.