## **NLP Lecture -01 -> On notebook**
- Introduction to NLP
- What is Natural Language?
- Real world application
- Common NLP tasks
- Approaches to NLP
- Challenges in NLP

## **NLP Lecture -02 -> On notebook**
- NLP Pipeline
  - Data Acquisition
  - Text Preparation
  - Feature Engineering
  - Modelling
  - Deployment

## **Text Preparation: Cleaning**

## a. Removing HTML Tags

In [None]:
sample_text = '<html> <head> <style> </style> </head> <body> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. </p> </body> </html>'
sample_text

'<html> <head> <style> </style> </head> <body> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. </p> </body> </html>'

In [None]:
# Defining function
import re
def htmltag_strip(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

In [None]:
htmltag_strip(sample_text)

'      Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.   '

## b. Unicode Normalization

In [None]:
emoji_text = 'Hii👋❤️️, How 🛩🛥 are you🤦‍♀️. I was 🏟🥇 waiting 🌟💖 🇮🇩 since🕒💘. Hope we will ☎️💄 meet tomorrow ✡️🔀 with your 😂🌹 family✍️👍.'
emoji_text

'Hii👋❤️️, How 🛩🛥 are you🤦\u200d♀️. I was 🏟🥇 waiting 🌟💖 🇮🇩 since🕒💘. Hope we will ☎️💄 meet tomorrow ✡️🔀 with your 😂🌹 family✍️👍.'

In [None]:
emoji_text.encode('utf-8')

b'Hii\xf0\x9f\x91\x8b\xe2\x9d\xa4\xef\xb8\x8f\xef\xb8\x8f, How \xf0\x9f\x9b\xa9\xf0\x9f\x9b\xa5 are you\xf0\x9f\xa4\xa6\xe2\x80\x8d\xe2\x99\x80\xef\xb8\x8f. I was \xf0\x9f\x8f\x9f\xf0\x9f\xa5\x87 waiting \xf0\x9f\x8c\x9f\xf0\x9f\x92\x96 \xf0\x9f\x87\xae\xf0\x9f\x87\xa9 since\xf0\x9f\x95\x92\xf0\x9f\x92\x98. Hope we will \xe2\x98\x8e\xef\xb8\x8f\xf0\x9f\x92\x84 meet tomorrow \xe2\x9c\xa1\xef\xb8\x8f\xf0\x9f\x94\x80 with your \xf0\x9f\x98\x82\xf0\x9f\x8c\xb9 family\xe2\x9c\x8d\xef\xb8\x8f\xf0\x9f\x91\x8d.'

## c. Spelling check

In [None]:
incorrect_text = 'ceertain conditions duriing sevaral ggeneration aree modified in the saame maner.'
incorrect_text

'ceertain conditions duriing sevaral ggeneration aree modified in the saame maner.'

In [None]:
# Importing library
from textblob import TextBlob
txtblob = TextBlob(incorrect_text)
txtblob.correct()

TextBlob("certain conditions during several generation are modified in the same manner.")

## d. Tokenization

In [None]:
dummy='Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'
dummy

'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'

In [None]:
# Importing library
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

sents = sent_tokenize(dummy)
sents

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.',
 'Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.',
 'Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.',
 'Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.']

In [None]:
for word in sents:
  print(word_tokenize(word))

['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.']
['Ut', 'enim', 'ad', 'minim', 'veniam', ',', 'quis', 'nostrud', 'exercitation', 'ullamco', 'laboris', 'nisi', 'ut', 'aliquip', 'ex', 'ea', 'commodo', 'consequat', '.']
['Duis', 'aute', 'irure', 'dolor', 'in', 'reprehenderit', 'in', 'voluptate', 'velit', 'esse', 'cillum', 'dolore', 'eu', 'fugiat', 'nulla', 'pariatur', '.']
['Excepteur', 'sint', 'occaecat', 'cupidatat', 'non', 'proident', ',', 'sunt', 'in', 'culpa', 'qui', 'officia', 'deserunt', 'mollit', 'anim', 'id', 'est', 'laborum', '.']


In [None]:
for char in word[:9]:
  print(char)

E
x
c
e
p
t
e
u
r
