## Text Preprocessing

Preproccesing is very important for NLP.The better we do text preprocessing, the better our model will yield.We can perform different operations depending on the model we will use.


In [1]:
import nltk 
import string
import re

In [2]:
input_str= "Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains."

## Lower Casing

In [3]:
def lowercase_text(text):
  return text.lower()
  

In [4]:
lowercase_text(input_str)

'recurrent models typically factor computation along the symbol positions of the input and output sequences. aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. this inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. the fundamental constraint of sequential computation, however, remains.'

## Removal Numerical Value

In [7]:
def remove_num(text):
  result= re.sub(r'\d+', '' , text)
  return result

In [8]:
remove_num(input_str)

'Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht− and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [] and conditional computation [], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.'

## Removal of Punctuations

In [9]:
def remove_punct(text):
  translator=str.maketrans('','', string.punctuation)
  return text.translate(translator)

In [10]:
remove_punct(input_str)

'Recurrent models typically factor computation along the symbol positions of the input and output sequences Aligning the positions to steps in computation time they generate a sequence of hidden states ht as a function of the previous hidden state ht−1 and the input for position t This inherently sequential nature precludes parallelization within training examples which becomes critical at longer sequence lengths as memory constraints limit batching across examples Recent work has achieved significant improvements in computational efficiency through factorization tricks 21 and conditional computation 32 while also improving model performance in case of the latter The fundamental constraint of sequential computation however remains'

In [14]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [15]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [42]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

## Removal Stopwords and Tokenize

In [16]:
def rem_stopwords(text):
  stop_words =set(stopwords.words('english'))
  word_tokens = word_tokenize(text)
  filltered_text= [word for word in word_tokens if word not in stop_words]
  return filltered_text

In [24]:
a=rem_stopwords(input_str)

In [27]:
a

['Recurrent',
 'models',
 'typically',
 'factor',
 'computation',
 'along',
 'symbol',
 'positions',
 'input',
 'output',
 'sequences',
 '.',
 'Aligning',
 'positions',
 'steps',
 'computation',
 'time',
 ',',
 'generate',
 'sequence',
 'hidden',
 'states',
 'ht',
 ',',
 'function',
 'previous',
 'hidden',
 'state',
 'ht−1',
 'input',
 'position',
 't.',
 'This',
 'inherently',
 'sequential',
 'nature',
 'precludes',
 'parallelization',
 'within',
 'training',
 'examples',
 ',',
 'becomes',
 'critical',
 'longer',
 'sequence',
 'lengths',
 ',',
 'memory',
 'constraints',
 'limit',
 'batching',
 'across',
 'examples',
 '.',
 'Recent',
 'work',
 'achieved',
 'significant',
 'improvements',
 'computational',
 'efficiency',
 'factorization',
 'tricks',
 '[',
 '21',
 ']',
 'conditional',
 'computation',
 '[',
 '32',
 ']',
 ',',
 'also',
 'improving',
 'model',
 'performance',
 'case',
 'latter',
 '.',
 'The',
 'fundamental',
 'constraint',
 'sequential',
 'computation',
 ',',
 'however',
 '

In [26]:
from collections import Counter
cnt = Counter()
for text in a:
    for word in text.split():
        cnt[word] += 1
        
cnt.most_common(10)

[(',', 7),
 ('computation', 4),
 ('.', 4),
 ('positions', 2),
 ('input', 2),
 ('sequence', 2),
 ('hidden', 2),
 ('sequential', 2),
 ('examples', 2),
 ('[', 2)]

In [None]:
S

## Stemming 

In [34]:
ps = PorterStemmer()

In [35]:
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]

In [36]:
for w in example_words:
    print(ps.stem(w))

python
python
python
python
pythonli


In [37]:
new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."

In [38]:
words = word_tokenize(new_text)

for w in words:
    print(ps.stem(w))

it
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc
.


## Lemmatization

In [46]:
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer()
  
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
print("worst :", lemmatizer.lemmatize("worst",pos="a"))  
print("better :", lemmatizer.lemmatize("better", pos ="a"))

rocks : rock
corpora : corpus
worst : bad
better : good


## Conversion of Emoji to Words

## Removal of Emojis

In [47]:
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

remove_emoji("game is on 🔥🔥")

'game is on '

## Removal of URLS 

In [49]:
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

In [50]:
text = "Driverless AI NLP blog post on https://www.h2o.ai/blog/detecting-sarcasm-is-difficult-but-ai-may-have-an-answer/"
remove_urls(text)

'Driverless AI NLP blog post on '

In [51]:
text = "Want to know more. Checkout www.h2o.ai for additional information"
remove_urls(text)

'Want to know more. Checkout  for additional information'

## Removal of HTML Tags

In [52]:
def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)

text = """<div>
<h1> H2O</h1>
<p> AutoML</p>
<a href="https://www.h2o.ai/products/h2o-driverless-ai/"> Driverless AI</a>
</div>"""

print(remove_html(text))


 H2O
 AutoML
 Driverless AI



## Chat Words Conversion

In [53]:
chat_words_str = """
AFAIK=As Far As I Know
AFK=Away From Keyboard
ASAP=As Soon As Possible
ATK=At The Keyboard
ATM=At The Moment
A3=Anytime, Anywhere, Anyplace
BAK=Back At Keyboard
BBL=Be Back Later
BBS=Be Back Soon
BFN=Bye For Now
B4N=Bye For Now
BRB=Be Right Back
BRT=Be Right There
BTW=By The Way
B4=Before
B4N=Bye For Now
CU=See You
CUL8R=See You Later
CYA=See You
FAQ=Frequently Asked Questions
FC=Fingers Crossed
FWIW=For What It's Worth
FYI=For Your Information
GAL=Get A Life
GG=Good Game
GN=Good Night
GMTA=Great Minds Think Alike
GR8=Great!
G9=Genius
IC=I See
ICQ=I Seek you (also a chat program)
ILU=ILU: I Love You
IMHO=In My Honest/Humble Opinion
IMO=In My Opinion
IOW=In Other Words
IRL=In Real Life
KISS=Keep It Simple, Stupid
LDR=Long Distance Relationship
LMAO=Laugh My A.. Off
LOL=Laughing Out Loud
LTNS=Long Time No See
L8R=Later
MTE=My Thoughts Exactly
M8=Mate
NRN=No Reply Necessary
OIC=Oh I See
PITA=Pain In The A..
PRT=Party
PRW=Parents Are Watching
ROFL=Rolling On The Floor Laughing
ROFLOL=Rolling On The Floor Laughing Out Loud
ROTFLMAO=Rolling On The Floor Laughing My A.. Off
SK8=Skate
STATS=Your sex and age
ASL=Age, Sex, Location
THX=Thank You
TTFN=Ta-Ta For Now!
TTYL=Talk To You Later
U=You
U2=You Too
U4E=Yours For Ever
WB=Welcome Back
WTF=What The F...
WTG=Way To Go!
WUF=Where Are You From?
W8=Wait...
7K=Sick:-D Laugher
"""

In [54]:
chat_words_map_dict = {}
chat_words_list = []
for line in chat_words_str.split("\n"):
    if line != "":
        cw = line.split("=")[0]
        cw_expanded = line.split("=")[1]
        chat_words_list.append(cw)
        chat_words_map_dict[cw] = cw_expanded
chat_words_list = set(chat_words_list)

def chat_words_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words_list:
            new_text.append(chat_words_map_dict[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

chat_words_conversion("one minute BRB")

'one minute Be Right Back'