# Preprocessing in NLP

![nlp.png](attachment:nlp.png)

# 1: Lowercasing

In [6]:
def convert_lowercase(column):
    column = column.lower()
    return column

In [7]:
lc = "AI's Impact Across Industries Continues to Grow, Driving Innovation and Reshaping the Future of Technology."

In [8]:
convert_lowercase(lc)

"ai's impact across industries continues to grow, driving innovation and reshaping the future of technology."

# 2: Removing HTML Tags

In [9]:
import re
def remove_html_tags(text):
    re_html = re.compile('<.*?>')
    return re_html.sub(r'', text)

In [18]:
text = '<h2>Welcome to our Website!</h2><p> Explore the services, we offer and discover the possibilities.</p><blockquote> Life is what happens when you are busy making other plans.<cite> Happy Learning</cite></blockquote>'

In [19]:
print(remove_html_tags(text))

Welcome to our Website! Explore the services, we offer and discover the possibilities. Life is what happens when you are busy making other plans. Happy Learning


# 3: Removing URLs

In [20]:
text = 'My medium link: https://medium.com/@borhadepiyush, My kaggle link : https://www.kaggle.com/piyushborhade, My linkedIn link: https://www.linkedin.com/in/piyush-borhade/'

In [21]:
def remove_url(text):
    re_url = re.compile('https?://\S+|www\.\S+')
    return re_url.sub('', text)

In [25]:
print(f'Text before removing URL: {text}')

Text before removing URL: My medium link: https://medium.com/@borhadepiyush, My kaggle link : https://www.kaggle.com/piyushborhade, My linkedIn link: https://www.linkedin.com/in/piyush-borhade/


In [26]:
print(f'Text after removing URL : {remove_url(text)}')

Text after removing URL : My medium link:  My kaggle link :  My linkedIn link: 


# 4: Removing Punctuations

In [27]:
import string
exclude = string.punctuation

def remove_punc(text):
    return text.translate(str.maketrans('', '', exclude))

In [28]:
text = 'In the ever-evolving realm of artificial intelligence (AI), breakthroughs and innovations punctuate the landscape with the fervor of discovery! As algorithms crunch vast datasets, insights emerge like exclamation points, illuminating the potential for transformative change. The collaboration of human intellect and machine learning, marked by hyphens of synergy, propels the field forward with relentless determination.'

In [29]:
print(f'Text before punctuation: {text}')

Text before punctuation: In the ever-evolving realm of artificial intelligence (AI), breakthroughs and innovations punctuate the landscape with the fervor of discovery! As algorithms crunch vast datasets, insights emerge like exclamation points, illuminating the potential for transformative change. The collaboration of human intellect and machine learning, marked by hyphens of synergy, propels the field forward with relentless determination.


In [32]:
text = text_without_punc = remove_punc(text)
print(f'Text after punctuation : {text}')

Text after punctuation : In the everevolving realm of artificial intelligence AI breakthroughs and innovations punctuate the landscape with the fervor of discovery As algorithms crunch vast datasets insights emerge like exclamation points illuminating the potential for transformative change The collaboration of human intellect and machine learning marked by hyphens of synergy propels the field forward with relentless determination


# 5: Chat word treatment

In [40]:
chat_words = {
    'FYI' : 'for your information',
    'LOL' : 'laugh out loud',
    'DM' : 'Direct Message',
    'BTW' : 'By The Way'
}

In [41]:
def chat_words_conv(text):
    new_text = []
    for word in text.split():
        if word.upper() in chat_words:
            new_text.append(chat_words[word.upper()])
        else:
            new_text.append(word)

    return ' '.join(new_text)

In [44]:
text = 'BTW I sent you a DM yesterday with some details about the upcoming event.'
print(chat_words_conv(text))

By The Way I sent you a Direct Message yesterday with some details about the upcoming event.


# 6: Spelling Correction

In [47]:
from textblob import TextBlob

In [53]:
text = 'The algorithim they used for data anlysis was extremly sophistcated and effecient, The Ariificial Intelligance conference is happening next week, and I am super exited to attend!'

In [54]:
textblob_ = TextBlob(text)

In [55]:
print(f'Corrected text: {textblob_.correct().string}')

Corrected text: The algorithim they used for data analysis was extremely sophisticated and efficient, The Artificial Intelligence conference is happening next week, and I am super excited to attend!


# 7: Removing stop words

In [56]:
from nltk.corpus import stopwords
stopwords_english = stopwords.words('english')

In [57]:
def remove_stopwords(text):
    new_text = []
    for word in text.split():
        if word in stopwords_english:
            continue
        else:
            new_text.append(word)

    return ' '.join(new_text)

In [58]:
text = 'In the vast world of technology, the evolution of artificial intelligence has reshaped industries and paved the way for innovative solutions.'

In [59]:
print(f'Text before removing stop words: {text}')

Text before removing stop words: In the vast world of technology, the evolution of artificial intelligence has reshaped industries and paved the way for innovative solutions.


In [60]:
print(f'Text after removing stop words : {remove_stopwords(text)}')

Text after removing stop words : In vast world technology, evolution artificial intelligence reshaped industries paved way innovative solutions.


# 8: Handling Emojis

In [63]:
import emoji

In [64]:
text = 'He is suffering from fever 🤒'

In [65]:
print(emoji.demojize(text))

He is suffering from fever :face_with_thermometer:


# 9: Tokenization

In [88]:
import nltk

In [89]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [90]:
from nltk.tokenize import word_tokenize

In [91]:
sent_1 = 'Reading is to the mind what exercise is to the body'

In [92]:
print(word_tokenize(sent_1))

['Reading', 'is', 'to', 'the', 'mind', 'what', 'exercise', 'is', 'to', 'the', 'body']


# 10: Stemming

In [78]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

def perform_stemming(text):
    new_text = [ps.stem(word) for word in text.split()]
    return ' '.join(new_text)

In [79]:
text = 'She spends her weekends swimming, finding solace in the rhythmic motion of swimming laps'
perform_stemming(text)

'she spend her weekend swimming, find solac in the rhythmic motion of swim lap'

Don't forget to follow me on [Medium](https://medium.com/@borhadepiyush) | [GitHub](https://github.com/PiyushBorhade) | [Linkedin](https://www.linkedin.com/in/piyush-borhade/) | [Kaggle](https://www.kaggle.com/piyushborhade) 😎