Text preprocessing is an essential step in natural language processing (NLP) that involves preparing and cleaning text data before applying any machine learning or deep learning models. It ensures that the data is in a suitable format for analysis and modeling. Common text preprocessing techniques include:
1. Lower casing
2. Removal of HTML Tags
3. Removal of URls
4. Removal of Punctuations
5. Removal of Stop words
6. Removel of emojis
7. Conversion of emojis to words
8. Removel of frequent Words
9. Stemming
10. Lemmatization
11. Chat words conversion
12. Spelling correction

In [None]:
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import re
import nltk
import spacy
import string
pd.options.mode.chained_assignment = None

In [None]:

df = pd.read_csv("/kaggle/input/twitter-dataset/sample.csv", nrows=5000)

In [None]:
df.head()

In [None]:
df_text = df[['text']]
df_text['text'] = df['text'].astype(str)

In [None]:
df_text.head()

### 1. Lower Casing
Lower casing is a common text preprocessing technique. The idea is to convert the input text into same casing format so that 'text', 'Text' and 'TEXT' are treated the same way,its because python is case sensitive language.

This is more helpful for text featurization techniques like frequency, tfidf as it helps to combine the same words together thereby reducing the duplication and get correct counts / tfidf values.

This may not be helpful when we do tasks like Part of Speech tagging (where proper casing gives some information about Nouns and so on) and Sentiment Analysis (where upper casing refers to anger and so on)

In [None]:
df_text['lower_text'] = df_text['text'].str.lower()
df_text.head()

### 2.Removal of HTML Tags¶
One another common preprocessing technique that will come handy in multiple places is removal of html tags. This is especially useful, if we scrap the data from different websites. We might end up having html strings as part of our text.

First, let us try to remove the HTML tags using regular expressions.

In [None]:
def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'',text)

In [None]:
text = """<div>
<h1> H2O</h1>
<p> AutoML</p>
<a href="https://www.h2o.ai/products/h2o-driverless-ai/"> Driverless AI</a>
</div>"""

print(remove_html(text))

### 3.Removal of URLs¶
Next preprocessing step is to remove any URLs present in the data. For example, if we are doing a twitter analysis, then there is a good chance that the tweet will have some URL in it. Probably we might need to remove them for our further analysis.

In [None]:
def remove_url(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

In [None]:
text = "NLP Future  blog post on https://www.h2o.ai/blog/detecting-sarcasm-is-difficult-but-ai-may-have-an-answer/"
remove_url(text)

### 4.Removal of Punctuations¶
One another common text preprocessing technique is to remove the punctuations from the text data. This is again a text standardization process that will help to treat 'hurray' and 'hurray!' in the same way.

We also need to carefully choose the list of punctuations to exclude depending on the use case. For example, the string.punctuation in python contains the following punctuation symbols

!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`

We can add or remove more punctuations as per our need.

In [None]:
df_text

In [None]:
df_text.drop(['lower_text'],axis=1,inplace=True)

In [None]:
PUNCT_TO_REMOVE = string.punctuation
def remove_punc(text):
    return text.translate(str.maketrans('','',PUNCT_TO_REMOVE))

In [None]:
df_text['Punc_text'] = df_text['text'].apply(lambda text : remove_punc(text))
df_text.head()

### 5.Removal of stopwords¶
Stopwords are commonly occuring words in a language like 'the', 'a' and so on. They can be removed from the text most of the times, as they don't provide valuable information for downstream analysis. In cases like Part of Speech tagging, we should not remove them as provide very valuable information about the POS.

These stopword lists are already compiled for different languages and we can safely use them. For example, the stopword list for english language from the nltk package can be seen below.

In [None]:
from nltk.corpus import stopwords
' ,'.join(stopwords.words('english'))

In [None]:
STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):

    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

In [None]:
df_text["text_stop"] = df_text["Punc_text"].apply(lambda text: remove_stopwords(text))
df_text.head()

### 6.Removal of Emojis¶
With more and more usage of social media platforms, there is an explosion in the usage of emojis in our day to day life as well. Probably we might need to remove these emojis for some of our textual analysis.

In [None]:
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

In [None]:
remove_emoji("There is 🔥🔥 in the house")

### 7.Conversion of Emoji to Words¶
Now let us do the same for Emojis as well. Neel Shah has put together a list of emojis with the corresponding words as well as part of his Github repo. We are going to make use of this dictionary to convert the emojis to corresponding words.

In [None]:
import emoji
print(emoji.demojize("Python is 🔥😂"))

### 8.Removal of Frequent words¶
In the previos preprocessing step, we removed the stopwords based on language information. But say, if we have a domain specific corpus, we might also have some frequent words which are of not so much importance to us.

So this step is to remove the frequent words in the given corpus. If we use something like tfidf, this is automatically taken care of.


In [None]:
from collections import Counter
cnt = Counter()

In [None]:
for text in df_text['text_stop'].values:
    for word in text.split():
        cnt[word] += 1

cnt.most_common(20)

### 9.Stemming¶
Stemming is a text preprocessing technique used in Natural Language Processing (NLP) that reduces words to their root form or stem. The goal is to eliminate any suffixes or prefixes to bring related words down to a common base form, which is useful in various NLP tasks like text classification, sentiment analysis, and information retrieval.
* running", "runner", "ran" all stem to the base form "run".
* "better" and "good" might both be reduced to "good" (depending on the stemming algorithm).
  
By reducing words to their root forms, stemming can help models to recognize different forms of a word as the same entity, improving efficiency and effectiveness in downstream tasks.

In [None]:
from nltk.stem.porter import PorterStemmer
df_text.drop(['text_stop','Punc_text'],axis=1,inplace=True)

In [None]:
stemmer = PorterStemmer()
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

In [None]:
df_text['stemmed_text'] = df_text['text'].apply(lambda text: stem_words(text))
df_text.head()

### 10.Lemmatization
Lemmatization is a text preprocessing technique in Natural Language Processing (NLP) that involves converting words into their base or root form, called a lemma. Unlike stemming, which may reduce words to an arbitrary root (sometimes resulting in non-existent words), lemmatization ensures that the base form of the word is a valid word in the language. Lemmatization also takes into account the context and part of speech (POS) of the word to produce the correct base form.

* "running" becomes "run"
* "better" becomes "good"
* "cats" becomes "cat

In [None]:
import spacy
# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

def lemmatize_words(text):
    doc = nlp(text)
    return " ".join([token.lemma_ for token in doc])



In [None]:
df_text["text_lemmatized"] = df_text["text"].apply(lambda text: lemmatize_words(text))
df_text.head()

In [None]:
nlp("running"),nlp("better"),nlp("walking")

### 11.Chat Words Conversion¶
This is an important text preprocessing step if we are dealing with chat data. People do use a lot of abbreviated words in chat and so it might be helpful to expand those words for our analysis purposes.

Got a good list of chat slang words from this repo. We can use this for our conversion here. We can add more words to this list.

In [None]:
chat_words = {
    'AFAIK': 'As Far As I Know',
    'AFK': 'Away From Keyboard',
    'ASAP': 'As Soon As Possible',
    'ATK': 'At The Keyboard',
    'ATM': 'At The Moment',
    'A3': 'Anytime, Anywhere, Anyplace',
    'BAK': 'Back At Keyboard',
    'BBL': 'Be Back Later',
    'BBS': 'Be Back Soon',
    'BFN': 'Bye For Now',
    'B4N': 'Bye For Now',
    'BRB': 'Be Right Back',
    'BRT': 'Be Right There',
    'BTW': 'By The Way',
    'B4': 'Before',
    'CU': 'See You',
    'CUL8R': 'See You Later',
    'CYA': 'See You',
    'FAQ': 'Frequently Asked Questions',
    'FC': 'Fingers Crossed',
    'FWIW': 'For What It\'s Worth',
    'FYI': 'For Your Information',
    'GAL': 'Get A Life',
    'GG': 'Good Game',
    'GN': 'Good Night',
    'GMTA': 'Great Minds Think Alike',
    'GR8': 'Great!',
    'G9': 'Genius',
    'IC': 'I See',
    'ICQ': 'I Seek you (also a chat program)',
    'ILU': 'ILU: I Love You',
    'IMHO': 'In My Honest/Humble Opinion',
    'IMO': 'In My Opinion',
    'IOW': 'In Other Words',
    'IRL': 'In Real Life',
    'KISS': 'Keep It Simple, Stupid',
    'LDR': 'Long Distance Relationship',
    'LMAO': 'Laugh My A.. Off',
    'LOL': 'Laughing Out Loud',
    'LTNS': 'Long Time No See',
    'L8R': 'Later',
    'MTE': 'My Thoughts Exactly',
    'M8': 'Mate',
    'NRN': 'No Reply Necessary',
    'OIC': 'Oh I See',
    'PITA': 'Pain In The A..',
    'PRT': 'Party',
    'PRW': 'Parents Are Watching',
    'ROFL': 'Rolling On The Floor Laughing',
    'ROFLOL': 'Rolling On The Floor Laughing Out Loud',
    'ROTFLMAO': 'Rolling On The Floor Laughing My A.. Off',
    'SK8': 'Skate',
    'STATS': 'Your sex and age',
    'ASL': 'Age, Sex, Location',
    'THX': 'Thank You',
    'TTFN': 'Ta-Ta For Now!',
    'TTYL': 'Talk To You Later',
    'U': 'You',
    'U2': 'You Too',
    'U4E': 'Yours For Ever',
    'WB': 'Welcome Back',
    'WTF': 'What The F...',
    'WTG': 'Way To Go!',
    'WUF': 'Where Are You From?',
    'W8': 'Wait...',
    '7K': 'Sick:-D Laugher'
}

In [None]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)

    return ' '.join(new_text)

In [None]:
text = "AFAIK, I will be AFK for a while"
converted_text = chat_conversion(text)
print(converted_text)

### 13. Spelling Correction¶
One another important text preprocessing step is spelling correction. Typos are common in text data and we might want to correct those spelling mistakes before we do our analysis.

In [None]:
from textblob import TextBlob

incorrect_text = 'ceertain connditions during swval genrtations are moddified in same mannerr'
textblb = TextBlob(incorrect_text)
textblb.correct().string