In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<h1 style='text-align:center;'><b>Basic Text-Preprocessing in NLP</b></h1>

<h5>1. Lowercasing</h5>
<h5>2. Removing HTML tags</h5>
<h5>3. Remove URLs</h5>
<h5>4. Remove Punctuations </h5>
<h5>5. Chat word treatment</h5>
<h5>6. Spelling Correction</h5>
<h5>7. StopWords removal</h5>
<h5>8. Handling Emojis</h5>

## 1) Lowercasing:

1. Lowercasing ensures that 'apple' and 'Apple' are treated as the same word, improving text normalization and reducing vocabulary size.

2. By applying lowercasing, NLP models can capture word semantics more effectively and achieve better performance in tasks like text classification and sentiment analysis.

    However, it is worth noting that lowercasing may not always be appropriate or desirable, depending on the specific NLP task or domain. For example, in named entity recognition or part-of-speech tagging, capitalization information can be crucial for accurate identification. Additionally, lowercasing should be used with caution for languages where capitalization carries significant linguistic information (e.g., German or Turkish).

In [3]:
def lower_case(text):
    return text.lower()

In [5]:
text = "Hi! What's going on."
lower_case(text)

"hi! what's going on."

### for dataset - example
df['review'] = df['review'].str.lower()

## 2) Remove HTML tags:

1. "Removing HTML tags is essential in NLP as it helps extract clean and meaningful text from web pages, ensuring accurate analysis and understanding of the content."

2. "Eliminating HTML tags in NLP preprocessing allows models to focus solely on the textual information, improving the quality of language processing tasks such as sentiment analysis, information extraction, and text classification."

    * Here we use regex for emliminating HTML tags.

In [6]:
import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'',text)

In [7]:
text = 'i thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. the plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). while some may be disappointed when they realize this is not match point 2: risk addiction, i thought it was proof that woody allen is still fully in control of the style many of us have grown to love.<br /><br />this was the most i\'d laughed at one of woody\'s comedies in years (dare i say a decade?). while i\'ve never been impressed with scarlet johanson, in this she managed to tone down her "sexy" image and jumped right into a average, but spirited young woman.<br /><br />this may not be the crown jewel of his career, but it was wittier than "devil wears prada" and more interesting than "superman" a great comedy to go see with friends.'
remove_html_tags(text)

'i thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. the plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). while some may be disappointed when they realize this is not match point 2: risk addiction, i thought it was proof that woody allen is still fully in control of the style many of us have grown to love.this was the most i\'d laughed at one of woody\'s comedies in years (dare i say a decade?). while i\'ve never been impressed with scarlet johanson, in this she managed to tone down her "sexy" image and jumped right into a average, but spirited young woman.this may not be the crown jewel of his career, but it was wittier than "devil wears prada" and more interesting than "superman" a great comedy to go see with friends.'

### for dataset - use apply()
df['review'] = df['review'].apply(remove_html_tags)

## 3) Remove URLs

1. `Noise Reduction`: URLs often contain machine-generated codes, special characters, or irrelevant information that can introduce noise into the text data. Removing URLs helps to clean the text and eliminate unwanted distractions, allowing NLP models to focus on the relevant content.

2. `Data Consistency`: URLs are unique identifiers for web pages and can vary significantly in their structure and length. By removing URLs, the text data is made more consistent, ensuring that different instances of the same URL do not create unnecessary duplication or affect the analysis.

3. `Bias Mitigation`: URLs often contain domain names or specific web sources that may introduce biases in the analysis. Removing URLs helps to mitigate such biases and ensures that the NLP model focuses solely on the textual content, leading to more objective and unbiased results.

4. `Improved Generalization`: NLP models trained on specific web data or social media text may encounter a large number of URLs. Removing URLs allows the models to generalize better by not relying on specific web sources or references during analysis, making them more applicable to different domains and datasets.

5. `Privacy Protection`: URLs may contain sensitive or personally identifiable information. By removing URLs, privacy concerns are addressed, ensuring that confidential or private information is not inadvertently exposed during NLP processing or analysis.

Overall, removing URLs as part of the NLP preprocessing pipeline helps to enhance data cleanliness, consistency, bias mitigation, generalization, and privacy protection, leading to more accurate and reliable results in various NLP tasks.

#### `We use regex to remove URLs`

In [8]:
text1 = 'Google link is www.google.com'
text2 = 'regular expressions practice www.regex101.com'

In [9]:
def remove_url(text):
    pattern = re.compile(r'www.[a-zA-Z\d]*.[a-zA-Z]*')
    return pattern.sub(r'',text)

In [10]:
remove_url(text1)

'Google link is '

In [11]:
remove_url(text2)

'regular expressions practice '

### for dataset - use apply()
df['review'] = df['review'].apply(remove_url)

## 4) Remove Punctuations

By removing punctuation, NLP practitioners can simplify text, enhance tokenization, reduce noise, and improve the accuracy and efficiency of various language processing tasks.

In [1]:
import string
result = string.punctuation

result

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [2]:
sentence = "Hey, Tony !, How are you?"

In [None]:
def remove_punctuation(text):
    new_str = ''
    for w in text:
        if w in string.punctuation:
            continue 
        else:
            new_str+=w
    
    return new_str

In [16]:
remove_punctuation(sentence)

'Hey,Tony!,Howareyou?'

## 5) Chat-Word Treament

In [2]:
chat_word = {
  "AFAIK": "As Far As I Know",
  "AFK": "Away From Keyboard",
  "ASAP": "As Soon As Possible",
  "ATK": "At The Keyboard",
  "ATM": "At The Moment",
  "A3": "Anytime, Anywhere, Anyplace",
  "BAK": "Back At Keyboard",
  "BBL": "Be Back Later",
  "BBS": "Be Back Soon",
  "BFN": "Bye For Now",
  "B4N": "Bye For Now",
  "BRB": "Be Right Back",
  "BRT": "Be Right There",
  "BTW": "By The Way",
  "B4": "Before",
  "CU": "See You",
  "CUL8R": "See You Later",
  "CYA": "See You",
  "FAQ": "Frequently Asked Questions",
  "FC": "Fingers Crossed",
  "FWIW": "For What It's Worth",
  "FYI": "For Your Information",
  "GAL": "Get A Life",
  "GG": "Good Game",
  "GN": "Good Night",
  "GMTA": "Great Minds Think Alike",
  "GR8": "Great!",
  "G9": "Genius",
  "IC": "I See",
  "ICQ": "I Seek you (also a chat program)",
  "ILU": "I Love You",
  "IMHO": "In My Honest/Humble Opinion",
  "IMO": "In My Opinion",
  "IOW": "In Other Words",
  "IRL": "In Real Life",
  "KISS": "Keep It Simple, Stupid",
  "LDR": "Long Distance Relationship",
  "LMAO": "Laugh My A.. Off",
  "LOL": "Laughing Out Loud",
  "LTNS": "Long Time No See",
  "L8R": "Later",
  "MTE": "My Thoughts Exactly",
  "M8": "Mate",
  "NRN": "No Reply Necessary",
  "OIC": "Oh I See",
  "PITA": "Pain In The A..",
  "PRT": "Party",
  "PRW": "Parents Are Watching",
  "QPSA": "Que Pasa?",
  "ROFL": "Rolling On The Floor Laughing",
  "ROFLOL": "Rolling On The Floor Laughing Out Loud",
  "ROTFLMAO": "Rolling On The Floor Laughing My A.. Off",
  "SK8": "Skate",
  "STATS": "Your sex and age",
  "ASL": "Age, Sex, Location",
  "THX": "Thank You",
  "TTFN": "Ta-Ta For Now!",
  "TTYL": "Talk To You Later",
  "U": "You",
  "U2": "You Too",
  "U4E": "Yours For Ever",
  "WB": "Welcome Back",
  "WTF": "What The F...",
  "WTG": "Way To Go!",
  "WUF": "Where Are You From?",
  "W8": "Wait...",
  "7K": "Sick:-D Laughter"
}

In [3]:
def remove_chatword(text):
    new_list = []
    
    for word in text.split():
        if word.upper() in chat_word:
            new_list.append(chat_word[word.upper()])
        else:
            new_list.append(word)
            
    return ''.join(new_list)

In [4]:
remove_chatword('WB Jack!')

'Welcome BackJack!'

## 6) Spelling Correction

Spelling correction is an essential step in natural language processing (NLP) for several reasons:

1. `Improved Text Understanding:` Correcting spelling mistakes helps improve the overall understanding of the text. Spelling errors can lead to ambiguity, misinterpretation, or difficulty in comprehending the intended meaning. By applying spelling correction, the text becomes more coherent and easier to process.

2. `Better Information Retrieval`: Spelling errors in search queries or document collections can lead to poor search results. Correcting spelling mistakes in queries helps retrieve more relevant documents, enhancing the effectiveness of information retrieval systems. Similarly, spell-corrected document text can improve the accuracy of search engines in matching user queries with relevant content.

3. `Language Modeling`: Spelling correction contributes to language modeling tasks by improving the accuracy of language models. Language models learn from patterns in the training data, including correct spellings. Correcting spelling mistakes helps align the input data with the expected patterns, leading to more accurate language generation or prediction.

4. `Text Preprocessing`: Spelling correction is often performed as part of text preprocessing to ensure consistent and standardized text data. This step is particularly useful when working with user-generated content, social media data, or data from sources where spelling errors are common. Correcting spelling mistakes helps normalize the text and ensure consistent representations.

5. `Data Cleaning and Quality`: Spelling mistakes can be viewed as noise or inaccuracies in the text data. Cleaning up these errors improves the overall quality and reliability of the data. It helps to eliminate inconsistencies, enhances data integrity, and ensures that downstream NLP tasks are based on accurate and reliable input.

6. `Text Analysis and Sentiment Analysis`: Spelling correction is vital in text analysis tasks, such as sentiment analysis or opinion mining. Accurate sentiment analysis relies on correctly understanding the sentiment-bearing words in the text. Correcting spelling errors ensures that sentiment analysis models capture the intended sentiment more accurately.

Overall, spelling correction is crucial in NLP to enhance text understanding, improve search and retrieval, facilitate language modeling, ensure data quality, and enable accurate analysis of textual data. It helps align the text with expected linguistic patterns, leading to more accurate and meaningful results in various NLP applications.

In [5]:
from textblob import TextBlob

In [6]:
incorrect_text = 'ceertain conditions during seveal qnenerations aree modified in the saame maner.'

correct_text = TextBlob(incorrect_text)

correct_text.correct().string

'certain conditions during several generations are modified in the same manner.'

## 7) StopWords Removal

Stopwords are commonly used words in a language that often do not carry significant meaning or contribute much to the overall understanding of a text. Here are some reasons why removing stopwords is beneficial in natural language processing (NLP):

1. `Noise Reduction`: Stopwords can be considered noise in text data. They are frequently occurring words that appear in almost every document and do not convey specific information about the content. Removing stopwords helps to reduce the noise in the text and allows the focus to be on the more meaningful words.

2. `Memory and Storage Efficiency`: Stopwords are typically high-frequency words in a language. Removing them reduces the overall size of the text data, which can be advantageous in terms of memory usage and storage requirements, particularly when dealing with large corpora or datasets.

3. `Speed and Performance`: Removing stopwords can lead to faster processing and improved performance in NLP tasks. Since stopwords are frequently encountered, their removal reduces the number of tokens that need to be processed, resulting in faster execution of tasks like tokenization, parsing, or analysis.

4. `Improved Feature Extraction`: In many NLP tasks, such as text classification, sentiment analysis, or topic modeling, stopwords do not contribute much to the overall meaning or context of the text. By removing stopwords, the more meaningful words and phrases can stand out and be better utilized for feature extraction, resulting in more relevant and informative features.

5. `Enhanced Interpretability`: Removing stopwords can improve the interpretability of NLP models. When stopwords are present, they can dominate the feature space and affect the model's interpretation or the importance assigned to other more meaningful words. Removing stopwords allows the model to focus on the words that carry greater semantic value.

6. `Better Generalization`: Stopwords are commonly used across different domains and texts. By removing them, the resulting text data can be more generalized and applicable to a wider range of NLP tasks and analysis. This helps to avoid overfitting to domain-specific or language-specific stopwords.

It's worth noting that the decision to remove stopwords depends on the specific NLP task, dataset, and the context in which the analysis is performed. In some cases, retaining stopwords may be important, especially when analyzing specific linguistic patterns, understanding document structure, or preserving certain language characteristics.

In [7]:
from nltk.corpus import stopwords

In [8]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [11]:
def remove_stopword(text):
    new_list = []
    
    for word in text.split():
        if word in stopwords.words('english'):
            new_list.append('')
        else:
            new_list.append(word)
    
    return ' '.join(new_list)

In [12]:
sentence = "This is an example sentence demonstrating stopword removal."

remove_stopword(sentence)

'This   example sentence demonstrating stopword removal.'

## 8) Handling Emojis

While emojis can provide valuable insights and context in certain NLP tasks, there are situations where removing emojis from text data is beneficial. Here are some reasons why removing emojis can be useful in NLP:

1. `Simplifying Text`: Emojis are visual elements that can add complexity to the text data. In some cases, they might not carry significant information or contribute to the specific NLP task at hand. Removing emojis simplifies the text, making it easier to process and analyze.

2. `Noise Reduction`: In certain NLP tasks, emojis can be considered noise or irrelevant information. For example, in tasks like text classification or topic modeling, emojis might not contribute meaningfully to the classification or topic identification. Removing emojis helps to reduce noise and focus on the essential textual content.

3. `Tokenization and Feature Extraction`: Emojis can pose challenges during tokenization and feature extraction processes. Emojis are represented as Unicode characters, and their inclusion in text can lead to increased vocabulary size and tokenization complexity. By removing emojis, tokenization becomes more straightforward, and feature extraction can focus on textual information.

4. `Language Model Training`: In some cases, emojis may not align with the training objectives of language models. Language models typically learn patterns and relationships based on textual information rather than visual elements like emojis. Removing emojis ensures that the language model focuses solely on the textual content and learns relevant patterns.

5. `Language Agnostic Analysis`: Emojis may not be universally understood across different languages and cultures. If the NLP task requires language-agnostic analysis, removing emojis can help in achieving language independence and avoid potential issues with emoji interpretations or misinterpretations.

6. `Text Normalization:` In text normalization processes, such as stemming or lemmatization, emojis may not have standard linguistic representations. Removing emojis before normalization ensures that the text is consistently processed based on linguistic rules and patterns.

It's important to note that the decision to remove or retain emojis depends on the specific NLP task, dataset, and the importance of emojis for the analysis. In some cases, emojis can provide valuable information and context, and it may be necessary to retain and handle them appropriately.

In [13]:
import re
def remove_emoji(text):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [14]:
remove_emoji('hello 🙌🙌')

'hello '

In [15]:
#replace emoji with its meaning
import emoji

def replace_emoji(text):
    print(emoji.demojize(text))
    
print(emoji.demojize('Python is 🔥'))

Python is :fire:


In [16]:
replace_emoji('This movie is 😂')

This movie is :face_with_tears_of_joy:
