# Text Preprocessing

This notebook showcases various functions which cleans up text data and prepares it for Model Training.

We will be going through the following Text Cleaning Pipeline:

- Text to Lowercase
- Clean HTML Texts
- Remove URLs
- Split on Numbers
- Expand Contractions
- Remove Punctuation
- Remove Stopwords
- Remove Numbers
- Shrink Extra Spaces

__Libraries Used:__

1. [Pandas](https://pandas.pydata.org/)
2. [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/#)
3. [String](https://docs.python.org/3/library/string.html)

__Let's get started!__

First let's install the libraries and import it

In [None]:
%pip install -q pandas
%pip install -q beautifulsoup4

In [None]:
import pandas as pd
from bs4 import BeautifulSoup
import string

### Create the Dataset

In [None]:
text = """Hey Amazon - my package never arrived but it shows it's delivered https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first PLEASE FIX IT! \n <html> Amazon2022 © <br/> <br /> </html>"""

In [None]:
df = pd.DataFrame([text], columns=["text"])

In [None]:
df.iat[0,0]

### Text to Lowercase

In [6]:
df["text"] = df["text"].str.lower()

In [7]:
df.iat[0,0]

"hey amazon - my package never arrived but it shows it's delivered https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first please fix it! \n <html> amazon2022 © <br/> <br /> </html>"

### Clean HTML Text

[Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/#) is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

In [8]:
# Cleaning HTML Characters
df["text"] = df["text"].apply(
    lambda x: BeautifulSoup(x, "html.parser").text
)

In [9]:
df.iat[0,0]

"hey amazon - my package never arrived but it shows it's delivered https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first please fix it! \n  amazon2022 ©   "

### Remove URLs

Regular expressions are patterns used to match character combinations in strings.

[Reference](https://en.wikipedia.org/wiki/Regular_expression)

In [45]:
 # Remove URLs
df["text"] = df["text"].str.replace(
    r"https?:\/\/.\S+", "", regex=True
)

In [46]:
df.iat[0,0]

'hey amazon package never arrived shows delivered please fix amazon'

### Split on Numbers

In [12]:
# Split Numbers from Words
df["text"] = df["text"].str.split(
    r"(\d+)", regex=True
)
df["text"] = df["text"].apply(
    " ".join
)

In [13]:
df.iat[0,0]

"hey amazon - my package never arrived but it shows it's delivered  please fix it! \n  amazon 2022  ©   "

### Expand Contractions


In [14]:
# Remove Contracted Words
contractions = {
        "'s": " is",
        "n't": " not",
        "'m": " am",
        "'ll": " will",
        "'d": " would",
        "'ve": " have",
        "'re": " are",
    }

In [15]:
for key, value in contractions.items():
    df["text"] = df["text"].str.replace(key, value)

In [16]:
df.iat[0,0]

'hey amazon - my package never arrived but it shows it is delivered  please fix it! \n  amazon 2022  ©   '

### Remove Punctuation

In [17]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [18]:
# Extra Punctuation List
extra_punct = [
        ",", ".", '"', ":", ")", "(", "!", "?", "|", ";", "'", "&", 
        "/", "[", "]", ">", "%", "=", "#", "*", "+", "\\", "•", "~", 
        "@", "·", "_", "{", "}", "©", "^", "®", "`", "<", "→", "°", 
        "€", "™", "›", "♥", "←", "×", "§", "″", "′", "Â", "█", "½", 
        "à", "…", "“", "★", "”", "–", "●", "â", "►", "−", "¢", "²", 
        "¬", "░", "¶", "↑", "±", "¿", "▾", "═", "¦", "║", "―", "¥", 
        "▓", "—", "‹", "─", "▒", "：", "¼", "⊕", "▼", "▪", "†", "■", 
        "’", "▀", "¨", "▄", "♫", "☆", "é", "¯", "♦", "¤", "▲", "è", 
        "¸", "¾", "Ã", "⋅", "‘", "∞", "∙", "）", "↓", "、", "│", "（", 
        "»", "，", "♪", "╩", "╚", "³", "・", "╦", "╣", "╔", "╗", "▬", 
        "❤", "ï", "Ø", "¹", "≤", "‡", "√", "«", "»", "´", "º", "¾", 
        "¡", "§",
        ]

punctuation = list(string.punctuation) + extra_punct

In [19]:
for punct in list(punctuation):
    df["text"] = df["text"].str.replace(punct, "")

  df["text"] = df["text"].str.replace(punct, "")


In [20]:
df.iat[0,0]

'hey amazon  my package never arrived but it shows it is delivered  please fix it \n  amazon 2022     '

### Remove Stopwords

In [33]:
 # Stopwords Removal
stopwords = [
    "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "you're", 
    "you've", "you'll", "you'd", "your", "yours", "yourself", "yourselves", "he", 
    "him", "his", "himself", "she", "she's", "her", "hers", "herself", "it", "it's", 
    "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", 
    "who", "whom", "this", "that", "that'll", "these", "those", "am", "is", "are", 
    "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", 
    "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", 
    "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", 
    "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", 
    "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", 
    "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", 
    "nor", "not", "only", "own", "same", "so", "than", "too", "very", "can", "will", "just", "don", 
    "don't", "should", "should've", "now", "ll", "re", "ve", "ain", "aren", "aren't", "couldn", "couldn't", 
    "didn", "didn't", "doesn", "doesn't", "hadn", "hadn't", "hasn", "hasn't", "haven", "haven't", "isn", 
    "isn't", "ma", "mightn", "mightn't", "mustn", "mustn't", "needn", "needn't", "shan", "shan't", "shouldn", 
    "shouldn't", "wasn", "wasn't", "weren", "weren't", "won", "won't", "wouldn", "wouldn't",
]

Alternative method,

```
import nltk
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
```

In [34]:
pattern = r'\b(?:{})\b'.format('|'.join(stopwords))
df['text'] = df['text'].str.replace(pattern, '', regex=True)

In [35]:
df.iat[0,0]

'hey amazon   package never arrived   shows   delivered  please fix  \n  amazon 2022     '

### Remove Numbers

In [36]:
# Remove Numbers
df['text'] = df['text'].str.replace(r'\d+', '', regex=True)

In [37]:
df.iat[0,0]

'hey amazon   package never arrived   shows   delivered  please fix  \n  amazon      '

### Shrink Extra Spaces

In [38]:
# Shrink Extra Spaces
df["text"] = df["text"].str.strip().replace(r"\s+", " ", regex=True)

In [39]:
df.iat[0,0]

'hey amazon package never arrived shows delivered please fix amazon'

Since we have completed the text preprocessing pipeline, let's compare the starting text with the final preprocessed text

In [40]:
print(f"{text}\n\n\n{df.iat[0,0]}") 

Hey Amazon - my package never arrived but it shows it's delivered https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first PLEASE FIX IT! 
 <html> Amazon2022 © <br/> <br /> </html>


hey amazon package never arrived shows delivered please fix amazon


### Text Preprocessing Function

In [41]:
def text_preprocessing(df):
    """
    Preprocessing the DataFrame containing Text Data

    Args:
        df (pd.DataFrame): Input DataFrame
    """
    
    # Convert to Lowercase
    df["text"] = df["text"].str.lower()
    # Cleaning HTML Characters
    df["text"] = df["text"].apply(
        lambda x: BeautifulSoup(x, "html.parser").text
    )
    # Remove URLs
    df["text"] = df["text"].str.replace(
        r"https?:\/\/.\S+", "", regex=True
    )
    # Split Numbers from Words
    df["text"] = df["text"].str.split(r"(\d+)")
    df["text"] = df["text"].apply(" ".join)
    # Remove Contracted Words
    contractions = {
        "'s": " is",
        "n't": " not",
        "'m": " am",
        "'ll": " will",
        "'d": " would",
        "'ve": " have",
        "'re": " are",
    }
    for key, value in contractions.items():
        df["text"] = df["text"].str.replace(key, value)
    # Extra Punctuation List
    extra_punct = [
        ",", ".", '"', ":", ")", "(", "!", "?", "|", ";", "'", "&", 
        "/", "[", "]", ">", "%", "=", "#", "*", "+", "\\", "•", "~", 
        "@", "·", "_", "{", "}", "©", "^", "®", "`", "<", "→", "°", 
        "€", "™", "›", "♥", "←", "×", "§", "″", "′", "Â", "█", "½", 
        "à", "…", "“", "★", "”", "–", "●", "â", "►", "−", "¢", "²", 
        "¬", "░", "¶", "↑", "±", "¿", "▾", "═", "¦", "║", "―", "¥", 
        "▓", "—", "‹", "─", "▒", "：", "¼", "⊕", "▼", "▪", "†", "■", 
        "’", "▀", "¨", "▄", "♫", "☆", "é", "¯", "♦", "¤", "▲", "è", 
        "¸", "¾", "Ã", "⋅", "‘", "∞", "∙", "）", "↓", "、", "│", "（", 
        "»", "，", "♪", "╩", "╚", "³", "・", "╦", "╣", "╔", "╗", "▬", 
        "❤", "ï", "Ø", "¹", "≤", "‡", "√", "«", "»", "´", "º", "¾", 
        "¡", "§",
    ]
    punctuation = list(string.punctuation) + extra_punct
    for punct in list(punctuation):
        df["text"] = df["text"].str.replace(punct, "", regex=False)
    # Stopwords Removal
    stopwords = [
        "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "you're", 
        "you've", "you'll", "you'd", "your", "yours", "yourself", "yourselves", "he", 
        "him", "his", "himself", "she", "she's", "her", "hers", "herself", "it", "it's", 
        "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", 
        "who", "whom", "this", "that", "that'll", "these", "those", "am", "is", "are", 
        "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", 
        "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", 
        "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", 
        "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", 
        "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", 
        "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", 
        "nor", "not", "only", "own", "same", "so", "than", "too", "very", "can", "will", "just", "don", 
        "don't", "should", "should've", "now", "ll", "re", "ve", "ain", "aren", "aren't", "couldn", "couldn't", 
        "didn", "didn't", "doesn", "doesn't", "hadn", "hadn't", "hasn", "hasn't", "haven", "haven't", "isn", 
        "isn't", "ma", "mightn", "mightn't", "mustn", "mustn't", "needn", "needn't", "shan", "shan't", "shouldn", 
        "shouldn't", "wasn", "wasn't", "weren", "weren't", "won", "won't", "wouldn", "wouldn't",
    ]
    pattern = r'\b(?:{})\b'.format('|'.join(stopwords))
    df['text'] = df['text'].str.replace(pattern, '', regex=True)
    # Remove Numbers
    df['text'] = df['text'].str.replace(r'\d+', '', regex=True)
    # Shrink Extra Spaces
    df["text"] = df["text"].str.strip().replace(r"\s+", " ", regex=True)

    return df

In [42]:
df = pd.DataFrame([text], columns=["text"])

In [43]:
text

"Hey Amazon - my package never arrived but it shows it's delivered https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first PLEASE FIX IT! \n <html> Amazon2022 © <br/> <br /> </html>"

In [44]:
text_preprocessing(df).iat[0, 0]

'hey amazon package never arrived shows delivered please fix amazon'