# <font color='red'><b>Text Preprocessing</b></font>

*  <b>In any of the machine learning task, preprocessing plays a key role . Preprocessing is as important as model building.
*  In this notebook, we will create a class with all the preprocessing functions.</b>

<br>

<font color='blue'><b>Import packages</b></font>

In [0]:
import numpy as np
import pandas as pd
import re
import nltk
import spacy
import string

In [22]:
nltk.download('stopwords') # download stopwords from nltk
from nltk.corpus import stopwords
from string import digits
from bs4 import BeautifulSoup
from nltk.stem.porter import PorterStemmer
!pip install pyspellchecker
from spellchecker import SpellChecker

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<font color='blue'><b>Creating a class with all the preprocessing steps</b>

In [0]:
class TextPreprocess():
  ''' With this class, we can do lower casing, removal of punctuations, removal of stopwords,  removal of digits, removal of html tags, 
      removal of emoji's, removal of url's, stemming, spelling correction'''
  def __init__(self):
      pass
  def clean_raw_text(self,text,remove_html=False,lower_case = False,remove_punctuation = False,remove_stopwords=False,remove_digits=False,remove_emoji=False,remove_urls=False,stemming=False,spell_correction=False):
    '''
    text --> Text data that to be cleaned,
    if remove_html is True, it will remove the html tags and then return the text
    if lower_case is True, it will convert the text to lower case
    if remove_punctuation is True, will remove the punctuations (!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~)
    if remove_stopwords is True,it will remove all the stopwords in our text
    if remove_digits  is True, it will remove all the digits in our text
    if remove_emoji is True, it will remove the emoji's in our text
    if remove_urls is True, it will remove the urls in our text
    if stemming is True , then it will do stemming
    if spell_correction is True, then it will correct our spellings 
    '''
    if remove_html:
      text =BeautifulSoup(text, "lxml").text
    if lower_case:
      text = str(text).lower()
    if remove_punctuation:
      text=text.translate(str.maketrans('','',string.punctuation))
    if remove_stopwords:
      stop_words= set(stopwords.words('english'))
      text = (' '.join([word for word in str(text).split() if word not in stop_words]))
    if remove_digits:
      text=text.translate(str.maketrans('','',digits))
    if remove_emoji:
      # https://stackoverflow.com/a/49146722/330558
      emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
      text = emoji_pattern.sub(r'', text)
    if remove_urls:
      url_pattern = re.compile(r'https?://\S+|www\.\S+')
      text = url_pattern.sub(r'', text)
    if stemming:
      stemmer = PorterStemmer()
      text = (' '.join([stemmer.stem(word) for word in str(text).split()]))
    if spell_correction:
      # https://norvig.com/spell-correct.html
      spell = SpellChecker()
      corrected_text = []
      misspelled_words = spell.unknown(text.split())
      for word in text.split():
          if word in misspelled_words:
              corrected_text.append(spell.correction(word))
          else:
              corrected_text.append(word)
      text = " ".join(corrected_text) 
    return text

<font color='blue'><b>Creating an instance of class</b></font>

In [0]:
preprocess=TextPreprocess()

In [25]:
print(preprocess.__doc__) # .__doc__ will print the documentation

 With this class, we can do lower casing, removal of punctuations, removal of stopwords,  removal of digits, removal of html tags, 
      removal of emoji's, removal of url's, stemming, spelling correction


In [26]:
import inspect
print(inspect.signature(preprocess.clean_raw_text)) # this will print the parameters for the function

(text, remove_html=False, lower_case=False, remove_punctuation=False, remove_stopwords=False, remove_digits=False, remove_emoji=False, remove_urls=False, stemming=False, spell_correction=False)


In [0]:
text ='Hi, Am Sakesh Pusuluri, This is my notebook on basic preprocessing, you can read more about text preprocessing from this link : https://www.kaggle.com/sudalairajkumar/getting-started-with-text-preprocessing/data'

In [28]:
print(text)

Hi, Am Sakesh Pusuluri, This is my notebook on basic preprocessing, you can read more about text preprocessing from this link : https://www.kaggle.com/sudalairajkumar/getting-started-with-text-preprocessing/data


<font color='blue'><b> Removing url's and lower casing text </b></font> 

In [29]:
preprocess.clean_raw_text(text,lower_case=True,remove_urls=True)

'hi, am sakesh pusuluri, this is my notebook on basic preprocessing, you can read more about text preprocessing from this link : '

<font color='lightgreen'><b>Yayyyyyy !... url's has been removed and text has been converted to lower case </b></font> 

<font color='blue'><b>Removing punctuations, removing digits and removing emoji's</b> </font>

In [0]:
text=" Lahari music : South India's Largest Music Company #Lahari ❤ Music # No. 1 🎶"

In [31]:
preprocess.clean_raw_text(text,remove_punctuation=True,remove_digits=True,remove_emoji=True)

' Lahari music  South Indias Largest Music Company Lahari  Music  No  '

<font color='lightgreen'><b>Booom !... Punctuations,digits and emoji's are removed </b></font> 

<font color='blue'><b> Spell correction </b></font>

In [0]:
text ="Am waiting outside ,come sono, let's go " 

In [41]:
preprocess.clean_raw_text(text,remove_punctuation=True,spell_correction=True)

'Am waiting outside come soon lets go'

<font color='lightgreen'><b>🤩 !... Spelling has been corrected</b></font> 