## Steps for Preprocessing of Text in NLP

- Remove Newlines and Tabs
- Remove HTML Tags
- Remove Links
- Remove White spaces
- Remove Accented Words
- Case Coversion
- Remove Repeated Character and Punctuation
- Expand Contraction Words
- Remove Special Characters
- Remove Stop Words
- Spelling Coreections
- Lematization (converting words to its root words)


## Importing Packages

In [2]:
!pip install unidecode
!pip install autocorrect
!pip install nltk

Collecting unidecode
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.9/235.9 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.3.6
Collecting autocorrect
  Downloading autocorrect-2.6.1.tar.gz (622 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m622.8/622.8 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: autocorrect
  Building wheel for autocorrect (setup.py) ... [?25l[?25hdone
  Created wheel for autocorrect: filename=autocorrect-2.6.1-py3-none-any.whl size=622363 sha256=7f9c3ce6ce2b4cd35a070dd6ed38539d4bc4c00edaa49aceb6170a00766c24f2
  Stored in directory: /root/.cache/pip/wheels/b5/7b/6d/b76b29ce11ff8e2521c8c7dd0e5bfee4fb1789d76193124343
Successfully built autocorrect
Installing collected packages: autocorrect
Succes

In [3]:
import pandas as pd
import numpy as np
import re
import time
import unidecode
import nltk
import zipfile
import io
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from autocorrect import Speller
import string
import timeit
from bs4 import BeautifulSoup

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Remove Newlines and Tabs

In [4]:
def new_line_tabs(text):
  result = text.replace('\\',' ').replace('\\n', ' ').replace('\t', ' ').replace('\n', ' ').replace('. com', '.com')
  return result
check = 'This is her \\ first day at this place.\n Please,\t Be nice to her.\\n'
print(check)
check1 = new_line_tabs(check)
print(check1)

This is her \ first day at this place.
 Please,	 Be nice to her.\n
This is her   first day at this place.  Please,  Be nice to her. n


## Removing HTML Tags

In [5]:
def remove_HTML_Tags(text):
  soup = BeautifulSoup(text, 'html.parser')
  result = soup.getText(separator = ' ')
  return result
check = 'This is a <b>nice</b> place to live.<h1>Hello Wolrld</h1>'
print(check)
check1 = remove_HTML_Tags(check)
print(check1)

This is a <b>nice</b> place to live.<h1>Hello Wolrld</h1>
This is a  nice  place to live. Hello Wolrld


## Remove Links

In [6]:
def remove_links(text):
  rem_http = re.sub(r'http\S+', '', text)
  rem_com = re.sub(r'\ [A-Za-z]*\.com',' ',rem_http)
  return rem_com
check = ' website: catster.com  visit: https://catster.com//how-to-feed-cats adasfas ad as'
print(check)
check1 = remove_links(check)
print(check1)

 website: catster.com  visit: https://catster.com//how-to-feed-cats adasfas ad as
 website:   visit:  adasfas ad as


## Remove WhiteSpaces

In [7]:
def remove_whitespaces(text):
  pattern = re.compile(r'\s+')
  rws = re.sub(pattern, ' ', text)
  result = rws.replace('?', ' ? ').replace(')', ') ')
  return result
check = 'How   are  \t you \n   doing? (pakistan)ABC hello                world!'
print(check)
check1 = remove_whitespaces(check)
print(check1)


How   are  	 you 
   doing? (pakistan)ABC hello                world!
How are you doing ?  (pakistan) ABC hello world!


## Remove Accented Words

In [8]:
def remove_accented_Word(text):
  result = unidecode.unidecode(text)
  return result
check = 'Málaga, àéêöhello'
print(check)
check1 = remove_accented_Word(check)
print(check1)

Málaga, àéêöhello
Malaga, aeeohello


## Case Conversion

In [9]:
def case_conversion(text):
  if isinstance(text, pd.Series):
      result =  text.str.lower()
      return result
  elif isinstance(text, str):
      result =  text.lower()
      return result

check = 'Pakistan Zinda BAD!'
print(check)
check1 = case_conversion(check)
print(check1)


Pakistan Zinda BAD!
pakistan zinda bad!


## reduce repeated characters

In [10]:
def reduce_repeated_characters(text):
    Pattern_alpha = re.compile(r"([A-Za-z])\1{1,}", re.DOTALL)
    Formatted_text = Pattern_alpha.sub(r"\1\1", text)
    Pattern_Punct = re.compile(r'([.,/#!$%^&*?;:{}=_`~()+-])\1{1,}')
    Combined_Formatted = Pattern_Punct.sub(r'\1', Formatted_text)
    Final_Formatted = re.sub(' {2,}',' ', Combined_Formatted)
    return Final_Formatted

check = 'Realllllllllyyyyy,        Greeeeaaaatttt   !!!!?....;;;;:)'
print(check)
check1 = reduce_repeated_characters(check)
print(check1)

Realllllllllyyyyy,        Greeeeaaaatttt   !!!!?....;;;;:)
Reallyy, Greeaatt !?.;:)


## Expand Contraction words

In [11]:
CONTRACTION_MAP = {
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have",
}

In [12]:
def expand_contraction_words(text, cont_map = CONTRACTION_MAP):
  items = text.split(' ')
  for item in items:
    if item in CONTRACTION_MAP:
      items = [sample.replace(item, CONTRACTION_MAP[item]) for sample in items]
  items = ' '.join(str(e) for e in items)
  return items
check = "ain't , aren't , can't , cause , can't've"
print(check)
check1 = expand_contraction_words(check)
print(check1)


ain't , aren't , can't , cause , can't've
is not , are not , cannot , cause , cannot've


## Removing Special Characters

In [13]:
def remove_special_characters(text):
  result = re.sub(r'[^a-zA-Z0-9:$-,%.?!]+', ' ', text)
  return result
check = ' Hello, K-a-j-a-l. Thi*s is $100.05 : the payment that you will recieve! (Is this okay?) '
print(check)
check1 = remove_special_characters(check)
print(check1)

 Hello, K-a-j-a-l. Thi*s is $100.05 : the payment that you will recieve! (Is this okay?) 
 Hello, K a j a l. Thi*s is $100.05 : the payment that you will recieve! (Is this okay?) 


## Removing Stop Words

In [14]:
stoplist = stopwords.words('english')
stoplist = set(stoplist)
def remove_stopwords(text):
  text = repr(text)
  result = [word for word in word_tokenize(text) if word.lower() not in stoplist ]
  result = ' '.join(result)
  result = result.replace("'",'')
  return result
check = 'This is Kajal from the delhi who was came here to study.'
print(check)
check1 = remove_stopwords(check)
print(check1)

This is Kajal from the delhi who was came here to study.
This Kajal delhi came study . 


## Spelling Correction

In [15]:
def spell_correction(text):
  spell =Speller(lang='en')
  result = spell(text)
  return result
check = 'This is Oberois from Dlhi who came hiree to stuydd'
print(check)
check1 = spell_correction(check)
print(check1)

This is Oberois from Dlhi who came hiree to stuydd
This is Oberoi from Delhi who came hired to study


## Lemmatization

In [22]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatization(text):
  result = [lemmatizer.lemmatize(w,'v') for w in tokenizer.tokenize(text)]
  result = str(result)
  result = result.replace('[', '').replace(']', '').replace("'",'').replace(',','')
  return result
check = 'cats ext having root words only, no tense form, no plural are forms cats dogs'
print(check)
check1 = lemmatization(check)
print(check1)

cats ext having root words only, no tense form, no plural are forms cats dogs
cat ext have root word only no tense form no plural be form cat dog


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
