### Text Preprocessing

#### Basic Text Preprocessing:
1. Lower casing
2. Remove HTML Tags
3. Remove Punctuations.
4. Chat word treatment.
5. Spelling correction.
6. Removing Stop words.
7. Handeling Emojis.
8. Tokenization
9. Stemming.
10. Lematization.

In [2]:
import pandas as pd

In [31]:
data = {'review':['Hi My Name Is Manoj ND','<CENTER>This will center your contents</CENTER>']}
df = pd.DataFrame(data)

In [32]:
#### LOwercasing

df['review'] = df['review'].str.lower()
df

Unnamed: 0,review
0,hi my name is manoj nd
1,<center>this will center your contents</center>


In [33]:
### Remove HTML Tags and Removing URLs by using Regular Expression

#We use Regular expression and remove it.

import re
def remove_html_tags(text):
    pattern = re.compile("(?i)<td[^>]*>", " ")
    return pattern.sub(r'', text)

In [34]:
df.dtypes

review    object
dtype: object

In [None]:
df['review'].apply(remove_html_tags)

In [None]:
#Remove Punctuation

#'!"#$%\"'()+,.|/?@   are punctuations

import string
exclude = string.punctuation

In [37]:
def remove_punc(text):
    for char in exclude:
        text = text.replace(char," ")
    return text

#This is very slow when we have more data sets so we use the below

def remove_punc1(text):
    return text.translate(str.maketrans('','', exclude))

In [None]:
df['treat'].apply(remove_punc1)

##### Chat word treatments

eg: rm, asap, gn etc...
    
we want to convert into remove, as soon as possible, good night


In [38]:
# Either prepare the punction word full form or we can find in online source
#We create chat_words --> Is a dictionary which stores word short form and full form

def chat_conversion(text):
    new_text = []
    for a in text.split():
        if a.upper() in chat_words:
            new_text.append(chat_words[a.upper()])
        else:
            new_text.append(a)
    return " ".join(new_text)

In [53]:
#### Spelling correction
#We can use textblob, or NLTK or many others

from textblob import TextBlob

incorrect_text = 'Certin conditins during several genration are modfied in the same error'
textblob =  TextBlob(incorrect_text)
result  = textblob.correct()
print(textblob)
print('The corrected one is below ----->')
print(result)

Certin conditins during several genration are modfied in the same error
The corrected one is below ----->
Certain conditions during several generation are modified in the same error


In [45]:
textblob

TextBlob("Certin conditins during several genration are modfied in the same error")

In [57]:
####Removing Stop Words.
#eg: a, the, of, are, my etc...
###We use NLTK.

from nltk.corpus import stopwords
stopwords.words('english')


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [58]:
def remove_stopwords(text):
    new_text = []
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)



In [59]:
df['review'].apply(remove_stopwords)

0                         hi  name  manoj nd
1    <center>this  center  contents</center>
Name: review, dtype: object

In [None]:
#### Handling Emojis

# 1. Remove the Emojis 2.Replace with the meaning of the emojis

#Remove emoji by regular expression
import remove_emoji(text):
    emoki_pattern = re.compile("[Patterns of the emojie]")
    
    return emoji.pattern.sub(r'', text)

# Replace with the meaning of the emojis
import emoji
emoji.demojize

#We can use this to replace emoji

### Tokenization

##### Converting the text into token.
#### We need to check the below 4 things
1. Prefix eg: (" ")
2. Suffix eg km).?!
3. Infix eg: - --/ ...
4. Exception. eg: let's u.s

In [2]:
# We have to be very carefull in tokenization
####1. Using split function

sent1 = 'I am going to Delhi'
sent1.split()
print(sent1.split())
######

sent2 = 'I am going to Delhi. I will stay in my friends house'
sent2.split('.')

['I', 'am', 'going', 'to', 'Delhi']


['I am going to Delhi', ' I will stay in my friends house']

In [3]:
#####Best way to tokinization

##NLTK
from nltk.tokenize import word_tokenize, sent_tokenize

word_tokenize(sent1)

### Spacy
import spacy

In [None]:

nlp = spacy.load('en_core_web_sm')

doc1 = nlp(sent1)

In [6]:
#### Stemming : It's a process of reducing reducing inflextion in words to their root forms such as mapping a group of words to 
##same stem even if the stem itself is not a valid word in the language :)


from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()


In [7]:
def stem_word(text):
    return " ".join([ps.stem[word] for word in text.split()])

In [None]:
#### Lemmatization: It reduces the inflexted words properly ensuring that the root word belongs to the language.
## Lemma : A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.bbb
#Stemming if faster then Lemmatization