# Basics steps for cleaning text data
  1. Removing the punctuation
  2. Upper or lower case conversion
  3. Removing the stopwords. (a, an, the.....)
  4. Removing unwanted text ('/n','/x'....)
  5. Tokenization (Word tokenization, sentance tokenization-----> eg. Hii I am hari---->op:seperate words in vertical)
  6. Lematization and stemming (stemming is converting word to root word. Might not have meaning to the word---< eg:playing---play >)
                               (Lematization gives meaningful word)

# Advanced text cleaning steps:
  1. Normalization (mapping short word to meaningful word eg-tc - takecare)
  2. Correction of typoes (mapping wrong input word to correct word with help of dictionary)

In [96]:
import nltk
import re
import numpy as np
import pandas as pd

In [103]:
text = ["!But, even   though it seems like there aren’t any rules when it comes to writing a text message", 
"there are some unspoken general guidelines—especially when    it comes to punctuation. Do you include a period at the end of a sentence?", 
"Can you use a smiley face emoji as a period instead?", 
"The answers to these questions—and many more—will vary. If you’re talking to your coworker, maybe leave out the snarky periods." ,
"But your friends will likely appreciate some creative emoji game to end a witty comment ?"]

In [104]:
text=str(text)

In [109]:
lm=nltk.WordNetLemmatizer()

def text_clean(text):
    
    # removing punctuation and unwanted words in text
    text = re.sub("[^A-Za-z0-9]"," ",text)
    
    # Lower case
    text = "".join([i.lower() for i in text])
    
    # tokenize
    text = word_tokenize(text)
    
    # stopwords
    text = [i for i in text if i not in stopwords.words('english')]
    
    # Lematizer
    text = [lm.lemmatize(i) for i in text]
    
    return text
    

In [110]:
text_clean(text)

['even',
 'though',
 'seems',
 'like',
 'rule',
 'come',
 'writing',
 'text',
 'message',
 'unspoken',
 'general',
 'guideline',
 'especially',
 'come',
 'punctuation',
 'include',
 'period',
 'end',
 'sentence',
 'use',
 'smiley',
 'face',
 'emoji',
 'period',
 'instead',
 'answer',
 'question',
 'many',
 'vary',
 'talking',
 'coworker',
 'maybe',
 'leave',
 'snarky',
 'period',
 'friend',
 'likely',
 'appreciate',
 'creative',
 'emoji',
 'game',
 'end',
 'witty',
 'comment']

## Step 1 --- Removing extra space

Most of the time the text data that you have may contain extra spaces in between the words, after or before a sentence. So to start with we will remove these extra spaces from each sentence by using regular expressions.

In [37]:
text="".join(re.sub("\s+"," ",text))  # used to remove extra white spaces
text

"['!But, even though it seems like there aren’t any rules when it comes to writing a text message', 'there are some unspoken general guidelines—especially when it comes to punctuation. Do you include a period at the end of a sentence?', 'Can you use a smiley face emoji as a period instead?', 'The answers to these questions—and many more—will vary. If you’re talking to your coworker, maybe leave out the snarky periods.', 'But your friends will likely appreciate some creative emoji game to end a witty comment ?']"

## Step 2 --- Removing punctution

The punctuations present in the text do not add value to the data. The punctuation, when attached to any word, will create a problem in differentiating with other words.

In [38]:
"hii." == "hi"

False

In [39]:
"hi" == "hi"

True

In [50]:
text = "".join(re.sub("[^0-9A-Za-z]"," ", text))
text

' but  even   though it seems like there aren t any rules when it comes to writing a text messagethere are some unspoken general guidelines especially when    it comes to punctuation  do you include a period at the end of a sentence can you use a smiley face emoji as a period instead the answers to these questions and many more will vary  if you re talking to your coworker  maybe leave out the snarky periods but your friends will likely appreciate some creative emoji game to end a witty comment  '

# Step 3 --- Case Normalization

In this, we simply convert the case of all characters in the text to either upper or lower case. As python is a case sensitive language so it will treat NLP and nlp differently. One can easily convert the string to either lower or upper by using:
str.lower() or str.upper().

In [95]:
text="".join([i.lower() for i in text])
text

"['!but, even   though it seems like there aren’t any rules when it comes to writing a text message', 'there are some unspoken general guidelines—especially when    it comes to punctuation. do you include a period at the end of a sentence?', 'can you use a smiley face emoji as a period instead?', 'the answers to these questions—and many more—will vary. if you’re talking to your coworker, maybe leave out the snarky periods.', 'but your friends will likely appreciate some creative emoji game to end a witty comment ?']"

# Step 4 --- Tokenization

Splitting a sentence into words and creating a list, ie each sentence is a list of words. There are mainly 3 types of tokenizers.

In [52]:
from nltk.tokenize import word_tokenize

In [54]:
word_tok=word_tokenize(text)
word_tok

['but',
 'even',
 'though',
 'it',
 'seems',
 'like',
 'there',
 'aren',
 't',
 'any',
 'rules',
 'when',
 'it',
 'comes',
 'to',
 'writing',
 'a',
 'text',
 'messagethere',
 'are',
 'some',
 'unspoken',
 'general',
 'guidelines',
 'especially',
 'when',
 'it',
 'comes',
 'to',
 'punctuation',
 'do',
 'you',
 'include',
 'a',
 'period',
 'at',
 'the',
 'end',
 'of',
 'a',
 'sentence',
 'can',
 'you',
 'use',
 'a',
 'smiley',
 'face',
 'emoji',
 'as',
 'a',
 'period',
 'instead',
 'the',
 'answers',
 'to',
 'these',
 'questions',
 'and',
 'many',
 'more',
 'will',
 'vary',
 'if',
 'you',
 're',
 'talking',
 'to',
 'your',
 'coworker',
 'maybe',
 'leave',
 'out',
 'the',
 'snarky',
 'periods',
 'but',
 'your',
 'friends',
 'will',
 'likely',
 'appreciate',
 'some',
 'creative',
 'emoji',
 'game',
 'to',
 'end',
 'a',
 'witty',
 'comment']

# Step 5 --- Removing stopwords

Stopwords include: I, he, she, and, but, was were, being, have, etc, which do not add meaning to the data. So these words must be removed which helps to reduce the features from our data. These are removed after tokenizing the text.

In [29]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [55]:
st=[i for i in word_tok if i not in stopwords.words('english')]
st

['even',
 'though',
 'seems',
 'like',
 'rules',
 'comes',
 'writing',
 'text',
 'messagethere',
 'unspoken',
 'general',
 'guidelines',
 'especially',
 'comes',
 'punctuation',
 'include',
 'period',
 'end',
 'sentence',
 'use',
 'smiley',
 'face',
 'emoji',
 'period',
 'instead',
 'answers',
 'questions',
 'many',
 'vary',
 'talking',
 'coworker',
 'maybe',
 'leave',
 'snarky',
 'periods',
 'friends',
 'likely',
 'appreciate',
 'creative',
 'emoji',
 'game',
 'end',
 'witty',
 'comment']

# Step 6 --- Lematization and stemming

a. Stemming: A technique that takes the word to its root form. It just removes suffixes from the words. The stemmed word might not be part of the dictionary, i.e it will not necessarily give meaning. There are two main types of stemmer- Porter Stemmer and Snow Ball Stemmer(advanced version of Porter Stemmer).

b. Lemmatization: Takes the word to its root form called Lemma. It helps to bring words to their dictionary form. It is applied to nouns by default. It is more accurate as it uses more informed analysis to create groups of words with similar meanings based on the context, so it is complex and takes more time. This is used where we need to retain the contextual information.

In [59]:
sm=nltk.PorterStemmer()
sm=[sm.stem(i) for i in st]
sm

['even',
 'though',
 'seem',
 'like',
 'rule',
 'come',
 'write',
 'text',
 'messagether',
 'unspoken',
 'gener',
 'guidelin',
 'especi',
 'come',
 'punctuat',
 'includ',
 'period',
 'end',
 'sentenc',
 'use',
 'smiley',
 'face',
 'emoji',
 'period',
 'instead',
 'answer',
 'question',
 'mani',
 'vari',
 'talk',
 'cowork',
 'mayb',
 'leav',
 'snarki',
 'period',
 'friend',
 'like',
 'appreci',
 'creativ',
 'emoji',
 'game',
 'end',
 'witti',
 'comment']

In [61]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [63]:
lm=nltk.WordNetLemmatizer()
lm=[lm.lemmatize(i) for i in st]
lm

['even',
 'though',
 'seems',
 'like',
 'rule',
 'come',
 'writing',
 'text',
 'messagethere',
 'unspoken',
 'general',
 'guideline',
 'especially',
 'come',
 'punctuation',
 'include',
 'period',
 'end',
 'sentence',
 'use',
 'smiley',
 'face',
 'emoji',
 'period',
 'instead',
 'answer',
 'question',
 'many',
 'vary',
 'talking',
 'coworker',
 'maybe',
 'leave',
 'snarky',
 'period',
 'friend',
 'likely',
 'appreciate',
 'creative',
 'emoji',
 'game',
 'end',
 'witty',
 'comment']