**1. Removing html tags**

**Regex method**

In [1]:
import re

In [2]:
html_text = '<p>HTML list come in two flavors: ordered and unordered. Ordered list tags automatically inserts the right numbers for each of the list items, where as the unordered list tag inserts bullets.</p> <ul> <li>First item in the list</li> <li>Second item in the list</li> <li>Third item in the list</li> </ul>'

In [3]:
clean = re.compile('<.*?>')
cleantext = re.sub(clean, '', html_text)

In [4]:
cleantext

'HTML list come in two flavors: ordered and unordered. Ordered list tags automatically inserts the right numbers for each of the list items, where as the unordered list tag inserts bullets.  First item in the list Second item in the list Third item in the list '

**BeautifulSoup Method**

In [5]:
from bs4 import BeautifulSoup

In [6]:
html_text = '<p>HTML list come in two flavors: ordered and unordered. Ordered list tags automatically inserts the right numbers for each of the list items, where as the unordered list tag inserts bullets.</p> <ul> <li>First item in the list</li> <li>Second item in the list</li> <li>Third item in the list</li> </ul>'

In [7]:
cleantext = BeautifulSoup(html_text, "html.parser").text

In [8]:
cleantext

'HTML list come in two flavors: ordered and unordered. Ordered list tags automatically inserts the right numbers for each of the list items, where as the unordered list tag inserts bullets.  First item in the list Second item in the list Third item in the list '

**2. Remove stop-words**

**NLTK method**

In [9]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

In [10]:
stop_words = set(stopwords.words("english"))

In [11]:
text = 'Machine Learning (ML) is the study of computer algorithms that improve automatically through experience.[1][2] It is seen as a subset of artificial intelligence. machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.'
text

'Machine Learning (ML) is the study of computer algorithms that improve automatically through experience.[1][2] It is seen as a subset of artificial intelligence. machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.'

In [12]:
word_tokens = word_tokenize(text)
filtered_text = [word for word in word_tokens if word not in stop_words] 

In [13]:
' '.join(filtered_text)

"Machine Learning ( ML ) study computer algorithms improve automatically experience . [ 1 ] [ 2 ] It seen subset artificial intelligence . machine learning algorithms build mathematical model based sample data , known `` training data '' , order make predictions decisions without explicitly programmed ."

**3. Removing extra-spaces**

In [14]:
text = 'Machine learning (ML) is      the study of computer algorithms that improve automatically through experience.[1][2] It is seen as a subset of artificial intelligence. Machine learning algorithms build                a mathematical model based on sample data,      known as "training data", in order to make predictions or decisions without being      explicitly programmed to do so.'
text

'Machine learning (ML) is      the study of computer algorithms that improve automatically through experience.[1][2] It is seen as a subset of artificial intelligence. Machine learning algorithms build                a mathematical model based on sample data,      known as "training data", in order to make predictions or decisions without being      explicitly programmed to do so.'

In [15]:
txt = text.split()

In [16]:
' '.join(txt)

'Machine learning (ML) is the study of computer algorithms that improve automatically through experience.[1][2] It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.'

**4. Converting number to text**

In [17]:
from num2words import num2words as n2w
import spacy
nlp = spacy.load('en_core_web_sm')

In [18]:
text = 'I will be there by 3. Its 5 am now. Can the meeting be shifted to 7.'

text

'I will be there by 3. Its 5 am now. Can the meeting be shifted to 7.'

In [19]:
doc = nlp(text)
tokens = [n2w(token.text) if token.pos_ == 'NUM' else token.text for token in doc]

In [20]:
' '.join(tokens)

'I will be there by three . Its five am now . Can the meeting be shifted to seven .'

**5. Lowercasing the text**

In [21]:
text = 'Machine Learning (ML) is the study of computer algorithms that improve automatically through experience.[1][2] It is seen as a subset of artificial intelligence. machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.'
text

'Machine Learning (ML) is the study of computer algorithms that improve automatically through experience.[1][2] It is seen as a subset of artificial intelligence. machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.'

In [22]:
lowered_text = text.lower()

In [23]:
lowered_text

'machine learning (ml) is the study of computer algorithms that improve automatically through experience.[1][2] it is seen as a subset of artificial intelligence. machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.'

**6. Tokenization**

In [24]:
from nltk.tokenize import word_tokenize

In [25]:
text = 'Machine Learning (ML) is the study of computer algorithms that improve automatically through experience.[1][2] It is seen as a subset of artificial intelligence. machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.'
text

'Machine Learning (ML) is the study of computer algorithms that improve automatically through experience.[1][2] It is seen as a subset of artificial intelligence. machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.'

In [26]:
word_tokens = word_tokenize(text)

In [27]:
word_tokens

['Machine',
 'Learning',
 '(',
 'ML',
 ')',
 'is',
 'the',
 'study',
 'of',
 'computer',
 'algorithms',
 'that',
 'improve',
 'automatically',
 'through',
 'experience',
 '.',
 '[',
 '1',
 ']',
 '[',
 '2',
 ']',
 'It',
 'is',
 'seen',
 'as',
 'a',
 'subset',
 'of',
 'artificial',
 'intelligence',
 '.',
 'machine',
 'learning',
 'algorithms',
 'build',
 'a',
 'mathematical',
 'model',
 'based',
 'on',
 'sample',
 'data',
 ',',
 'known',
 'as',
 '``',
 'training',
 'data',
 "''",
 ',',
 'in',
 'order',
 'to',
 'make',
 'predictions',
 'or',
 'decisions',
 'without',
 'being',
 'explicitly',
 'programmed',
 'to',
 'do',
 'so',
 '.']

**7. Stemming**

In [28]:
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize

In [29]:
ps = PorterStemmer()

In [30]:
words = ["sitting", "thinking", "going", "linked", "likely"] 
  
for w in words: 
    print(w, " : ", ps.stem(w))

sitting  :  sit
thinking  :  think
going  :  go
linked  :  link
likely  :  like


**8. Lemmatization**

In [31]:
from nltk.stem import WordNetLemmatizer 

In [32]:
lemmatizer = WordNetLemmatizer() 

In [33]:
print("rocks :", lemmatizer.lemmatize("rocks")) 
print("eating :", lemmatizer.lemmatize("eating", pos='v')) 
print("worse :", lemmatizer.lemmatize("worse", pos='a')) 

rocks : rock
eating : eat
worse : bad


**9. Spell Checker**

In [34]:
from nltk.tokenize import word_tokenize
from textblob import TextBlob 

In [35]:
text = 'I am ging there. Will brng the thngs from tem'

In [36]:
tokens = word_tokenize(text)
res = []
for token in tokens:
    word = TextBlob(token)
    res.append(str(word.correct()))

In [37]:
' '.join(res)

'I am going there . Will bring the things from them'