## Text pre-processing in python

### Text normalization includes
(1) Converting text to lowercase<br>
(2) Removing punctuations<br>
(3) Removing white spaces<br>
(4) Stop word removal<br>

#### Converting text to lowercase

In [1]:
input_str = "I am working on Text Pre-Processing" 
print(input_str.lower())

i am working on text pre-processing


#### Removing punctuations
Following code removes all the punctuation [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~] 

In [10]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [11]:
import string
input_str = "This &is [an] example? {of} string. with.? punctuation!!!!" # Sample string
result = input_str.translate(str.maketrans("","",string.punctuation))
print(result)

This is an example of string with punctuation


#### Removing white spaces
To remove leading & ending white spaces u can use strip()

In [12]:
input_str = " \t a string example\t "
input_str = input_str.strip()
input_str

'a string example'

#### Removing URLS

In [17]:
import re
input_str = "https://malpurepriya27.wixsite.com/nlpbasics. This is removing urls"
url = re.compile(r'https?://\S+|www\.\S+')
print(url.sub(r'',input_str))

 This is removing urls


### Tokenization
Process of spliting text into smaller pieces called tokens

#### Removing stop words

In [19]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [22]:
from nltk.tokenize import word_tokenize
input_str = "NLTK is a leading platform for building Python programs to work with human language data."
tokens = word_tokenize(input_str)
result = [i for i in tokens if not i in stop_words]
print (result)

['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.']


U can also use scikit-learn or a spacy to remove stop words

### Stemming
Reducing words to their stem word or a root form<br>
Example : books - book <br><br>
2 Main Algorithms:<br>
    (1) Porter Stemming Algorithm <br>
    (2) Lancaster stemming algorithm

In [26]:
from nltk.stem import PorterStemmer
stemmer= PorterStemmer()
for word in result:
    print(stemmer.stem(word))

nltk
lead
platform
build
python
program
work
human
languag
data
.


### Lemmatization
It is similar to stemmming, it also reduces words to their root form, <br>only difference is that it doesn't just chop off inflections, <br>it gives word having dictionary meaning 

In [30]:
from nltk.tokenize import word_tokenize
input_str="been had done languages cities mice"
input_str=word_tokenize(input_str)

In [31]:
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
for word in input_str:
    print(lemmatizer.lemmatize(word))

been
had
done
language
city
mouse


### Part of speech tagging
Assigns part of speech to each word of given text.<br>
`NLTK`, `Spacy`, `TextBlob` contains POS taggers 

In [35]:
input_str= "Parts of speech examples: an article, to write, interesting, easily, and, of"

from textblob import TextBlob
result = TextBlob(input_str)
print(result.tags)

[('Parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('examples', 'NNS'), ('an', 'DT'), ('article', 'NN'), ('to', 'TO'), ('write', 'VB'), ('interesting', 'VBG'), ('easily', 'RB'), ('and', 'CC'), ('of', 'IN')]
