### Introduction


- This notebook is a beginner's guide to all the steps involved in text-preprocessing.  
- All steps involved in converting unstructured data to structured.  
- Source of tutorial: [FreeCodeCamp](https://www.freecodecamp.org/news/natural-language-processing-techniques-for-beginners/)

In [5]:
import re

### Pre-processing steps:

<table>
    <thead>
        <tr>
            <th>Step</th>
            <th>Explanation</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Tokenization</td>
            <td>convert entire text-data to words or sentence tokens</td>
        </tr>
        <tr>
            <td>Normalization</td>
            <td>convert to lowercase, remove numbers, punctuations, accents, special chars (using unicode) etc.</td>
        </tr>
        <tr>
            <td>Stemming</td>
            <td>convert words to base form. less efficient as compared to lemmatization</td>
        </tr>
        <tr>
            <td>Lemmatization</td>
            <td>convert words to base form. more advanced than stemming</td>
        </tr>
        <tr>
            <td>Remove stop-words</td>
            <td>Words like "the", "an" etc. which do not contribute to overall meaning</td>
    </tbody>
</table>

### 1. Tokenization  

i) **Word Tokenization-** each word is a token. the sentences are broken down into single words.  
ii) **Sentence Tokenization-** phrases form one token. sentences are broken down into words and phrases.

#### Some tokenization algorithms
1. **Whitespace tokenization-** create tokens using whitespaces as separator.
2. **Pattern tokenization (regex)-** create tokens using provided regex as separator.
3. **Punctuation-based tokenization** etc.

Type of algorithm used depends on the task at hand.

In [7]:
text = 'For the powerful, crimes are those that others commit.'

In [8]:
#using python's in-built split() function
words = text.split()
print(words)

['For', 'the', 'powerful,', 'crimes', 'are', 'those', 'that', 'others', 'commit.']


### 2. Normalization

Make text-data consistent.

- convert all to lowercase.
- remove numbers & punctuation marks.
- remove accents using Unicodes.

#### a. Convert to lowercase

In [9]:
text2 = 'To THe mooN and bAcK.'
text2_lower = text2.lower()
print(text2_lower)

to the moon and back.


#### b. Remove Punctuation marks

In [10]:
#all the punctuation marks

#re.compile() converts the given patterns to a regex object
punctuation_marks = re.compile(r'[{};():,."/<>-]')
text3 = 'shefali: (This is a cool day!), <as of now>,'
text3_clean = punctuation_marks.sub('', text3)
print(text3_clean)

shefali This is a cool day! as of now


#### c. Remove accents

In [12]:
text4 = 'After the café closed, René decided to visit his mère.'
accents_regex = re.compile(u"[\u0300-\u036F]|é|è")
accents_regex.sub('e', text4)

'After the cafe closed, Rene decided to visit his mere.'

### 3. Stemming

Reduce the words to their root form.  
For e.g.- ***"study, studying, studied"*** all get converted to ***"study".***

### 4. Lemmatisation

Better than stemming.  
It takes into consideration the "part of speech" whether word is mentioned as a noun or verb.  
Before converting a word to base form, it takes into account structure of the word.  
It is more computationally expensive as it is more accurate than stemming.  
<hr>

For e.g. ***"I drink water every day"*** & ***"I like that drink."***  
The word drink is a verb in first sentence & a noun in the other.  
Lemmetisation keeps these differences in check.

### 5. Stop Words

Stop words includes "the", "an", "that", "to", "because", "as" etc.  
Words that do not add any meaning to a sentence.  
They should also be removed as they reduce the size of dataset and help improve efficiency of model.