# Working with Text Data - Text Preprocessing and Feature Extraction (Text to Numerical Vector)

##### Text Data

Text Analysis is a major application field for machine learning algorithms. Some of the major application areas of NLP are:

1. Spell Checker, Keyword Search, etc
2. Sentiment Analysis, Spam Classification
3. Machine Translation
4. Chatbots/Dialog Systems
5. Question Answering Systems
etc..

However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

##### Why NLP is hard?
1. Complexity of representation
 Poems, Sarcasm, etc...
 Example 1: This task is a piece of cake.
 Example 2: You have a football game tomorrow. Break a leg!

2. Ambiguity in Natural Language
 Ambiguity means uncertainity of meaning.
 For Example: The car hit the pole while it was moving.

#### Text Preprocessing

1. Removing special characters and punctuations
2. Convert sentence into lower case
3. Tokenisation
4. Removing stop words
5. Stemming or Lemmatization

### Feature Extraction Techniques (Convert Text to Numerical Vectors)

1. Bag of Words
2. TF IDF (Term Frequency - Inverse Document Frequency)
3. Word2Vec (by Google)
4. GloVe (Global Vectors by Stanford) - Not Covered in this notebook
5. Pretrained GloVe Embeddings
6. FastText (by Facebook) - Not Covered in this notebook
7. ELMo (Embeddings from Language Models) - Not Covered in this notebook
8. BERT (Bidirectional Encoder Representations from Transformer)
9. GPT
10. LLM's

### Text Preprocessing Steps

Text Preprocessing steps include some essential tasks to clean and remove the noise from the available data.

1. Removing Special Characters and Punctuation - Special characters like ^, ~, @, $, etc... Punctuations like ., ?, ,, etc...

2. Converting to Lower Case - We convert the whole text corpus to lower case to reduce the size of the vocabulary of our text data.

3. Tokenization (Sentence Tokenization and Word Tokenization) - This is a simple step to break the text into sentences or words.

4. Removing Stop Words - Stopwords don't contribute to the meaning of a sentence. So, we can safely remove them without changing the meaning of the sentence. For eg: it, was, any, then, a, is, by, etc are the stopwords.

5. Stemming or Lemmatization - Stemming is the process of removing suffixes and reducing a word to some root form. For eg: warm, warmer, warming can be converted to warm.

Install nltk
! pip install nltk

In [2]:
import nltk

# Download the punctions
nltk.download('punkt')
# Download the stopwords
nltk.download('stopwords')
# Downloading wordnet before applying lemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /home/idreesy31/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/idreesy31/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/idreesy31/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/idreesy31/nltk_data...


True

In [4]:
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer


In [5]:
raw_text = "This 1is Natural-LAnguage-Processing. In this example wE are goIng to Learn variouS text9 preprocessing steps."

print(raw_text)

This 1is Natural-LAnguage-Processing. In this example wE are goIng to Learn variouS text9 preprocessing steps.


In [7]:
# Removing special characters and digits
text = re.sub("[^a-zA-z.]", ' ', raw_text )
print(text)

This  is Natural LAnguage Processing. In this example wE are goIng to Learn variouS text  preprocessing steps.


In [8]:
# change sentence to lower case
text = text.lower()
print(text)

this  is natural language processing. in this example we are going to learn various text  preprocessing steps.


In [9]:
# tokenize text into sentences
my_sentences = sent_tokenize(text)
print(my_sentences)

['this  is natural language processing.', 'in this example we are going to learn various text  preprocessing steps.']
