# Text Data - Preprocessing and Text to Numerical Vector

# Text Preprocessing 

1.Tokenisation

2.Removing special characters

3.Convert sentence into lower case

4.Removing stop words

5.Stemming or Lemmatization

# Techniques to convert Text to Numerical Vectors

1.Bag of Words

2.TF IDF (Term Frequency - Inverse Document Frequency)

3.Word2Vec (by Google)

4.GloVe (Global Vectors by Stanford) - Not Covered in this notebook

5.Pretrained GloVe Embeddings

6.FastText (by Facebook) - Not Covered in this notebook

7.ELMo (Embeddings from Language Models) - Not Covered in this notebook

8.BERT (Bidirectional Encoder Representations from Transformer)

# Data Preparation

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
lst_text = ['it Was the best oF Times $', 
            'It was The worst of times.',
            'IT 9 was tHe age Of wisdom', 
            'it was thE age of foolishness']

df = pd.DataFrame({'text': lst_text})

df.head()

Unnamed: 0,text
0,it Was the best oF Times $
1,It was The worst of times.
2,IT 9 was tHe age Of wisdom
3,it was thE age of foolishness


In [4]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 398.8 kB/s eta 0:00:00
Collecting regex>=2021.8.3
  Downloading regex-2022.9.13-cp310-cp310-win_amd64.whl (267 kB)
     ------------------------------------ 267.7/267.7 kB 232.1 kB/s eta 0:00:00
Collecting tqdm
  Downloading tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
     -------------------------------------- 78.5/78.5 kB 483.9 kB/s eta 0:00:00
Installing collected packages: tqdm, regex, nltk
Successfully installed nltk-3.7 regex-2022.9.13 tqdm-4.64.1


In [5]:
import nltk
nltk.download('stopwords')
# Downloading wordnet before applying Lemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SNEGHAL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\SNEGHAL\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\SNEGHAL\AppData\Roaming\nltk_data...


True

In [6]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [7]:
## initialise the inbuilt Stemmer
stemmer = PorterStemmer()

In [8]:
## We can also use Lemmatizer instead of Stemmer
lemmatizer = WordNetLemmatizer()

# Text Preprocessing Steps

Text Preprocessing steps include some essential tasks to clean and remove the noise from the available data.

1.Removing Special Characters and Punctuation

2.Converting to Lower Case - We convert the whole text corpus to lower case to reduce the size of the vocabulary of our text data.

3.Removing Stop Words - Stopwords don't contribute to the meaning of a sentence. So, we can safely remove them without changing the meaning of the sentence. For eg: it, was, any, then, a, is, by, etc are the stopwords.

4.Stemming or Lemmatization - Stemming is the process of getting the root form of a word. For eg: warm, warmer, warming can be converted to warm.

In [9]:
raw_text = "This 1is Natural-LAnguage-Processing."
print(raw_text)

This 1is Natural-LAnguage-Processing.


In [10]:
# Removing special characters and digits
sentence = re.sub("[^a-zA-Z]", " ", raw_text)
print(sentence)

This  is Natural LAnguage Processing 


In [11]:
# change sentence to lower case
sentence = sentence.lower()
print(sentence)

this  is natural language processing 


In [12]:
# tokenize into words
tokens = sentence.split()
print(tokens)

['this', 'is', 'natural', 'language', 'processing']


In [13]:
# Removing stop words
clean_tokens = [t for t in tokens if t not in stopwords.words("english")]
print(clean_tokens)

['natural', 'language', 'processing']


In [14]:
# Stemming
clean_tokens_stem = [stemmer.stem(word) for word in clean_tokens]
print(clean_tokens_stem)

['natur', 'languag', 'process']


In [15]:
# Lemmatizing
clean_tokens_lem = [lemmatizer.lemmatize(word) for word in clean_tokens]
print(clean_tokens_lem)

['natural', 'language', 'processing']


In [17]:
def preprocess(raw_text, flag):
    # Removing special characters and digits
    sentence = re.sub("[^a-zA-Z]", " ", raw_text)
    
    # change sentence to lower case
    sentence = sentence.lower()

    # tokenize into words
    tokens = sentence.split()
    
    # remove stop words                
    clean_tokens = [t for t in tokens if t not in stopwords.words("english")]
    
    # Stemming/Lemmatization
    if(flag == 'stem'):
        clean_tokens = [stemmer.stem(word) for word in clean_tokens]
    else:
        clean_tokens = [lemmatizer.lemmatize(word) for word in clean_tokens]
    
    return pd.Series([" ".join(clean_tokens), len(clean_tokens)])

In [18]:
temp_df = df['text'].apply(lambda x : preprocess(x, 'stem'))

temp_df.head()

Unnamed: 0,0,1
0,best time,2
1,worst time,2
2,age wisdom,2
3,age foolish,2


In [19]:
temp_df.columns = ['clean_text_stem', 'text_length_stem']

temp_df.head()

Unnamed: 0,clean_text_stem,text_length_stem
0,best time,2
1,worst time,2
2,age wisdom,2
3,age foolish,2


In [20]:
df = pd.concat([df, temp_df], axis=1)

df.head()

Unnamed: 0,text,clean_text_stem,text_length_stem
0,it Was the best oF Times $,best time,2
1,It was The worst of times.,worst time,2
2,IT 9 was tHe age Of wisdom,age wisdom,2
3,it was thE age of foolishness,age foolish,2


In [21]:
temp_df = df['text'].apply(lambda x: preprocess(x, 'lemma'))

temp_df.head()

Unnamed: 0,0,1
0,best time,2
1,worst time,2
2,age wisdom,2
3,age foolishness,2


In [22]:
temp_df.columns = ['clean_text_lemma', 'text_length_lemma']

temp_df.head()

Unnamed: 0,clean_text_lemma,text_length_lemma
0,best time,2
1,worst time,2
2,age wisdom,2
3,age foolishness,2


In [23]:
df = pd.concat([df, temp_df], axis=1)

df.head()


Unnamed: 0,text,clean_text_stem,text_length_stem,clean_text_lemma,text_length_lemma
0,it Was the best oF Times $,best time,2,best time,2
1,It was The worst of times.,worst time,2,worst time,2
2,IT 9 was tHe age Of wisdom,age wisdom,2,age wisdom,2
3,it was thE age of foolishness,age foolish,2,age foolishness,2


# Bag of Word 

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [24]:
df.head()

Unnamed: 0,text,clean_text_stem,text_length_stem,clean_text_lemma,text_length_lemma
0,it Was the best oF Times $,best time,2,best time,2
1,It was The worst of times.,worst time,2,worst time,2
2,IT 9 was tHe age Of wisdom,age wisdom,2,age wisdom,2
3,it was thE age of foolishness,age foolish,2,age foolishness,2


In [25]:
# Bag of Words

from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.
vocab = CountVectorizer()

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.

dtm = vocab.fit_transform(df['clean_text_lemma'])

# fit_transform() could be done seperatly as mentioned below
# vocab.fit(df.clean_text_stem)
# dtm = vocab.transform(df.clean_text_stem)

In [26]:
# We can look at unique words by using 'vocabulary_'

vocab.vocabulary_

{'best': 1, 'time': 3, 'worst': 5, 'age': 0, 'wisdom': 4, 'foolishness': 2}

In [27]:
# Observe that the type of dtm is sparse

print(type(dtm))

<class 'scipy.sparse._csr.csr_matrix'>


In [28]:
# Lets now print the  shape of this dtm

print(dtm.shape)

# o/p -> (4, 6)
# i.e -> 4 documents and 6 unique words

(4, 6)


In [29]:
# Lets look at the dtm

print(dtm)

# Remember that dtm is a sparse matrix. i.e. zeros wont be stored
# Lets understand First line of output -> (0,1)    1
# Here (0, 1) means 0th document and 1st(index starting from 0) unique word. 
# (we have total 4 documents) & (we have total 6 unique words)
# (0, 1)    1 -> 1 here refers to the number of occurence of 1st word
# Now lets read it all in english.
# (0, 1)    1 -> 'times' occurs 1 time in 0th document. 
# Try to observe -> (3, 2)   1

  (0, 1)	1
  (0, 3)	1
  (1, 3)	1
  (1, 5)	1
  (2, 0)	1
  (2, 4)	1
  (3, 0)	1
  (3, 2)	1


In [30]:
# Since the dtm is sparse, lets convert it into numpy array.

print(dtm.toarray())

[[0 1 0 1 0 0]
 [0 0 0 1 0 1]
 [1 0 0 0 1 0]
 [1 0 1 0 0 0]]


In [31]:
sorted(vocab.vocabulary_)

['age', 'best', 'foolishness', 'time', 'wisdom', 'worst']

In [32]:
pd.DataFrame(dtm.toarray(), columns=sorted(vocab.vocabulary_))

Unnamed: 0,age,best,foolishness,time,wisdom,worst
0,0,1,0,1,0,0
1,0,0,0,1,0,1
2,1,0,0,0,1,0
3,1,0,1,0,0,0


In [33]:
# 2-grams

vocab = CountVectorizer(ngram_range=[1,2])

dtm = vocab.fit_transform(df.clean_text_stem)

In [34]:
print(vocab.vocabulary_)

{'best': 3, 'time': 6, 'best time': 4, 'worst': 8, 'worst time': 9, 'age': 0, 'wisdom': 7, 'age wisdom': 2, 'foolish': 5, 'age foolish': 1}


In [35]:
# convert sparse matrix to numpy array
print(dtm.toarray())

[[0 0 0 1 1 0 1 0 0 0]
 [0 0 0 0 0 0 1 0 1 1]
 [1 0 1 0 0 0 0 1 0 0]
 [1 1 0 0 0 1 0 0 0 0]]


In [36]:
pd.DataFrame(dtm.toarray(), columns=sorted(vocab.vocabulary_))

Unnamed: 0,age,age foolish,age wisdom,best,best time,foolish,time,wisdom,worst,worst time
0,0,0,0,1,1,0,1,0,0,0
1,0,0,0,0,0,0,1,0,1,1
2,1,0,1,0,0,0,0,1,0,0
3,1,1,0,0,0,1,0,0,0,0


# Term Frequency Inverse Document Frequency

In BOW approach all the words in the text are treated as equally important i.e. there's no notion of some words in the document being more important than others. TF-IDF, or term frequency-inverse document frequency, addresses this issue. It aims to quantify the importance of a given word relative to other words in the document and in the corpus.

1.Term Frequency

2.Inverse Document Frequency

TF = Probabilty of Occurence of Word in Document

IDF = Probablity of Occurence of Word in Corpus 

In [38]:
# TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

dtm = vectorizer.fit_transform(df.clean_text_lemma)

In [39]:
print(vectorizer.vocabulary_)

{'best': 1, 'time': 3, 'worst': 5, 'age': 0, 'wisdom': 4, 'foolishness': 2}


In [40]:
print(dtm.toarray()) 

# convert sparse matrix to nparray

[[0.         0.78528828 0.         0.6191303  0.         0.        ]
 [0.         0.         0.         0.6191303  0.         0.78528828]
 [0.6191303  0.         0.         0.         0.78528828 0.        ]
 [0.6191303  0.         0.78528828 0.         0.         0.        ]]


In [41]:
pd.DataFrame(dtm.toarray(), columns=sorted(vectorizer.vocabulary_))

Unnamed: 0,age,best,foolishness,time,wisdom,worst
0,0.0,0.785288,0.0,0.61913,0.0,0.0
1,0.0,0.0,0.0,0.61913,0.0,0.785288
2,0.61913,0.0,0.0,0.0,0.785288,0.0
3,0.61913,0.0,0.785288,0.0,0.0,0.0
