#   <font color='Brown'> Working with Text Data
    
#### <font color='Blue'> Topics Covered:



1.   **Text cleaning**: Understanding principles of Regex Patterns
2.   **Tokenization**: Using specific NLP Libarary creating tokens
3.   **Lemmatization**: Understanding & applying Lemmatization
4.   **Stop Words**: Understanding what are stop words & removal of them
5.   **TF-IDF Vectorizer**: Using TF-IDF Vectorizer, understanding feature importance of variables







---


<span style='color:Orange'> - Arpendu Ganguly</span>

![](https://miro.medium.com/v2/resize:fit:1024/1*vXKKe3J-lfi1YQ7HC6onxQ.jpeg)

## What are Regular Expressions?

**Regular expressions or RegEx** is defined as a sequence of characters that are mainly used to find or replace patterns present in the text.

---


Extracting all hashtags from a tweet, getting email iD or phone numbers, etc from large unstructured text content are some examples of Regular Expressions



### I. Common Regex Functions used in NLP

Python has a built-in module known as “re” for Regular Expressions. Some common functions from this module are as follows:

*   **re.search()**:Take the pattern, scan the text, and then return a Match object.
*   **re.match()**:match the string if the pattern is present at start of string
*   **re.sub()**:Replaces one or many matches with a string
*   **re.compile()**:stores the pattern in the cache memory for faster searches
*   **re.findall()**:Returns a list containing all matches









In [None]:
#re.search( ):
#given regular expression pattern is present in the given input string. It matches the first occurrence of a pattern in the entire string and not just at the beginning

#Syntax: re.search(patterns, string)
#‘re.I’ — to ignore the case (either uppercase or lowercase) of the text
#'re.M' — enables to search the string in multiple lines
#re.search(pattern, string, flags=re.I | re.M)

#Example1
import re
text_1 = "Ivy offers multiple Data Science Modules.Ivy also offers Data Engineering modules"
result_1 = re.search(r"Ivy",text_1)
print("with matching case:",result_1.group())

#Example2
import re
text_2 = "Ivy offers multiple Data Science Modules"
result_2 = re.search(r"ivy",text_2,flags=re.I)
print("without matching case:",result_2.group())


#Example3
import re
text_3 = "There is a lot of potential in Data Science"
result_3 = re.search(r"Ivy",text_3,flags=re.I)
print("with no match here:",result_3)


with matching case: Ivy
without matching case: Ivy
with no match here: None


In [None]:
#re.match( )
#This function will only match the string if the pattern is present at the very start of the string.

#Syntax: re.match(patterns, string)
import re
text_2 = "Ivy offers multiple Data Science Modules"
result_2 = re.match(r"Ivy",text_2)
print("Match Output:",result_2.group())

#Syntax: re.match(patterns, string)
import re
text_2 = "It offers multiple Data Science Modules.Ivy is Good"
result_2 = re.match(r"Ivy",text_2)
print("Match Output:",result_2)



Match Output: Ivy
Match Output: None


In [None]:
#re.sub( ):
#given used to substitute a substring with another substring
#Syntax: re.sub(patterns, Substitute, Input text)

#Example1
import re
text_1 = "I love R in Data Science Programming. SPSS is verstaile"
result_1 = re.sub(r"R|SPSS","Python",text_1)
print("old text:",text_1)
print("new text:",result_1)

old text: I love R in Data Science Programming. SPSS is verstaile
new text: I love Python in Data Science Programming. Python is verstaile


In [None]:
#re.compile( )
#It stores the regular expression pattern in the cache memory for faster searches. So, we have to pass the regex pattern to re.compile() function.
#Syntax: re.compile(patterns, repl, string)

#Example1
import re
text_1 = re.compile("Ivy")
result_1 = text_1.findall("Ivy offers Data Science Course. Lot of students are prospering with Ivy")
result_2 = text_1.findall("NLP courses are Good in Ivy")
print("1st instance:",result_1)
print("2nd instance:",result_2)

1st instance: ['Ivy', 'Ivy']
2nd instance: ['Ivy']


In [None]:
#re.findall( )
#will return all the occurrences of the pattern from the string
#Syntax: re.findall(patterns, string)

#Example1
import re

result_1 = re.findall("Deep Learning","Deep Learning is a emerging field. There are various sub-fields in Deep Learning")
print("Find All Results Given:",result_1)

Find All Results Given: ['Deep Learning', 'Deep Learning']


In [None]:
#re.split( )
#returns a list where the string has been split at each match.
#Syntax: re.split(splitterm, phrase)

#Example1
import re

split_term = '@'
phrase = 'My email is my_first_name@gmail.com'

result_1 = re.split(split_term,phrase)
print("Split Results Given:",result_1)

Split Results Given: ['My email is my_first_name', 'gmail.com']


### II. Special Characters in Regex



**Metacharacters**
There are five ways to express repetition in a pattern:



1.   A pattern followed by the meta-character * is repeated zero or more times.
2.   Replace the * with + and the pattern must appear at least once.
3.   Using ? means the pattern appears zero or one time.
4.   For a specific number of occurrences, use {n} after the pattern, where n
     is replaced with the number of times the pattern should repeat.
4.   Use {x,y} where x is the minimum number of repetitions and y is the maximum.
2.   Leaving out y {x,} means the value appears at least x times, with no maximum.











In [None]:
def multi_re_find(patterns,phrase):
    '''
    Takes in a list of regex patterns
    Prints a list of all matches
    '''
    for i in range(0,len(patterns)):
        print("Searching the phrase using the re check:",patterns[i])
        print(re.findall(patterns[i],phrase))
        print('\n')

In [None]:
test_phrase = 'gaga....gggaaa...gaaagaaa...agga...agggg...agggg'

#test_pattern = 'ga*'

test_pattern = ['ga*',#g is followed by zero or more a's
                'ga+',#g is followed by one or more a's
                'ga?',#g is followed by zero or one a's
                'ga{3}',#g is followed by three a's
                'ga{1,3}'#g is followed by one to three a's
                 ]

multi_re_find(test_pattern,test_phrase)

Searching the phrase using the re check: ga*
['ga', 'ga', 'g', 'g', 'gaaa', 'gaaa', 'gaaa', 'ga', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g']


Searching the phrase using the re check: ga+
['ga', 'ga', 'gaaa', 'gaaa', 'gaaa', 'ga']


Searching the phrase using the re check: ga?
['ga', 'ga', 'g', 'g', 'ga', 'ga', 'ga', 'ga', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g']


Searching the phrase using the re check: ga{3}
['gaaa', 'gaaa', 'gaaa']


Searching the phrase using the re check: ga{1,3}
['ga', 'ga', 'gaaa', 'gaaa', 'gaaa', 'ga']




### II. Special Characters in Regex



**Special Sequences**
Special sequence or escape codes to find specific types of patterns in your data, such as digits, non-digits, whitespace.Escapes are indicated by prefixing the character with a backslash \.



Character--Description--Example Pattern Code--Example Match


*   \d	A digit	file_\d\d	file_25
*   \D	A non digit	\D\D\D	ABC
*   \w	Alphanumeric	\w-\w\w\w	A-b_1
*   \W	Non-alphanumeric	\W\W\W\W	!*+)
*   \s	White space	a\sb\sc	a b c
*  \S	Non-whitespace	\S\S\S\S	This











In [None]:
text = 'Jersey number of MS Dhoni is 7, his twitter account is @MSDhoni. '

patterns=[ r'\d+', # sequence of digits
           r'\D+', # sequence of non-digits
           r'\s+', # sequence of whitespace
           r'\S+', # sequence of non-whitespace
           r'\w+', # alphanumeric characters
           r'\W+', # non-alphanumeric
          ]

multi_re_find(patterns,text)

Searching the phrase using the re check: \d+
['7']


Searching the phrase using the re check: \D+
['Jersey number of MS Dhoni is ', ', his twitter account is @MSDhoni. ']


Searching the phrase using the re check: \s+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


Searching the phrase using the re check: \S+
['Jersey', 'number', 'of', 'MS', 'Dhoni', 'is', '7,', 'his', 'twitter', 'account', 'is', '@MSDhoni.']


Searching the phrase using the re check: \w+
['Jersey', 'number', 'of', 'MS', 'Dhoni', 'is', '7', 'his', 'twitter', 'account', 'is', 'MSDhoni']


Searching the phrase using the re check: \W+
[' ', ' ', ' ', ' ', ' ', ' ', ', ', ' ', ' ', ' ', ' @', '. ']




### III. Character Sets in Regex



**Character Sets**
Character sets are used when you wish to match any one of a group of characters at a point in the input. Brackets are used to construct character set inputs.

For example, the input [ab] searches for occurrences of either a or b.












In [None]:
test_phrase = 'gaga....gggaaa...gaaagaaa...agag...agggg...agggg'

#test_pattern = 'ga*'

test_pattern = ['[ga]',#either g or a
                'g[ga]+',#g is followed by one or more g or a
                 ]

multi_re_find(test_pattern,test_phrase)

Searching the phrase using the re check: [ga]
['g', 'a', 'g', 'a', 'g', 'g', 'g', 'a', 'a', 'a', 'g', 'a', 'a', 'a', 'g', 'a', 'a', 'a', 'a', 'g', 'a', 'g', 'a', 'g', 'g', 'g', 'g', 'a', 'g', 'g', 'g', 'g']


Searching the phrase using the re check: g[ga]+
['gaga', 'gggaaa', 'gaaagaaa', 'gag', 'gggg', 'gggg']




### IV. Exclusions Regex



**Exclusions**
 ^ is used to exclude terms by incorporating it into the bracket syntax notation.  

 [^!.? ] to check for matches that are not a !,.,?, or space.

Add a + to check that the match appears at least once. This basically translates into finding the words.

In [None]:
text = "What is the jersey number of Christano Ronaldo?? Is it 7 or 9!"
re.findall('[^!.? ]+',text)

['What',
 'is',
 'the',
 'jersey',
 'number',
 'of',
 'Christano',
 'Ronaldo',
 'Is',
 'it',
 '7',
 'or',
 '9']

### IV. Character Ranges in Regex



**Character Ranges**
character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. The format used is [start-end].

For example, [a-g] will return matches between a and g

In [None]:
test_phrase = 'It was 7 in Manchester United, he was given that number right after Beckham left.'

test_patterns=['[a-z]+',      # sequences of lower case letters
               '[A-Z]+',      # sequences of upper case letters
               '[a-zA-Z]+',   # sequences of lower or upper case letters
               '[A-Z][a-z]+', # one upper case letter followed by lower case letters
               '[0-9]+'       # sequences of digits
              ]


multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: [a-z]+
['t', 'was', 'in', 'anchester', 'nited', 'he', 'was', 'given', 'that', 'number', 'right', 'after', 'eckham', 'left']


Searching the phrase using the re check: [A-Z]+
['I', 'M', 'U', 'B']


Searching the phrase using the re check: [a-zA-Z]+
['It', 'was', 'in', 'Manchester', 'United', 'he', 'was', 'given', 'that', 'number', 'right', 'after', 'Beckham', 'left']


Searching the phrase using the re check: [A-Z][a-z]+
['It', 'Manchester', 'United', 'Beckham']


Searching the phrase using the re check: [0-9]+
['7']




### VI. Tokenization



Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

In [None]:
pip install --user -U nltk



In [None]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed
liquid-fuel launch vehicle to orbit the Earth."""
print(word_tokenize(text))

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['Founded', 'in', '2002', ',', 'SpaceX', '’', 's', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’', 's', 'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth', '.']


### VII. Lemmitization



Lemmatization is another technique used to reduce inflected words to their root word. It describes the algorithmic process of identifying an inflected word’s “lemma” (dictionary form) based on its intended meaning.

As opposed to stemming, lemmatization relies on accurately determining the intended part-of-speech and the meaning of a word based on its context.



*   I like **playing**
*   I will **play** tomorrow
*   Everyone **played** at at the school





In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")
nltk.download("omw-1.4")
# Initialize wordnet lemmatizer
wnl = WordNetLemmatizer()
example_words = ["program","programming","programer","programs","programmed"]
print("{0:20}{1:20}".format("--Word--","--Lemma--"))
for word in example_words:
   print ("{0:20}{1:20}".format(word, wnl.lemmatize(word,pos="v")))

# pos
# "n" - nouns
# "v" - verbs
# "a" - adjectives
# "r" - adverbs
# "s" - satellite adjectives

--Word--            --Lemma--           
program             program             
programming         program             
programer           programer           
programs            program             
programmed          program             


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### VIII. Stop Words



Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

In [None]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = """This is a sample sentence,
                  showing off the stop words filtration."""

stop_words = stopwords.words('english')
new_stop_words = ['sample']
stop_words.extend(new_stop_words)
stop_words = set(stop_words)

word_tokens = word_tokenize(example_sent)
# converts the words in word_tokens to lower case and then checks whether
#they are present in stop_words or not
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
#with no lower case conversion
filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


In [None]:
stop_words = stopwords.words('english')
new_stop_words = ['sample']
stop_words.extend(new_stop_words)
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

## IX. Term Frequency – Inverse Document Frequency (TF-IDF)

Term Frequency – Inverse Document Frequency (TF-IDF) is a popular statistical technique utilized in natural language processing and information retrieval to assess a term’s significance in a document in comparison to a group of documents, known as a corpus. The technique employs a text vectorization process to transform words in a text document into numerical values that denote their importance.

Term Frequency: TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.

TF = {number of times the term appears in the document\total number of terms in the document}

Inverse Document Frequency: IDF of a term reflects the proportion of documents in the corpus that contain the term. Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance values than words common across all documents (e.g., a, the, and).

IDF = log ({total number of the documents in the corpus\number of documents in the corpus contain the term})

The TF-IDF of a term is calculated by multiplying TF and IDF scores.

In [None]:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
corpus = ['data science is one of the most important fields of science',
          'this is one of the best data science courses',
          'data scientists analyze data' ]

In [None]:
words_set = set()

for doc in  corpus:
    words = doc.split(' ')
    words_set = words_set.union(set(words))

print('Number of words in the corpus:',len(words_set))
print('The words in the corpus: \n', words_set)

Number of words in the corpus: 14
The words in the corpus: 
 {'courses', 'one', 'important', 'of', 'scientists', 'the', 'analyze', 'best', 'science', 'data', 'most', 'fields', 'this', 'is'}


In [None]:
tr_idf_model  = TfidfVectorizer()
tf_idf_vector = tr_idf_model.fit_transform(corpus)

In [None]:
print(type(tf_idf_vector), tf_idf_vector.shape)

<class 'scipy.sparse._csr.csr_matrix'> (3, 14)


In [None]:
tf_idf_array = tf_idf_vector.toarray()
print(tf_idf_array)

[[0.         0.         0.         0.18952581 0.32089509 0.32089509
  0.24404899 0.32089509 0.48809797 0.24404899 0.48809797 0.
  0.24404899 0.        ]
 [0.         0.40029393 0.40029393 0.23642005 0.         0.
  0.30443385 0.         0.30443385 0.30443385 0.30443385 0.
  0.30443385 0.40029393]
 [0.54270061 0.         0.         0.64105545 0.         0.
  0.         0.         0.         0.         0.         0.54270061
  0.         0.        ]]


In [None]:
words_set = tr_idf_model.get_feature_names_out()
print(words_set)

['analyze' 'best' 'courses' 'data' 'fields' 'important' 'is' 'most' 'of'
 'one' 'science' 'scientists' 'the' 'this']


In [None]:
df_tf_idf = pd.DataFrame(tf_idf_array, columns = words_set)
df_tf_idf

Unnamed: 0,analyze,best,courses,data,fields,important,is,most,of,one,science,scientists,the,this
0,0.0,0.0,0.0,0.189526,0.320895,0.320895,0.244049,0.320895,0.488098,0.244049,0.488098,0.0,0.244049,0.0
1,0.0,0.400294,0.400294,0.23642,0.0,0.0,0.304434,0.0,0.304434,0.304434,0.304434,0.0,0.304434,0.400294
2,0.542701,0.0,0.0,0.641055,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.542701,0.0,0.0


### X. Frequency of Words
We can use FreqDist from Nltk package which can calculate the no. of times a particular word is appearing in the corpus

In [None]:
from nltk.probability import FreqDist
example_text = "This is a sample text. We need to calculate the frequency of each word."

tokens = word_tokenize(example_text)

#Derive the frequency Distribution:
freq_dist_tokens = FreqDist(tokens)

for word, frequency in freq_dist_tokens.items():
    print(f"{word}:,{frequency}")


This:,1
is:,1
a:,1
sample:,1
text:,1
.:,2
We:,1
need:,1
to:,1
calculate:,1
the:,1
frequency:,1
of:,1
each:,1
word:,1


In [None]:
freq_dist_tokens

FreqDist({'.': 2, 'This': 1, 'is': 1, 'a': 1, 'sample': 1, 'text': 1, 'We': 1, 'need': 1, 'to': 1, 'calculate': 1, ...})

In [None]:
#Useful Links:
https://docs.python.org/3/library/re.html#regular-expression-syntax