# **Import Regex Library**

In [None]:
import re
print(re.__version__)

# **Detect and Remove HTML Tags**

In [None]:
text = "<p>Follow this <b>website</b> for more details. </p>"
x = re.findall("<.*?>", text)
print(x)
z = re.sub("<.*?>", "", text)
print(z)

# **Removing special characters and keeping only alphabets and numbers**

In [None]:
text = "2022 #Partner of the Year Finalist Learning #Award%%"
x = re.sub("[^a-zA-Z0-9]+", " ", text)
print(x)

# **Keeping only alphabets**

In [None]:
text = "2022 #Partner of the Year Finalist Learning #Award%%"
x = re.sub("[^a-zA-Z]+", " ", text)
print(x)

# **Detect and Remove URLs**

In [None]:
text = "Visit www.cloudthat.com and login to https://www.skillpipe.com/#/account/login"
x = re.findall("https?://\S+|www\.\S+", text)
print(x)
z = re.sub("https?://\S+|www\.\S+", "", text)
print(z)

# **Detect and Remove Email IDs**

In [None]:
text = "Please send your feedback to ctml@cloudthat.com or ctml123@gmail.com"
x = re.findall("[a-zA-Z0-9_\-\.]+@[a-zA-Z0-9_\-\.]+\.[a-zA-Z]{2,5}", text)
print(x)
z = re.sub("[a-zA-Z0-9_\-\.]+@[a-zA-Z0-9_\-\.]+\.[a-zA-Z]{2,5}", "", text)
print(z)

# **Replacing Multi-Spaces**

In [None]:
text = "2022 Partner              of the Year          Finalist Learning Award"
x = re.sub("\s+", " ", text)
print(x)

# **Import NLTK**

In [None]:
import nltk
print(nltk.__version__)

In [None]:
nltk.download('punkt')

paragraph = """Diversity is an important issue for any modern business, but it’s not enough to simply hire people of different nationalities, races, genders and sexual orientations. 
                Everyone needs to feel welcome, safe and free to be themselves in the workplace. 
                If you focus on diversity, equity and inclusion (DEI) in your workplace, your business’s culture and bottom line will benefit. 
                Inclusive workplaces go the extra mile to consider the safety and comfortability of all employees, especially those in marginalized groups. 
                For example, gendered bathrooms have the potential to make transgender and gender-nonconforming employees uncomfortable, especially in light of controversial “bathroom bills” in multiple states that could or already do impact transgender people’s rights. 
                On a broader level, inclusive spaces can be created simply by spending time with one another. 
                Consider hosting team lunches and other informal events where employees can casually connect with each other. 
                If your company is bigger, creating an in-office support group or network for diverse employees can help them connect with others who share their experiences."""

# **Tokenization**

Tokenization may be defined as the Process of breaking the given text, into smaller units called tokens. Words, numbers or punctuation marks can be tokens. It may also be called word segmentation.<br>

We have different packages for tokenization provided by NLTK. We can use these packages based on our requirements. The packages and the details of their installation are as follows −

1. sent_tokenize package: This package can be used to divide the input text into sentences. We can import it by using the following command − <br>
from nltk.tokenize import sent_tokenize

In [None]:
sentences = nltk.sent_tokenize(paragraph)
print(sentences)
print(len(sentences))

2. word_tokenize package: This package can be used to divide the input text into words. We can import it by using the following command − <br>
from nltk.tokenize import word_tokenize

In [None]:
words = nltk.word_tokenize(paragraph)
print(words)
print(len(words))

3. WordPunctTokenizer package: This package can be used to divide the input text into words and punctuation marks. We can import it by using the following command −<br>
from nltk.tokenize import WordPuncttokenizer

In [None]:
tk = nltk.WordPunctTokenizer()
wordsp = tk.tokenize(paragraph)
print(wordsp)
print(len(wordsp))

In [None]:
from nltk.util import ngrams
n = 3
sentence = 'You will face many defeats in life, but never let yourself be defeated.'
bigrams = ngrams(sentence.split(), n)

for item in bigrams:
    print(item)

# **Printing Stop Words in English**

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

print(stopwords.words('english'))

# **Stemming**

Language includes lots of variations. Variations in the sense that the language have different forms of a word. For example, the words like democracy, democratic, and democratization.For machine learning projects, it is very important for machines to understand that these different words, like above, have the same base form. That is why it is very useful to extract the base forms of the words while analyzing the text.<br>
Stemming is a heuristic process that helps in extracting the base forms of the words by chopping of their ends.

The different packages for stemming provided by NLTK module are as follows −

PorterStemmer package: Porter’s algorithm is used by this stemming package to extract the base form of the words. 

In [None]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ["python","pythoner","pythoning","pythoned","pythonly"]
for word in words:
    print(word,"--->",ps.stem(word))

2. LancasterStemmer package: Lancaster’s algorithm is used by this stemming package to extract the base form of the words. With the help of following command, we can import this package − <br>
from nltk.stem.lancaster import LancasterStemmer<br>
For example, ‘writ’ would be the output of the word ‘writing’ given as the input to this stemmer.

In [None]:
from nltk.stem import LancasterStemmer
lancaster = LancasterStemmer()
words = ['eating','eats','eaten','puts','putting']
for word in words:
    print(word,"--->",lancaster.stem(word)) 

3. SnowballStemmer package: Snowball’s algorithm is used by this stemming package to extract the base form of the words. With the help of following command, we can import this package −<br>
from nltk.stem.snowball import SnowballStemmer<br>
For example, ‘write’ would be the output of the word ‘writing’ given as the input to this stemmer.

In [None]:
from nltk.stem import SnowballStemmer
snowball = SnowballStemmer(language='english')
words = ['generous','generate','generously','generation']
for word in words:
    print(word,"--->",snowball.stem(word))

4. Regexp Stemmer – RegexpStemmer(): Regex stemmer identifies morphological affixes using regular expressions. Substrings matching the regular expressions will be discarded.

In [None]:
from nltk.stem import RegexpStemmer
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)
words = ['mass','was','bee','computer','advisable']
for word in words:
    print(word,"--->",regexp.stem(word))

Porter vs Snowball vs Lancaster vs Regex Stemming in NLTK

In [None]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer(language='english')
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)
word_list = ["friend", "friendship", "friends", "friendships"]
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format("Word","Porter Stemmer","Snowball Stemmer","Lancaster Stemmer",'Regexp Stemmer'))
for word in word_list:
    print("{0:20}{1:20}{2:20}{3:30}{4:40}".format(word,porter.stem(word),snowball.stem(word),lancaster.stem(word),regexp.stem(word)))

Stemming a Text File with NLTK

In [None]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

paragraph = """Diversity is an important issue for any modern business, but it’s not enough to simply hire people of different nationalities, races, genders and sexual orientations. 
                Everyone needs to feel welcome, safe and free to be themselves in the workplace. 
                If you focus on diversity, equity and inclusion (DEI) in your workplace, your business’s culture and bottom line will benefit. 
                Inclusive workplaces go the extra mile to consider the safety and comfortability of all employees, especially those in marginalized groups. 
                For example, gendered bathrooms have the potential to make transgender and gender-nonconforming employees uncomfortable, especially in light of controversial “bathroom bills” in multiple states that could or already do impact transgender people’s rights. 
                On a broader level, inclusive spaces can be created simply by spending time with one another. 
                Consider hosting team lunches and other informal events where employees can casually connect with each other. 
                If your company is bigger, creating an in-office support group or network for diverse employees can help them connect with others who share their experiences."""              

sentences = nltk.sent_tokenize(paragraph)
stemmer = PorterStemmer()
corpus = []

for i in range(len(sentences)):
    sentence = re.sub("[^a-zA-Z]", " ", sentences[i])
    sentence = sentence.lower()
    words = nltk.word_tokenize(sentence)
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentence = ' '.join(words)
    corpus.append(sentence)  

print(corpus)    

Tagging Parts of Speech: Part of speech is a grammatical term that deals with the roles words play when you use them together in sentences. Tagging parts of speech, or POS tagging, is the task of labeling the words in your text according to their part of speech.<br>(Requires - nltk.download('averaged_perceptron_tagger'))<br>

In English, there are eight parts of speech:

Part of speech<br>	
Noun<br>
Pronoun<br>
Adjective<br>
Verb<br>
Adverb<br>
Preposition<br>
Conjunction<br>
Interjection

In [None]:
from nltk.tokenize import word_tokenize

quote = """If you wish to make an apple pie from scratch,you must first invent the universe."""
words_in_quote = word_tokenize(quote)

nltk.pos_tag(words_in_quote)

# **Lemmatization**

It is another way to extract the base form of words, normally aiming to remove inflectional endings by using vocabulary and morphological analysis. After lemmatization, the base form of any word is called lemma.<br> Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word. 

1. WordNetLemmatizer package: This package will extract the base form of the word depending upon whether it is used as a noun or as a verb. 

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
  
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
print("better :", lemmatizer.lemmatize("better", pos ="a"))
# a denotes adjective in "pos"

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('omw-1.4')

paragraph = """Thank you all so very much. Thank you to the Academy. 
               Thank you to all of you in this room. I have to congratulate 
               the other incredible nominees this year. The Revenant was 
               the product of the tireless efforts of an unbelievable cast
               and crew. First off, to my brother in this endeavor, Mr. Tom 
               Hardy. Tom, your talent on screen can only be surpassed by 
               your friendship off screen … thank you for creating a t
               ranscendent cinematic experience. Thank you to everybody at 
               Fox and New Regency … my entire team. I have to thank 
               everyone from the very onset of my career … To my parents; 
               none of this would be possible without you. And to my 
               friends, I love you dearly; you know who you are. And lastly,
               I just want to say this: Making The Revenant was about
               man's relationship to the natural world. A world that we
               collectively felt in 2015 as the hottest year in recorded
               history. Our production needed to move to the southern
               tip of this planet just to be able to find snow. Climate
               change is real, it is happening right now. It is the most
               urgent threat facing our entire species, and we need to work
               collectively together and stop procrastinating. We need to
               support leaders around the world who do not speak for the 
               big polluters, but who speak for all of humanity, for the
               indigenous people of the world, for the billions and 
               billions of underprivileged people out there who would be
               most affected by this. For our children’s children, and 
               for those people out there whose voices have been drowned
               out by the politics of greed. I thank you all for this 
               amazing award tonight. Let us not take this planet for 
               granted. I do not take tonight for granted. Thank you so very much."""

sentences = nltk.sent_tokenize(paragraph)
lemmatizer = WordNetLemmatizer()
corpus = []

for i in range(len(sentences)):
    sentence = re.sub("[^a-zA-Z]", " ", sentences[i])
    sentence = sentence.lower()
    words = nltk.word_tokenize(sentence)
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentence = ' '.join(words)
    corpus.append(sentence)  

print(corpus)  

Stemming algorithm works by cutting the suffix from the word. In a broader sense cuts either the beginning or end of the word.

On the contrary, Lemmatization is a more powerful operation, and it takes into consideration morphological analysis of the words. It returns the lemma which is the base form of all its inflectional forms. In-depth linguistic knowledge is required to create dictionaries and look for the proper form of the word. Stemming is a general operation while lemmatization is an intelligent operation where the proper form will be looked in the dictionary. Hence, lemmatization helps in forming better machine learning features.

Chunking: While tokenizing allows you to identify words and sentences, chunking allows you to identify phrases.
<br>
Note: A phrase is a word or group of words that works as a single unit to perform a grammatical function. Noun phrases are built around a noun.
<br>
Here are some examples:
<br>
“A planet”<br>
“A tilting planet”<br>
“A swiftly tilting planet”<br>

Chunking makes use of POS tags to group words and apply chunk tags to those groups. Chunks don’t overlap, so one instance of a word can be in only one chunk at a time.

In [None]:
lotr_quote = "It's a dangerous business, Frodo, going out your door."
words_in_lotr_quote = nltk.word_tokenize(lotr_quote)
words_in_lotr_quote

In [None]:
lotr_pos_tags = nltk.pos_tag(words_in_lotr_quote)
lotr_pos_tags

DT is the determinant
VBP is the verb
JJ is the adjective
IN is the preposition
NN is the noun

list of tuples of all the words in the quote, along with their POS tag.<br>In order to chunk, you first need to define a chunk grammar.

Create a chunk grammar with one regular expression rule:

In [None]:
grammar = "NP: {<DT>?<JJ>*<NN>}"

According to the rule that we created, chunks are:
1. Start with an optional (?) determiner ('DT')
2. Can have any number (*) of adjectives (JJ)
3. End with a noun (<NN>) <br>
Create a chunk parser with this grammar:

In [None]:
chunk_parser = nltk.RegexpParser(grammar)

In [None]:
tree = chunk_parser.parse(lotr_pos_tags)

![image-2.png](attachment:image-2.png)

In [None]:
tree

We got two noun phrases:
1. 'a dangerous business' has a determiner, an adjective, and a noun.
2. 'door' has just a noun.

Chinking
Chinking is used together with chunking, but while chunking is used to include a pattern, chinking is used to exclude a pattern.

In [None]:
lotr_pos_tags

The next step is to create a grammar to determine what you want to include and exclude in your chunks.

In [None]:
grammar = """
... Chunk: {<.*>+}
...        }<JJ>{"""

1. The first rule of your grammar is {<.*>+}. This rule has curly braces that face inward ({}) because it’s used to determine what patterns you want to include in you chunks. In this case, you want to include everything: <.*>+.

2. The second rule of your grammar is }<JJ>{. This rule has curly braces that face outward (}{) because it’s used to determine what patterns you want to exclude in your chunks. In this case, you want to exclude adjectives: <JJ>.

Create a chunk parser with this grammar:

In [None]:
chunk_parser = nltk.RegexpParser(grammar)

Now chunk sentence with the chink you specified:

In [None]:
tree = chunk_parser.parse(lotr_pos_tags)

In [None]:
tree

In this case, ('dangerous', 'JJ') was excluded from the chunks because it’s an adjective (JJ). 

<b>Named Entity Recognition (NER)<b>

Named entities are noun phrases that refer to specific locations, people, organizations, and so on. With named entity recognition, we will able to find the named entities in texts and also determine what kind of named entity they are.

In [None]:
tree = nltk.ne_chunk(lotr_pos_tags)

In [None]:
tree

 Create a string from which to extract named entities. We are using this quote from The War of the Worlds:

In [None]:
quote = """Men like Schiaparelli watched the red planet—it is odd, by-the-bye, that for countless centuries Mars has been the star of war—but failed to
interpret the fluctuating appearances of the markings they mapped so well. All that time the Martians must have been getting ready.
During the opposition of 1894 a great light was seen on the illuminated part of the disk, first at the Lick Observatory, then by Perrotin of Nice,
and then by other observers. English readers heard of it first in the issue of Nature dated August 2."""

In [None]:
def extract_ne(quote):
    words = word_tokenize(quote, language="english")
    tags = nltk.pos_tag(words)
    tree = nltk.ne_chunk(tags, binary=True)
    return set(
        " ".join(i[0] for i in t)
        for t in tree
        if hasattr(t, "label") and t.label() == "NE")

With this function, we gather all named entities, with no repeats. in order to get this done,<br> We tokenize by word, apply part of speech tags to those words, and then extract named entities based on those tags.

In [None]:
extract_ne(quote)

We got the following:
1. An institution: 'Lick Observatory'
2. A planet: 'Mars'
3. A publication: 'Nature'
4. People: 'Perrotin', 'Schiaparelli'