# NLP-Natural Language Processing

![image.png](attachment:4627240a-c9fc-4768-854c-67b967f35867.png)

# Introduction

`Natural Language Processing (NLP)` is a branch of `artificial intelligence (AI)` that focuses on enabling `computers to understand`, `interpret`, and `generate human language` in a way that is `both meaningful and useful`.

# What is NLP?
* `NLP` stands for `Natural Language Processing`, which is a part of `Computer Science`, `Human language`, and `Artificial Intelligence`. 
* It is the technology that is used by `machines to understand`, `analyse`, `manipulate`, and `interpret human's languages`.
*  It helps developers to organize knowledge for performing tasks such as `translation`, `automatic summarization`, `Named Entity Recognition (NER)`, `speech recognition`, `relationship extraction`, and `topic segmentation`.

![image.png](attachment:c6878b04-749d-450c-8061-284b7d11f7c8.png)

## NLP can be broadly categorized into two types:

### Natural Language Understanding (NLU): 
* it helps the `machine to understand` and `analyse human language` by `extracting the metadata` from content such as `concepts`, `entities`, `keywords`, `emotion`, `relations`, and `semantic roles`.
* NLU mainly used in Business applications to understand the customer's problem in both spoken and written language.

### Natural Language Generation (NLG):
* Natural Language Generation (NLG) acts as a `translator that converts the computerized data into natural language representation`. It mainly involves `Text planning, Sentence planning, and Text Realization`


![image.png](attachment:31dc90d0-f88c-4f20-9d48-220afcf4a0ed.png)

# Applications of NLP
There are the following applications of NLP -

## 1. Question Answering

Question Answering focuses on building systems that automatically answer the questions asked by humans in a natural language.

![image.png](attachment:20c60e77-c109-4b86-b71f-171281048021.png)

## 2. Spam Detection

Spam detection is used to detect unwanted e-mails getting to a user's inbox

![image.png](attachment:62b37405-35c3-4796-9b8b-5fc3901449d1.png)

## 3. Sentiment Analysis

* `Sentiment Analysis` is also known as `opinion mining`. It is used on the `web to analyse the attitude, behaviour, and emotional state of the sender`. 
* This application is implemented through a combination of NLP (Natural Language Processing) and statistics by assigning the values to the text (`positive, negative, or natural`), identify the mood of the context (happy, sad, angry, etc.)

## Example using a movie review:

- If someone writes: `"This movie was awesome!"`, the sentiment analysis would tell us it's `positive because the word "awesome" indicates they liked the movie`.

![image.png](attachment:400f3c1b-71bb-40dd-bb86-716d4f25148e.png)

## 4. Machine Translation

Machine translation is used to translate text or speech from one natural language to another natural language.

![image.png](attachment:f0b43eb2-6fd8-46ee-ab98-58b9176a8ab7.png)

## 5. Spelling correction

Microsoft Corporation provides word processor software like MS-word, PowerPoint for the spelling correction

![image.png](attachment:1976edca-3912-4ce0-80a1-6f20950bde1a.png)

## 6. Chatbot

Implementing the Chatbot is one of the important applications of NLP. It is used by many companies to provide the customer's chat services.

![image.png](attachment:76d37c16-04f4-480b-b3f3-1b22e90c9a09.png)

# Various Steps in NLP
## Tokenization
Tokenization refers to the procedure of `splitting a sentence into its constituent words`

<img src = "https://miro.medium.com/v2/resize:fit:812/1*kb1f7z_6fvG9WpINv4rcLg.png">

* For example, consider this sentence: `"I am reading a book."` Here, our task is to `extract words/tokens from this sentence`.
* After passing this sentence to a tokenization program, the extracted words/tokens would be ` "I", "am", "reading", "a", "book", and "."`.

- unigrams
- bigrams
- trigrams
- n-gram refers to a sequence of n items from a given text.

![image.png](attachment:02476ad8-2e83-41e9-988e-ec7b9af20a87.png)

In [1]:
!pip install nltk

Defaulting to user installation because normal site-packages is not writeable


In [2]:
import nltk
from nltk import word_tokenize

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ELCOT\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [4]:
word_tokenize("i am learning Python for NLP")

['i', 'am', 'learning', 'Python', 'for', 'NLP']

* The `word_tokenize()` method is used to split the `sentence into words/tokens`.
* We need to add a sentence as input to the word_tokenize() method, so that it performs its job.
* The `result obtained would be a list`, which we` will store in a word variable`.

In [5]:
words = word_tokenize("I am reading NLP Fundamentals")
print(words)

['I', 'am', 'reading', 'NLP', 'Fundamentals']


In [29]:
paragraph='''
Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.[32]

Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.[33][34]

Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0.[35] Python 2.0 was released in 2000. Python 3.0, released in 2008, was a major revision not completely backward-compatible with earlier versions. Python 2.7.18, released in 2020, was the last release of Python 2.[36]

Python consistently ranks as one of the most popular programming languages, and has gained widespread use in the machine learning community.[37][38][39][40]'''
paragraph

'\nPython is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.[32]\n\nPython is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.[33][34]\n\nGuido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0.[35] Python 2.0 was released in 2000. Python 3.0, released in 2008, was a major revision not completely backward-compatible with earlier versions. Python 2.7.18, released in 2020, was the last release of Python 2.[36]\n\nPython consistently ranks as one of the most popular programming languages, and has gained widespread use in the machine learning community.[37][38][39][40]'

In [31]:
import nltk
from nltk import sent_tokenize  # paragraph to sentence
from nltk import word_tokenize  # sentence to word

In [32]:
sentences = sent_tokenize(paragraph)
sentences

['\nPython is a high-level, general-purpose programming language.',
 'Its design philosophy emphasizes code readability with the use of significant indentation.',
 '[32]\n\nPython is dynamically typed and garbage-collected.',
 'It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming.',
 'It is often described as a "batteries included" language due to its comprehensive standard library.',
 '[33][34]\n\nGuido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0.',
 '[35] Python 2.0 was released in 2000.',
 'Python 3.0, released in 2008, was a major revision not completely backward-compatible with earlier versions.',
 'Python 2.7.18, released in 2020, was the last release of Python 2.',
 '[36]\n\nPython consistently ranks as one of the most popular programming languages, and has gained widespread use in the machine lear

In [48]:
for i in sentences:
    print(word_tokenize(i))

['Python', 'is', 'a', 'high-level', ',', 'general-purpose', 'programming', 'language', '.']
['Its', 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'the', 'use', 'of', 'significant', 'indentation', '.']
['[', '32', ']', 'Python', 'is', 'dynamically', 'typed', 'and', 'garbage-collected', '.']
['It', 'supports', 'multiple', 'programming', 'paradigms', ',', 'including', 'structured', '(', 'particularly', 'procedural', ')', ',', 'object-oriented', 'and', 'functional', 'programming', '.']
['It', 'is', 'often', 'described', 'as', 'a', '``', 'batteries', 'included', "''", 'language', 'due', 'to', 'its', 'comprehensive', 'standard', 'library', '.']
['[', '33', ']', '[', '34', ']', 'Guido', 'van', 'Rossum', 'began', 'working', 'on', 'Python', 'in', 'the', 'late', '1980s', 'as', 'a', 'successor', 'to', 'the', 'ABC', 'programming', 'language', 'and', 'first', 'released', 'it', 'in', '1991', 'as', 'Python', '0.9.0', '.']
['[', '35', ']', 'Python', '2.0', 'was', 'released', 'in

In [47]:
word_tokenize(sentences)             #cant do without for loop or list comprehension

TypeError: expected string or bytes-like object

In [49]:
for i in range(len(sent_tokenize(paragraph))):
    words = nltk.word_tokenize(sentences[i])
    print(words)            

['Python', 'is', 'a', 'high-level', ',', 'general-purpose', 'programming', 'language', '.']
['Its', 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'the', 'use', 'of', 'significant', 'indentation', '.']
['[', '32', ']', 'Python', 'is', 'dynamically', 'typed', 'and', 'garbage-collected', '.']
['It', 'supports', 'multiple', 'programming', 'paradigms', ',', 'including', 'structured', '(', 'particularly', 'procedural', ')', ',', 'object-oriented', 'and', 'functional', 'programming', '.']
['It', 'is', 'often', 'described', 'as', 'a', '``', 'batteries', 'included', "''", 'language', 'due', 'to', 'its', 'comprehensive', 'standard', 'library', '.']
['[', '33', ']', '[', '34', ']', 'Guido', 'van', 'Rossum', 'began', 'working', 'on', 'Python', 'in', 'the', 'late', '1980s', 'as', 'a', 'successor', 'to', 'the', 'ABC', 'programming', 'language', 'and', 'first', 'released', 'it', 'in', '1991', 'as', 'Python', '0.9.0', '.']
['[', '35', ']', 'Python', '2.0', 'was', 'released', 'in

# Parts-of-Speech (PoS) tagging.

* PoS tagging refers to the `process of tagging words within sentences into their respective parts of speech` and then `finally labeling them`.
* We extract `Part of Speech of tokens constituting a sentence`, so that we can filter out the PoS that are of `interest and analyze them`.
* For example, if we look at the sentence, `"The sky is blue,"` we get four tokens – `"The," "sky," "is," and "blue"` – with the help of tokenization. Now, using PoS tagger, we tag parts of speech to each word/token. 


![](https://byteiota.com/wp-content/uploads/2021/01/POS-Tagging-1200x600.jpg)

* This process is essential for various downstream `NLP tasks like named entity recognition, parsing, sentiment analysis, and machine translatio`

![image.png](attachment:b011c4c1-acbc-4841-b2b7-b61b32accb18.png)

In [50]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ELCOT\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [10]:
import nltk
from nltk import word_tokenize
words = word_tokenize("The sky is blue")           #can do pos tagging after word tokenise
print(words)
nltk.pos_tag(words)

['The', 'sky', 'is', 'blue']


[('The', 'DT'), ('sky', 'NN'), ('is', 'VBZ'), ('blue', 'JJ')]

In [12]:
n=word_tokenize("blue is my favrt color")
nltk.pos_tag(n)

[('blue', 'NN'),
 ('is', 'VBZ'),
 ('my', 'PRP$'),
 ('favrt', 'JJ'),
 ('color', 'NN')]

In [52]:
import nltk
from nltk import word_tokenize
words = word_tokenize("I am reading NLP Fundamentals")
print(words)
nltk.pos_tag(words)

['I', 'am', 'reading', 'NLP', 'Fundamentals']


[('I', 'PRP'),
 ('am', 'VBP'),
 ('reading', 'VBG'),
 ('NLP', 'NNP'),
 ('Fundamentals', 'NNS')]

In [53]:

for i in sent_tokenize(paragraph):
    print(nltk.pos_tag((word_tokenize(i))))

[('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('high-level', 'JJ'), (',', ','), ('general-purpose', 'JJ'), ('programming', 'NN'), ('language', 'NN'), ('.', '.')]
[('Its', 'PRP$'), ('design', 'NN'), ('philosophy', 'NN'), ('emphasizes', 'VBZ'), ('code', 'JJ'), ('readability', 'NN'), ('with', 'IN'), ('the', 'DT'), ('use', 'NN'), ('of', 'IN'), ('significant', 'JJ'), ('indentation', 'NN'), ('.', '.')]
[('[', 'RB'), ('32', 'CD'), (']', 'JJ'), ('Python', 'NNP'), ('is', 'VBZ'), ('dynamically', 'RB'), ('typed', 'JJ'), ('and', 'CC'), ('garbage-collected', 'JJ'), ('.', '.')]
[('It', 'PRP'), ('supports', 'VBZ'), ('multiple', 'JJ'), ('programming', 'NN'), ('paradigms', 'NN'), (',', ','), ('including', 'VBG'), ('structured', 'VBN'), ('(', '('), ('particularly', 'RB'), ('procedural', 'NN'), (')', ')'), (',', ','), ('object-oriented', 'JJ'), ('and', 'CC'), ('functional', 'JJ'), ('programming', 'NN'), ('.', '.')]
[('It', 'PRP'), ('is', 'VBZ'), ('often', 'RB'), ('described', 'VBN'), ('as', 'IN'), ('a'

In [54]:
for i in range(len(sent_tokenize(paragraph))):
    words = nltk.word_tokenize(sentences[i])
    print(words)
    print(nltk.pos_tag(words))

['Python', 'is', 'a', 'high-level', ',', 'general-purpose', 'programming', 'language', '.']
[('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('high-level', 'JJ'), (',', ','), ('general-purpose', 'JJ'), ('programming', 'NN'), ('language', 'NN'), ('.', '.')]
['Its', 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'the', 'use', 'of', 'significant', 'indentation', '.']
[('Its', 'PRP$'), ('design', 'NN'), ('philosophy', 'NN'), ('emphasizes', 'VBZ'), ('code', 'JJ'), ('readability', 'NN'), ('with', 'IN'), ('the', 'DT'), ('use', 'NN'), ('of', 'IN'), ('significant', 'JJ'), ('indentation', 'NN'), ('.', '.')]
['[', '32', ']', 'Python', 'is', 'dynamically', 'typed', 'and', 'garbage-collected', '.']
[('[', 'RB'), ('32', 'CD'), (']', 'JJ'), ('Python', 'NNP'), ('is', 'VBZ'), ('dynamically', 'RB'), ('typed', 'JJ'), ('and', 'CC'), ('garbage-collected', 'JJ'), ('.', '.')]
['It', 'supports', 'multiple', 'programming', 'paradigms', ',', 'including', 'structured', '(', 'particularly', '

## Why is PoS tagging important?

### Syntactic Analysis: 
* Identifying the part of speech of `each word helps in understanding the grammatical structure of a sentence`.
* This information is `crucial for syntactic parsing`, which involves `analyzing the sentence's structure and relationships between words`.

## Semantic Analysis: 
* `Different parts of speech often carry different semantic meanings`.
* For example, `verbs typically denote actions`, `nouns represent objects`, and `adjectives describe qualities`.
* Knowing the part of speech can aid in `understanding the intended meaning of a sentence`.

# Stop Word Removal

* `Stop words` are `common words that are just used to support the construction of sentences`.
* `We remove stop words from our analysis` as they do `not impact the meaning of sentences they are present in`.
* Examples of `stop words include a, am, and the`.
*  Since they occur very frequently and their presence doesn't have much impact on the sense of the sentence, they need to be removed.

 * Examples of stop words include articles `(e.g., "the", "a", "an")`, conjunctions `(e.g., "and", "but", "or")`, prepositions `(e.g., "in", "on", "at")`, and pronouns `(e.g., "I", "you", "he", "she")`.

![](https://cdn.botpenguin.com/assets/website/Stop_Words_d01a1e75a3.png)

In [14]:
import nltk
nltk.download('stopwords')
from nltk import word_tokenize
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ELCOT\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [22]:
# In order to check the list of stopwords provided for the English language, 
#we pass it as a parameter to the words() function.
stop_words = stopwords.words('English')
print(stop_words)  

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [23]:
from nltk import word_tokenize
from nltk.corpus import stopwords


for i in sentences:
    print([b for b in word_tokenize(i) if b not in stop_words])
    #word_tokenize(sentences)

['Python', 'high-level', ',', 'general-purpose', 'programming', 'language', '.']
['Its', 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'use', 'significant', 'indentation', '.']
['[', '32', ']', 'Python', 'dynamically', 'typed', 'garbage-collected', '.']
['It', 'supports', 'multiple', 'programming', 'paradigms', ',', 'including', 'structured', '(', 'particularly', 'procedural', ')', ',', 'object-oriented', 'functional', 'programming', '.']
['It', 'often', 'described', '``', 'batteries', 'included', "''", 'language', 'due', 'comprehensive', 'standard', 'library', '.']
['[', '33', ']', '[', '34', ']', 'Guido', 'van', 'Rossum', 'began', 'working', 'Python', 'late', '1980s', 'successor', 'ABC', 'programming', 'language', 'first', 'released', '1991', 'Python', '0.9.0', '.']
['[', '35', ']', 'Python', '2.0', 'released', '2000', '.']
['Python', '3.0', ',', 'released', '2008', ',', 'major', 'revision', 'completely', 'backward-compatible', 'earlier', 'versions', '.']
['Python', '2

In [24]:
for i in sentences:
    print([ii for ii in word_tokenize(i) if ii in stop_words ])

['is', 'a']
['with', 'the', 'of']
['is', 'and']
['and']
['is', 'as', 'a', 'to', 'its']
['on', 'in', 'the', 'as', 'a', 'to', 'the', 'and', 'it', 'in', 'as']
['was', 'in']
['in', 'was', 'a', 'not', 'with']
['in', 'was', 'the', 'of']
['as', 'of', 'the', 'most', 'and', 'has', 'in', 'the']
[]


In [25]:
sentence = "I am learning Python. It is one of the most popular programming languages"
sentence_words = word_tokenize(sentence)
print(sentence_words)

['I', 'am', 'learning', 'Python', '.', 'It', 'is', 'one', 'of', 'the', 'most', 'popular', 'programming', 'languages']


To remove the stopwords, first we need to loop through each word in the sentence, check whether there are any stop words, and then finally combine them to form a complete sentence

In [26]:
' '.join([i for i in sentence_words if i not in stop_words])

'I learning Python . It one popular programming languages'

In [29]:
sentence_stops = ' '.join([word for word in sentence_words if word  in stop_words])
print(sentence_stops)

am is of the most


## Why are stop words used?

### Noise Reduction: 
* Stop words are frequently used to reduce the `noise in text data`.
* By removing these common words, the focus can be shifted towards the more meaningful content words, which are typically `nouns, verbs, adjectives, and adverbs`.

### Memory and Processing Efficiency:
* Stop words removal can help `reduce the size` of the `vocabulary` and the `computational resources needed` for `text processing tasks such as indexing, searching, and analysis`.

# Text Normalization


* There are some words that are `spelt`, `pronounced`, and `represented differently`, for example, words such as `Mumbai and Bombay, and US and United States`.
* Although they are different, they `mean the same thing`.
* There are `also different forms words that need to be converted into base forms`.
* For example, words such as ``"does" and "doing,"`` when converted to their base form, become ` "do" `.
* Along these lines, ``text normalization is a process wherein different variations of text get converted into a standard form``. We need to perform text normalization as there are some words that can mean the same thing as each other.
*  There are various ways of normalizing text, such as ``spelling correction, stemming, and lemmatization``

In [30]:
sentence = "I visited US from UK on 22-10-18"
sentence

'I visited US from UK on 22-10-18'

We want to replace "US" with "United States", "UK" with "United Kingdom", and "18" with "2018". To do so, we make use of the replace() function and store the updated output in the "normalized_sentence" variable

In [31]:
normalized_sentence = sentence.replace("US", "United States").replace("UK", "United Kingdom").replace("18", "2018")
normalized_sentence

'I visited United States from United Kingdom on 22-10-2018'

## Lower casing: 
* Converting all letters in the text to lowercase. This ensures that capitalization variations do not affect text analysis. For example:

Original: "The Quick Brown Fox"

Normalized: "the quick brown fox"

## Removing Punctuation: 
* Eliminating punctuation marks from the text. Punctuation often doesn't carry much semantic meaning and can be safely disregarded in many NLP tasks. For example:

Original: "Hello, world!"

Normalized: "Hello world"

# Spelling Correction

* Spelling correction is one of the `most important tasks in any NLP project`.
* It can be `time consuming`, but `without it there are high chances of losing out on required information`.
*  We make use of the ` "autocorrect" `Python library to correct spellings

In [15]:
!pip install autocorrect

Defaulting to user installation because normal site-packages is not writeable
Collecting autocorrect
  Downloading autocorrect-2.6.1.tar.gz (622 kB)
Building wheels for collected packages: autocorrect
  Building wheel for autocorrect (setup.py): started
  Building wheel for autocorrect (setup.py): finished with status 'done'
  Created wheel for autocorrect: filename=autocorrect-2.6.1-py3-none-any.whl size=622382 sha256=b8ff4eed1ca380864b2cc404ae180786d1a2840bad3e0d5fd7df6aee7521650f
  Stored in directory: c:\users\elcot\appdata\local\pip\cache\wheels\ab\0f\23\3c010c3fd877b962146e7765f9e9b08026cac8b035094c5750
Successfully built autocorrect
Installing collected packages: autocorrect
Successfully installed autocorrect-2.6.1


In [32]:
from autocorrect import Speller
spell=Speller()
spell("ambrell")

'umbrella'

In [33]:
spell("Pythom")

'Python'

In [35]:
spell('dancig')

'dancing'

In [39]:
spell('libraey')

'library'

In [53]:
# Spelling Correction of a Word and a Sentence
#In order to correct the spelling of a word, pass a wrongly spelled word as a parameter to the spell() function.
spell('Natureal')

'Natural'

# Examples

In [48]:
#In order to correct the spelling of a sentence, we first need to tokenize it into words
sentence = word_tokenize("Ntural Luanguage Processin deals with the art of extracting insightes from Natural Languaes")
print(sentence)

['Ntural', 'Luanguage', 'Processin', 'deals', 'with', 'the', 'art', 'of', 'extracting', 'insightes', 'from', 'Natural', 'Languaes']


In [52]:
print(' '.join([spell(i) for i in sentence]))

Natural Language Processing deals with the art of extracting insights from Natural Languages


Natural Language Processing deals with the art of extracting insights from Natural Languages
In the preceding figure, we can see that most of the wrongly spelled words have been corrected. But the word "Processin" was wrongly converted into "Procession". It should have been "Processing". It happened because to change "Processin" to "Procession" or "Processing," an equal number of edits is required. To rectify this, we need to use other kinds of spelling correctors that are aware of context.

# Stemming

* Converting words to their `base or root form`.
* This reduces inflectional forms and variants to their common base form. For example:

Original: "running"

Normalized: "run"

* In languages such as `English`, words get transformed into `various forms when being used in a sentence`.
*  For example, the word `"product" `might get transformed into `"production" `when referring to the process of making something or transformed into ` "products" ` in plural form.
*  It is necessary to convert these words into their base forms, as they carry the same meaning. Stemming is a process that helps us in doing so. If we look at the following figure, we get a perfect idea about how words get transformed into their base forms:

![image.png](attachment:2a640918-8a46-4622-b9af-15d654f30d28.png)

In [43]:
import nltk
stemmer = nltk.stem.PorterStemmer()
stemmer.stem("production")

'product'

In [44]:
from nltk.stem import PorterStemmer
stemmer=PorterStemmer()
stemmer.stem("products")

'product'

In [45]:
print(stemmer.stem("fairly")) # no meaning 
print(stemmer.stem('fires'))
print(stemmer.stem('Firing'))

fairli
fire
fire


## Why is stemming used?

### Normalization: 
* Stemming helps in normalizing words by `reducing them to their base form`.
*  This ensures that `different inflected forms of a word are treated as the same word`, which simplifies `text analysis`.

### Reducing Vocabulary Size: 
* Stemming `reduces the vocabulary size` by `collapsing words with similar meanings into a single representation`.
* This can improve `computational efficiency and reduce the complexity of NLP models`.

### Lemmatization
- Sometimes, the stemming process leads to inappropriate results. 
- For example, in the last exercise, the word "battling" got transformed to "battl," which has no meaning. 
- To overcome these problems with stemming, we make use of lemmatization. 
- In this process, an additional check is being made, by looking through the dictionary to extract the base form of a word. - However, this additional check slows down the process

## Extracting the base word using Lemmatization

In [46]:
#the lemmatization process to produce the proper form of a given word
import nltk
nltk.download('omw-1.4')
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

In [47]:
lemmatizer = WordNetLemmatizer()
#Bring the word to its proper form by using the lemmatize() method of the WordNetLemmatizer class.
lemmatizer.lemmatize('products')

'product'

In [48]:
lemmatizer.lemmatize('production')

'production'

In [60]:
lemmatizer.lemmatize("fairly")

'fairly'

In [49]:
lemmatizer.lemmatize("changing")

'changing'

![image.png](attachment:aa99e243-5af6-41d5-a97a-72d59e055fad.png)

### Named entity recognition (NER).
- Named entities are usually not present in dictionaries. So, we need to treat them separately. 
- The main objective of this process is to identify the named entities (such as proper nouns) and map them to the categories that are already defined. 
- For example, the categories might include names of persons, places, and so on. 
- To get a better understanding of this process, we'll look at an exercise.

- Named Entity Recognition (NER) is a natural language processing task that involves `identifying and categorizing named entities` within a text into `predefined categories` such as `names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages,` etc.
-  The goal of NER is to `extract and classify relevant entities from unstructured text to understand the key elements mentioned in the text and their relationships`.

In [58]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk

# Sample text
text = "We are reading a book published by Packt which is based out of Birmingham.."
print(text)

# Tokenize the text into words
words = word_tokenize(text)
print(words)

# Part-of-speech tagging
pos_tags = pos_tag(words)
print('pos tag :',pos_tags)

# Named Entity Recognition
ner_tags = ne_chunk(pos_tags)
print('ne_chunk :',ner_tags)

# Print named entities
for chunk in ner_tags:
    if hasattr(chunk, 'label'):
        print(chunk.label(), ' '.join(c[0] for c in chunk))


We are reading a book published by Packt which is based out of Birmingham..
['We', 'are', 'reading', 'a', 'book', 'published', 'by', 'Packt', 'which', 'is', 'based', 'out', 'of', 'Birmingham', '..']
pos tag : [('We', 'PRP'), ('are', 'VBP'), ('reading', 'VBG'), ('a', 'DT'), ('book', 'NN'), ('published', 'VBN'), ('by', 'IN'), ('Packt', 'NNP'), ('which', 'WDT'), ('is', 'VBZ'), ('based', 'VBN'), ('out', 'IN'), ('of', 'IN'), ('Birmingham', 'NNP'), ('..', 'NNP')]
ne_chunk : (S
  We/PRP
  are/VBP
  reading/VBG
  a/DT
  book/NN
  published/VBN
  by/IN
  (PERSON Packt/NNP)
  which/WDT
  is/VBZ
  based/VBN
  out/IN
  of/IN
  (ORGANIZATION Birmingham/NNP)
  ../NNP)
PERSON Packt
ORGANIZATION Birmingham


In [54]:
for i in ner_tags:
    if hasattr(i,'label'):
        print(i)

(PERSON Packt/NNP)
(ORGANIZATION Birmingham/NNP)


In [12]:
#In this exercise, we will find out the named entities in a sentence.
import nltk
from nltk import word_tokenize
nltk.download('maxent_ne_chunker')
nltk.download('words')
sentence = """We are reading a book published by Packt
which is based out of Birmingham."""
              

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\ELCOT\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\ELCOT\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


In [63]:
i = nltk.ne_chunk(nltk.pos_tag(word_tokenize(sentence)),binary=True)
print(i)

(S
  We/PRP
  are/VBP
  reading/VBG
  a/DT
  book/NN
  published/VBN
  by/IN
  (NE Packt/NNP)
  which/WDT
  is/VBZ
  based/VBN
  out/IN
  of/IN
  (NE Birmingham/NNP)
  ./.)


In [64]:
[a for a in i if len(a)==1]  

[Tree('NE', [('Packt', 'NNP')]), Tree('NE', [('Birmingham', 'NNP')])]

## Why Named Entity Recognition is used:

### Information Extraction: 
* NER helps in `extracting valuable information from text by identifying important entities`.
* For example, in news articles, NER can be used to extract `names of people, organizations, and locations mentioned in the article`.

### Facilitating Search and Retrieval: 
* NER aids in `improving search and retrieval systems` by `identifying specific entities mentioned in documents`.
* Users `can search for documents containing particular entities`, `making information retrieval more accurate and efficient`.

# Bag of Words

- Bag of words model helps convert the text into numerical representation (numerical feature vectors) such that the same can be used to train models using machine learning algorithms. 
- Here are the key steps of fitting a bag-of-words model:


    - 1. Create a vocabulary indices of words or tokens from the entire set of documents. The vocabulary indices can be created in alphabetical order. 
    - 2.Construct the numerical feature vector for each document that represents how frequent each word appears in different documents. The feature vector representing each will be sparse in nature as the words in each document will represent only a small subset of words out of all words (bag-of-words) present in entire set of documents.


- The picture below represents the above concept. Note some of the following:

   - Number of words in header represents unique words in all the three documents listed in first column Against each document, number represents number of occurences. For example, for the first document, “bird” occured for 5 times, “the” occured for two times and “about” occured for 1 time.


![image.png](attachment:046e3e0d-cdce-46dd-98e3-f129fbaca52d.png)

## Creating a bag-of-words model using Python Sklearn

- To construct a bag-of-words model based on the word counts in the respective documents, the CountVectorizer class implemented in scikit-learn is used

- CountVectorizer (sklearn.feature_extraction.text.CountVectorizer) is used to fit the bag-or-words model
- The fit_transform method of CountVectorizer takes an array of text data, which can be documents or sentences.

In [2]:
paragraph='''

 is an education initiative with a keen vision to impart education which is relevant to the real-world industry and intellectual needs and it is part of Sabari Foundation.

With decades of experience in the field of education, we believe that by harnessing the new age digital teaching methodologies, we can provide the quintessential learning experience.

Rajalakshmi Institutions established in 1997 is the largest private engineering college group in Tamil Nadu in terms of student enrolment. Having an enviable placement record, we have always been keen on imparting the latest and most relevant education to students ensuring industry readiness. The cornerstone of the Rajalakshmi experience has been a learner centric approach to ensure each and every learner has comfortably completed the Learning Objectives of the program they have signed up for.

We at REV are on a mission of leveraging the wealth of experience and knowledge available across our Institutions and inculcating it to the new age digital citizens.

Our contact and digital programs have been designed by industry and academic doyens taking into account of all the leading and global corporate demands.
'''

In [11]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ELCOT\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Cleaning the texts
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer




In [5]:
import nltk
ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []

In [28]:
print(sentences)

['\n\n is an education initiative with a keen vision to impart education which is relevant to the real-world industry and intellectual needs and it is part of Sabari Foundation.', 'With decades of experience in the field of education, we believe that by harnessing the new age digital teaching methodologies, we can provide the quintessential learning experience.', 'Rajalakshmi Institutions established in 1997 is the largest private engineering college group in Tamil Nadu in terms of student enrolment.', 'Having an enviable placement record, we have always been keen on imparting the latest and most relevant education to students ensuring industry readiness.', 'The cornerstone of the Rajalakshmi experience has been a learner centric approach to ensure each and every learner has comfortably completed the Learning Objectives of the program they have signed up for.', 'We at REV are on a mission of leveraging the wealth of experience and knowledge available across our Institutions and inculca

In [7]:
set(stopwords.words('english'))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [8]:
import re

In [33]:
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    print(review)
    review = review.lower()
    print("Lower:" ,review)
    review = review.split()
    print("Split:",review)
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    print("Stemming:",review)
    review = ' '.join(review)
    print("After joining:",review)
    corpus.append(review)

print(corpus)

 Python is a high level  general purpose programming language 
Lower:  python is a high level  general purpose programming language 
Split: ['python', 'is', 'a', 'high', 'level', 'general', 'purpose', 'programming', 'language']
Stemming: ['python', 'high', 'level', 'gener', 'purpos', 'program', 'languag']
After joining: python high level gener purpos program languag
Its design philosophy emphasizes code readability with the use of significant indentation 
Lower: its design philosophy emphasizes code readability with the use of significant indentation 
Split: ['its', 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'the', 'use', 'of', 'significant', 'indentation']
Stemming: ['design', 'philosophi', 'emphas', 'code', 'readabl', 'use', 'signific', 'indent']
After joining: design philosophi emphas code readabl use signific indent
      Python is dynamically typed and garbage collected 
Lower:       python is dynamically typed and garbage collected 
Split: ['python', 'is

In [15]:
# Cleaning the texts
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in 
              set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
    
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
X

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
        1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
        0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1

In [34]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
print(X[1:100])

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 0 0 0]]


In [35]:
X.shape

(18, 134)

In [36]:
cv.vocabulary_

{'educ': 35,
 'initi': 61,
 'keen': 64,
 'vision': 129,
 'impart': 56,
 'relev': 109,
 'real': 106,
 'world': 133,
 'industri': 60,
 'intellectu': 63,
 'need': 83,
 'part': 90,
 'sabari': 113,
 'foundat': 46,
 'decad': 26,
 'experi': 43,
 'field': 44,
 'believ': 11,
 'har': 54,
 'new': 84,
 'age': 4,
 'digit': 30,
 'teach': 123,
 'methodolog': 79,
 'provid': 98,
 'quintessenti': 101,
 'learn': 72,
 'rajalakshmi': 102,
 'institut': 62,
 'establish': 41,
 'largest': 67,
 'privat': 95,
 'engin': 37,
 'colleg': 16,
 'group': 52,
 'tamil': 122,
 'nadu': 82,
 'term': 124,
 'student': 118,
 'enrol': 38,
 'enviabl': 40,
 'placement': 93,
 'record': 107,
 'alway': 5,
 'latest': 70,
 'ensur': 39,
 'readi': 105,
 'cornerston': 24,
 'learner': 73,
 'centric': 12,
 'approach': 6,
 'everi': 42,
 'comfort': 17,
 'complet': 20,
 'object': 85,
 'program': 97,
 'sign': 114,
 'rev': 110,
 'mission': 80,
 'leverag': 75,
 'wealth': 130,
 'knowledg': 65,
 'avail': 7,
 'across': 3,
 'inculc': 58,
 'citizen':

In [37]:
X

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [38]:
import pandas as pd
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,124,125,126,127,128,129,130,131,132,133
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,1,1,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
6,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


### Term Frequency-Inverse Document Frequency (TF-IDF)
- “Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

- TF-IDF which means Term Frequency and Inverse Document Frequency, is a scoring measure widely used in information retrieval (IR) or summarization.
- TF-IDF is intended to reflect how relevant a term is in a given document.

![image.png](attachment:6fe8db77-fc48-47a4-b009-24c3231bfc11.png)

In [32]:
'hi i learn python l , my fed also learn python'

'hi i learn python l , my fed also learn python'

In [33]:
(2/10)

0.2

![image.png](attachment:5c7d741f-163d-4338-9105-7d5d1f7d54d1.png)

# Implementation

In [39]:
# Cleaning the texts
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
    

In [40]:
corpus

['python high level gener purpos program languag',
 'design philosophi emphas code readabl use signific indent',
 'python dynam type garbag collect',
 'support multipl program paradigm includ structur particularli procedur object orient function program',
 'often describ batteri includ languag due comprehens standard librari',
 'guido van rossum began work python late successor abc program languag first releas python',
 'python releas',
 'python releas major revis complet backward compat earlier version',
 'python releas last releas python',
 'python consist rank one popular program languag gain widespread use machin learn commun',
 '']

In [41]:
# Creating the  model
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
X= cv.fit_transform(corpus).toarray()

In [42]:
X

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.4403944 , 0.        , 0.4403944 , 0.        ,
        0.        , 0.29585139, 0.        , 0.        , 0.        ,
        0.4403944 , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.29585139,
        0.4403944 , 0.22170927, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.35966056,
        0.        , 0.        , 0.        , 0.        , 0.        ,
   

In [43]:
X.shape

(11, 63)

In [44]:
import pandas as pd
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,53,54,55,56,57,58,59,60,61,62
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.359661,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.307425,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.484875,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.484875,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.294426,0.0,0.294426,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.349601,0.0,0.0,0.0,0.0,0.0,0.0,0.349601,...,0.349601,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.296595,0.0,0.0,0.296595,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.296595,0.0,0.0,0.0,0.296595,0.0,0.0,0.296595
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.360264,0.0,0.0,0.0,0.0,0.0,0.360264,0.360264,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.360264,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.303077,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.259059,0.0,0.0,0.303077,0.0


In [45]:
# Get the TF-IDF values
feature_names = cv.get_feature_names_out()
feature_names

array(['abc', 'backward', 'batteri', 'began', 'code', 'collect', 'commun',
       'compat', 'complet', 'comprehens', 'consist', 'describ', 'design',
       'due', 'dynam', 'earlier', 'emphas', 'first', 'function', 'gain',
       'garbag', 'gener', 'guido', 'high', 'includ', 'indent', 'languag',
       'last', 'late', 'learn', 'level', 'librari', 'machin', 'major',
       'multipl', 'object', 'often', 'one', 'orient', 'paradigm',
       'particularli', 'philosophi', 'popular', 'procedur', 'program',
       'purpos', 'python', 'rank', 'readabl', 'releas', 'revis', 'rossum',
       'signific', 'standard', 'structur', 'successor', 'support', 'type',
       'use', 'van', 'version', 'widespread', 'work'], dtype=object)

In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "To be or not to be, that is the question.",
    "The quick brown fox jumps over the lazy dog.",
    "A stitch in time saves nine."
]

# Create a TfidfVectorizer instance
tfidf_vectorizer = TfidfVectorizer()

# Fit the vectorizer to your data and compute IDF values
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Print IDF values
print("IDF values:")
for word, idf_value in zip(tfidf_vectorizer.get_feature_names_out(), tfidf_vectorizer.idf_):
    print(f"   {word}: {idf_value:.4f}")


IDF values:
   be: 1.6931
   brown: 1.6931
   dog: 1.6931
   fox: 1.6931
   in: 1.6931
   is: 1.6931
   jumps: 1.6931
   lazy: 1.6931
   nine: 1.6931
   not: 1.6931
   or: 1.6931
   over: 1.6931
   question: 1.6931
   quick: 1.6931
   saves: 1.6931
   stitch: 1.6931
   that: 1.6931
   the: 1.2877
   time: 1.6931
   to: 1.6931


In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Example documents
documents = [
    "To be or not to be, that is the question.",
    "The quick brown fox jumps over the lazy dog.",
    "A stitch in time saves nine."
]

# TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Get the TF-IDF values
feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_values = tfidf_matrix.toarray()
tfidf_values

array([[0.54275734, 0.        , 0.        , 0.        , 0.        ,
        0.27137867, 0.        , 0.        , 0.        , 0.27137867,
        0.27137867, 0.        , 0.27137867, 0.        , 0.        ,
        0.        , 0.27137867, 0.20639047, 0.        , 0.54275734],
       [0.        , 0.32767345, 0.32767345, 0.32767345, 0.        ,
        0.        , 0.32767345, 0.32767345, 0.        , 0.        ,
        0.        , 0.32767345, 0.        , 0.32767345, 0.        ,
        0.        , 0.        , 0.49840822, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.4472136 ,
        0.        , 0.        , 0.        , 0.4472136 , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.4472136 ,
        0.4472136 , 0.        , 0.        , 0.4472136 , 0.        ]])

In [55]:
# Print TF-IDF values for each document and feature (word)
for i, doc in enumerate(documents):
    print(f"Document {i+1}:")              #prints 3 separate documents since it(documents) has 3 different sentences.
    for j, word in enumerate(feature_names):
        print(f"   {word}: {tfidf_values[i][j]:.4f}")
    print()

# Print IDF values
print("IDF values:")
for j, word in enumerate(feature_names):
    idf_value = tfidf_vectorizer.idf_[j]
    print(f"   {word}: {idf_value:.4f}")asas

Document 1:
   be: 0.5428
   brown: 0.0000
   dog: 0.0000
   fox: 0.0000
   in: 0.0000
   is: 0.2714
   jumps: 0.0000
   lazy: 0.0000
   nine: 0.0000
   not: 0.2714
   or: 0.2714
   over: 0.0000
   question: 0.2714
   quick: 0.0000
   saves: 0.0000
   stitch: 0.0000
   that: 0.2714
   the: 0.2064
   time: 0.0000
   to: 0.5428

Document 2:
   be: 0.0000
   brown: 0.3277
   dog: 0.3277
   fox: 0.3277
   in: 0.0000
   is: 0.0000
   jumps: 0.3277
   lazy: 0.3277
   nine: 0.0000
   not: 0.0000
   or: 0.0000
   over: 0.3277
   question: 0.0000
   quick: 0.3277
   saves: 0.0000
   stitch: 0.0000
   that: 0.0000
   the: 0.4984
   time: 0.0000
   to: 0.0000

Document 3:
   be: 0.0000
   brown: 0.0000
   dog: 0.0000
   fox: 0.0000
   in: 0.4472
   is: 0.0000
   jumps: 0.0000
   lazy: 0.0000
   nine: 0.4472
   not: 0.0000
   or: 0.0000
   over: 0.0000
   question: 0.0000
   quick: 0.0000
   saves: 0.4472
   stitch: 0.4472
   that: 0.0000
   the: 0.0000
   time: 0.4472
   to: 0.0000

IDF values:
 

## SpaCy
* `SpaCy` is a `popular and efficient library for natural language processing (NLP)` in Python.
* It provides `pre-trained models for various languages, allowing you to perform tasks such as tokenization, part-of-speech tagging, named entity recognition, and more`.
* Here's a brief overview of how to use SpaCy for basic NLP tasks:

![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/SpaCy_logo.svg/180px-SpaCy_logo.svg.png)

In [44]:
!pip install spacy==3.7.4

Defaulting to user installation because normal site-packages is not writeable
Collecting spacy==3.7.4
  Downloading spacy-3.7.4-cp39-cp39-win_amd64.whl (12.2 MB)
Collecting smart-open<7.0.0,>=5.2.1
  Downloading smart_open-6.4.0-py3-none-any.whl (57 kB)
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.8-cp39-cp39-win_amd64.whl (39 kB)
Collecting thinc<8.3.0,>=8.2.2
  Downloading thinc-8.2.3-cp39-cp39-win_amd64.whl (1.5 MB)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.4.0-py3-none-any.whl (182 kB)
Collecting pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4
  Downloading pydantic-2.7.1-py3-none-any.whl (409 kB)
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.9-cp39-cp39-win_amd64.whl (122 kB)
Collecting wasabi<1.2.0,>=0.9.1
  Downloading wasabi-1.1.2-py3-none-any.whl (27 kB)
Collecting weasel<0.4.0,>=0.1.0
  Downloading weasel-0.3.4-py3-none-any.whl (50 kB)
Collecting srsly<3.0.0,>=2.4.3
  Downloading srsly-2.4.8-cp39-cp39-win_amd64.whl (483 kB)
Collecting spacy-lega



In [62]:
!python -m spacy download en_core_web_md   #installs medium-sized English model (en_core_web_md) for spaCy NLP library.

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [56]:
import spacy

# Load the English language model
nlp = spacy.load('en_core_web_md')

# Example text
text = """Elon Reeve Musk is a businessman and investor.
     He is the founder, chairman, CEO, and CTO of SpaceX, angel investor, CEO, product architect, and former chairman of Tesla,
     Inc. owner, executive chairman, and CTO of X Corp. founder of the Boring Company and xAI co-founder of Neuralink and OpenAI, and president of the Musk Foundation. 
     He is the second wealthiest person in the world, with an estimated net worth of US$232 billion as of December 2023, according to the Bloomberg Billionaires Index, 
     and $182.6 billion according to Forbes, primarily from his ownership stakes in Tesla and SpaceX."""

# Process the text with SpaCy
doc = nlp(text)
doc

Elon Reeve Musk is a businessman and investor.
     He is the founder, chairman, CEO, and CTO of SpaceX, angel investor, CEO, product architect, and former chairman of Tesla,
     Inc. owner, executive chairman, and CTO of X Corp. founder of the Boring Company and xAI co-founder of Neuralink and OpenAI, and president of the Musk Foundation. 
     He is the second wealthiest person in the world, with an estimated net worth of US$232 billion as of December 2023, according to the Bloomberg Billionaires Index, 
     and $182.6 billion according to Forbes, primarily from his ownership stakes in Tesla and SpaceX.

In [57]:
# Tokenization with stopwords removal
filtered_tokens = [token.text for token in doc if not token.is_stop]
print("Filtered Tokens (without stopwords):", filtered_tokens)


Filtered Tokens (without stopwords): ['Elon', 'Reeve', 'Musk', 'businessman', 'investor', '.', '\n     ', 'founder', ',', 'chairman', ',', 'CEO', ',', 'CTO', 'SpaceX', ',', 'angel', 'investor', ',', 'CEO', ',', 'product', 'architect', ',', 'chairman', 'Tesla', ',', '\n     ', 'Inc.', 'owner', ',', 'executive', 'chairman', ',', 'CTO', 'X', 'Corp.', 'founder', 'Boring', 'Company', 'xAI', 'co', '-', 'founder', 'Neuralink', 'OpenAI', ',', 'president', 'Musk', 'Foundation', '.', '\n     ', 'second', 'wealthiest', 'person', 'world', ',', 'estimated', 'net', 'worth', 'US$', '232', 'billion', 'December', '2023', ',', 'according', 'Bloomberg', 'Billionaires', 'Index', ',', '\n     ', '$', '182.6', 'billion', 'according', 'Forbes', ',', 'primarily', 'ownership', 'stakes', 'Tesla', 'SpaceX.']


In [59]:
#Lemmatization
lemmas = [token.lemma_ for token in doc if not token.is_stop]
print("Lemmas (without stopwords):", lemmas)


Lemmas (without stopwords): ['Elon', 'Reeve', 'Musk', 'businessman', 'investor', '.', '\n     ', 'founder', ',', 'chairman', ',', 'ceo', ',', 'CTO', 'SpaceX', ',', 'angel', 'investor', ',', 'ceo', ',', 'product', 'architect', ',', 'chairman', 'Tesla', ',', '\n     ', 'Inc.', 'owner', ',', 'executive', 'chairman', ',', 'CTO', 'X', 'Corp.', 'founder', 'Boring', 'Company', 'xAI', 'co', '-', 'founder', 'Neuralink', 'OpenAI', ',', 'president', 'Musk', 'Foundation', '.', '\n     ', 'second', 'wealthy', 'person', 'world', ',', 'estimate', 'net', 'worth', 'us$', '232', 'billion', 'December', '2023', ',', 'accord', 'Bloomberg', 'Billionaires', 'Index', ',', '\n     ', '$', '182.6', 'billion', 'accord', 'Forbes', ',', 'primarily', 'ownership', 'stake', 'Tesla', 'SpaceX.']


In [6]:
# Part-of-speech tagging
pos_tags = [(token.text, token.pos_) for token in doc if not token.is_stop]
print("Part-of-speech tags (without stopwords):", pos_tags)


Part-of-speech tags (without stopwords): [('Elon', 'PROPN'), ('Reeve', 'PROPN'), ('Musk', 'PROPN'), ('businessman', 'NOUN'), ('investor', 'NOUN'), ('.', 'PUNCT'), ('\n     ', 'SPACE'), ('founder', 'NOUN'), (',', 'PUNCT'), ('chairman', 'NOUN'), (',', 'PUNCT'), ('CEO', 'NOUN'), (',', 'PUNCT'), ('CTO', 'PROPN'), ('SpaceX', 'PROPN'), (',', 'PUNCT'), ('angel', 'NOUN'), ('investor', 'NOUN'), (',', 'PUNCT'), ('CEO', 'NOUN'), (',', 'PUNCT'), ('product', 'NOUN'), ('architect', 'NOUN'), (',', 'PUNCT'), ('chairman', 'NOUN'), ('Tesla', 'PROPN'), (',', 'PUNCT'), ('\n     ', 'SPACE'), ('Inc.', 'PROPN'), ('owner', 'NOUN'), (',', 'PUNCT'), ('executive', 'ADJ'), ('chairman', 'NOUN'), (',', 'PUNCT'), ('CTO', 'PROPN'), ('X', 'PROPN'), ('Corp.', 'PROPN'), ('founder', 'NOUN'), ('Boring', 'PROPN'), ('Company', 'PROPN'), ('xAI', 'PROPN'), ('co', 'NOUN'), ('-', 'NOUN'), ('founder', 'NOUN'), ('Neuralink', 'PROPN'), ('OpenAI', 'PROPN'), (',', 'PUNCT'), ('president', 'NOUN'), ('Musk', 'PROPN'), ('Foundation', 'P

In [7]:
# Named Entity Recognition (NER)
for ent in doc.ents:
    print(ent.text,"||",ent.label_,"||",spacy.explain(ent.label_))

Elon Reeve Musk || PERSON || People, including fictional
SpaceX || ORG || Companies, agencies, institutions, etc.
Tesla || ORG || Companies, agencies, institutions, etc.
X Corp. || ORG || Companies, agencies, institutions, etc.
the Boring Company || ORG || Companies, agencies, institutions, etc.
xAI || PERSON || People, including fictional
Neuralink || ORG || Companies, agencies, institutions, etc.
OpenAI || ORG || Companies, agencies, institutions, etc.
the Musk Foundation || ORG || Companies, agencies, institutions, etc.
second || ORDINAL || "first", "second", etc.
US$232 billion || MONEY || Monetary values, including unit
December 2023 || DATE || Absolute or relative dates or periods
the Bloomberg Billionaires Index || ORG || Companies, agencies, institutions, etc.
$182.6 billion || MONEY || Monetary values, including unit
Forbes || ORG || Companies, agencies, institutions, etc.
Tesla || ORG || Companies, agencies, institutions, etc.
SpaceX. || ORG || Companies, agencies, institutio

#### Visualization: Using displaCy
spaCy comes with a built-in visualizer called displaCy. You can use it to visualize a dependency parse or named entities in a browser or a Jupyter notebook.

In [60]:
from spacy import displacy
displacy.render(doc,style='ent',jupyter=True)

## list of five NLP-powered applications:

* ``Google Assistant``: Virtual assistant for tasks and information retrieval.
* ``Amazon Alexa``: Voice-controlled assistant for home and tasks.
* ``IBM Watson``: Enterprise solutions for search and conversational AI.
* ``Microsoft Azure Cognitive Services``: Suite for language understanding and analysis.
* ``OpenAI GPT Models``: Used for chatbots, content generation, and more.