<h1 style="background: linear-gradient(to right, #ff9a9e, #fad0c4); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">NLP (Natural Language Processing)</h1>

 <span style="background: linear-gradient(to right, #ff9a9e, #fad0c4); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">Natural Language Processing, is a field that gives the machines the ability toread, understand, and derive meaning from human languages. It involves the interaction between computers and humans using the natural language to perform tasks like translation, sentiment analysis, speech recognition, etc</span> 


In [2]:
import nltk

<small> `Natural Language Toolkit`, is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources, along with a suite of text processing libraries for `classification`, `tokenization`, `stemming`, `tagging`, `parsing`, and `semantic reasoning`.</small>

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\home\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
text = "Hello world, i use linux"
text

'Hello world, i use linux'

In [5]:

text.split(' ')

['Hello', 'world,', 'i', 'use', 'linux']

* `word_tokenize()` splits a sentence into individual words
* `sent_tokenize()` splits a text into individual sentences.

In [6]:
from nltk.tokenize import word_tokenize,sent_tokenize
word_tokenize(text)

['Hello', 'world', ',', 'i', 'use', 'linux']

In [7]:
sent_tokenize(text)

['Hello world, i use linux']

In [8]:
for words in word_tokenize(text):
    if words != ',':
        print(words)
     

Hello
world
i
use
linux


<strong style='color:orange;font-size: 25px;'>Stemming and Lemmatizationn</strong>

In [9]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\home\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\home\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [10]:
from nltk.stem import WordNetLemmatizer,PorterStemmer

stem = PorterStemmer()
lem = WordNetLemmatizer()


### Lemmatization : 
<small>`lemmatization` involves reducing words to their dictionary form</small>

In [11]:
print(lem.lemmatize('change'))
print(lem.lemmatize('changer'))
print(lem.lemmatize('changes'))

change
changer
change


### Stemming :
<small>`Stemming` is the process of reducing words to their root form</small>

In [12]:
print(stem.stem("run"))
print(stem.stem('runner'))

run
runner


### Stopwords
<p>Stopwords are commonly used words in a language that are often filtered out in NLP <br>Examples : <br>"is", "the", "and", "in", "it", "you", "to", "for". </p>

In [13]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\home\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
from nltk.corpus import stopwords

In [15]:
# list of stopwords
stopwords = stopwords.words('english')

In [16]:
len(stopwords)

179

In [17]:
txt = "Using Linux is like having a secret superpower that makes Windows machines always crashing."
text = word_tokenize(txt)

In [18]:
# removing stopwords
corpus = ''
for word in text:
    if word.lower() not in stopwords and len(word)>2:
        corpus += word + " "


In [19]:
print(txt) # with stop words
print(corpus) # without stop words

Using Linux is like having a secret superpower that makes Windows machines always crashing.
Using Linux like secret superpower makes Windows machines always crashing 


In [81]:
corpus = "where every user is a self-appointed expert and every system update feels like rolling the dice in a game of digital roulette. Installing software is like navigating a labyrinth of dependencies and package managers, with the occasional sacrifice to the Linux gods for good measure. Troubleshooting is an art form, involving hours of scouring obscure forums and deciphering cryptic error messages. And don't even get me started on hardware compatibility – it's like playing Russian roulette with your peripherals"

In [82]:
words = []
for word in word_tokenize(corpus):
    if word.lower() not in stopwords and len(word)>2:
        words.append(word.lower())

In [84]:
set(words)

{'art',
 'compatibility',
 'cryptic',
 'deciphering',
 'dependencies',
 'dice',
 'digital',
 'error',
 'even',
 'every',
 'expert',
 'feels',
 'form',
 'forums',
 'game',
 'get',
 'gods',
 'good',
 'hardware',
 'hours',
 'installing',
 'involving',
 'labyrinth',
 'like',
 'linux',
 'managers',
 'measure',
 'messages',
 "n't",
 'navigating',
 'obscure',
 'occasional',
 'package',
 'peripherals',
 'playing',
 'rolling',
 'roulette',
 'russian',
 'sacrifice',
 'scouring',
 'self-appointed',
 'software',
 'started',
 'system',
 'troubleshooting',
 'update',
 'user'}

In [90]:
from tensorflow.keras.preprocessing.text import Tokenizer
tk = Tokenizer()

<small>The `Tokenizer` class allows you to vectorize a text corpus, by turning each text into a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token</small>

In [91]:
corpus = ['Wolverine has Claws','Ironman has arcReactor']

<small>`fit_on_texts()` is used to update the internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency.</small>

In [93]:
tk.fit_on_texts(corpus)

In [95]:
tk.word_index

{'has': 1, 'wolverine': 2, 'claws': 3, 'ironman': 4, 'arcreactor': 5}

<small>`texts_to_sequences()` transforms each text in the given list of texts to a sequence of integers.</small>

In [100]:
tk.texts_to_sequences(corpus)

[[2, 1, 3], [4, 1, 5]]

In [104]:
tok = Tokenizer(num_words = 4)
corps = ['water is cold','black tea is hot']
tok.fit_on_texts(corps)
tok.word_index

{'is': 1, 'water': 2, 'cold': 3, 'black': 4, 'tea': 5, 'hot': 6}

In [105]:
tok.texts_to_sequences(corps)

[[2, 1, 3], [1]]