<a href="https://colab.research.google.com/github/Rishabh9559/Data_science/blob/main/Phase%204%20NLP/Context_to_vector/NLP_Libraries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLTK - Natural Language Toolkit

As an open-source platform, NLTK provides a suite of programs and libraries that help developers, researchers, and students perform various Natural Language Processing (NLP) tasks.

**It is used for things such as:**
* Classification
* Tokenization: Breaking down text into smaller units, like words or sentences.
* Tagging: Labeling words with their part of speech.
* Stemming and Lemmatization: Reducing words to their base or root form.
* Parsing: Analyzing the grammatical structure of sentences.
* Semantic Reasoning: Understanding the meaning of text.

In [1]:
import nltk
import spacy

In [2]:
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
spacy.cli.download("en_core_web_sm")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [3]:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [12]:
# stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

{'ain', 'don', 'was', 'this', 'then', "we'll", "doesn't", "they've", 'before', 'ma', 'you', 'won', 'how', 'needn', 'the', "hadn't", 'aren', "it'll", 'than', 'o', 'while', 'once', 'having', "it's", "they'd", 'now', 'with', 'nor', 'out', 'were', "you're", "it'd", "shan't", 'more', 're', 'by', "she'd", "she'll", "we're", 'do', 'does', "he's", 'me', 'ours', 'that', 'all', 'they', 'her', 'didn', 'which', 'itself', 'both', 'should', 'did', "we've", 'm', 'further', 'wasn', 'of', "i'm", "couldn't", 'had', 'if', "he'll", 'him', 'ourselves', "haven't", 's', 'their', 'under', 'haven', 'hers', 'can', 'down', 'up', "you'd", 'mustn', 'each', 'why', 'couldn', 'it', 'on', 'between', 'some', 'off', "they'll", 'be', 've', 'hadn', 'we', "weren't", 'not', 'over', 'any', 'few', 'shouldn', 'weren', 'will', 'd', 'into', 'wouldn', 'being', 'when', "wouldn't", 'these', 'against', 'here', 'our', 'shan', 'just', 'in', 'he', 'themselves', 'yourself', "you've", "shouldn't", 'above', 'yours', "won't", 'his', 'himse

In [13]:
sentence = 'Natural Language Processing (NLP) is a field that ' \
'combines computer science, artificial intelligence and ' \
'language studies. It helps computers understand, process and ' \
'create human language in a way that makes sense and is useful. ' \
'With the growing amount of text data from social media, ' \
'websites and other sources, NLP is becoming a key tool to gain ' \
'insights and automate tasks like analyzing text or translating ' \
'languages.'

1. Conver into lowercharacter

In [20]:
token = word_tokenize(sentence)

In [21]:
print(token)

['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'field', 'that', 'combines', 'computer', 'science', ',', 'artificial', 'intelligence', 'and', 'language', 'studies', '.', 'It', 'helps', 'computers', 'understand', ',', 'process', 'and', 'create', 'human', 'language', 'in', 'a', 'way', 'that', 'makes', 'sense', 'and', 'is', 'useful', '.', 'With', 'the', 'growing', 'amount', 'of', 'text', 'data', 'from', 'social', 'media', ',', 'websites', 'and', 'other', 'sources', ',', 'NLP', 'is', 'becoming', 'a', 'key', 'tool', 'to', 'gain', 'insights', 'and', 'automate', 'tasks', 'like', 'analyzing', 'text', 'or', 'translating', 'languages', '.']


In [28]:
lower_tokens = [t.lower() for t in token]
lower_tokens

['natural',
 'language',
 'processing',
 '(',
 'nlp',
 ')',
 'is',
 'a',
 'field',
 'that',
 'combines',
 'computer',
 'science',
 ',',
 'artificial',
 'intelligence',
 'and',
 'language',
 'studies',
 '.',
 'it',
 'helps',
 'computers',
 'understand',
 ',',
 'process',
 'and',
 'create',
 'human',
 'language',
 'in',
 'a',
 'way',
 'that',
 'makes',
 'sense',
 'and',
 'is',
 'useful',
 '.',
 'with',
 'the',
 'growing',
 'amount',
 'of',
 'text',
 'data',
 'from',
 'social',
 'media',
 ',',
 'websites',
 'and',
 'other',
 'sources',
 ',',
 'nlp',
 'is',
 'becoming',
 'a',
 'key',
 'tool',
 'to',
 'gain',
 'insights',
 'and',
 'automate',
 'tasks',
 'like',
 'analyzing',
 'text',
 'or',
 'translating',
 'languages',
 '.']

2. Filter out those are not parts of stopwords

Stop words in NLP are common, high-frequency words like "a," "the," and "is" that are often removed from text because they add little meaning to the core content.

Removing them reduces noise and computational load, helping NLP tasks like search, summarization, and machine learning focus on more significant words.

In [29]:
filtered_tokens = [t for t in lower_tokens if t.isalpha() and t not in stop_words]


In [30]:
print(filtered_tokens)

['natural', 'language', 'processing', 'nlp', 'field', 'combines', 'computer', 'science', 'artificial', 'intelligence', 'language', 'studies', 'helps', 'computers', 'understand', 'process', 'create', 'human', 'language', 'way', 'makes', 'sense', 'useful', 'growing', 'amount', 'text', 'data', 'social', 'media', 'websites', 'sources', 'nlp', 'becoming', 'key', 'tool', 'gain', 'insights', 'automate', 'tasks', 'like', 'analyzing', 'text', 'translating', 'languages']


3. Stemmer and lemmatizer

Stemming and lemmatization are two text normalization techniques in NLP that reduce words to a **base or root form**.

**Stemming** uses a simple heuristic to cut off word endings, which can result in a non-word.

**Lemmatization** uses a dictionary and analyzes the word's context and part of speech to find a grammatically correct base form (lemma). Lemmatization is more accurate but slower, while stemming is faster but less accurate

In [31]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

steamed_word = [ stemmer.stem(t) for t in filtered_tokens ]
lemmatized_word = [ lemmatizer.lemmatize(t) for t in filtered_tokens]

In [32]:
print(steamed_word)

['natur', 'languag', 'process', 'nlp', 'field', 'combin', 'comput', 'scienc', 'artifici', 'intellig', 'languag', 'studi', 'help', 'comput', 'understand', 'process', 'creat', 'human', 'languag', 'way', 'make', 'sens', 'use', 'grow', 'amount', 'text', 'data', 'social', 'media', 'websit', 'sourc', 'nlp', 'becom', 'key', 'tool', 'gain', 'insight', 'autom', 'task', 'like', 'analyz', 'text', 'translat', 'languag']


In [33]:
print(lemmatized_word)

['natural', 'language', 'processing', 'nlp', 'field', 'combine', 'computer', 'science', 'artificial', 'intelligence', 'language', 'study', 'help', 'computer', 'understand', 'process', 'create', 'human', 'language', 'way', 'make', 'sense', 'useful', 'growing', 'amount', 'text', 'data', 'social', 'medium', 'website', 'source', 'nlp', 'becoming', 'key', 'tool', 'gain', 'insight', 'automate', 'task', 'like', 'analyzing', 'text', 'translating', 'language']


In [34]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(sentence)

for sentence in doc.sents:
    print(sentence)

Natural Language Processing (NLP) is a field that combines computer science, artificial intelligence and language studies.
It helps computers understand, process and create human language in a way that makes sense and is useful.
With the growing amount of text data from social media, websites and other sources, NLP is becoming a key tool to gain insights and automate tasks like analyzing text or translating languages.


In [None]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m67.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [None]:
from gensim.models import Word2Vec
sentences = [['natural', 'language', 'processing', 'nlp', 'field', 'combine', 'computer', 'science', 'artificial', 'intelligence', 'language', 'study', 'help', 'computer', 'understand', 'process', 'create', 'human', 'language', 'way', 'make', 'sense', 'useful', 'growing', 'amount', 'text', 'data', 'social', 'medium', 'website', 'source', 'nlp', 'becoming', 'key', 'tool', 'gain', 'insight', 'automate', 'task', 'like', 'analyzing', 'text', 'translating', 'language']]
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, workers=4)
print(model.wv['natural'])

[-0.01037544 -0.01479442 -0.00582388 -0.00172863  0.00705572  0.01948378
 -0.00677855  0.00380354  0.01936202  0.00306317  0.00197299  0.01960474
  0.01859092  0.01541613 -0.01234105  0.01996797  0.01169797  0.01814533
 -0.0039904   0.00669987  0.01366711 -0.00778751  0.0132857   0.00512573
  0.01862746 -0.00607161 -0.00621874  0.01243078 -0.01815649 -0.01450798
 -0.01300005 -0.00149815 -0.00472604  0.01363104  0.01847318 -0.00181952
  0.00282565  0.00404071 -0.0040396  -0.01606868  0.0148821  -0.0085958
  0.00915304  0.01817941  0.00608644  0.00627758  0.00812366 -0.00540243
  0.00764954  0.00067525]


In [None]:
for token in sentences:
  print(model.wv[token])

[[-0.01037544 -0.01479442 -0.00582388 ... -0.00540243  0.00764954
   0.00067525]
 [-0.00108949  0.00043467  0.01020348 ...  0.019192    0.00998422
   0.01846741]
 [ 0.01946244 -0.01958262 -0.01299524 ...  0.00045938  0.01892731
  -0.00520865]
 ...
 [-0.01632091  0.00896744 -0.00827308 ... -0.01409276  0.00181619
   0.012786  ]
 [ 0.00018957  0.0061436  -0.01361971 ... -0.00540813 -0.00872874
  -0.00206447]
 [-0.00108949  0.00043467  0.01020348 ...  0.019192    0.00998422
   0.01846741]]
