1. *Tokenization:*
   - NLTK provides tools for breaking down a text into words or sentences. Tokenization is a fundamental step in natural language processing.

2. *Stopwords:*
   - NLTK includes a list of common stopwords for various languages. Stopwords are words that are commonly used in a language but typically do not contribute much to the meaning of a text (e.g., "the," "and," "is").

3. *Stemming and Lemmatization:*
   - NLTK supports stemming and lemmatization, which are techniques for reducing words to their base or root forms. This can help in reducing the dimensionality of the data and improving the performance of certain NLP tasks.

4. *Part-of-Speech Tagging:*
   - NLTK provides tools for assigning parts of speech (e.g., noun, verb, adjective) to words in a text. Part-of-speech tagging is crucial for understanding the grammatical structure of a sentence.

5. *Named Entity Recognition (NER):*
   - NLTK includes tools for identifying named entities (e.g., persons, organizations, locations) in text. NER is important for extracting structured information from unstructured text.

6. *WordNet Interface:*
   - NLTK interfaces with WordNet, a lexical database of the English language. WordNet allows for exploration of word meanings, synonyms, antonyms, and more.

7. *Frequency Distribution and Concordance:*
   - NLTK facilitates the analysis of word frequency distributions in a text and provides tools for examining concordances (occurrences of a word in context).

8. *Collocations:*
   - NLTK supports the identification of collocations, which are sequences of words that often appear together. Collocations can provide insights into the relationships between words.

9. *Parsing:*
   - NLTK includes parsers for syntactic analysis. It allows you to analyze the grammatical structure of sentences.

10. *Machine Learning with NLTK:*
    - NLTK provides tools for integrating machine learning techniques into natural language processing tasks, including classification, clustering, and sentiment analysis.


In [1]:
!pip install nltk
!pip install spacy




In [2]:
import nltk
import spacy


In [3]:
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Text Tokenization and Stopwords

In [4]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')

text = "NLTK is a powerful library for natural language processing."
tokens = word_tokenize(text)
filtered_tokens = [word for word in tokens if word.lower() not in stopwords.words('english')]

print("Original Tokens:", tokens)
print("Filtered Tokens:", filtered_tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Original Tokens: ['NLTK', 'is', 'a', 'powerful', 'library', 'for', 'natural', 'language', 'processing', '.']
Filtered Tokens: ['NLTK', 'powerful', 'library', 'natural', 'language', 'processing', '.']


 spaCy - Named Entity Recognition (NER):

In [6]:
!python -m spacy download en


2024-01-25 14:09:21.038286: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-25 14:09:21.038364: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-25 14:09:21.040176: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [8]:
nlp = spacy.load('en_core_web_sm')

sentence = "Albert Einstein was born on March 14, 1879, in Ulm, Germany."
doc = nlp(sentence)

print("Named Entities:")
for ent in doc.ents:
    print(f"{ent.text} - {ent.label_}")


Named Entities:
Albert Einstein - PERSON
March 14, 1879 - DATE
Ulm - GPE
Germany - GPE
