SPACY

SpaCy is an open-source Natural Language Processing (NLP) library in Python, designed for fast and efficient text processing. It is widely used for tokenization, part-of-speech (POS) tagging, named entity recognition (NER), dependency parsing, text classification, and more.


We use spacy because of the following reasons:
1) it is fast and efficient
2) It has pretrained models, which are trained on large data sets
3) Is it easy to use
4) It can be integrated with tensorflow and pytorch to apply deep learning on data sets.


We use "pip install spacy" to install.

Step 1: Loading a Language Model
It needs a language model to process text. When loaded, it allows us to analyze text using various NLP techniques.

Step 2: Processing Text with SpaCy:
Once the model is loaded, we can pass text to it for analysis. SpaCy breaks down the text into structured components like words, sentences, and punctuation.

Step 3: Tokenization
Tokenization is the process of splitting a sentence into words, punctuation marks, and symbols. Each of these elements is called a token.

Step 4: Part-of-Speech (POS) Tagging
Each word in a sentence has a grammatical role, such as a noun, verb, adjective, or adverb. SpaCy automatically identifies the part of speech (POS) for each token.

Step 5 : Lemmatization
Lemmatization is the process of reducing a word to its base or dictionary form. For example, "running" becomes "run", and "children" becomes "child".

Step 6 : Named Entity Recognition (NER)
Named Entity Recognition (NER) identifies important names, dates, places, organizations, and monetary amounts in text.

Step 7 : Dependency Parsing
Dependency parsing examines how words in a sentence relate to each other. It identifies the main verb and its subjects, objects, and modifiers.
EX: For a sentence "Set an alarm for 6 AM tomorrow."
"Set" is the main verb
"alarm" is the object
"6 AM" is the time
"tomorrow" is an additional detail

Step 8 : Sentence Segmentation
This step involves splitting text into meaningful sentences. Many NLP applications need to process one sentence at a time.

Step 9 : Stopword Removal
Stopwords are common words like "the", "is", "in", and "of" that don't add much meaning to a sentence. Removing them helps improve efficiency in text analysis.

Step 10 : Word Similarity
Word similarity measures how close in meaning two words are based on their context. This is useful for semantic search and recommendations.

Step 11: Custom NLP Pipelines
A pipeline in SpaCy consists of multiple NLP processes executed in sequence. You can also create custom components.

Step 12: Saving and Loading Models
Once you train a SpaCy model, you can save it and reuse it later without retraining.

NLTK

Install NLTK using "pip install nltk"

Step 1:Loading and Importing Text Data
Once NLTK is installed, you need to load text data for analysis. This can be from books, documents, web articles, or even live-streaming data. 

Step 2 : Tokenization
Tokenization splits text into smaller parts, such as words or sentences. There are two main types:

Word Tokenization (splitting text into individual words)
Sentence Tokenization (splitting text into separate sentences)

Step 3 : Stopword Removal
Stopwords are common words like "is", "the", "in", and "and" that do not add much meaning to a sentence. Removing them improves efficiency.


Step 4 : Stemming
Stemming reduces a word to its root form by removing suffixes. For example:

"running" → "run"
"better" → "bet"

Step 5: Lemmatization
Lemmatization is similar to stemming but ensures that words are converted to their dictionary form.

"running" → "run"
"better" → "good"

Step 6: Part-of-Speech (POS) Tagging
POS tagging assigns grammatical roles to words, such as noun, verb, adjective, and adverb.

Step 7 : Named Entity Recognition (NER)
NER identifies names, locations, dates, organizations, and monetary values in a text.

Step 8 : Synonyms and Word Relationships with WordNet
NLTK includes WordNet, a large lexical database of English words. It helps find synonyms, antonyms, and word relationships.

Step 9: Sentence Segmentation
Sentence segmentation breaks a large document into separate sentences for easier processing.

Step 10 : Sentiment Analysis
Sentiment analysis determines if a text expresses positive, negative, or neutral emotion.

Step 11 : Text Classification
Text classification involves categorizing text into predefined categories using machine learning models.

Step 12: Custom NLP Pipelines and Model Training
NLTK allows users to build custom NLP models for text classification, chatbot training, and predictive text analysis.

TEXTBLOB

Install textblob using "pip install textblob"

Step 1 : Creating a TextBlob Object
TextBlob processes text by converting it into an object that allows for various NLP operations.

Step 2 : Tokenization (Splitting Text into Words & Sentences)
TextBlob can split text into individual words and sentences, making it useful for text analysis, summarization, and keyword extraction.

Step 3: Part-of-Speech (POS) Tagging
Each word in a sentence is labeled with its grammatical category (noun, verb, adjective, etc.), which helps in understanding sentence structure 
and meaning.

Step 4 : Noun Phrase Extraction
TextBlob can identify important noun phrases in a sentence. These are groups of words that include a noun and describe key elements of the text.

Step 5 : Sentiment Analysis
TextBlob provides a sentiment polarity score (ranging from -1 to +1) and subjectivity score (how opinion-based the text is).

Step 6: Spelling Correction
TextBlob can detect and correct spelling mistakes automatically.

Step 7 : Word & Sentence Frequency Analysis
TextBlob can count the occurrences of words and sentences in a given text.

Step 8 : Translation & Language Detection
TextBlob supports automatic translation between languages and can detect the language of a given text.

Step 9 : Named Entity Recognition (NER) Using TextBlob)
Named Entity Recognition identifies names of people, locations, organizations, and dates within text.

Step 10: Text Classification
TextBlob allows machine learning-based text classification, where text is categorized into predefined groups.

Step 11: Working with Custom Text Processing Pipelines
Advanced users can train custom models using TextBlob for specialized NLP tasks.

In [4]:
pip install nltk

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: C:\Users\anand\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [1]:
import nltk

In [2]:
dir(nltk)

['ARLSTem',
 'ARLSTem2',
 'AbstractLazySequence',
 'AffixTagger',
 'AlignedSent',
 'Alignment',
 'AnnotationTask',
 'ApplicationExpression',
 'Assignment',
 'BigramAssocMeasures',
 'BigramCollocationFinder',
 'BigramTagger',
 'BinaryMaxentFeatureEncoding',
 'BlanklineTokenizer',
 'BllipParser',
 'BottomUpChartParser',
 'BottomUpLeftCornerChartParser',
 'BottomUpProbabilisticChartParser',
 'Boxer',
 'BrillTagger',
 'BrillTaggerTrainer',
 'CFG',
 'CRFTagger',
 'CfgReadingCommand',
 'ChartParser',
 'ChunkParserI',
 'ChunkScore',
 'Cistem',
 'ClassifierBasedPOSTagger',
 'ClassifierBasedTagger',
 'ClassifierI',
 'ConcordanceIndex',
 'ConditionalExponentialClassifier',
 'ConditionalFreqDist',
 'ConditionalProbDist',
 'ConditionalProbDistI',
 'ConfusionMatrix',
 'ContextIndex',
 'ContextTagger',
 'ContingencyMeasures',
 'CoreNLPDependencyParser',
 'CoreNLPParser',
 'CrossValidationProbDist',
 'DRS',
 'DecisionTreeClassifier',
 'DefaultTagger',
 'DependencyEvaluator',
 'DependencyGrammar',
 'D

In [7]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [15]:
import nltk
nltk.download('punkt', download_dir='D:/nltk_data')


[nltk_data] Downloading package punkt to D:/nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [5]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

In [6]:
import nltk


nltk.data.path.append('D:/nltk_data')

text = "I am Anand. I am a student of KMIT."

sentences = sent_tokenize(text)

print(sentences)


['I am Anand.', 'I am a student of KMIT.']


In [7]:
import nltk
from nltk.tokenize import PunktSentenceTokenizer

tokenizer = PunktSentenceTokenizer()

text = "I am Anand. I am a student of KMIT."

sentences = tokenizer.tokenize(text)

print(sentences)


['I am Anand.', 'I am a student of KMIT.']


In [8]:
from nltk.tokenize import TreebankWordTokenizer
treetokenizer = TreebankWordTokenizer()
treetokenizer.tokenize(text)

['I', 'am', 'Anand.', 'I', 'am', 'a', 'student', 'of', 'KMIT', '.']

In [9]:
import nltk
from nltk.tokenize import word_tokenize
word_tokenize(text)

['I', 'am', 'Anand', '.', 'I', 'am', 'a', 'student', 'of', 'KMIT', '.']

In [10]:
text1 = 'This is my first sentence. This is my second sentence. Is this third one?'

In [11]:
s = sent_tokenize(text1)
s

['This is my first sentence.',
 'This is my second sentence.',
 'Is this third one?']

In [12]:
from nltk.tokenize import PunktSentenceTokenizer

sentences = tokenizer.tokenize(text1)
print(sentences)

['This is my first sentence.', 'This is my second sentence.', 'Is this third one?']


In [13]:
from nltk.tokenize import TreebankWordTokenizer
treetokenizer = TreebankWordTokenizer()
treetokenizer.tokenize(text1)

['This',
 'is',
 'my',
 'first',
 'sentence.',
 'This',
 'is',
 'my',
 'second',
 'sentence.',
 'Is',
 'this',
 'third',
 'one',
 '?']

In [14]:
word_tokenize(text1)

['This',
 'is',
 'my',
 'first',
 'sentence',
 '.',
 'This',
 'is',
 'my',
 'second',
 'sentence',
 '.',
 'Is',
 'this',
 'third',
 'one',
 '?']

In [15]:
import re

words = re.findall(r'\w+', text1)
print(words)

print('count : ',len(words))

['This', 'is', 'my', 'first', 'sentence', 'This', 'is', 'my', 'second', 'sentence', 'Is', 'this', 'third', 'one']
count :  14


In [16]:
words = re.findall(r'\w+', text)
print(words)

print('count : ',len(words))

['I', 'am', 'Anand', 'I', 'am', 'a', 'student', 'of', 'KMIT']
count :  9


In [17]:
word_tokenize("Hello KMIT! I am Anand")

['Hello', 'KMIT', '!', 'I', 'am', 'Anand']

In [18]:
low = text1.lower()
low

'this is my first sentence. this is my second sentence. is this third one?'

In [19]:
words = re.findall(r'\w+', text)
words,len(words)

(['I', 'am', 'Anand', 'I', 'am', 'a', 'student', 'of', 'KMIT'], 9)

In [20]:
# splitting wach word manually
ans_list=[]
list_of_words=text.split(" ")
for word in list_of_words:
  inner_words= re.split(",|\.",word)
  for inner_most in inner_words:
    if(inner_most!=''):
      ans_list.append(inner_most)
ans_list

  inner_words= re.split(",|\.",word)


['I', 'am', 'Anand', 'I', 'am', 'a', 'student', 'of', 'KMIT']

In [21]:
## frequency of words
map={}
for word in ans_list:
  if word in map:
    map[word]+=1
  else:
      map[word]=1;
map

{'I': 2, 'am': 2, 'Anand': 1, 'a': 1, 'student': 1, 'of': 1, 'KMIT': 1}