# Natural Language Processing (NLP)
#### Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) that focuses on the interaction between computers and human (natural) languages.
## Comprehensive NLP Pipeline
#### A well-defined NLP pipeline involves several processes that transform raw text into structured, meaningful data. Here's the sequence of processes commonly followed:
### 1. Text Input Raw text data from documents, web pages, social media, etc.
#### • The text data is collected and fed into the system for processing.
### 2. Text Preprocessing The text is cleaned and prepared for further analysis.
#### • Tokenization The text is split into smaller units like words or sentences.
#### • Lowercasing All characters are converted to lowercase for consistency.
#### • Stopword Removal Common, non-informative words (like "and" or "the") are removed.
#### • Punctuation Removal Unnecessary punctuation marks (e.g., commas, periods) areremoved.
#### • Stemming/Lemmatization Words are reduced to their root form (e.g., "running" →"run").
#### • Spell Correction Misspelled words are corrected totheir correct form.
### 3. Text Representation Converting text into a machine-readable format.
#### • Bag of Words (BoW) The text is represented as a vector with the frequency of each word.
#### • TF-IDF Weights are applied to words based on their frequency and importance in thedocument.
#### • Word Embeddings (e.g., Word2Vec, GloVe) Words are converted into dense vectors that capture their meaning.
#### • Sentence Embeddings (e.g., BERT) Entire sentences are represented as vectors that capture the overall meaning.
### 4. Named Entity Recognition (NER) Identify entities like names, dates,organizations.
#### • Specific entities such as people, locations, and dates are recognized and classified.
### 5. Syntactic Analysis Analyzing the grammatical structure of the text.
#### • Part-of-Speech (POS) Tagging Each word is labeled with its grammatical role (e.g.,noun, verb).
#### • Dependency Parsing The relationships between words in a sentence are identified.
#### • Constituency Parsing The sentence structure is broken down into nested components(e.g., noun phrases, verb phrases).
### 6. Feature Engineering Extracting and selecting meaningful features to enhance model performance.
#### • N-gram Extraction Word combinations (e.g., bigrams, trigrams) are generated for analysis.
#### • TF-IDF Scaling Weights are applied to words to highlight their importance in the context.
#### • Custom Features Specific features are created based on domain knowledge or problem requirements.
#### • Sentiment Scores Emotional tone (positive, negative, neutral) is extracted from the text.
#### • POS Tags as Features Part-of-speech tags are included as features to improve model predictions.
#### • Text Length Metrics Features like sentence or document length are included for better analysis
### 7. Task-Specific Processes Processing based on the desired NLP task.
#### • Sentiment Analysis The emotional tone of the text is classified (positive, negative, neutral).
#### • Text Summarization A concise version of the text is generated.
#### • Text Classification The text is assigned to a predefined category or label.
#### • Machine Translation The text is translated from one language to another.
### 8. Output Generation Generating meaningful insights or predictions based on the processed text.
#### • The system produces structured output such as predictions, translations, summaries, or classifications.




## 1. Tokenization
#### • Word Tokenization: Breaks the text into words.
#### • Sentence Tokenization: Splits the text into sentences.
#### • Character Tokenization: Breaks the text into individual characters.
#### • Whitespace Tokenization: Splits based on spaces between words.
#### • Punctuation-Aware Tokenization: Treats punctuation marks as separate tokens

In [9]:
# pip install nltk
!pip install nltk
import nltk
nltk.download('punkt_tab') # # This downloads the necessary data fortokenization





[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\neeru\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [13]:
from nltk.tokenize import word_tokenize
text = "Neeraja love programming!"
text
words = nltk.word_tokenize(text)
print(words)

['Neeraja', 'love', 'programming', '!']


In [15]:
from nltk.tokenize import word_tokenize
text = "Neeraja love programming!"
words = nltk.word_tokenize(text) # Tokenizing into words
print(words)

['Neeraja', 'love', 'programming', '!']


In [17]:
#Sentence tokenization splits a text into sentences.
from nltk.tokenize import sent_tokenize
# Sample text
text = "I love programming. It's fun!"
# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokens:", sentences)

Sentence Tokens: ['I love programming.', "It's fun!"]


In [19]:
#Character tokenization splits text into individual characters.
# Example word
word = "programming"
# Character Tokenization (manual splitting)
char_tokens = list(word)
# Print the result
print("Character Tokenization:", char_tokens)

Character Tokenization: ['p', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'i', 'n', 'g']


In [1]:
#Whitespace tokenization splits the text based on spaces (whitespace characters).
# Example sentence
text = "I love programming."
# Whitespace Tokenization (splitting based on space)
whitespace_tokens = text.split( )
# Print the result
print("Whitespace Tokenization:", whitespace_tokens)

Whitespace Tokenization: ['I', 'love', 'programming.']


In [21]:
#Punctuation-Aware Tokenization handles punctuation separately from words.
# Example sentence with punctuation
text = "Hello! How are you doing?"
# Word Tokenization (with punctuation handling)
word_tokens_with_punct = word_tokenize(text)
# Print the result
print("Punctuation-Aware Tokenization:", word_tokens_with_punct)

Punctuation-Aware Tokenization: ['Hello', '!', 'How', 'are', 'you', 'doing', '?']


In [25]:
from nltk.corpus import stopwords
# Example sentence (tokens)
tokens = ["I", "love", "programming", "and", "it", "is", "fun"]
tokens

['I', 'love', 'programming', 'and', 'it', 'is', 'fun']

In [11]:
import nltk
nltk.download('punkt')  # This is the standard Punkt tokenizer model


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\neeru\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [27]:
# Get the list of stop words in English
stop_words = set(stopwords.words('english'))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 "he's",
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 "i'll",
 "i'm",
 "i've",
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [29]:
# Print the stop words list (optional)
print("Stop Words List:", stop_words)

Stop Words List: {'mustn', 'on', 'whom', 'didn', "don't", 'an', 'his', 'll', 'into', 'by', "we've", 'yours', 'why', 'some', 'these', 'am', "you'd", "i've", 'themselves', 'she', 'at', 'this', 'be', 'yourselves', 'haven', 'ain', "aren't", 'is', "shan't", "didn't", 'when', "that'll", 'own', 'of', 'as', 'most', 'same', 'the', "it's", 'both', 'because', 'aren', 'will', 'isn', 'yourself', 'm', 'that', 'are', "weren't", 'had', 'they', "i'd", "you've", 'all', 'ourselves', "couldn't", "needn't", "mightn't", 'its', 'have', 'being', 'about', 'shan', 'now', 'here', 'ma', "it'd", 'than', 'were', 'then', 'too', "you'll", 'should', 'to', "hasn't", 'has', 'having', "wouldn't", 'off', 'doesn', "they're", "he'd", "they'd", "won't", 'against', 'won', 'there', 'hadn', 'through', 't', "we're", 'been', 'such', "haven't", 's', 'down', 'couldn', 'below', "you're", "mustn't", 'doing', 'our', 'where', 'over', 'but', 'from', 'which', 'can', 'do', 'it', 'further', 'while', 'only', 'for', 'them', 'a', 'me', 'more'

In [31]:
nltk.download('stopwords') # Download the stop words list

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\neeru\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [33]:
# Remove stop words
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
# Print the filtered tokens
#tokens = ["I", "love", "programming", "and", "it", "is", "fun"]
print("Tokens after Stop Word Removal:", filtered_tokens)

Tokens after Stop Word Removal: ['love', 'programming', 'fun']


#### Understanding List Comprehension
[word for word in tokens if word.lower() not in stop_words]

This is called List Comprehension in Python, which is a concise way to create a new list by iterating over an existing iterable and applying a condition.

Let's observe an example:

new_list = [expression for item in iterable if condition]

In [45]:
basket = ["apple", "banana", "orange", "kiwi", "grape"]
long_fruits = [fruit for fruit in basket if len(fruit) > 5]

print(long_fruits)


['banana', 'orange']


In [47]:
# With List Comprehension
basket = ["apple", "banana", "orange", "kiwi", "grape"]
long_fruits = [fruit for fruit in basket if len(fruit) > 5]
print(long_fruits)


['banana', 'orange']


In [49]:
# another example
numbers = [1, 2, 3, 4, 5]
doubled = [num * 2 for num in numbers]
print(doubled)

[2, 4, 6, 8, 10]


### Removing Punctuation

In [53]:
import string
# Removing punctuation
reviews_no_punct = [[word for word in tokens if word not in string.punctuation] for tokens in filtered_tokens]
print(reviews_no_punct)


[['l', 'o', 'v', 'e'], ['p', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'i', 'n', 'g'], ['f', 'u', 'n']]


Removing Special Characters Special characters such as “#”, “@”, or emoji may need to be removed or treated based on their relevance.

In [57]:
import re
# Removing special characters
def remove_special_chars(tokens):
 return [re.sub(r'[^A-Za-z0-9]+', '', word) for word in tokens]
reviews_cleaned = [remove_special_chars(tokens) for tokens in filtered_tokens]
print(reviews_cleaned)

[['l', 'o', 'v', 'e'], ['p', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'i', 'n', 'g'], ['f', 'u', 'n']]


In [59]:
from nltk.stem import PorterStemmer # Stemming algorithm
from nltk.tokenize import word_tokenize # For breaking text intotokens
# The Porter Stemmer is a popular algorithm for stemming in English.
stemmer = PorterStemmer()
# Sample Text
text = "I am loving the process of learning and understanding NLP concepts."
# Tokenize the Text
tokens = word_tokenize(text)
print("Tokens:", tokens)

Tokens: ['I', 'am', 'loving', 'the', 'process', 'of', 'learning', 'and', 'understanding', 'NLP', 'concepts', '.']


In [61]:
# apply stemming to each word in the list of tokens
stemmed_words = [stemmer.stem(word) for word in tokens]
print("Stemmed Words:", stemmed_words)


Stemmed Words: ['i', 'am', 'love', 'the', 'process', 'of', 'learn', 'and', 'understand', 'nlp', 'concept', '.']


### wordnet

In [64]:
# Import the Required Libraries
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
# We need WordNet and Punkt for tokenization and lemmatization
nltk.download('wordnet') # For lexical database
nltk.download('omw-1.4') # Optional for extended multilingual support
nltk.download('punkt') # For tokenization

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\neeru\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\neeru\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\neeru\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [66]:
# Initialize the Lemmatizer
lemmatizer = WordNetLemmatizer()
# Smaple text
text = "The leaves on the tree were falling. She was running quickly but got tired."
# Break sentence into words
tokens = word_tokenize(text)
print("Tokens:", tokens)


Tokens: ['The', 'leaves', 'on', 'the', 'tree', 'were', 'falling', '.', 'She', 'was', 'running', 'quickly', 'but', 'got', 'tired', '.']


In [68]:
# apply lemmatization
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]
print("Lemmatized Words:", lemmatized_words)

Lemmatized Words: ['The', 'leaf', 'on', 'the', 'tree', 'were', 'falling', '.', 'She', 'wa', 'running', 'quickly', 'but', 'got', 'tired', '.']


In [72]:
#Lemmatization (with POS tagging)
# Import necessary libraries
from nltk.corpus import wordnet # for WordNet-compatible POS tags
from nltk.tag import pos_tag # to assign POS tags to words
from nltk.stem import WordNetLemmatizer # from NLTK for lemmatization
# Function to map POS tags to WordNet tags
def get_wordnet_pos(word):
 tag = pos_tag([word])[0][1][0].upper() # Get the POS tag's first letter
 tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V":
wordnet.VERB, "R": wordnet.ADV}
 # dict ={ "key" : values, "key":value....}
 return tag_dict.get(tag, wordnet.NOUN) # Default to noun if tag is not in the dictionary

In [17]:
!pip install spacy


Collecting spacy
  Downloading spacy-3.8.5-cp312-cp312-win_amd64.whl.metadata (28 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.12-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp312-cp312-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.6-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.1-cp312-cp312-win_amd64

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
contourpy 1.2.0 requires numpy<2.0,>=1.20, but you have numpy 2.2.5 which is incompatible.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.2.5 which is incompatible.
numba 0.60.0 requires numpy<2.1,>=1.22, but you have numpy 2.2.5 which is incompatible.
tensorflow 2.19.0 requires numpy<2.2.0,>=1.26.0, but you have numpy 2.2.5 which is incompatible.
