# Foundations of NLP-From Tokenization to Encoding

# 1.Tokenization
Tokenize a paragraph into sentences and words using nltk.sent_tokenize() and nltk.word_tokenize()

In [3]:
#Step-1: Import required Libraries
import nltk
from nltk.tokenize import sent_tokenize , word_tokenize

# Step-2: Download the necessary NLTK Resources
nltk.download("punkt") 

'''
# step-3: Sample Sentence
input_text = " Environment pollution is a complex and urgent global issue that threatens the health of our planet and its " \
"inhabitants . It encompasses air pollution , water pollution, and soil pollution , each with its set of consequences . Air" \
" pollution , primarily caused by industrial activities and transportation, results in respiratory illnesses , cardiovascular" \
" problems , and environmental damage . Water pollution , often due to industrial discharges and inadequate waste management , " \
"endangers aquatic ecosystems and human health . Soil pollution, stemming from the use of chemicals and improper waste disposal , " \
" harms agriculture and food security . Awareness and education play a crucial role in addressing pollution . By adopting eco-friendly " \
"lifestyles , conserving resources, and supporting policies that prioritize environmental protection , we can contribute to a cleaner , " \
" healthier planet . Pollution knows no borders , and it affects us all . It is our responsibility to take action and work together to" \
" ensure a sustainable and pollution-free future for generations to come . Environmental pollution is the contamination of air , water , " \
"and land by harmful substances . The increasing number of factories , vehicles , and the disposal of waste in water bodies are major" \
" causes of pollution . It harms not only the environment but also human health . The burning of fossil fuels and deforestation further" \
" contribute to air pollution and global warming . To reduce pollution , we need to use eco-friendly products , plant trees , and reduce" \
" waste . We must all take responsibility for protecting our environment for a better tomorrow."
'''

# Step-3: Create a Paragraph
input_text = input("Enter input_text")

# step-4: Perform Text-Tokenization

# Word_Tokenization
word_token = word_tokenize(input_text)

# Sentence_Tokenization
sentence_token = sent_tokenize(input_text)

# Step-5: Display the result
print("Word_Tokens:",word_token)
print("Sentence_Tokens:",sentence_token)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Word_Tokens: ['Hi', ',', 'I', 'am', 'Manasa', '.', 'I', 'am', 'the', 'Trainee', 'of', 'Skill', 'Share', 'Technologies', '.', 'I', 'am', 'learning', 'Machine', 'Learning', 'from', 'the', 'faculty', 'of', 'Skill', 'Share', '.']
Sentence_Tokens: ['Hi , I am Manasa .', 'I am the Trainee of Skill Share Technologies .', 'I am learning Machine Learning from the faculty of Skill Share.']


What is Tokenization?
Tokenization is the process of breaking down a text like a sentence or paragraph into smaller units called tokens. These tokens can be words,sentence depending on the type of tokenization used.

Why Tokenization is important in Natural Language Processing?
Machines can't process raw text like humans.In such cases we use Tokenization, which breaks the complete text into smaller, meaningful units (tokens) that computers can work with. So that the  each token can be analyzed and produces the output again in human Understandable Language.

Code Explanation
----------------
import nltk: Here We are importing the Natural Language Toolkit (NLTK), which is a powerful Python library for NLP,used for processing human language data.
We are using this nltk library to access functions like tokenizers (word_tokenize, sent_tokenize), stopwords, POS tagging, etc.

from nltk.tokenize import sent_tokenize, word_tokenize: We are specifically importing two functions:
sent_tokenize: For splitting text into sentences.

word_tokenize: For splitting text into words and punctuation.
We are using the above functions because, Tokenization is the first step in text analysis; these functions help break down the text for further processing.

nltk.download("punkt"):We are downloading a pre-trained tokenizer model called "punkt".
The punkt model is needed by sent_tokenize and word_tokenize to understand sentence boundaries and word patterns.Without this line, those functions would raise an error.

input_text = input("Enter input_text"): We are asking the user to enter a string (a paragraph or sentence), which is saved in the variable input_text.
We need some text to tokenize — this lets the user provide that text during runtime.

word_token = word_tokenize(input_text):This function splits the input text into a list of words and punctuation marks.

sentence_token = sent_tokenize(input_text):This function splits the input text into a list of sentences.

# 2.Text Preprocessing-Stemming
    Use PorterStemmer to reduce words to their base form 

In [4]:
# Step-1: Import required Libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
'''
#Step-2: Download the necessary NLTK resources
nltk.download("punkt")
'''
# Step-3: Initialize the Stemmer
stemmer = PorterStemmer()

# Step-4: Sample Sentence
input_text1 = "Hi , I am Manasa . I am the Trainee of Skill Share Technologies . I am learning Machine Learning from the faculty of Skill Share."
'''
# Step-4: Create a Paragraph
input_text1 = input("Enter input_text:")
'''
# Step-5: Tokenize the paragraph(Split into words)
word_token = word_tokenize(input_text1)

# Step-6: Apply Stemming
stemmed_text = [stemmer.stem(text) for text in word_token]

# Step-7: Display Result
print("Original Input_Text:",input_text1)
print("Stemmed_Text:",stemmed_text)



Original Input_Text: Hi , I am Manasa . I am the Trainee of Skill Share Technologies . I am learning Machine Learning from the faculty of Skill Share.
Stemmed_Text: ['hi', ',', 'i', 'am', 'manasa', '.', 'i', 'am', 'the', 'traine', 'of', 'skill', 'share', 'technolog', '.', 'i', 'am', 'learn', 'machin', 'learn', 'from', 'the', 'faculti', 'of', 'skill', 'share', '.']


What is Stemming?
Stemming is the process of reducing a word to its base or stem form — often by chopping off suffixes like -ing, -ed, -ly, etc.
Here we are using the PorterStemmer, which is a rule-based stemmer that removes common endings.

Difference between Stem and Root?
Stem:The part of a word that remains after removing suffixes or prefixes (may not be a valid word)
     Output	May be incomplete or not meaningful in English
     Example: “Running” Stem: "run" (Correct in this case)	
     Example: “Studies”	Stem: "studi" (Incorrect English word)
     Speed:Fast (simple rules)
Root:The original, meaningful core word from which related words are derived
	 output Usually a valid word in the language
     Example: “Running” Root: "run"
     Example: “Studies”	Root: "study"
     Speed:Slower (requires deeper analysis)

Why stemming may affect the meaning?
Stemming can affect the meaning of words because that removes affixes (like -ing, -ed, -es, etc.) without understanding the context or grammar of the word.
Stemming algorithms produce stems that are not valid English words.These stems don’t carry clear meaning on their own and may not be understood by humans.
Examples:
"connect", "connection", "connected" → "connect": (this is fine)
"universe", "university" → "univers" :(very different meanings, same stem)

Code Explanation
----------------
import nltk: Imports the Natural Language Toolkit library — used for processing human language data.

from nltk.stem import PorterStemmer: Imports the PorterStemmer class — a popular rule-based stemming algorithm that removes common endings like -ing, -ed, -s, etc.

stemmer = PorterStemmer():Here we created an object called stemmer for the PorterStemmer class.
We need to use this object(stemmer) to call the .stem() method on each word you want to reduce to its stem.

stemmed_text = [stemmer.stem(text) for text in word_token]:Uses a list comprehension to stem each word token.
stem(text) takes each word and reduces it to its root-like form.

Example:"learning" → "learn","Technologies" → "technolog"
We use Stemming because it helps to reduce the different forms of a word to a common base, making text analysis simpler.

# 3. Text Processing-Lemmatization
     Use WordNetLemmatizer and compare with Stemming

In [5]:
# Step-1: Import required Libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Step-2: Download necessary NLTK resources
#nltk.download("punkt")
nltk.download("wordnet")
nltk.download("omw-1.4")

# Step-3: Initialize the Lemmatizer
lemmatizer = WordNetLemmatizer()

# Step-4: Create a Paragraph
input_text2 = "Hi , I am Manasa . I am the Trainee of Skill Share Technologies . I am learning Machine Learning from the faculty of Skill Share."
'''
# Step-4: Create a Paragraph
input_text2 = input("Enter Paragraph:")
'''

# Step-5: Tokenize the Sentence
word_token2 = word_tokenize(input_text2)

# Step-6: Applying Lemmatization
lemmatize_word = [lemmatizer.lemmatize(text) for text in word_token2]

# Step-7: Display the result
print("Original_text:",input_text2)
print("Lemmatized_word:",lemmatize_word)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Original_text: Hi , I am Manasa . I am the Trainee of Skill Share Technologies . I am learning Machine Learning from the faculty of Skill Share.
Lemmatized_word: ['Hi', ',', 'I', 'am', 'Manasa', '.', 'I', 'am', 'the', 'Trainee', 'of', 'Skill', 'Share', 'Technologies', '.', 'I', 'am', 'learning', 'Machine', 'Learning', 'from', 'the', 'faculty', 'of', 'Skill', 'Share', '.']


What is Lemmatization?
Lemmatization means changing a word to its basic or dictionary form.
It understands the meaning and grammar of the word, so it gives you the correct root word.It helps computers understand words better by reducing different word forms to one standard word, so it's easier to analyze and compare text.


When is Lemmatization more appropriate than Stemming?
Lemmatization returns proper words found in the dictionary.
Stemming may return broken or meaningless roots.
Example:
Stemming: "studies" → "studi" 
Lemmatization: "studies" → "study" 

Code Explanation
----------------
WordNetLemmatizer: A tool from NLTK to convert words to their dictionary form (lemma).We use WordNetLemmatizer to prepare text for further processing like searching, classification, or language understanding.

"wordnet": A large dictionary database of English words (WordNet), used for finding lemmas.
"omw-1.4": Open Multilingual WordNet, needed for word meanings and translations (improves accuracy).
"punkt" : Required for tokenizing text into words or sentences 
We download above NLP resources because Lemmatization needs a vocabulary to work correctly. These downloads provide that knowledge.

lemmatizer = WordNetLemmatizer(): We are creating an object of the WordNetLemmatizer class so we can use it later.
We need this object to call the function .lemmatize() on each word.

lemmatize_word = [lemmatizer.lemmatize(text) for text in word_token2]:This is a list comprehension. It runs .lemmatize() on each word in your tokenized list


# 4. Stopwords Removal
Remove common stopwords using NLTK

In [6]:
# Step-1: Import the Libraries
# import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Step-2: Download necessary NLTK resources
nltk.download("punkt")
nltk.download("stopwords")

# Step-3: Generate a Sentence or a paragraph by Static or Dynamic 
# STATIC
input_text4 = "Hi , I am Manasa . I am the Trainee of Skill Share Technologies . I am learning Machine Learning from the faculty of Skill Share."
# DYNAMIC
# input_text4 = input("Enter INPUT:")

# Step-4: Perform Tokenization(word_tokenization)
words = word_tokenize(input_text4)

# Step-5: Stopwords in English
stop_words = set(stopwords.words('english'))

# Step-6: Remove Stopwords
filtered_words =[word for word in words if word.lower() not in stop_words]

# Step-7: Display Result
print("Original_Text:",input_text4)
print("Filtered_text:",filtered_words)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...


Original_Text: Hi , I am Manasa . I am the Trainee of Skill Share Technologies . I am learning Machine Learning from the faculty of Skill Share.
Filtered_text: ['Hi', ',', 'Manasa', '.', 'Trainee', 'Skill', 'Share', 'Technologies', '.', 'learning', 'Machine', 'Learning', 'faculty', 'Skill', 'Share', '.']


[nltk_data]   Package stopwords is already up-to-date!


What are Stopwords?
Stopwords are the common words in a language that are often filtered out (removed) during text processing because they usually carry less important meaning in terms of NLP tasks.
These words appear frequently, but they usually don’t help much in understanding the main meaning or intent of a sentence.

When should we keep or remove Stopwords?
We remove the stopwords when they
Do not add meaningful information
Increase data size without benefit
Slow down algorithms unnecessarily

Code Explanation:
-----------------
stopwords: A module from NLTK containing lists of common words (e.g., "is", "the", "and") in different languages.We need these tools to split the text and filter out the common words.
nltk.download("stopwords"): Downloads lists of stopwords in multiple languages (you’ll use English).These are required for stopword removal to work properly.
stop_words = set(stopwords.words('english')): Loads a list of common English words like "the", "is", "am"...... 
These words usually don’t help in meaning and can be filtered out for many NLP tasks.

filtered_words =[word for word in words if word.lower() not in stop_words]: It's a list comprehension.
Converts each word to lowercase (to match stopword list) and includes it only if it’s not a stopword.
This step cleans the data by removing noise. It improves the focus on important words 

# 5. Parts of Speech(POS) Tagging
  Use nltk.pos_tag() on a sentence

In [7]:
# Step1: Import Libraries
import nltk
from nltk.tokenize import word_tokenize

# Step-2: Download the necessary NLTK resources
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger_eng")

# Step-3: Generate a Sentence or a paragraph by Static or Dynamic 
# STATIC
input_text4 = "Hi , I am Manasa . I am the Trainee of Skill Share Technologies . I am learning Machine Learning from the faculty of Skill Share."
# DYNAMIC
# input_text4 = input("Enter INPUT:")

# Step-4: Perform Tokenization(word_tokenization)
word_token4 = word_tokenize(input_text4)

# Step-5: Applying POS tags
tags = nltk.pos_tag(word_token4)

# Step-6: Display Result
print("Original_Text:",input_text4)
print("input_text4 and Tags:",tags)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


Original_Text: Hi , I am Manasa . I am the Trainee of Skill Share Technologies . I am learning Machine Learning from the faculty of Skill Share.
input_text4 and Tags: [('Hi', 'NNP'), (',', ','), ('I', 'PRP'), ('am', 'VBP'), ('Manasa', 'NNP'), ('.', '.'), ('I', 'PRP'), ('am', 'VBP'), ('the', 'DT'), ('Trainee', 'NNP'), ('of', 'IN'), ('Skill', 'NNP'), ('Share', 'NNP'), ('Technologies', 'NNPS'), ('.', '.'), ('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Machine', 'NNP'), ('Learning', 'NNP'), ('from', 'IN'), ('the', 'DT'), ('faculty', 'NN'), ('of', 'IN'), ('Skill', 'NNP'), ('Share', 'NNP'), ('.', '.')]


What is POS Tagging?
POS tagging is the process of identifying the part of speech (like noun, verb, adjective, etc.) for each word in a sentence.
It tells us about,what role each word plays in the sentence.

Importance of POS Tagging in Syntantic and Semantic Analysis?
Syntactic analysis focuses on the structure of a sentence. POS tagging helps by:

How POS Tagging Helps
 Sentence Parsing:	Identifies how words are structured (subject-verb-object). Example: “Manasa reads books” → NNP VBZ NNS
 Phrase Chunking :	Helps in forming noun phrases, verb phrases.
 Disambiguation	 :  Helps resolve grammar ambiguities (e.g., “book” as a noun vs. verb).
 Grammar Checking:	Detects syntactic errors by checking POS sequences.

 Semantic analysis focuses on the meaning behind words and phrases. POS tagging contributes by:

How POS Tagging Helps
Word Sense Disambiguation:	Knowing a word’s POS helps pick its correct meaning.
Sentiment Analysis       :	Identifies key adjectives, adverbs (e.g., happy, badly).
Entity Recognition       :	Helps tag proper nouns (people, places, etc.).
Question Answering       :	Helps determine expected answer types based on question words (e.g., “Where” → location, “Who” → person).

Code Explanation
----------------
nltk.download("averaged_perceptron_tagger_eng"): Downloads the POS tagger model that uses the "Averaged Perceptron" algorithm to assign parts of speech.
These datasets and models are required dependencies. Without them, the tokenizer and tagger won't work.

nltk.pos_tag(word_token4): Assigns each word in word_token4 a POS tag (like noun, verb, adjective).
This is the core of the program. It tells you what grammatical role each word plays.


# 6. Named Entity Recognition(NER)
Perform NER using spacy

In [8]:
# Step-1: Import Libraries
import spacy

# Step-2: Load
nlp = spacy.load("en_core_web_sm")
text = "Hi , I am Manasa . I am the Trainee of Skill Share Technologies . I am learning Machine Learning from the faculty of Skill Share."
doc = nlp(text)
print("Tokens:")
for ent in doc:
    print(ent.text)
    
    

Tokens:
Hi
,
I
am
Manasa
.
I
am
the
Trainee
of
Skill
Share
Technologies
.
I
am
learning
Machine
Learning
from
the
faculty
of
Skill
Share
.


What is NER?
NER stands for Named Entity Recognition.
It is a Natural Language Processing (NLP) technique used to Identify and classify named entities in a text into predefined categories such as:
Person names,Locations,Organizations,Dates,Times,Monetary values,Percentages.

How NER is used in real-world applications like resumes,news,etc?
1.Resumes
Automatically extract and organize key information from job applicants' resumes.
Uses:
Speeds up resume screening
Enables applicant ranking
Fills candidate profiles automatically

2. News Classification & Summarization
Identify key people, places, and events in news articles.
Uses:
Summarize news into key points
Group articles by topic or people
Detect breaking news or trends

Code Explanation
----------------
import spacy: This line imports the spaCy library. spaCy is used for advanced NLP tasks like tokenization, POS tagging, Named Entity Recognition (NER), and more.
nlp = spacy.load("en_core_web_sm")
spacy.load("en_core_web_sm") loads a pre-trained English model.
"en_core_web_sm" stands for:
"en" = English language
"core" = core (basic) features
"web" = trained on web data
"sm" = small (lightweight version)
nlp is now a pipeline object that processes English text (tokenizes, tags, recognizes entities, etc.)
doc = nlp(text):This line passes the input text through the spaCy pipeline using the nlp model.
doc is a spaCy Doc object, which contains:Tokenized words,POS tags,Named entities
for ent in doc:
    print(ent.text)
We are looping through the doc object.
ent represents each token in the processed text .
ent.text returns the original word or punctuation symbol.
So, this loop prints all the tokens one by one.

# 7. One Hot Encoding
Use OneHotEncoder to encode Categorical variables like gender

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({
    'Gender':['Male','Female','Male','Female','Other']
})
'''
df = pd.read_csv("C:/python_Files/housing.csv")
x = df.iloc[:,-1]
x_reshaped = x.values.reshape(-1, 1)
'''
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(df[['Gender']])
encoded_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out(['Gender']))
print(encoded_df)



   Gender_Female  Gender_Male  Gender_Other
0            0.0          1.0           0.0
1            1.0          0.0           0.0
2            0.0          1.0           0.0
3            1.0          0.0           0.0
4            0.0          0.0           1.0


How OneHotEncoding works?
One-Hot Encoding is a technique to convert text data (like words) into numerical format so that machines can understand and process them.
We need OneHotEncoding in NLP because Computers can’t understand text directly, but they can work with numbers. So, we convert words into numbers using techniques like:
One-Hot Encoding,Bag of Words,TF-IDF,Word Embeddings (Word2Vec, GloVe)
One-Hot Encoding is the simplest among them.

Real-life uses(e.g:customer segmentation)
Customer segmentation is the process of dividing customers into groups based on shared characteristics, such as:
Age,Gender,Location,Purchase behavior,Product preference
Businesses use this to personalize marketing, improve recommendations, and target promotions effectively.

What happens with unknown labels?
When using One-Hot Encoding, unknown labels (i.e., labels not seen during training) can cause issues—especially if we are transforming new data that includes categories not present during fit().
Example:Gender: ['Male', 'Female']
After fitting the encoder, we try to transform:
Gender: ['Other']  # <-- new, unseen label
This is an unknown category, and by default, it will cause an error.

Code Explanation
----------------
import pandas as pd
pandas is a library used for handling structured data.
pd is just an alias used for convenience,Needed to read the CSV file and handle data in tabular format (DataFrames).
from sklearn.preprocessing import OneHotEncoder:Imports the OneHotEncoder class from scikit-learn.
encoder = OneHotEncoder(sparse_output=False):Creates an instance of the OneHotEncoder.
sparse_output=False: outputs a dense NumPy array instead of a sparse matrix (easier to print and work with).
 We want a human-readable table instead of a compressed sparse format.
encoded = encoder.fit_transform(x_reshaped)
Fits the encoder to the data (fit) and transforms it (transform) in one step.
The result encoded is a 2D NumPy array with 1s and 0s, where each column represents a category.
This is the actual One-Hot Encoding step that converts categorical text into numeric format.
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['Gender]))
Converts the NumPy encoded array back into a pandas DataFrame for easy viewing and further processing,becauseMakes the encoded data readable and usable like the original DataFrame.



