**Name:Ankitha**

**Date:01/june/2025**

# **STEP 1**

In [7]:
# Load the CSV file
import pandas as pd
df = pd.read_csv("/content/NLP_Assignment_Sentences.csv")

df.head()

Unnamed: 0,SentenceID,Sentence
0,1,I love programming in Python.
1,2,Natural Language Processing is fascinating.
2,3,Spacy and NLTK are popular NLP libraries.
3,4,Machine learning enables predictive analysis.
4,5,Data preprocessing is a crucial step in NLP.


**Load the Dataset**

We load dataset using pandas library. The function pd.read_csv() reads the CSV file from the specified path into a Dataframe, a data structure that allows for efficient data manipulation and analysis.

In [7]:
!pip install pandas nltk

import nltk
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    print("Punkt tokenizer not found, downloading...")
    nltk.download('punkt')



In [12]:
import nltk
from nltk.tokenize import word_tokenize
import pandas as pd # Import pandas

# Load the CSV file
df = pd.read_csv("/content/NLP_Assignment_Sentences.csv")

# Download the necessary NLTK resources
nltk.download('punkt')
# Download punkt_tab resource as suggested by the traceback
nltk.download('punkt_tab')


# Tokenize each sentence
df['Tokens'] = df['Sentence'].apply(word_tokenize)

# Display the DataFrame with the new 'Tokens' column
print(df[['Sentence', 'Tokens']].head())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


                                        Sentence  \
0                  I love programming in Python.   
1    Natural Language Processing is fascinating.   
2      Spacy and NLTK are popular NLP libraries.   
3  Machine learning enables predictive analysis.   
4   Data preprocessing is a crucial step in NLP.   

                                              Tokens  
0              [I, love, programming, in, Python, .]  
1  [Natural, Language, Processing, is, fascinatin...  
2  [Spacy, and, NLTK, are, popular, NLP, librarie...  
3  [Machine, learning, enables, predictive, analy...  
4  [Data, preprocessing, is, a, crucial, step, in...  


⭐ TOKENIZATION

 THE PROCESS OF BREAKING DOWN A SENTENCE OR TEXT INTO INDIVIDUAL UNITS CALLED TOKENS. THESE TOKENS ARE TYPICALLY WORDS, BUT CAN ALSO INCLUDE PUNCTUATION MARKS OR OTHER MEANINGFUL ELEMENTS.

# **🔨 What We Used word_tokenize**

 A FUNCTION FROM THE NLTK LIBRARY THAT SPLITS TEXT INTO WORDS AND PUNCTUATION USING ROBUST LINGUISTIC RULES.apply(): USED TO APPLY THE TOKENIZATION FUNCTION TO EACH SENTENCE IN THE DATASET.

# **STEP 2 (Stemming)**

In [13]:
# step-->2 Text preprocessing
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
df['Stemmed'] = df['Tokens'].apply(lambda tokens: [stemmer.stem(word) for word in tokens])

# 📌WHAT IS STEMMING?


# Definition:
Stemming is a rule-based process that reduces words to their base form by removing suffixes or prefixes.
Example:
Stemming "running," "runs," and "ran" would result in "run".
# Characteristics:
Rule-based: It relies on predefined rules to remove affixes.
Faster: Generally faster than lemmatization due to its rule-based nature.
Less accurate: May not always result in a valid base word, and can be less accurate than lemmatization in some cases.
# Use cases:
Stemming is often used in situations where speed is prioritized over accuracy, such as in information retrieval systems where the goal is to find relevant documents quickly.

# **STEP_2(Lemmatization)**

In [14]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()
df['Lemmatized'] = df['Tokens'].apply(lambda tokens: [lemmatizer.lemmatize(word) for word in tokens])

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...



# **📌What is lemmatization?**


Lemmatization is a text normalization technique used in natural language processing.

It reduces a word to its base or dictionary form, known as the lemma.

Unlike stemming, lemmatization ensures that the resulting word is linguistically valid.

It takes into account the context and part of speech of the word.

Lemmatization requires the use of a lexical database or dictionary.

It helps in standardizing words for better text analysis.

This process improves the accuracy of many NLP tasks.

Lemmatization is slower than stemming due to its complexity.

It is useful in applications like search engines, chatbots, and machine translation.





# **📌 Sample Comparison**

In [15]:
# Compare stemming vs lemmatization on 2 sample sentences
df[['Tokens', 'Stemmed', 'Lemmatized']].head(2)

Unnamed: 0,Tokens,Stemmed,Lemmatized
0,"[I, love, programming, in, Python, .]","[i, love, program, in, python, .]","[I, love, programming, in, Python, .]"
1,"[Natural, Language, Processing, is, fascinatin...","[natur, languag, process, is, fascin, .]","[Natural, Language, Processing, is, fascinatin..."


# **STEP_3 Stopwords Removal**


In [17]:
from nltk.corpus import stopwords
import nltk # Import nltk explicitly
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

# Remove stopwords
df['No_Stopwords'] = df['Tokens'].apply(lambda tokens: [word for word in tokens if word.lower() not in stop_words])

# Display the DataFrame with the correct column name 'Sentence'
df[['Sentence', 'No_Stopwords']].head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Sentence,No_Stopwords
0,I love programming in Python.,"[love, programming, Python, .]"
1,Natural Language Processing is fascinating.,"[Natural, Language, Processing, fascinating, .]"
2,Spacy and NLTK are popular NLP libraries.,"[Spacy, NLTK, popular, NLP, libraries, .]"
3,Machine learning enables predictive analysis.,"[Machine, learning, enables, predictive, analy..."
4,Data preprocessing is a crucial step in NLP.,"[Data, preprocessing, crucial, step, NLP, .]"


# **📌 what are stopwards?**





Stopwords are commonly used words in a language that are often removed during natural language processing (NLP) tasks because they carry little or no meaningful content. Words like "the," "is," "in," "and," and "of" are considered stopwords in English. These words appear frequently in text but do not contribute significantly to the overall meaning, especially in tasks like text classification, search engines, or topic modeling. By removing stopwords, NLP systems can focus on more important words that carry substantial information. However, the use and removal of stopwords depend on the context of the application. In some cases, such as sentiment analysis or language translation, stopwords may still be important for understanding meaning or tone. Many NLP libraries, like NLTK and spaCy, provide predefined lists of stopwords, but custom lists can also be created depending on the specific needs of a project. Removing stopwords is usually a standard step in the text preprocessing pipeline to simplify and speed up analysis.

2. When should we keep or remove them?


The decision to keep or remove stopwords in a natural language processing task depends on the specific goal and context of the project. In many cases, stopwords are removed to reduce noise and improve performance, especially in tasks like text classification, topic modeling, and information retrieval. These common words—such as "the," "is," "and," or "of"—appear frequently in language but usually do not contribute significant meaning when identifying the main themes or topics of a document. Removing them helps simplify the data and focus on more meaningful content words.

However, there are situations where keeping stopwords is essential. In sentiment analysis, for example, words like "not" or "was" can greatly influence the sentiment of a sentence, and removing them could lead to incorrect interpretations. Similarly, in machine translation or chatbot applications, stopwords are important for maintaining correct grammar and conveying precise meanings. Tasks that rely on syntactic or grammatical structure, such as part-of-speech tagging or dependency parsing, also require stopwords to be preserved. Ultimately, whether to keep or remove stopwords should be based on the specific requirements of the task, and it’s often beneficial to test both approaches to determine which yields better results.









# **STEP_4(Part of Speech (POS) Tagging)**





In [30]:
import nltk
import pandas as pd

# Download required NLTK resources, including the specific English tagger
try:
    nltk.download('averaged_perceptron_tagger')
    # Download the specific English averaged perceptron tagger
    nltk.download('averaged_perceptron_tagger_eng')
except Exception as e:
    print(f"Error downloading NLTK resource: {e}")

# Assuming df is already created with a 'Tokens' column from the previous step
# POS tagging
df['POS_Tags'] = df['Tokens'].apply(nltk.pos_tag)

# Extract tags
df['Nouns'] = df['POS_Tags'].apply(lambda tags: [word for word, pos in tags if pos.startswith('NN')])
df['Verbs'] = df['POS_Tags'].apply(lambda tags: [word for word, pos in tags if pos.startswith('VB')])
df['Adjectives'] = df['POS_Tags'].apply(lambda tags: [word for word, pos in tags if pos.startswith('JJ')])
df['Adverbs'] = df['POS_Tags'].apply(lambda tags: [word for word, pos in tags if pos.startswith('RB')])

# Display the updated DataFrame
print(df[['Tokens', 'POS_Tags', 'Nouns', 'Verbs', 'Adjectives', 'Adverbs']])

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


                                              Tokens  \
0              [I, love, programming, in, Python, .]   
1  [Natural, Language, Processing, is, fascinatin...   
2  [Spacy, and, NLTK, are, popular, NLP, librarie...   
3  [Machine, learning, enables, predictive, analy...   
4  [Data, preprocessing, is, a, crucial, step, in...   

                                            POS_Tags  \
0  [(I, PRP), (love, VBP), (programming, VBG), (i...   
1  [(Natural, JJ), (Language, NNP), (Processing, ...   
2  [(Spacy, NN), (and, CC), (NLTK, NNP), (are, VB...   
3  [(Machine, NN), (learning, VBG), (enables, NNS...   
4  [(Data, NNP), (preprocessing, NN), (is, VBZ), ...   

                              Nouns                Verbs    Adjectives Adverbs  
0                          [Python]  [love, programming]            []      []  
1            [Language, Processing]    [is, fascinating]     [Natural]      []  
2     [Spacy, NLTK, NLP, libraries]                [are]     [popular]      []  
3 

# 📌 WHAT IS  Part of Speech (POS) Tagging?


Part of Speech (POS) tagging is a fundamental task in Natural Language Processing (NLP) that involves assigning a grammatical category (or "tag") to each word in a sentence based on its role and context. These tags represent the word’s part of speech, such as noun, verb, adjective, adverb, pronoun, preposition, conjunction, or interjection. For example, in the sentence "I love coding," POS tagging would label "I" as a pronoun (PRP), "love" as a verb (VBP), and "coding" as a noun (NN).POS tagging is crucial because it helps machines understand the syntactic structure of a sentence, which is essential for tasks like text analysis, machine translation, sentiment analysis, and information extraction. For instance, knowing whether "book" is a noun ("I read a book") or a verb ("I book a flight") changes the sentence's meaning.The process typically involves tokenizing the text into words and then applying a tagging algorithm. Tools like NLTK, spaCy, or Stanford NLP use pre-trained models to perform POS tagging. These models are trained on large annotated corpora, such as the Penn Treebank, which provide standardized tag sets (e.g., NN for nouns, VB for verbs).There are different approaches to POS tagging:

Rule-Based: Uses predefined grammatical rules to assign tags. For example, a word following "the" is often a noun.Statistical: Relies on probabilistic models like Hidden Markov Models (HMMs) to predict tags based on word sequences and their likelihood.Neural/Deep Learning: Uses models like transformers (e.g., BERT) to capture contextual relationships, improving accuracy for ambiguous words.POS tagging can handle challenges like homonyms (words with multiple meanings) by considering the surrounding context. For example, in "I can can a can," the first "can" is a modal verb (MD), the second is a verb (VB), and the third is a noun (NN).Accuracy in POS tagging depends on the language and domain, as some languages have complex grammar, and slang or technical terms can confuse models. Modern taggers achieve over 97% accuracy on standard English datasets but may struggle with informal or low-resource languages.

# **Step 5: Named Entity Recognition (NER)**

In [32]:
nltk.download('maxent_ne_chunker')
nltk.download('words')
# Download the missing resource
nltk.download('maxent_ne_chunker_tab')

from nltk import ne_chunk

# Perform NER
df['NER'] = df['POS_Tags'].apply(ne_chunk)
# Corrected column name from 'Sentences' to 'Sentence'
print(df[['Sentence', 'NER']].head())

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.


                                        Sentence  \
0                  I love programming in Python.   
1    Natural Language Processing is fascinating.   
2      Spacy and NLTK are popular NLP libraries.   
3  Machine learning enables predictive analysis.   
4   Data preprocessing is a crucial step in NLP.   

                                                 NER  
0  [(I, PRP), (love, VBP), (programming, VBG), (i...  
1  [(Natural, JJ), (Language, NNP), (Processing, ...  
2  [[(Spacy, NN)], (and, CC), [(NLTK, NNP)], (are...  
3  [[(Machine, NN)], (learning, VBG), (enables, N...  
4  [[(Data, NNP)], (preprocessing, NN), (is, VBZ)...  


# **📌WHAT IS Named Entity Recognition (NER)**


Named Entity Recognition (NER) is a key NLP task that identifies and classifies named entities in text into predefined categories like person names, organizations, locations, dates, and more. For example, in the sentence "Elon Musk founded SpaceX in California on March 14, 2002," NER would tag "Elon Musk" as a person, "SpaceX" as an organization, "California" as a location, and "March 14, 2002" as a date. This process helps machines understand structured information within unstructured text, making it vital for applications like information retrieval, question answering, and knowledge graph construction. NER typically follows tokenization and POS tagging, as understanding a word’s role aids in entity identification. There are several approaches to NER: rule-based methods use handcrafted patterns (e.g., regular expressions) to detect entities, while statistical methods like Conditional Random Fields (CRFs) leverage annotated data to predict entity boundaries and types. Modern NER often employs deep learning models, such as BiLSTM-CRF or transformers like BERT, which excel at capturing contextual nuances, especially for ambiguous entities (e.g., "Jordan" as a person or location). Tools like spaCy, NLTK, and Stanford NLP provide pre-trained NER models, supporting multiple languages, though performance varies with language complexity and domain-specific terms. For instance, spaCy can be used in Python with a simple command: nlp = spacy.load("en_core_web_sm"); doc = nlp("Apple is in California"); which identifies "Apple" as an organization and "California" as a location. NER faces challenges like entity ambiguity, nested entities (e.g., "University of California" as one entity), and out-of-vocabulary terms in specialized domains like medicine or law. Fine-tuning models on domain-specific data can improve accuracy, but it requires annotated corpora, which can be scarce for low-resource languages. On standard English datasets like CoNLL-2003, state-of-the-art NER systems achieve F1 scores above 90%, but real-world noisy text, such as social media, often lowers performance. NER is a building block for advanced NLP tasks, enabling structured data extraction and enhancing text understanding for applications like chatbots, search engines, and automated content analysis.

# **STEP_6 ONE HOT ENCODING**

In [36]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer # Import CountVectorizer

# Assuming df is your DataFrame with tokenized sentences
# Example: df['Tokens'] = [['I', 'love', 'programming'], ['Python', 'is', 'great']]

# Join the tokens back into strings for CountVectorizer
# CountVectorizer expects strings as input, not lists of tokens
df['Token_String'] = df['Tokens'].apply(lambda tokens: ' '.join(tokens))

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the token strings into a count matrix
count_matrix = vectorizer.fit_transform(df['Token_String'])

# Display the shape of the count matrix
print("Shape of count matrix:", count_matrix.shape)

# Optional: Convert the count matrix back to a DataFrame for better readability
# Note: CountVectorizer returns a sparse matrix, converting to dense might use a lot of memory for large datasets
count_df = pd.DataFrame(count_matrix.todense(), columns=vectorizer.get_feature_names_out())
print(count_df.head())

Shape of count matrix: (5, 25)
   analysis  and  are  crucial  data  enables  fascinating  in  is  language  \
0         0    0    0        0     0        0            0   1   0         0   
1         0    0    0        0     0        0            1   0   1         1   
2         0    1    1        0     0        0            0   0   0         0   
3         1    0    0        0     0        1            0   0   0         0   
4         0    0    0        1     1        0            0   1   1         0   

   ...  nlp  nltk  popular  predictive  preprocessing  processing  \
0  ...    0     0        0           0              0           0   
1  ...    0     0        0           0              0           1   
2  ...    1     1        1           0              0           0   
3  ...    0     0        0           1              0           0   
4  ...    1     0        0           0              1           0   

   programming  python  spacy  step  
0            1       1      0     0

# **📌What Is One Hot Encoding**





One hot encoding is a fundamental technique used in natural language processing (NLP) to convert text data into a numerical format that machine learning models can understand. Computers do not inherently understand words, sentences, or language; they only process numbers. Therefore, we need a method to represent words as numbers without losing their identity. One hot encoding helps achieve this by assigning each unique word in a vocabulary a distinct binary vector.

In one hot encoding, we begin by creating a vocabulary of all unique words present in our dataset. Suppose we have five unique words: "apple", "banana", "cat", "dog", and "elephant". Each word is represented as a vector of length equal to the vocabulary size, which is five in this case. For "apple", the vector would be [1, 0, 0, 0, 0]; for "banana", it would be [0, 1, 0, 0, 0], and so on. Only one position in the vector has a value of 1, indicating the presence of that specific word, while all other positions are 0.

This binary representation makes it easy to distinguish between different words. Each word is mutually exclusive in its representation. That means no two words share the same encoding, and each is orthogonal to the others in vector space. This is important because it prevents the model from assuming any inherent ordering or relationship between the words based on their encoding. Unlike label encoding, where numbers might imply ranking (like 1, 2, 3), one hot encoding avoids such misleading implications.

One hot encoding is particularly useful in early-stage NLP models where word meaning isn’t derived from context. It’s often used before applying more advanced representations like word embeddings (Word2Vec, GloVe) or contextual embeddings (BERT, GPT). It is simple, fast, and works well for small vocabularies and short texts. However, it has limitations when applied to large datasets. As the vocabulary grows, the length of each one hot vector increases, leading to high memory consumption and sparse matrices with lots of zeros.

For example, if your vocabulary contains 10,000 words, each one hot vector will be 10,000 elements long with only a single 1 and 9,999 zeros. This is computationally inefficient. Moreover, one hot encoding does not capture any semantic similarity between words. For instance, “king” and “queen” will have completely different one hot vectors, even though they are related in meaning. As a result, while one hot encoding is effective for simple tasks and demonstrations, more sophisticated models often rely on embedding techniques that capture context and relationships between words.

In sentence processing, we apply one hot encoding to each word in the sentence using the predefined vocabulary. The entire sentence can then be represented as a matrix of binary vectors, one for each word. This makes it suitable for feeding into neural networks, especially in tasks like sentiment analysis, text classification, or intent detection. One hot encoding is also commonly used in bag-of-words models and feedforward neural networks as input features.

While using one hot encoding, special care must be taken to handle unknown or out-of-vocabulary words. If a word appears during testing that was not present in the training vocabulary, it cannot be properly encoded unless a placeholder token like <UNK> is used. Modern libraries like Scikit-learn’s OneHotEncoder offer ways to handle unknowns gracefully using parameters like handle_unknown='ignore'.

In conclusion, one hot encoding is a basic yet powerful technique to represent categorical text data numerically. It is simple to implement, interpretable, and serves as a good starting point for understanding word representation in NLP. Despite its limitations in scalability and lack of semantic depth, it remains an important concept in machine learning and natural language processing. Understanding one hot encoding helps build the foundation for more complex text representation methods used in deep learning models today. It exemplifies the importance of data preprocessing and feature engineering in NLP pipelines.

**📌Add Unknown Word Example**

In [15]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Example vocabulary
vocab = ['apple', 'banana', 'cat']

# Step 1: Initialize the encoder with handle_unknown='ignore'
# Removed the 'sparse=False' argument as it's not supported in this version
encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit(np.array(vocab).reshape(-1, 1))

# Step 2: Try transforming a known and an unknown word
print("Known word 'banana':")
print(encoder.transform([['banana']]))

print("\nUnknown word 'nonexistent':")
print(encoder.transform([['nonexistent']]))

Known word 'banana':
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 1 stored elements and shape (1, 3)>
  Coords	Values
  (0, 1)	1.0

Unknown word 'nonexistent':
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 0 stored elements and shape (1, 3)>
