## JAGATHA MANASA                                                                                                             1/06/2025

# Performing NLP tasks by loading CSV File
## 1. Load the dataset and perform Tokenization

In [None]:
# Step-1: Import libraries
import nltk
import pandas as pd
from nltk.tokenize import word_tokenize

# Step-2: Download necessary resources
#nltk.download('punkt')

# Step-3: Load the dataset using pandas
data = pd.read_csv("NLP_Assignment_Sentences.csv")

# Step-4: Extract the necessary columns from the entire dataset(Slicing)
x = data['Sentence']

# Step-5: Process the sentences 
for row in x:

    # Step-6: Perform word_tokenization
    tokens = word_tokenize(row)

    # Step-7: Display the Result
    print("Original Sentence:",row)
    print("Word_Tokenization:",tokens)
    print("  ")

Original Sentence: I love programming in Python.
Word_Tokenization: ['I', 'love', 'programming', 'in', 'Python', '.']
  
Original Sentence: Natural Language Processing is fascinating.
Word_Tokenization: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '.']
  
Original Sentence: Spacy and NLTK are popular NLP libraries.
Word_Tokenization: ['Spacy', 'and', 'NLTK', 'are', 'popular', 'NLP', 'libraries', '.']
  
Original Sentence: Machine learning enables predictive analysis.
Word_Tokenization: ['Machine', 'learning', 'enables', 'predictive', 'analysis', '.']
  
Original Sentence: Data preprocessing is a crucial step in NLP.
Word_Tokenization: ['Data', 'preprocessing', 'is', 'a', 'crucial', 'step', 'in', 'NLP', '.']
  



## Code Objective:

We are using NLTK to tokenize words in sentences from a CSV file. 

### Code Explanation:

.nltk:nltk is a natural language processing library.
.pandas:pandas is used to read and handle the CSV file.
.word_tokenize():word_tokenize() is a function from nltk.tokenize that splits a sentence into individual words and punctuation tokens.
.data = pd.read_csv():Loads our CSV file into a DataFrame called data.
.x = data['Sentence']:Extracts just the "Sentence" column from our DataFrame.x is now a Pandas Series containing all sentences.
.for loop:Iterates over each sentence (row) in the Sentence column.Uses word_tokenize() to split the every sentence into words and punctuation.

## 2. Text Preprocessing Using PorterStemmer and WordNetLemmatizer

In [8]:
# Step-1: Import necessary functions
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer , WordNetLemmatizer

# Step-2: Initialize Stemmer and lemmatizer
Stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Step-3: Extract the necessary columns from the entire dataset(Slicing)
x = data['Sentence']

# Step-4: Process the sentences 
for row in x:

    # Step-5: Perform word_tokenization
    word_tokens = word_tokenize(row)
    
    # Step-6: Apply Stemming and Lemmatization
    stemmed_text = [Stemmer.stem(text) for text in word_tokens]
    lemmatized_text = [lemmatizer.lemmatize(text) for text in word_tokens]

    # Step-7: Display the Result
    print("Original Sentence:",row)
    print("Tokenized_words:",word_tokens)
    print("Stemmed_text:",stemmed_text)
    print("Lemmatized_text:",lemmatized_text)
    print("   ")


Original Sentence: I love programming in Python.
Tokenized_words: ['I', 'love', 'programming', 'in', 'Python', '.']
Stemmed_text: ['i', 'love', 'program', 'in', 'python', '.']
Lemmatized_text: ['I', 'love', 'programming', 'in', 'Python', '.']
   
Original Sentence: Natural Language Processing is fascinating.
Tokenized_words: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '.']
Stemmed_text: ['natur', 'languag', 'process', 'is', 'fascin', '.']
Lemmatized_text: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '.']
   
Original Sentence: Spacy and NLTK are popular NLP libraries.
Tokenized_words: ['Spacy', 'and', 'NLTK', 'are', 'popular', 'NLP', 'libraries', '.']
Stemmed_text: ['spaci', 'and', 'nltk', 'are', 'popular', 'nlp', 'librari', '.']
Lemmatized_text: ['Spacy', 'and', 'NLTK', 'are', 'popular', 'NLP', 'library', '.']
   
Original Sentence: Machine learning enables predictive analysis.
Tokenized_words: ['Machine', 'learning', 'enables', 'predictive', 'analysis', '


## Code Objective:

It processes each sentence from a CSV file, and for each sentence:
.Tokenizes the sentence into words.
.Stems each word using the PorterStemmer.
.Lemmatizes each word using the WordNetLemmatizer.

### Code Explanation:

.word_tokenize: Splits sentences into words.
.PorterStemmer: Reduces words to their root form.
.WordNetLemmatizer: Reduces words to their base dictionary form.
.Stemmer = PorterStemmer()
.lemmatizer = WordNetLemmatizer()
  These create instances of the stemmer and lemmatizer.
.for loop:This loop goes through each sentence in the dataset.Uses word_tokenize() to split the every sentence into words and punctuation.
.stemmed_text = [Stemmer.stem(text) for text in word_tokens]:Applies stemming to each word.
.lemmatized_text = [lemmatizer.lemmatize(text) for text in word_tokens]:Applies lemmatization to each word.

## 3.StopWords Removal

In [9]:
# Step-1: Import necessary functions
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Step-2: Download necessary resources
#nltk.download('stopwords')

# Step-3: Extract the necessary columns from the entire dataset(Slicing)
x = data['Sentence']

# Step-4: Process the sentences 
for row in x:

    # Step-5: Perform word_tokenization
    word_token = word_tokenize(row)

    # Step-6: Stopwords in English
    stop_words = set(stopwords.words('english'))

    # Step-7: Remove Stopwords
    filtered_words = [word for word in word_token if word.lower() not in stop_words]

    # Step-8: Display the Result
    print("Original Sentence:",row)
    print("Filtered_Text:",filtered_words)
    print(" ")

Original Sentence: I love programming in Python.
Filtered_Text: ['love', 'programming', 'Python', '.']
 
Original Sentence: Natural Language Processing is fascinating.
Filtered_Text: ['Natural', 'Language', 'Processing', 'fascinating', '.']
 
Original Sentence: Spacy and NLTK are popular NLP libraries.
Filtered_Text: ['Spacy', 'NLTK', 'popular', 'NLP', 'libraries', '.']
 
Original Sentence: Machine learning enables predictive analysis.
Filtered_Text: ['Machine', 'learning', 'enables', 'predictive', 'analysis', '.']
 
Original Sentence: Data preprocessing is a crucial step in NLP.
Filtered_Text: ['Data', 'preprocessing', 'crucial', 'step', 'NLP', '.']
 



## Code Objective:

Tokenize each sentence in our dataset.
.Remove common stopwords like "is", "the", "in", etc., which are not useful for many NLP tasks.
.Print the filtered (cleaned) version of each sentence.

### Code Explanation:

.stopwords: A built-in NLTK list of very common words in English (like "the", "is", "at", etc.) which adds little meaning in tasks like classification.
.word_tokenize: Splits a sentence into individual tokens (words and punctuation).
.for loop:This loop goes through each sentence in the dataset.Uses word_tokenize() to split the every sentence into words and punctuation.
filtered_words = [word for word in word_token if word.lower() not in stop_words]:
     .Loops through each token word
     .Converts it to lowercase with word.lower() to handle case insensitivity
     .Keeps it only if it is NOT in the stop words set

## 4.Parts of Speech Tagging

In [11]:
# Step-1: Import necessary functions
from nltk.tokenize import word_tokenize

# Step-2: Download necessary resources
#nltk.download('averaged_perceptron_tagger')

# Step-3: Extract the necessary columns from the entire dataset(Slicing)
x = data['Sentence']

# Step-4: Process the sentences 
for row in x:

    # Step-5: Perform word_tokenization
    token = word_tokenize(row)

    # Step-6: Applying pos_tag
    tags = nltk.pos_tag(token)

    # Step-7: Display the Result
    print("Original Sentence:",row)
    print("Text Tokens and Tags:",tags)
    print(" ")

Original Sentence: I love programming in Python.
Text Tokens and Tags: [('I', 'PRP'), ('love', 'VBP'), ('programming', 'VBG'), ('in', 'IN'), ('Python', 'NNP'), ('.', '.')]
 
Original Sentence: Natural Language Processing is fascinating.
Text Tokens and Tags: [('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('is', 'VBZ'), ('fascinating', 'VBG'), ('.', '.')]
 
Original Sentence: Spacy and NLTK are popular NLP libraries.
Text Tokens and Tags: [('Spacy', 'NN'), ('and', 'CC'), ('NLTK', 'NNP'), ('are', 'VBP'), ('popular', 'JJ'), ('NLP', 'NNP'), ('libraries', 'NNS'), ('.', '.')]
 
Original Sentence: Machine learning enables predictive analysis.
Text Tokens and Tags: [('Machine', 'NN'), ('learning', 'VBG'), ('enables', 'NNS'), ('predictive', 'JJ'), ('analysis', 'NN'), ('.', '.')]
 
Original Sentence: Data preprocessing is a crucial step in NLP.
Text Tokens and Tags: [('Data', 'NNP'), ('preprocessing', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('crucial', 'JJ'), ('step', 'NN'), ('in', '


## Code Objective:

.Tokenizes each sentence from our dataset.
.Applies POS tagging to each token (e.g., identifying nouns, verbs, adjectives).
.Prints the tagged result for each sentence.

### Code Explanation:

.nltk.download('averaged_perceptron_tagger'):Downloads the pretrained POS tagger model required to label each word with its part of speech.
.for loop:This loop goes through each sentence in the dataset.Uses word_tokenize() to split the every sentence into words and punctuation.
.tags = nltk.pos_tag(token):Applies POS tagging to each token using NLTK’s pos_tag() function.
.It returns a list of tuples, where each tuple is:('word', 'POS-tag')


## 5. Named Entity Recognition(NER)

In [None]:
# Step-1: Import libraries
import spacy

# Step-2: Load the English Model
nlp = spacy.load("en_core_web_sm")

# Step-3: Extract the necessary columns from the entire dataset(Slicing)
x = data['Sentence']
print(x)
# Step-4: Process the sentences 
for row in x:
    doc = nlp(row)

    # Step-5: Display the Result
    if doc.ents:
        print("  Named Entities:")
        for ent in doc.ents:
            print(f"{ent.text} -> {ent.label_}")
            print(" ")
    else:
        print("Sentence does not contain any Named Entity")
    
    

0                    I love programming in Python.
1      Natural Language Processing is fascinating.
2        Spacy and NLTK are popular NLP libraries.
3    Machine learning enables predictive analysis.
4     Data preprocessing is a crucial step in NLP.
Name: Sentence, dtype: object
  Named Entities:
Python -> GPE
 
  Named Entities:
Natural Language Processing -> ORG
 
  Named Entities:
Spacy -> PERSON
 
NLTK -> ORG
 
NLP -> ORG
 
Sentence does not contain any Named Entity
  Named Entities:
NLP -> ORG
 



## Code Objective:

This code uses spaCy's built-in NER capabilities to extract and print named entities from each sentence in a CSV file column.

### Code Explanation:

.import spacy:This imports the spaCy NLP library, which is powerful for tasks like tokenization, POS tagging, and especially Named Entity Recognition (NER).
.spacy.load("en_core_web_sm"):This line loads the English small model (en_core_web_sm).
This model is pre-trained and includes:Tokenizer,POS tagger,Named Entity Recognizer.
.nlp(row):Passes the sentence (row) through the NLP pipeline.
doc is a spaCy object that contains all the linguistic information about the sentence:Tokens,POS tags,Named Entities (via doc.ents)
.doc.ents: returns a list of named entities (like person names, organizations, dates, etc.).If the list is not empty, the sentence contains named entities.
.{ent.text} -> {ent.label_}:For each entity:
    .ent.text: The actual named entity in the sentence (e.g., "Python").
    .ent.label_: The type of the entity (e.g., ORG for organization, GPE for location).

# 6. One Hot Encoding

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Step 1: Get first 3 sentences
x = data['Sentence'].iloc[:3]

# Step 2: Tokenize sentences into words and flatten
tokenized = [sentence.lower().split() for sentence in x]
flat_words = [[word] for sentence in tokenized for word in sentence]  # Must be 2D for encoder


# Step 3: Initialize and apply OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(flat_words)

# Step 4: Create DataFrame with proper column names
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['Word']))


# Optional: Add original words as a reference
words = [word[0] for word in flat_words]
encoded_df.insert(0, 'Word', words)

# Print result
print(encoded_df)



            Word  Word_and  Word_are  Word_fascinating.  Word_i  Word_in  \
0              i       0.0       0.0                0.0     1.0      0.0   
1           love       0.0       0.0                0.0     0.0      0.0   
2    programming       0.0       0.0                0.0     0.0      0.0   
3             in       0.0       0.0                0.0     0.0      1.0   
4        python.       0.0       0.0                0.0     0.0      0.0   
5        natural       0.0       0.0                0.0     0.0      0.0   
6       language       0.0       0.0                0.0     0.0      0.0   
7     processing       0.0       0.0                0.0     0.0      0.0   
8             is       0.0       0.0                0.0     0.0      0.0   
9   fascinating.       0.0       0.0                1.0     0.0      0.0   
10         spacy       0.0       0.0                0.0     0.0      0.0   
11           and       1.0       0.0                0.0     0.0      0.0   
12          


## Code Objective:

.Takes the first 3 sentences from a dataset.
.Tokenize them into individual words.
.Apply One-Hot Encoding using sklearn's OneHotEncoder.
.Return a DataFrame where each word is represented as a one-hot encoded vector.

### Code Explanation:

from sklearn.preprocessing import OneHotEncoder:Imports the OneHotEncoder from scikit-learn, which transforms categorical text data into binary vectors.
.data['Sentence'].iloc[:3]:Selects the first 3 sentences from that column.
.[sentence.lower().split() for sentence in x]:Converts each sentence to lowercase and splits it into words.
.[[word] for sentence in tokenized for word in sentence]:Flattens the list of lists and wraps each word in its own list.This is required because OneHotEncoder expects a 2D array, where each row is a sample (in this case, a word).
OneHotEncoder(sparse_output=False):Creates a OneHotEncoder object.
.encoder.fit_transform(flat_words):Fits the encoder to the word list and transforms the words into binary vectors.
Each word becomes a vector with 1 in the position of that word’s column and 0 elsewhere.
.pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['Word'])):Creates a pandas DataFrame from the encoded result.
.get_feature_names_out(['Word']) returns the column names, e.g., ['Word_love', 'Word_python.', ...].
.words = [word[0] for word in flat_words]:Extracts the original words from the nested list flat_words.
.encoded_df.insert(0, 'Word', words):Inserts the original words as the first column of the DataFrame for easier interpretation.
.print(encoded_df):Displays the final one-hot encoded table, where each row represents a word, and the columns are binary indicators showing whether that word matches a specific vocabulary word.

### Why OneHotEncoding is useful?

Machine learning algorithms need numeric input. One-hot encoding is a common way to represent text data.
You can use this encoded data as input features to models like logistic regression, SVM, etc.
While this is simple and interpretable, for larger vocabularies or deep learning tasks, embeddings like Word2Vec or BERT are more efficient.

### Display the output when unknown word is recognized using OneHotEncoding

In [None]:
# Step-1: Import libraries
import tensorflow 

# Step-2: import necessary functions
from tensorflow.keras.preprocessing.text import one_hot

# Step-3: Define a vocabulary size (required!) for hashing
vocab_size = 100
# Step-4: Initializing Unknown Word
word = "transformer"

# Step-5: Perform OneHotEncoding for Unknown Word
encoded = one_hot(word, vocab_size)

# Step-6: Display the Result
print(f"One-hot encoded value for '{word}':", encoded)


One-hot encoded value for 'transformer': [35]



## Code Objective:

.We are using a hashing-based encoding — it doesn't care whether the word was seen before. Instead, it uses a hash function to turn any word into a fixed integer index.
.Even if a word is unknown, it still gets encoded to an integer. This allows the model to:
.Handle new test-time words
.Avoid crashing or failing due to “unknown token” errors

### Code Explanation:

.from tensorflow.keras.preprocessing.text import one_hot:This imports the one_hot function from Keras.
📌 Note: one_hot here is not true one-hot encoding — it’s a hashing-based index generator.
.word = "transformer":We are choosing a word that’s not in our training data to simulate how unknown words are handled.
.one_hot(word, vocab_size):This line hashes the word and gives you an integer index — a pseudo-ID of that word.
     .For "transformer" and vocab_size = 100, it might return something like [35].
     .Even if "transformer" was never seen in training, it still gets a consistent ID (35).
     .It doesn’t look at the actual vocabulary list.
     .It maps any word to an integer in the range 1 to vocab_size.

