### Name : Yuvraj Pardeshi
### Roll No : 43548
### Sub - NLP
### Assignment No - 3 
Perform text cleaning, perform lemmatization (any method), remove stop words (any method), 
label encoding. Create representations using TF-IDF. Save outputs.

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [2]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to C:\Users\Yuvraj
[nltk_data]     Pardeshi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Yuvraj
[nltk_data]     Pardeshi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Yuvraj
[nltk_data]     Pardeshi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Sample data
data = {
    'text': ['This is an example of text cleaning and lemmatization.',
             'We will remove stop words from this text.',
             'Label encoding is important for machine learning tasks.',
             'TF-IDF is a popular technique for text representation.'],
    'label': ['cleaning', 'removal', 'encoding', 'representation']
}

# Convert data to DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,text,label
0,This is an example of text cleaning and lemmat...,cleaning
1,We will remove stop words from this text.,removal
2,Label encoding is important for machine learni...,encoding
3,TF-IDF is a popular technique for text represe...,representation


In [4]:
# Text cleaning function
def clean_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    text = text.lower()  # Convert text to lowercase
    return text

# Lemmatization function
def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    tokens = nltk.word_tokenize(text)
    lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in tokens])
    return lemmatized_text

# Remove stop words function
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    tokens = nltk.word_tokenize(text)
    filtered_text = ' '.join([word for word in tokens if word not in stop_words])
    return filtered_text

In [6]:
df['cleaned_text']

0    this is an example of text cleaning and lemmat...
1             we will remove stop words from this text
2    label encoding is important for machine learni...
3    tfidf is a popular technique for text represen...
Name: cleaned_text, dtype: object

In [9]:
df['lemmatized_text'] = df['cleaned_text'].apply(lemmatize_text)
df['lemmatized_text'] 

0    this is an example of text cleaning and lemmat...
1              we will remove stop word from this text
2    label encoding is important for machine learni...
3    tfidf is a popular technique for text represen...
Name: lemmatized_text, dtype: object

In [11]:
df['processed_text'] = df['lemmatized_text'].apply(remove_stopwords)
df['processed_text'] 

0               example text cleaning lemmatization
1                             remove stop word text
2    label encoding important machine learning task
3       tfidf popular technique text representation
Name: processed_text, dtype: object

In [12]:
# Label encoding
label_encoder = LabelEncoder()
df['encoded_label'] = label_encoder.fit_transform(df['label'])

# Print processed data
print("Processed Data:")
print(df)

Processed Data:
                                                text           label  \
0  This is an example of text cleaning and lemmat...        cleaning   
1          We will remove stop words from this text.         removal   
2  Label encoding is important for machine learni...        encoding   
3  TF-IDF is a popular technique for text represe...  representation   

                                        cleaned_text  \
0  this is an example of text cleaning and lemmat...   
1           we will remove stop words from this text   
2  label encoding is important for machine learni...   
3  tfidf is a popular technique for text represen...   

                                     lemmatized_text  \
0  this is an example of text cleaning and lemmat...   
1            we will remove stop word from this text   
2  label encoding is important for machine learni...   
3  tfidf is a popular technique for text represen...   

                                   processed_text  encoded_la

In [15]:
# TF-IDF representation
tfidf_vectorizer = TfidfVectorizer()
tfidf_representation = tfidf_vectorizer.fit_transform(df['processed_text'])

# Print TF-IDF representation in detail
print("\nSentences with TF-IDF Representation:")
for i, sentence in enumerate(df['processed_text']):
    print("Sentence:", sentence)
    print("TF-IDF Representation:", tfidf_representation[i].toarray())
    print()


Sentences with TF-IDF Representation:
Sentence: example text cleaning lemmatization
TF-IDF Representation: [[0.5417361  0.         0.5417361  0.         0.         0.
  0.5417361  0.         0.         0.         0.         0.
  0.         0.         0.34578314 0.         0.        ]]

Sentence: remove stop word text
TF-IDF Representation: [[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.5417361  0.         0.5417361
  0.         0.         0.34578314 0.         0.5417361 ]]

Sentence: label encoding important machine learning task
TF-IDF Representation: [[0.         0.40824829 0.         0.40824829 0.40824829 0.40824829
  0.         0.40824829 0.         0.         0.         0.
  0.40824829 0.         0.         0.         0.        ]]

Sentence: tfidf popular technique text representation
TF-IDF Representation: [[0.         0.         0.         0.         0.         0.
  0.         0.         0.47633035 0.         0.47633035 0.
  0. 