assignment-3 Perform text cleaning, perform lemmatization (any method), remove stop words (any method),
label encoding. Create representations using TF-IDF. Save outputs

# Text Cleaning, Lemmatization, Stop Word Removal, Label Encoding & TF-**IDF**

**Step 1: Install & Import Libraries**

In [5]:
import nltk
import pandas as pd
import numpy as np
import re
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

**Step 2: Sample data**

In [2]:
data = {
    "text": [
        "I love NLP and Machine Learning!",
        "This is my 3rd NLP assignment.",
        "Text preprocessing is very important.",
        "I enjoy working with Python for NLP."
    ],
    "label": ["positive", "neutral", "neutral", "positive"]
}

df = pd.DataFrame(data)
df

Unnamed: 0,text,label
0,I love NLP and Machine Learning!,positive
1,This is my 3rd NLP assignment.,neutral
2,Text preprocessing is very important.,neutral
3,I enjoy working with Python for NLP.,positive


**Step 3: Text Cleaning**

In [3]:
def clean_text(text):
    text = text.lower()                      # lowercase
    text = re.sub(r'[^a-z\s]', '', text)     # remove punctuation & numbers
    text = re.sub(r'\s+', ' ', text).strip() # remove extra spaces
    return text

df["cleaned_text"] = df["text"].apply(clean_text)
df

Unnamed: 0,text,label,cleaned_text
0,I love NLP and Machine Learning!,positive,i love nlp and machine learning
1,This is my 3rd NLP assignment.,neutral,this is my rd nlp assignment
2,Text preprocessing is very important.,neutral,text preprocessing is very important
3,I enjoy working with Python for NLP.,positive,i enjoy working with python for nlp


**Step 4: Lemmatization**

In [6]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    tokens = nltk.word_tokenize(text)
    lemmas = [lemmatizer.lemmatize(word) for word in tokens]
    return " ".join(lemmas)

df["lemmatized_text"] = df["cleaned_text"].apply(lemmatize_text)
df

Unnamed: 0,text,label,cleaned_text,lemmatized_text
0,I love NLP and Machine Learning!,positive,i love nlp and machine learning,i love nlp and machine learning
1,This is my 3rd NLP assignment.,neutral,this is my rd nlp assignment,this is my rd nlp assignment
2,Text preprocessing is very important.,neutral,text preprocessing is very important,text preprocessing is very important
3,I enjoy working with Python for NLP.,positive,i enjoy working with python for nlp,i enjoy working with python for nlp


**Step 5: Stop Word Removal**

In [7]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    tokens = nltk.word_tokenize(text)
    filtered = [word for word in tokens if word not in stop_words]
    return " ".join(filtered)

df["final_text"] = df["lemmatized_text"].apply(remove_stopwords)
df

Unnamed: 0,text,label,cleaned_text,lemmatized_text,final_text
0,I love NLP and Machine Learning!,positive,i love nlp and machine learning,i love nlp and machine learning,love nlp machine learning
1,This is my 3rd NLP assignment.,neutral,this is my rd nlp assignment,this is my rd nlp assignment,rd nlp assignment
2,Text preprocessing is very important.,neutral,text preprocessing is very important,text preprocessing is very important,text preprocessing important
3,I enjoy working with Python for NLP.,positive,i enjoy working with python for nlp,i enjoy working with python for nlp,enjoy working python nlp


**Step 6: Label Encoding**

In [8]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df["label_encoded"] = label_encoder.fit_transform(df["label"])

df

Unnamed: 0,text,label,cleaned_text,lemmatized_text,final_text,label_encoded
0,I love NLP and Machine Learning!,positive,i love nlp and machine learning,i love nlp and machine learning,love nlp machine learning,1
1,This is my 3rd NLP assignment.,neutral,this is my rd nlp assignment,this is my rd nlp assignment,rd nlp assignment,0
2,Text preprocessing is very important.,neutral,text preprocessing is very important,text preprocessing is very important,text preprocessing important,0
3,I enjoy working with Python for NLP.,positive,i enjoy working with python for nlp,i enjoy working with python for nlp,enjoy working python nlp,1


**Step 7: TF-IDF Vectorization**

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df["final_text"])

tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf.get_feature_names_out())
tfidf_df

Unnamed: 0,assignment,enjoy,important,learning,love,machine,nlp,preprocessing,python,rd,text,working
0,0.0,0.0,0.0,0.541736,0.541736,0.541736,0.345783,0.0,0.0,0.0,0.0,0.0
1,0.644503,0.0,0.0,0.0,0.0,0.0,0.411378,0.0,0.0,0.644503,0.0,0.0
2,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.57735,0.0
3,0.0,0.541736,0.0,0.0,0.0,0.0,0.345783,0.0,0.541736,0.0,0.0,0.541736
