# 3.Perform text cleaning, perform lemmatization (any method), remove stop words (any method),
# label encoding. Create representations using TF-IDF. Save outputs. 

Text Cleaning:

Text cleaning is the process of preparing raw text data for analysis or processing by removing irrelevant characters, symbols, or formatting.
It typically involves steps such as:
Lowercasing: Converting all text to lowercase to ensure consistency.
Removing punctuation: Eliminating punctuation marks like commas, periods, and quotation marks.
Removing special characters: Getting rid of non-alphanumeric characters such as emojis or symbols.
Removing numbers: Excluding numerical digits that may not contribute to the textual meaning.
Handling whitespace: Normalizing spaces, tabs, or line breaks.
Removing HTML tags: Stripping out HTML tags if the text includes web content.
Correcting spelling: Optionally, correcting spelling errors using techniques like spell-checking.
Text cleaning helps improve the quality and consistency of textual data for further analysis or modeling.

Lemmatization:

Lemmatization is the process of reducing words to their base or dictionary form, known as the lemma.
Unlike stemming, which simply chops off prefixes or suffixes to derive a root word (stem), lemmatization considers the context and morphological analysis to produce valid lemmas.
For example, the lemma of "running" is "run," and the lemma of "better" is also "good."
Lemmatization often requires a dictionary or lexicon to map words to their respective lemmas.
It is useful in text normalization tasks where maintaining the integrity of words is important, such as in search engines or machine translation systems.

Removing Stop Words:

Stop words are commonly used words in natural language that are often filtered out during text processing because they are considered to have little or no semantic meaning.
Examples of stop words include "the," "is," "and," "in," "of," etc.
Removing stop words helps reduce noise in text data and focuses attention on more meaningful words that carry important information.
However, the list of stop words may vary depending on the specific application or language, and it may be necessary to customize the stop word list accordingly.

Label Encoding:

Label encoding is a process of converting categorical labels or classes into numerical representations.
It is commonly used in machine learning algorithms that require numerical input, such as regression or classification models.
Each unique label or class is assigned a unique integer value.
Label encoding is straightforward and can be done using simple mapping or encoding schemes.
However, it's essential to ensure that the numerical representations do not imply any ordinal relationship between the categories unless such a relationship exists.
Label encoding is different from one-hot encoding, where each category is represented by a binary vector. In label encoding, the numerical values are ordinal, while in one-hot encoding, they are not.

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
import pickle

# Download NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

# Sample data
data = {'Text': [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
],
    'Label': ['A', 'B', 'C', 'A']}

df = pd.DataFrame(data)

# Text Cleaning and Lemmatization
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    words = nltk.word_tokenize(text.lower())
    words = [lemmatizer.lemmatize(word) for word in words if word.isalpha() and word not in stop_words]
    return ' '.join(words)

df['Cleaned_Text'] = df['Text'].apply(preprocess_text)

# Label Encoding
label_encoder = LabelEncoder()
df['Encoded_Label'] = label_encoder.fit_transform(df['Label'])

# TF-IDF Representation
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(df['Cleaned_Text'])

# Save Outputs
df.to_csv('cleaned_data.csv', index=False)
with open('tfidf_matrix.pkl', 'wb') as tfidf_file:
    pickle.dump(X_tfidf, tfidf_file)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
