# Text Cleaning for ISL_CLSRT Dataset

This notebook demonstrates how to clean sentence-level data from ISL_CLSRT dataset for Sign Language Translation tasks. It covers:
- Lowercasing
- Punctuation removal
- Whitespace normalization
- Stopword removal
- Tokenization
- Conversion to gloss-style (uppercase tokens)


In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords

# Download stopwords (one-time)
nltk.download('stopwords')


In [None]:
df = pd.read_csv('isl_train_meta.csv')
df.head()


In [None]:
stop_words = set(stopwords.words('english'))

def clean_text(sentence, remove_stopwords=True):
    sentence = sentence.lower()
    sentence = re.sub(r'[^\w\s]', '', sentence)
    sentence = re.sub(r'\s+', ' ', sentence).strip()
    tokens = sentence.split()
    if remove_stopwords:
        tokens = [word for word in tokens if word not in stop_words]
    gloss = ' '.join(tokens).upper()
    return gloss


In [None]:
df['cleaned_gloss'] = df['Sentences'].apply(lambda x: clean_text(x))
df[['Sentences', 'cleaned_gloss']].head()


In [None]:
df.to_csv('isl_train_meta_cleaned.csv', index=False)
print("Cleaned file saved as isl_train_meta_cleaned.csv")


### Summary:
This notebook performed basic text preprocessing steps suitable for sign language translation tasks.
