# Text Preprocessing for Sentiment140

In this notebook, we will process the dataset to prepare it for training.  
This includes:
- Tokenization
- Lemmatization
- Formatting the dataset for Word2Vec training

The cleaned dataset from the EDA step will be used as input.


In [5]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download required NLTK resources
nltk.download("punkt", quiet=True)      
nltk.download("punkt_tab", quiet=True)  
nltk.download("wordnet", quiet=True)    


True

## Loading the Processed Dataset

We load the cleaned dataset from the previous EDA step.  
This dataset is stored in `data/processed/sentiment140_clean.csv`  
and has already undergone stopword removal and basic text cleaning.


In [6]:
df = pd.read_csv("../data/processed/sentiment140_clean.csv")

df.head()


Unnamed: 0,sentiment,text,tweet_length,clean_text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",115,switchfoot bummer shoulda david carr third
1,0,is upset that he can't update his Facebook by ...,111,upset cant update facebook texting might cry r...
2,0,@Kenichan I dived many times for the ball. Man...,89,kenichan dived many times ball managed save re...
3,0,my whole body feels itchy and like its on fire,47,whole body feels itchy fire
4,0,"@nationwideclass no, it's not behaving at all....",111,nationwideclass behaving mad cant


## Tokenization and Lemmatization

Each tweet is converted into a sequence of words using tokenization.  
Since stopwords were already removed in EDA, we only split the text into individual words and apply **lemmatization** to standardize the vocabulary.


In [7]:
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    """
    Tokenizes and lemmatizes the input text, removing punctuation and short words.
    
    Args:
        text (str): Input text to preprocess.
    
    Returns:
        list: List of lemmatized tokens.
    """
    # Handle NaN or non-string values
    if not isinstance(text, str):
        return []
    
    # Tokenize the text
    tokens = word_tokenize(text.lower())

    # Filter out non-alphabetic words and lemmatize
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha() and len(token) > 1]
    
    return lemmatized_tokens


In [9]:
df["tokens"] = df["clean_text"].apply(preprocess_text)

df[["clean_text", "tokens"]].head()



Unnamed: 0,clean_text,tokens
0,switchfoot bummer shoulda david carr third,"[switchfoot, bummer, shoulda, david, carr, third]"
1,upset cant update facebook texting might cry r...,"[upset, cant, update, facebook, texting, might..."
2,kenichan dived many times ball managed save re...,"[kenichan, dived, many, time, ball, managed, s..."
3,whole body feels itchy fire,"[whole, body, feel, itchy, fire]"
4,nationwideclass behaving mad cant,"[nationwideclass, behaving, mad, cant]"


In [13]:
# Save the tokenized dataset for the next step
output_path = "../data/processed/sentiment140_tokenized.csv"
df[['sentiment', 'clean_text', 'tokens']].to_csv(output_path, index=False)