## Explanation of the Notebook
This notebook takes a dataset of Spanish sentences and applies Natural Language Processing (NLP) to analyze and transform the text into useful features.

In [None]:
from IPython.display import display, HTML

display(HTML('<a href="../Documents/Preliminary feedback task 4_team16.pdf" target="_blank">Open PDF</a>'))

# Feature Overview

**Important Libraries**

- **pandas** → Handles data in tables
- **spaCy** → Identifies parts of speech (nouns, verbs, etc.)
- **NumPy** → Performs mathematical operations
- **NLTK** → Breaks sentences into words
- **GoogleTranslator** → Translates Spanish text to English
- **TfidfVectorizer** → Converts words into numerical importance scores
- **TextBlob** → Analyzes sentiment (positive or negative)
- **Word2Vec** → Converts words into numerical representations

| **Feature Name**           | **Description**                                                    |
|----------------------------|--------------------------------------------------------------------|
| **Sentence_English**       | English translation of the Spanish sentence                        |
| **POS_Tags**               | Part-of-speech tags for each word (e.g., nouns, verbs, adjectives) |
| **TF-IDF**                 | Word importance scores based on Term Frequency–Inverse Document Frequency |
| **Sentiment_Score**        | Polarity score indicating if the sentence is positive or negative  |
| **Pretrained_Embeddings**  | Sentence-level vector using Google’s pretrained Word2Vec model     |
| **Custom_Embeddings**      | Sentence-level vector using a Word2Vec model trained on the dataset |
| **Sentence_Length**        | Number of tokens in the cleaned sentence  


#### Methodology with Code Snippets and Descriptions

| **Step**                        | **Why** (Purpose of the Step)                                                                                                      | **Code Snippet**                                                                                                                   |
|---------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| **Data Loading**                | To begin processing, we need to load the dataset and ensure there are no missing values that could break downstream operations.    | <pre>df = pd.read_csv('path/to/augmented_dataset.csv')<br>df['text'] = df['text'].fillna('')</pre>                                |
| **POS Tagging**                 | Part-of-speech tags reveal sentence structure and grammatical roles, which can influence how emotions are expressed.               | <pre>def pos_tagging(text):<br>    doc = nlp_en(text)<br>    return [token.pos_ for token in doc]<br>df['POS_Tags'] = df['text'].apply(pos_tagging)</pre> |
| **Text Cleaning**              | To remove noise and standardise the input, ensuring the models and features are based on relevant, clean tokens.                   | <pre>def clean_text(text):<br>    ...  # your full cleaning logic<br>df['Clean_Sentence'] = df['text'].apply(clean_text)</pre>     |
| **TF-IDF Calculation**          | Highlights which words are most important in each sentence, helping models focus on emotionally meaningful terms.                  | <pre>vectorizer = TfidfVectorizer()<br>tfidf_matrix = vectorizer.fit_transform(df['Clean_Sentence'])<br>df['TF_IDF'] = list(tfidf_matrix.toarray())</pre> |
| **Sentiment Analysis**          | Captures the emotional tone (positive, negative, or neutral) of each sentence as a numerical feature.                              | <pre>def sentiment_score(text):<br>    return TextBlob(text).sentiment.polarity<br>df['Sentiment_Score'] = df['text'].apply(sentiment_score)</pre> |
| **Pretrained Word Embeddings**  | Uses external linguistic knowledge (Google News corpus) to convert words into meaningful numeric representations.                   | <pre>word_vectors = api.load("word2vec-google-news-300")<br>def get_embedding(text):<br>    words = word_tokenize(text.lower())<br>    vectors = [word_vectors[word] for word in words if word in word_vectors]<br>    return np.mean(vectors, axis=0) if vectors else np.zeros(300)<br>df['Pretrained_Embeddings'] = df['text'].apply(get_embedding)</pre> |
| **Custom Word2Vec Embeddings**  | Trains embeddings on your specific dataset to capture context-specific emotional expressions not covered in pretrained models.      | <pre>corpus = df['Clean_Sentence'].apply(word_tokenize).tolist()<br>custom_model = Word2Vec(sentences=corpus, vector_size=100, window=5, min_count=1, workers=4)<br>def get_custom_embedding(text):<br>    tokens = word_tokenize(text)<br>    vectors = [custom_model.wv[token] for token in tokens if token in custom_model.wv]<br>    return np.mean(vectors, axis=0) if vectors else np.zeros(100)<br>df['Custom_Embeddings'] = df['Clean_Sentence'].apply(get_custom_embedding)</pre> |
| **Sentence Length Calculation** | Emotionally rich sentences tend to vary in length; token count is a simple but effective feature capturing that variation.          | <pre>df['Sentence_Length'] = df['Clean_Sentence'].apply(lambda x: len(word_tokenize(x)))</pre>                                     |
| **Output Saving**               | Saves all engineered features into a file for model training, sharing, or further analysis.                                        | <pre>df.to_excel('FINAL_DATASET.xlsx', index=False)</pre>                                                                          


## Pipeline

In [1]:
import re
import string

# Data manipulation and numerical operations
import pandas as pd
import numpy as np

# Natural Language Processing libraries
import spacy
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from textblob import TextBlob
from deep_translator import GoogleTranslator

# Machine Learning and Feature Extraction
from sklearn.feature_extraction.text import TfidfVectorizer

# Word Embeddings
import gensim.downloader as api
from gensim.models import Word2Vec

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

# Load spaCy models for Spanish and English
nlp_es = spacy.load('es_core_news_sm')
nlp_en = spacy.load('en_core_web_sm')

# Initialize stopwords
stop_words = set(stopwords.words('english'))


[nltk_data] Downloading package punkt to /Users/Buas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/Buas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/Buas/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

- Loaded the data from an Excel file into a Pandas DataFrame
- Created a translator object (translator = Translator())

In [2]:
# Load the dataset
df =  pd.read_csv('../task_4/augmented_dataset_reduced.csv') 

df.head()

Unnamed: 0,text,main_category
0,there is absolutely no question personality ef...,happiness
1,i cannot find year stats anyhow i needed the e...,disgust
2,hated her and that was honestly a more offensi...,anger
3,pikachu shocked i hope we give him a fair chan...,happiness
4,worst bit is that it keeps snowballing and gro...,anger


In [3]:
# POS Tagging
# Changed POS tag representation from a concatenated string to a list.
# Previously, the POS tags were joined into a single string (e.g., "NOUN VERB ADJ"),
# which lost the sequential structure and individual token information.
# By storing the POS tags as a list, we preserve their order and details,
# enabling more effective downstream processing (e.g., one-hot encoding or embedding)
# and potentially enhancing model performance.
def pos_tagging_list(text):
    doc = nlp_en(text)
    return [token.pos_ for token in doc]

# run funcion 
df['POS_Tags'] = df['text'].apply(pos_tagging_list)
df.head()

Unnamed: 0,text,main_category,POS_Tags
0,there is absolutely no question personality ef...,happiness,"[PRON, VERB, ADV, DET, NOUN, NOUN, NOUN, DET, ..."
1,i cannot find year stats anyhow i needed the e...,disgust,"[PRON, AUX, PART, VERB, NOUN, NOUN, ADV, PRON,..."
2,hated her and that was honestly a more offensi...,anger,"[VERB, PRON, CCONJ, PRON, AUX, ADV, DET, ADV, ..."
3,pikachu shocked i hope we give him a fair chan...,happiness,"[PROPN, VERB, PRON, VERB, PRON, VERB, PRON, DE..."
4,worst bit is that it keeps snowballing and gro...,anger,"[ADJ, NOUN, AUX, SCONJ, PRON, VERB, VERB, CCON..."


In [5]:
# TF-IDF Pre-processing Improvement
#  - Converting text to lowercase
#  - Removing digits and punctuation
#  - Tokenizing the text
#  - Removing stopwords
#  - remove links and URLs
#  - Removing extra whitespace
#  - Removing mentions and hashtags
#  - Removing non-alphabetic characters
#  - Removing newlines
#  - Removing punctuation


def clean_text(text):
    text = text.lower()  # Convert text to lowercase
    text = re.sub(r'\d+', '', text)  # Remove digits
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE) # MULTILINE for regex pattern on a per-line basis 
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = re.sub(r'@[A-Za-z0-9]+', '', text)  # Remove mentions
    text = re.sub(r'#', '', text)  # Remove hashtags
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    text = re.sub(r'[\r\n]+', ' ', text)  # Remove newlines
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    tokens = word_tokenize(text)  # Tokenize the text
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return ' '.join(tokens)

# Apply the cleaning function to the sentences
df['Clean_Sentence'] = df['text'].apply(clean_text)

# Now apply TF-IDF on the cleaned text:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['Clean_Sentence'])
df['TF_IDF'] = list(tfidf_matrix.toarray())

df.head()

Unnamed: 0,text,main_category,POS_Tags,Clean_Sentence,TF_IDF
0,there is absolutely no question personality ef...,happiness,"[PRON, VERB, ADV, DET, NOUN, NOUN, NOUN, DET, ...",absolutely question personality effects outcom...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,i cannot find year stats anyhow i needed the e...,disgust,"[PRON, AUX, PART, VERB, NOUN, NOUN, ADV, PRON,...",find year stats anyhow needed excel files list...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,hated her and that was honestly a more offensi...,anger,"[VERB, PRON, CCONJ, PRON, AUX, ADV, DET, ADV, ...",hated honestly offensive reaction everyone lau...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,pikachu shocked i hope we give him a fair chan...,happiness,"[PROPN, VERB, PRON, VERB, PRON, VERB, PRON, DE...",pikachu shocked hope give fair chance succeeds,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,worst bit is that it keeps snowballing and gro...,anger,"[ADJ, NOUN, AUX, SCONJ, PRON, VERB, VERB, CCON...",worst bit keeps snowballing growing suddenly m...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [6]:
# Sentiment Analysis
def sentiment_score(sentence):
    return TextBlob(sentence).sentiment.polarity # Polarity measures the overall tone of a text, indicating whether it is positive, negative, or neutral

# sanity check
df['Sentiment_Score'] = df['text'].apply(sentiment_score)
df.head()

Unnamed: 0,text,main_category,POS_Tags,Clean_Sentence,TF_IDF,Sentiment_Score
0,there is absolutely no question personality ef...,happiness,"[PRON, VERB, ADV, DET, NOUN, NOUN, NOUN, DET, ...",absolutely question personality effects outcom...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.05
1,i cannot find year stats anyhow i needed the e...,disgust,"[PRON, AUX, PART, VERB, NOUN, NOUN, ADV, PRON,...",find year stats anyhow needed excel files list...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0
2,hated her and that was honestly a more offensi...,anger,"[VERB, PRON, CCONJ, PRON, AUX, ADV, DET, ADV, ...",hated honestly offensive reaction everyone lau...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.2
3,pikachu shocked i hope we give him a fair chan...,happiness,"[PROPN, VERB, PRON, VERB, PRON, VERB, PRON, DE...",pikachu shocked hope give fair chance succeeds,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.233333
4,worst bit is that it keeps snowballing and gro...,anger,"[ADJ, NOUN, AUX, SCONJ, PRON, VERB, VERB, CCON...",worst bit keeps snowballing growing suddenly m...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.26875


In [7]:
# Load pretrained embeddings 
word_vectors = api.load("word2vec-google-news-300")  

def get_word_embedding(sentence, word_vectors):
    words = word_tokenize(sentence.lower())
    vectors = [word_vectors[word] for word in words if word in word_vectors]
    return np.mean(vectors, axis=0) if vectors else np.zeros(300)

# sanity check
df['Pretrained_Embeddings'] = df['text'].apply(lambda x: get_word_embedding(x, word_vectors))
df.head()


Unnamed: 0,text,main_category,POS_Tags,Clean_Sentence,TF_IDF,Sentiment_Score,Pretrained_Embeddings
0,there is absolutely no question personality ef...,happiness,"[PRON, VERB, ADV, DET, NOUN, NOUN, NOUN, DET, ...",absolutely question personality effects outcom...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.05,"[0.06916882, -0.004824684, 0.030941918, 0.0643..."
1,i cannot find year stats anyhow i needed the e...,disgust,"[PRON, AUX, PART, VERB, NOUN, NOUN, ADV, PRON,...",find year stats anyhow needed excel files list...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"[0.024938246, 0.053811464, 0.06385085, 0.08510..."
2,hated her and that was honestly a more offensi...,anger,"[VERB, PRON, CCONJ, PRON, AUX, ADV, DET, ADV, ...",hated honestly offensive reaction everyone lau...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.2,"[0.061091498, -0.051133376, 0.0779149, 0.09205..."
3,pikachu shocked i hope we give him a fair chan...,happiness,"[PROPN, VERB, PRON, VERB, PRON, VERB, PRON, DE...",pikachu shocked hope give fair chance succeeds,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.233333,"[0.057531737, 0.0421875, 0.100134276, 0.069775..."
4,worst bit is that it keeps snowballing and gro...,anger,"[ADJ, NOUN, AUX, SCONJ, PRON, VERB, VERB, CCON...",worst bit keeps snowballing growing suddenly m...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.26875,"[0.10927473, 0.010367076, 0.029471261, 0.10525..."


In [None]:
# Custom embedding using Word2Vec

# Tokenize each cleaned sentence
df['tokens'] = df['Clean_Sentence'].apply(word_tokenize)

# Create a corpus: a list of token lists (one per sentence)
corpus = df['tokens'].tolist()

# Train the custom Word2Vec model on your corpus
# vertor size is the dimension of the word vectors
# the window size is the maximum distance between the current and predicted word within a sentence
# min_count is the minimum count of words to consider when training the model
# workers is the number of worker threads to train the model

custom_model = Word2Vec(sentences=corpus, vector_size=100, window=5, min_count=1, workers=4) 

# Function to get the average word embedding for a given sentence
def get_custom_embedding(sentence, model):
    tokens = word_tokenize(sentence)
    # Get vectors for each token in the sentence if available in the model vocabulary
    vectors = [model.wv[token] for token in tokens if token in model.wv]
    # Return the average of all token vectors, or a zero vector if none are found
    return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)

# Apply the custom embedding function to each cleaned sentence
df['Custom_Embeddings'] = df['Clean_Sentence'].apply(lambda x: get_custom_embedding(x, custom_model))
df.head()


Unnamed: 0,text,main_category,POS_Tags,Clean_Sentence,TF_IDF,Sentiment_Score,Pretrained_Embeddings,tokens,Custom_Embeddings
0,there is absolutely no question personality ef...,happiness,"[PRON, VERB, ADV, DET, NOUN, NOUN, NOUN, DET, ...",absolutely question personality effects outcom...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.05,"[0.06916882, -0.004824684, 0.030941918, 0.0643...","[absolutely, question, personality, effects, o...","[-0.0019956154, 0.012318241, 0.0036612581, 0.0..."
1,i cannot find year stats anyhow i needed the e...,disgust,"[PRON, AUX, PART, VERB, NOUN, NOUN, ADV, PRON,...",find year stats anyhow needed excel files list...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"[0.024938246, 0.053811464, 0.06385085, 0.08510...","[find, year, stats, anyhow, needed, excel, fil...","[-0.0045811804, 0.013830799, 0.0021979574, -0...."
2,hated her and that was honestly a more offensi...,anger,"[VERB, PRON, CCONJ, PRON, AUX, ADV, DET, ADV, ...",hated honestly offensive reaction everyone lau...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.2,"[0.061091498, -0.051133376, 0.0779149, 0.09205...","[hated, honestly, offensive, reaction, everyon...","[-0.008252272, 0.017440444, 0.0027486824, -0.0..."
3,pikachu shocked i hope we give him a fair chan...,happiness,"[PROPN, VERB, PRON, VERB, PRON, VERB, PRON, DE...",pikachu shocked hope give fair chance succeeds,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.233333,"[0.057531737, 0.0421875, 0.100134276, 0.069775...","[pikachu, shocked, hope, give, fair, chance, s...","[-0.0032568697, 0.012707858, 0.0024554152, -0...."
4,worst bit is that it keeps snowballing and gro...,anger,"[ADJ, NOUN, AUX, SCONJ, PRON, VERB, VERB, CCON...",worst bit keeps snowballing growing suddenly m...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.26875,"[0.10927473, 0.010367076, 0.029471261, 0.10525...","[worst, bit, keeps, snowballing, growing, sudd...","[-0.007715346, 0.02062087, 0.0048620435, -0.00..."


In [9]:
# Sentence Length
# used tokens to calculate the length of the sentence 
# This function tokenizes the sentence and returns the number of tokens
# This approach is more accurate than simply counting characters or words,
# as it considers the actual linguistic structure of the sentence.

def get_sentence_length(sentence):
    tokens = word_tokenize(sentence)
    return len(tokens)

df['Sentence_Length'] = df['Clean_Sentence'].apply(get_sentence_length)
df.head(2)

Unnamed: 0,text,main_category,POS_Tags,Clean_Sentence,TF_IDF,Sentiment_Score,Pretrained_Embeddings,tokens,Custom_Embeddings,Sentence_Length
0,there is absolutely no question personality ef...,happiness,"[PRON, VERB, ADV, DET, NOUN, NOUN, NOUN, DET, ...",absolutely question personality effects outcom...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.05,"[0.06916882, -0.004824684, 0.030941918, 0.0643...","[absolutely, question, personality, effects, o...","[-0.0019956154, 0.012318241, 0.0036612581, 0.0...",12
1,i cannot find year stats anyhow i needed the e...,disgust,"[PRON, AUX, PART, VERB, NOUN, NOUN, ADV, PRON,...",find year stats anyhow needed excel files list...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"[0.024938246, 0.053811464, 0.06385085, 0.08510...","[find, year, stats, anyhow, needed, excel, fil...","[-0.0045811804, 0.013830799, 0.0021979574, -0....",10


In [None]:
# create a new excel file with the new columns
df.to_excel('/task_4/ver_2_FINAL_DATASET_revised.xlsx', index=False)