### Imports

In [15]:
import spacy
import numpy as np
import pandas as pd
from collections import Counter
from datasets import load_dataset
from gensim.models import Word2Vec
from googletrans import Translator
from gensim.models import KeyedVectors
from sklearn.feature_extraction.text import TfidfVectorizer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

#### Change your CSV and Spacy language if needed here

- Loading the training and test datasets from CSV files.
- Importing a custom Dutch simplification corpus from Hugging Face (`wiki_simplifications_dutch_dedup_split`).
- Initializing a SpaCy NLP pipeline for Dutch using the `nl_core_news_sm` model. If you want to use the english one: `nlp = spacy.load("en_core_web_sm")`
- Displaying the first few rows of the test dataset to inspect its structure.

In [16]:
# Data used for training
training_data = pd.read_csv("data/train_dataset.csv", encoding="utf-8")
data = pd.read_csv("data/test_dataset.csv", encoding="utf-8")

custom_corpus = load_dataset("BramVanroy/wiki_simplifications_dutch_dedup_split")
custom_corpus = custom_corpus["train"].to_pandas()

nlp = spacy.load("nl_core_news_sm")

print(data.head())

                                            Sentence    Emotion
0               van jullie het eiland weer verlaten.    neutral
1  Maar zie het als een compliment, want eigenlij...  happiness
2  zien als de grootste bedreiging voor hun relatie.       fear
3                    OkÃ©, hier zijn ze, de koppels!  happiness
4  De koppels zien elkaar een laatste keer terug,...    sadness


Drop all the columns we don't use

In [17]:
df = data.drop(index=0)

### Extracting POS Tag Proportions with SpaCy

This script calculates the proportion of each Universal POS tag in a text using SpaCy. The steps include:

- Defining a list of Universal POS tags as used by SpaCy.
- Implementing the `pos_tag_proportions` function, which:
  - Processes a text with the active SpaCy NLP pipeline.
  - Counts the occurrence of each POS tag (excluding spaces).
  - Calculates the relative frequency (proportion) of each POS tag in the text.
- Applying this function to the `"Sentence"` column of a DataFrame `df`.
- Expanding the results into separate columns and merging them back into the original DataFrame.
- Displaying the updated DataFrame with added POS proportion features.

In [18]:
# List of all Universal POS tags from spaCy
all_pos_tags = [
    "CCONJ", "VERB", "PRON", "SCONJ", "DET", "NOUN", "PUNCT", 
    "ADJ", "ADV", "ADP", "PROPN", "AUX", "INTJ", "NUM", "SPACE", "SYM", "X"
]

def pos_tag_proportions(text):
    doc = nlp(str(text))
    # Extract POS tags (exclude spaces if you want, or include depending on needs)
    tags = [token.pos_ for token in doc if not token.is_space]
    total = len(tags)
    counts = {}
    
    # Count occurrences for each POS tag
    for tag in all_pos_tags:
        counts[f"POS_{tag}"] = tags.count(tag) / total if total > 0 else 0
    
    return counts

# Apply the function and convert list of dicts to a DataFrame
pos_props_df = df["Sentence"].apply(pos_tag_proportions).apply(pd.Series)

# Join these columns back to original DataFrame
df = pd.concat([df, pos_props_df], axis=1)

# Check result
print(df.head())

                                            Sentence    Emotion  POS_CCONJ  \
1  Maar zie het als een compliment, want eigenlij...  happiness   0.117647   
2  zien als de grootste bedreiging voor hun relatie.       fear   0.000000   
3                    OkÃ©, hier zijn ze, de koppels!  happiness   0.000000   
4  De koppels zien elkaar een laatste keer terug,...    sadness   0.066667   
5  Dat is super zenuwachtig, want je weet niet ho...       fear   0.071429   

   POS_VERB  POS_PRON  POS_SCONJ   POS_DET  POS_NOUN  POS_PUNCT   POS_ADJ  \
1  0.176471  0.176471   0.117647  0.117647  0.117647   0.058824  0.058824   
2  0.111111  0.111111   0.111111  0.111111  0.222222   0.111111  0.111111   
3  0.000000  0.100000   0.000000  0.100000  0.100000   0.300000  0.000000   
4  0.133333  0.066667   0.000000  0.200000  0.200000   0.133333  0.066667   
5  0.071429  0.214286   0.000000  0.000000  0.071429   0.142857  0.000000   

    POS_ADV   POS_ADP  POS_PROPN   POS_AUX  POS_INTJ  POS_NUM  POS_S

### Extracting POS Tags from Text with SpaCy

This code extracts Universal POS tags from each sentence in a DataFrame using SpaCy. The process includes:

- Defining the `extract_pos_tags` function, which:
  - Processes a text string with the active SpaCy NLP pipeline.
  - Returns a list of POS tags for each token, excluding spaces.
- Applying the function to the `"Sentence"` column of the DataFrame `df` to create a new column `"POS_tags"`, which contains the list of POS tags for each sentence.
- Optionally printing the `"Sentence"` and corresponding `"POS_tags"` columns to inspect the output.


In [19]:
def extract_pos_tags(text):
    doc = nlp(str(text))
    return [token.pos_ for token in doc if not token.is_space]

# Apply to dataframe
df["POS_tags"] = df["Sentence"].astype(str).apply(extract_pos_tags)

# Optional: check the result
print(df[["Sentence", "POS_tags"]].head())

                                            Sentence  \
1  Maar zie het als een compliment, want eigenlij...   
2  zien als de grootste bedreiging voor hun relatie.   
3                    OkÃ©, hier zijn ze, de koppels!   
4  De koppels zien elkaar een laatste keer terug,...   
5  Dat is super zenuwachtig, want je weet niet ho...   

                                            POS_tags  
1  [CCONJ, VERB, PRON, SCONJ, DET, NOUN, PUNCT, C...  
2  [VERB, SCONJ, DET, ADJ, NOUN, ADP, PRON, NOUN,...  
3  [PROPN, PROPN, PUNCT, ADV, AUX, PRON, PUNCT, D...  
4  [DET, NOUN, VERB, PRON, DET, ADJ, NOUN, ADV, P...  
5  [PRON, AUX, NOUN, ADV, PUNCT, CCONJ, PRON, VER...  


### Generating TF-IDF Features

Creates TF-IDF features from text data used for training. The steps include:

- Initializing a `TfidfVectorizer` from scikit-learn to convert text into numerical feature vectors based on word importance.
- Fitting the vectorizer to the `"Sentence"` column and transforming the sentences into a TF-IDF matrix.
- Converting the TF-IDF matrix to an array and:
  - Storing the full TF-IDF vectors as a list in the `"TF-IDF"` column.
  - Calculating the **mean TF-IDF score** per sentence and storing that value in the `"TF-IDF"` column.
- Displaying the first few rows of the updated DataFrame and checking the data types.

In [20]:
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the sentences
tfidf_matrix = vectorizer.fit_transform(df["Sentence"].astype(str))


df["TF-IDF"] = list(tfidf_matrix.toarray())
df["TF-IDF"] = np.mean(tfidf_matrix.toarray(), axis=1)

# Show first few rows
print(df.head())
print(df.dtypes)

                                            Sentence    Emotion  POS_CCONJ  \
1  Maar zie het als een compliment, want eigenlij...  happiness   0.117647   
2  zien als de grootste bedreiging voor hun relatie.       fear   0.000000   
3                    OkÃ©, hier zijn ze, de koppels!  happiness   0.000000   
4  De koppels zien elkaar een laatste keer terug,...    sadness   0.066667   
5  Dat is super zenuwachtig, want je weet niet ho...       fear   0.071429   

   POS_VERB  POS_PRON  POS_SCONJ   POS_DET  POS_NOUN  POS_PUNCT   POS_ADJ  \
1  0.176471  0.176471   0.117647  0.117647  0.117647   0.058824  0.058824   
2  0.111111  0.111111   0.111111  0.111111  0.222222   0.111111  0.111111   
3  0.000000  0.100000   0.000000  0.100000  0.100000   0.300000  0.000000   
4  0.133333  0.066667   0.000000  0.200000  0.200000   0.133333  0.066667   
5  0.071429  0.214286   0.000000  0.000000  0.071429   0.142857  0.000000   

   ...   POS_ADP  POS_PROPN   POS_AUX  POS_INTJ  POS_NUM  POS_SPACE 

### Pretrained Word Embedding Generation

Prepares text data by tokenizing, lemmatizing, and generating sentence-level embedding features using a pretrained Word2Vec model:

#### Loading Pretrained Word Embeddings
- A pretrained Word2Vec model (`GoogleNews-vectors-negative300.bin`) is loaded using Gensim's `KeyedVectors`.
- The dimensionality of the word embeddings is retrieved with `embedding_dim`.

#### Computing Sentence-Level Average Embeddings
- The `get_average_embedding` function:
  - Iterates over each token in a token list.
  - Checks if the token exists in the pretrained model's vocabulary.
  - Collects valid word vectors and computes their mean to obtain a single embedding per sentence.
  - If no valid tokens are found, returns a zero vector of appropriate dimension.
- The sentence-level average embeddings are computed for all token lists and stored in the `"Pretrained_Embeddings"` column of `df`.


In [21]:
df['Sentence'] = df["Sentence"].astype(str)

def spacy_tokenizer(text):
    doc = nlp(text) 
    tokens = [token.lemma_.lower() for token in doc]
    return tokens

tokens = df['Sentence'].apply(spacy_tokenizer)
print(tokens.head())


1    [maar, zien, het, als, een, compliment, ,, wan...
2    [zien, als, de, groot, bedreiging, voor, hun, ...
3        [okã, ©, ,, hier, zijn, ze, ,, de, koppel, !]
4    [de, koppel, zien, elkaar, een, laat, keer, te...
5    [dat, zijn, super, zenuwachtig, ,, want, je, w...
Name: Sentence, dtype: object


In [22]:
pretrained_path = 'data/GoogleNews-vectors-negative300.bin'
pretrained_model = KeyedVectors.load_word2vec_format(pretrained_path, binary=True)
embedding_dim = pretrained_model.vector_size

In [23]:
def get_average_embedding(token_list, model, embedding_dim):
    valid_vectors = []
    for token in token_list:
        if token in model.key_to_index:
            valid_vectors.append(model[token])
    if not valid_vectors:
        return np.zeros(embedding_dim)
    return np.mean(valid_vectors, axis=0)



avg_embeddings = []
for token_list in tokens:
    vec = get_average_embedding(token_list, pretrained_model, embedding_dim)
    avg_embeddings.append(vec)

df['Pretrained_Embeddings'] = avg_embeddings

In [24]:
display(df.head())

Unnamed: 0,Sentence,Emotion,POS_CCONJ,POS_VERB,POS_PRON,POS_SCONJ,POS_DET,POS_NOUN,POS_PUNCT,POS_ADJ,...,POS_PROPN,POS_AUX,POS_INTJ,POS_NUM,POS_SPACE,POS_SYM,POS_X,POS_tags,TF-IDF,Pretrained_Embeddings
1,"Maar zie het als een compliment, want eigenlij...",happiness,0.117647,0.176471,0.176471,0.117647,0.117647,0.117647,0.058824,0.058824,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[CCONJ, VERB, PRON, SCONJ, DET, NOUN, PUNCT, C...",0.003594,"[0.012676239, 0.17316228, 0.0743214, 0.0382080..."
2,zien als de grootste bedreiging voor hun relatie.,fear,0.0,0.111111,0.111111,0.111111,0.111111,0.222222,0.111111,0.111111,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[VERB, SCONJ, DET, ADJ, NOUN, ADP, PRON, NOUN,...",0.0026,"[-0.029785156, 0.12813313, 0.04682414, 0.03367..."
3,"OkÃ©, hier zijn ze, de koppels!",happiness,0.0,0.0,0.1,0.0,0.1,0.1,0.3,0.0,...,0.2,0.1,0.0,0.0,0.0,0.0,0.0,"[PROPN, PROPN, PUNCT, ADV, AUX, PRON, PUNCT, D...",0.002229,"[0.020166015, 0.1265625, 0.095947266, -0.08520..."
4,"De koppels zien elkaar een laatste keer terug,...",sadness,0.066667,0.133333,0.066667,0.0,0.2,0.2,0.133333,0.066667,...,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,"[DET, NOUN, VERB, PRON, DET, ADJ, NOUN, ADV, P...",0.003266,"[0.016427612, 0.07313013, 0.015861511, -0.0024..."
5,"Dat is super zenuwachtig, want je weet niet ho...",fear,0.071429,0.071429,0.214286,0.0,0.0,0.071429,0.142857,0.0,...,0.0,0.214286,0.0,0.0,0.0,0.0,0.0,"[PRON, AUX, NOUN, ADV, PUNCT, CCONJ, PRON, VER...",0.002981,"[0.10219505, 0.19249378, 0.0288835, 0.03053977..."


### Training Custom Word2Vec Embeddings and Generating Sentence-Level Features

This code trains a custom Word2Vec embedding model on a subset of the `custom_corpus` dataset and computes sentence-level average embeddings for the main DataFrame `df`. The process includes:

#### Preprocessing Custom Corpus
- The `'prompt'` column of `custom_corpus` is converted to string type.
- Only the first 10,000 rows are selected for efficiency.
- The `spacy_tokenizer` function (lemmatization and lowercasing) is applied to tokenize the prompts, generating token lists for training.

#### Training a Custom Word2Vec Model
- A Word2Vec model is trained on the tokenized custom corpus using the Gensim library with parameters:
  - Embedding dimension: 100
  - Context window size: 5
  - Minimum token count threshold: 10 (to ignore infrequent words)
  - Parallel workers: 40 (for faster training)
  - Number of epochs: 5
- The resulting word vectors are stored in `custom_word_vectors`.

#### Computing Average Custom Embeddings
- The function `get_average_custom_embedding` calculates the mean embedding vector for a list of tokens using the custom-trained embeddings.
- For each sentence in the main DataFrame (`df`), the average embedding is computed based on its tokens and stored in the `"Custom_Embeddings"` column.


In [None]:
# Ensure 'prompt' is string type
custom_corpus['prompt'] = custom_corpus['prompt'].astype(str)

# Limit to the first 10,000 rows for efficiency
custom_corpus_small = custom_corpus.head(10_000)

# Define tokenizer with lemmatization and lowercasing
def spacy_tokenizer(text):
    doc = nlp(text)
    tokens = [token.lemma_.lower() for token in doc]
    return tokens

# Apply tokenizer to the reduced dataset
custom_corpus_small_tokens = custom_corpus_small['prompt'].apply(spacy_tokenizer)

# Show first few token lists
print(custom_corpus_small_tokens)

KeyboardInterrupt: 

In [None]:
custom_embedding_dim = 100
custom_model = Word2Vec(
    sentences=custom_corpus_small_tokens, 
    vector_size=custom_embedding_dim,
    window=5,          
    min_count=10,       
    workers=40,         
    epochs=5           
)

custom_word_vectors = custom_model.wv
embedding_dim_custom = custom_model.vector_size

In [None]:
def get_average_custom_embedding(token_list, model, embedding_dim):
    valid_vectors = []
    for token in token_list:
        if token in model.key_to_index:
            valid_vectors.append(model[token])
    if not valid_vectors:
        return np.zeros(embedding_dim)
    return np.mean(valid_vectors, axis=0)



custom_avg_embeddings = []
for token_list in tokens:
    vec = get_average_custom_embedding(token_list, custom_word_vectors, embedding_dim_custom)
    custom_avg_embeddings.append(vec)

df['Custom_Embeddings'] = custom_avg_embeddings

In [None]:
display(df.head())

Unnamed: 0,Sentence,Emotion,POS_CCONJ,POS_VERB,POS_PRON,POS_SCONJ,POS_DET,POS_NOUN,POS_PUNCT,POS_ADJ,...,POS_AUX,POS_INTJ,POS_NUM,POS_SPACE,POS_SYM,POS_X,POS_tags,TF-IDF,Pretrained_Embeddings,Custom_Embeddings
1,"Bewust, niet bewust. ik vind het goed DAT het is.",surprise,0.0,0.076923,0.230769,0.076923,0.0,0.0,0.230769,0.230769,...,0.076923,0.0,0.0,0.0,0.0,0.0,"[ADJ, PUNCT, ADV, ADJ, PUNCT, PRON, VERB, PRON...",0.000674,"[0.059051514, 0.21899414, 0.003074646, -0.0087...","[-0.021206133, 0.14562613, -0.037825022, -0.05..."
2,"Er fysiek wel oordelen open, maar het is niet.",sadness,0.090909,0.090909,0.090909,0.0,0.0,0.0,0.181818,0.181818,...,0.090909,0.0,0.0,0.0,0.0,0.0,"[ADV, ADJ, ADV, VERB, ADJ, PUNCT, CCONJ, PRON,...",0.000829,"[0.04988752, 0.14551653, -0.0022495815, 0.0167...","[-0.01877182, 0.13091229, -0.03340487, -0.0473..."
3,"Er staat fontein oordelen open, maar het is niet.",sadness,0.090909,0.181818,0.090909,0.0,0.0,0.0,0.181818,0.181818,...,0.090909,0.0,0.0,0.0,0.0,0.0,"[ADV, VERB, ADJ, VERB, ADJ, PUNCT, CCONJ, PRON...",0.000845,"[0.0663147, 0.14313616, -0.0046561104, 0.01651...","[-0.018213319, 0.13521256, -0.03366551, -0.048..."
4,"Er staat bron voor open, maar het are niet.",sadness,0.090909,0.090909,0.0,0.0,0.090909,0.090909,0.181818,0.181818,...,0.0,0.0,0.0,0.0,0.0,0.0,"[ADV, VERB, NOUN, ADP, ADJ, PUNCT, CCONJ, DET,...",0.000854,"[0.009531657, 0.08696832, 0.008748372, 0.01874...","[-0.016320882, 0.12591653, -0.03068872, -0.045..."
5,kram doen.,disgust,0.0,0.333333,0.0,0.0,0.0,0.333333,0.333333,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,"[NOUN, VERB, PUNCT]",0.000423,"[-0.04345703, -0.0020446777, 0.13183594, 0.025...","[-0.024421837, 0.16432925, -0.043763738, -0.06..."


In [None]:
print(df.columns)

Index(['Sentence', 'Emotion', 'POS_CCONJ', 'POS_VERB', 'POS_PRON', 'POS_SCONJ',
       'POS_DET', 'POS_NOUN', 'POS_PUNCT', 'POS_ADJ', 'POS_ADV', 'POS_ADP',
       'POS_PROPN', 'POS_AUX', 'POS_INTJ', 'POS_NUM', 'POS_SPACE', 'POS_SYM',
       'POS_X', 'POS_tags', 'TF-IDF', 'Pretrained_Embeddings',
       'Custom_Embeddings'],
      dtype='object')


### Pretrained Word Embeddings in Machine Learning

Implements a basic text classification pipeline that predicts emotions based on average word embeddings extracted from sentences:

#### Computing Average Word Embeddings
- The function `get_avg_embedding`:
  - Splits each sentence into tokens (words).
  - Retrieves the pretrained embedding vector for each token if available.
  - Computes the mean embedding vector for the sentence.
  - Returns a zero vector if no tokens are found in the embedding model.
- Average embeddings are computed for each sentence in the `data` DataFrame and stored in a new `"avg_embed"` column.

#### Preparing Features and Labels
- The feature matrix `X` is constructed by stacking all average embedding vectors.
- The target labels `y` are extracted from the `"Emotion"` column.

#### Training and Evaluating a Logistic Regression Model
- The dataset is split into training (80%) and testing (20%) subsets.
- A `LogisticRegression` classifier is trained on the training embeddings and labels.
- Predictions are made on the test set.
- The classification performance is evaluated and summarized with `classification_report`, showing precision, recall, F1-score, and support for each emotion class.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

def get_avg_embedding(sentence, model, dim):
    tokens = sentence.split()
    vectors = [model[word] for word in tokens if word in model]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(dim)

# Compute embeddings on the test set (data)
data["avg_embed"] = data["Sentence"].astype(str).apply(
    lambda s: get_avg_embedding(s, pretrained_model, embedding_dim)
)

# Feature matrix
X = np.vstack(data["avg_embed"].values)

# Labels
y = data["Emotion"].astype(str)  # or replace with correct label column name

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

clf = LogisticRegression(max_iter=1000).fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

       anger       0.68      0.24      0.36        54
     disgust       0.47      0.23      0.31        69
        fear       0.45      0.51      0.48       195
   happiness       0.37      0.24      0.29       182
     neutral       0.35      0.56      0.43       231
     sadness       0.39      0.37      0.38       208
    surprise       0.49      0.46      0.47       240

    accuracy                           0.41      1179
   macro avg       0.46      0.37      0.39      1179
weighted avg       0.43      0.41      0.41      1179



### Sentiment Analysis
Performs the following steps to analyze sentiment for non-English sentences:

#### Translation to English
- Uses the `googletrans` library's `Translator` with a lightweight service URL (`translate.googleapis.com`).
- Defines an asynchronous function `translate_bulk` to translate a list of sentences (`df['Sentence']`) to English in bulk.
- Collects all translated texts in `translations_list`.

#### Sentiment Analysis on Translated Text
- Initializes the `VADER SentimentIntensityAnalyzer`, a lexicon and rule-based sentiment analysis tool optimized for English.
- Applies VADER’s `polarity_scores` method to each translated sentence, generating sentiment scores including:
  - Positive, negative, neutral, and compound scores.
- Extracts the `compound` score, which summarizes overall sentiment polarity, and stores it in the original DataFrame under the `"Sentiment_Score"` column.

In [None]:
from googletrans import Translator
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

translations_list = []
async def translate_bulk(sentences):
    translator = Translator(service_urls=['translate.googleapis.com'])
    translations = await translator.translate(sentences, dest='en')
    for translation in translations:
        translations_list.append(translation.text)
    return translations_list

translations_list = await translate_bulk(df['Sentence'].to_list())

# Apply sentiment analysis on the translated sentences
sentiment = pd.Series(translations_list).apply(lambda x: analyzer.polarity_scores(x))

# Extract the compound score
df['Sentiment_Score'] = sentiment.apply(lambda score_dict: score_dict['compound'])

df

Unnamed: 0,Sentence,Emotion,POS_CCONJ,POS_VERB,POS_PRON,POS_SCONJ,POS_DET,POS_NOUN,POS_PUNCT,POS_ADJ,...,POS_INTJ,POS_NUM,POS_SPACE,POS_SYM,POS_X,POS_tags,TF-IDF,Pretrained_Embeddings,Custom_Embeddings,Sentiment_Score
1,"Bewust, niet bewust. ik vind het goed DAT het is.",surprise,0.000000,0.076923,0.230769,0.076923,0.000000,0.000000,0.230769,0.230769,...,0.0,0.0,0.0,0.0,0.0,"[ADJ, PUNCT, ADV, ADJ, PUNCT, PRON, VERB, PRON...",0.000674,"[0.059051514, 0.21899414, 0.003074646, -0.0087...","[-0.021206133, 0.14562613, -0.037825022, -0.05...",0.0
2,"Er fysiek wel oordelen open, maar het is niet.",sadness,0.090909,0.090909,0.090909,0.000000,0.000000,0.000000,0.181818,0.181818,...,0.0,0.0,0.0,0.0,0.0,"[ADV, ADJ, ADV, VERB, ADJ, PUNCT, CCONJ, PRON,...",0.000829,"[0.04988752, 0.14551653, -0.0022495815, 0.0167...","[-0.01877182, 0.13091229, -0.03340487, -0.0473...",0.0
3,"Er staat fontein oordelen open, maar het is niet.",sadness,0.090909,0.181818,0.090909,0.000000,0.000000,0.000000,0.181818,0.181818,...,0.0,0.0,0.0,0.0,0.0,"[ADV, VERB, ADJ, VERB, ADJ, PUNCT, CCONJ, PRON...",0.000845,"[0.0663147, 0.14313616, -0.0046561104, 0.01651...","[-0.018213319, 0.13521256, -0.03366551, -0.048...",0.0
4,"Er staat bron voor open, maar het are niet.",sadness,0.090909,0.090909,0.000000,0.000000,0.090909,0.090909,0.181818,0.181818,...,0.0,0.0,0.0,0.0,0.0,"[ADV, VERB, NOUN, ADP, ADJ, PUNCT, CCONJ, DET,...",0.000854,"[0.009531657, 0.08696832, 0.008748372, 0.01874...","[-0.016320882, 0.12591653, -0.03068872, -0.045...",0.0
5,kram doen.,disgust,0.000000,0.333333,0.000000,0.000000,0.000000,0.333333,0.333333,0.000000,...,0.0,0.0,0.0,0.0,0.0,"[NOUN, VERB, PUNCT]",0.000423,"[-0.04345703, -0.0020446777, 0.13183594, 0.025...","[-0.024421837, 0.16432925, -0.043763738, -0.06...",0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5890,En klauteren er dan op.,neutral,0.166667,0.166667,0.000000,0.000000,0.000000,0.000000,0.166667,0.000000,...,0.0,0.0,0.0,0.0,0.0,"[CCONJ, VERB, ADV, ADV, ADP, PUNCT]",0.000605,"[-0.0087890625, 0.16479492, -0.041625977, 0.01...","[-0.012502238, 0.1236767, -0.031620137, -0.048...",0.0
5891,Als je er eenmaal op staat dan manoeuvreer je ...,neutral,0.000000,0.153846,0.153846,0.076923,0.076923,0.076923,0.076923,0.000000,...,0.0,0.0,0.0,0.0,0.0,"[SCONJ, PRON, ADV, ADV, ADP, VERB, ADV, VERB, ...",0.000905,"[0.07126465, 0.19904785, -0.03173828, 0.031010...","[-0.01296855, 0.11175762, -0.027165066, -0.043...",0.0
5892,Je teamgenoten die hier aan de touw staan...,neutral,0.000000,0.111111,0.222222,0.000000,0.111111,0.222222,0.111111,0.000000,...,0.0,0.0,0.0,0.0,0.0,"[PRON, NOUN, PRON, ADV, ADP, DET, NOUN, VERB, ...",0.000786,"[0.034505207, 0.1398112, 0.07434082, 0.0005086...","[-0.018213194, 0.09685853, -0.028583333, -0.03...",0.0
5893,bepalen het tempo.,neutral,0.000000,0.250000,0.000000,0.000000,0.250000,0.250000,0.250000,0.000000,...,0.0,0.0,0.0,0.0,0.0,"[VERB, DET, NOUN, PUNCT]",0.000476,"[0.21679688, 0.3828125, 0.024902344, -0.039306...","[-0.020713434, 0.16700783, -0.042052798, -0.05...",0.0


### Noun Chunks in Sentences Using SpaCy

It defines a function to count the number of noun chunks in each sentence and applies it to a DataFrame column:

- The `count_noun_chunks` function:
  - Processes the input text with SpaCy’s NLP pipeline.
  - Extracts all noun chunks (contiguous noun phrases).
  - Returns the count of these noun chunks.

- The function is applied to the `"Sentence"` column of the DataFrame `df`, creating a new column `"NounChunkCount"` with the counts.


In [None]:
# Function to count noun chunks
def count_noun_chunks(text):
    doc = nlp(text)
    return len(list(doc.noun_chunks))

# Apply to your DataFrame
df['NounChunkCount'] = df['Sentence'].apply(count_noun_chunks)

In [None]:
df.to_csv('NLP_features_Test.csv', index=False)
print("DataFrame Exported to CSV:", df.head())

DataFrame Exported to CSV:                                             Sentence   Emotion  POS_CCONJ  \
1  Bewust, niet bewust. ik vind het goed DAT het is.  surprise   0.000000   
2     Er fysiek wel oordelen open, maar het is niet.   sadness   0.090909   
3  Er staat fontein oordelen open, maar het is niet.   sadness   0.090909   
4        Er staat bron voor open, maar het are niet.   sadness   0.090909   
5                                         kram doen.   disgust   0.000000   

   POS_VERB  POS_PRON  POS_SCONJ   POS_DET  POS_NOUN  POS_PUNCT   POS_ADJ  \
1  0.076923  0.230769   0.076923  0.000000  0.000000   0.230769  0.230769   
2  0.090909  0.090909   0.000000  0.000000  0.000000   0.181818  0.181818   
3  0.181818  0.090909   0.000000  0.000000  0.000000   0.181818  0.181818   
4  0.090909  0.000000   0.000000  0.090909  0.090909   0.181818  0.181818   
5  0.333333  0.000000   0.000000  0.000000  0.333333   0.333333  0.000000   

   ...  POS_NUM  POS_SPACE  POS_SYM  POS_X  \
1

: 