# Commonlit Homework Lecture 2 Andrii Shevtsov

In the previous home task, my scores for LightGBM model were:

Validation = `0.76897`

Leaderboard:
![Old lightgbm leaderboard scores](https://i.imgur.com/QLtgeZV.png)

For the linear regression, the validation score was ≈ `0.8`.

## Imports and constants

In [2]:
!pip install lightgbm optuna



In [35]:
import re
import ipywidgets as widgets

import numpy as np
from scipy import sparse
import pandas as pd
import matplotlib.pyplot as plt
import nltk
import gensim
import optuna
import string

from ydata_profiling import ProfileReport
from nltk.corpus import stopwords
from nltk import tokenize
from wordcloud import WordCloud, STOPWORDS
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import RobustScaler
from lightgbm import LGBMRegressor
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from scipy import spatial
from tqdm import tqdm
from time import time

%matplotlib inline

In [4]:
import warnings
warnings.filterwarnings("ignore")

In [5]:
SUMMARIES_TRAIN_FILE = '../../data/commonlit_evaluate_student_summaries/summaries_train.csv'
SUMMARIES_TEST_FILE = '../../data/commonlit_evaluate_student_summaries/summaries_test.csv'
PROMPTS_TRAIN_FILE = '../../data/commonlit_evaluate_student_summaries/prompts_train.csv'
PROMPTS_TEST_FILE = '../../data/commonlit_evaluate_student_summaries/prompts_test.csv'

## Datasets and competition description

The dataset comprises about 24,000 summaries written by students in grades 3-12 of passages on a variety of topics and genres. These summaries have been assigned scores for both content and wording. The goal of the competition is to predict content and wording scores for summaries on unseen topics.

### Goal of the Competition

The goal of this competition is to assess the quality of summaries written by students in grades 3-12. You'll build a model that evaluates how well a student represents the main idea and details of a source text, as well as the clarity, precision, and fluency of the language used in the summary. You'll have access to a collection of real student summaries to train your model.

Your work will assist teachers in evaluating the quality of student work and also help learning platforms provide immediate feedback to students.

### File and Field Information

- **summaries_train.csv** - Summaries in the training set.
    - `student_id` - The ID of the student writer.
    - `prompt_id` - The ID of the prompt which links to the prompt file.
    - `text` - The full text of the student's summary.
    - `content` - The content score for the summary. The first target.
    - `wording` - The wording score for the summary. The second target.
- **summaries_test.csv** - Summaries in the test set. Contains all fields above except content and wording.
- **prompts_train.csv** - The four training set prompts. Each prompt comprises the complete summarization assignment given to students.
    - `prompt_id` - The ID of the prompt which links to the summaries file.
    - `prompt_question` - The specific question the students are asked to respond to.
    - `prompt_title` - A short-hand title for the prompt.
    - `prompt_text` - The full prompt text.
- **prompts_test.csv** - The test set prompts. Contains the same fields as above. The prompts here are only an example. The full test set has a large number of prompts. **The train / public test / private test splits do not share any prompts.**
- **sample_submission.csv** - A submission file in the correct format. See the Evaluation page for details.

This is a Code Competition. When your submission is scored, this example test data will be replaced with the full test set. The full test set comprises about 17,000 summaries from a large number of prompts.

### Evaluation

Submissions are scored using MCRMSE, mean columnwise root mean squared error:

![MCRMSE](https://latex.codecogs.com/png.latex?\dpi{150}&space;\fn_phv&space;\text{MCRMSE}&space;=&space;\frac{1}{m}&space;\sum_{j=1}^{m}&space;\sqrt{\frac{1}{n}&space;\sum_{i=1}^{n}&space;(y_{ij}&space;-&space;\hat{y}_{ij})^2})

## Datasets import

In [53]:
summaries_train_df = pd.read_csv(SUMMARIES_TRAIN_FILE)
summaries_test_df = pd.read_csv(SUMMARIES_TEST_FILE)
prompts_train_df = pd.read_csv(PROMPTS_TRAIN_FILE)
prompts_test_df = pd.read_csv(PROMPTS_TEST_FILE)

In [54]:
summaries_train_df.head()

Unnamed: 0,student_id,prompt_id,text,content,wording
0,000e8c3c7ddb,814d6b,The third wave was an experimentto see how peo...,0.205683,0.380538
1,0020ae56ffbf,ebad26,They would rub it up with soda to make the sme...,-0.548304,0.506755
2,004e978e639e,3b9047,"In Egypt, there were many occupations and soci...",3.128928,4.231226
3,005ab0199905,3b9047,The highest class was Pharaohs these people we...,-0.210614,-0.471415
4,0070c9e7af47,814d6b,The Third Wave developed rapidly because the ...,3.272894,3.219757


In [55]:
prompts_train_df.head()

Unnamed: 0,prompt_id,prompt_question,prompt_title,prompt_text
0,39c16e,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...
1,3b9047,"In complete sentences, summarize the structure...",Egyptian Social Structure,Egyptian society was structured like a pyramid...
2,814d6b,Summarize how the Third Wave developed over su...,The Third Wave,Background \r\nThe Third Wave experiment took ...
3,ebad26,Summarize the various ways the factory would u...,Excerpt from The Jungle,"With one member trimming beef in a cannery, an..."


## EDA

EDA would have been here, but for this notebook I see no value in repeating or improving it from the previous notebook.

## Data preparation

Probably, hard preprocessing with stemming / lemmatization and soft preprocessing will be beneficial for different kind of vectorization techniques, text-based features and final models.

In [9]:
nltk.download('stopwords')

stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use', 'good', 'bad', 'people']) #stopwords extended a bit
def preprocess_hard_base(text, join_back=True):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        # Stop Words Cleaning
        if (
            token not in gensim.parsing.preprocessing.STOPWORDS and 
            token not in stop_words
        ):
            result.append(token)
    if join_back:
        result = " ".join(result)
    return result

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Andrii\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
def preprocess_hard_stemming(text, join_back=True, stemmer = PorterStemmer()):
    tokens = preprocess_hard_base(text, join_back=False)
    
    result = [stemmer.stem(word) for word in tokens]
    if join_back:
        result = " ".join(result)
    
    return result

In [29]:
nltk.download('wordnet')
nltk.download('omw-1.4')

def preprocess_hard_lemmatizing(text, join_back=True, lemmatizer = WordNetLemmatizer()):
    tokens = preprocess_hard_base(text, join_back=False)
    
    result = [lemmatizer.lemmatize(word) for word in tokens]
    if join_back:
        result = " ".join(result)
    
    return result

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Andrii\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Andrii\AppData\Roaming\nltk_data...


Let's compare two NLTK stemmers: Porter Stemmer and Snowball Stemmer; and WordNet Lemmatizer too.

In [32]:
original_text = """For example:
• The company / organization / country / people names. "Steve has created Apple" and "Steve has created an apple" should definitely have different sentence embeddings and different scores in the classifier of news article topics. Also, in this case, it can be detrimental for NER tasks and even part of speech tagging (if the company is named something like "Beautiful").
• Sentiment analysis tasks. For example, "WHAT IS GOING ON???" should have a bigger value for the "fury" class score than the simple "What is going on?" (assuming your preprocessing also converts consecutive punctuation signs to a single one).
• Sequence to sequence text generation. People wouldn't like ChatGPT and other instruction LLMs so much if they were creating text in lowercase (and even more if it was stemmed or lemmatized). That's why it should train almost without preprocessing in common NLP sense.
"""

print("Porter    :", preprocess_hard_stemming(original_text))
print()
print("Snowball  :", preprocess_hard_stemming(original_text, stemmer=SnowballStemmer("english")))
print()
print("Lemmatizer:", preprocess_hard_lemmatizing(original_text))

Porter    : exampl compani organ countri name steve creat appl steve creat appl definit differ sentenc embed differ score classifi news articl topic case detriment ner task speech tag compani name like beauti sentiment analysi task exampl go bigger valu furi class score simpl go assum preprocess convert consecut punctuat sign singl sequenc sequenc text gener like chatgpt instruct llm creat text lowercas stem lemmat train preprocess common nlp sens

Snowball  : exampl compani organ countri name steve creat appl steve creat appl definit differ sentenc embed differ score classifi news articl topic case detriment ner task speech tag compani name like beauti sentiment analysi task exampl go bigger valu furi class score simpl go assum preprocess convert consecut punctuat sign singl sequenc sequenc text generat like chatgpt instruct llms creat text lowercas stem lemmat train preprocess common nlp sens

Lemmatizer: example company organization country name steve created apple steve created app

The only differences between stemmers are: 
- generation -> gener / generat; 
- LLMs -> llm / llms.

In most cases it should be unimportant, what stemmer to use. Let's try PorterStemmer as it's tokens are smaller sometimes.

Lemmatizer words are much better and more accurate, but it should require more computations to proceed.

In [56]:
start_time = time()
summaries_train_df['text_hard_preprocessed_stemmed'] = summaries_train_df['text'].apply(preprocess_hard_stemming)
print("Total processing time is:", time() - start_time, "secs")

Total processing time is: 4.924031734466553 secs


In [57]:
start_time = time()
summaries_train_df['text_hard_preprocessed_lemmatized'] = summaries_train_df['text'].apply(preprocess_hard_lemmatizing)
print("Total processing time is:", time() - start_time, "secs")

Total processing time is: 1.721034288406372 secs


Lemmatizer somehow worked faster then stemmer. Interesting...

> **TODO**: refactor stemmer and lemmatizer preprocessing functions to a single one

In [45]:
def collapse_dots(text):
    # Collapse sequential dots
    input = re.sub("\.+", ".", text)
    # Collapse dots separated by whitespaces
    all_collapsed = False
    while not all_collapsed:
        output = re.sub(r"\.(( )*)\.", ".", text)
        all_collapsed = input == output
        input = output
    return output

# Check how it will influence different ML models
def process_soft(text):
    if isinstance(text, str):
        text = " ".join(tokenize.sent_tokenize(text))
        text = re.sub(r"http\S+", "", text)
        text = re.sub(r"\n+", ". ", text)
        for symb in ["!", ",", ":", ";", "?"]:
            text = re.sub(rf"\{symb}\.", symb, text)
        text = re.sub("[^а-яА-Яa-zA-Z0-9!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~ё]+", " ", text)
        text = re.sub(r"#\S+", "", text)
        text = collapse_dots(text)
        text = text.strip()
    return text

In [58]:
start_time = time()
summaries_train_df['text_soft_preprocessed'] = summaries_train_df['text'].apply(process_soft)
print("Total processing time is:", time() - start_time, "secs")

Total processing time is: 0.8830337524414062 secs


In [47]:
summaries_train_df

Unnamed: 0,student_id,prompt_id,text,content,wording,text_hard_preprocessed_stemmed,text_hard_preprocessed_lemmatized,text_soft_preprocessed
0,000e8c3c7ddb,814d6b,The third wave was an experimentto see how peo...,0.205683,0.380538,wave experimentto react new leader govern gain...,wave experimentto reacted new leader governmen...,The third wave was an experimentto see how peo...
1,0020ae56ffbf,ebad26,They would rub it up with soda to make the sme...,-0.548304,0.506755,rub soda smell away wouldnt smell meat toss fl...,rub soda smell away wouldnt smell meat tossed ...,They would rub it up with soda to make the sme...
2,004e978e639e,3b9047,"In Egypt, there were many occupations and soci...",3.128928,4.231226,egypt occup social class involv day day live i...,egypt occupation social class involved day day...,"In Egypt, there were many occupations and soci..."
3,005ab0199905,3b9047,The highest class was Pharaohs these people we...,-0.210614,-0.471415,highest class pharaoh god nd highest class gon...,highest class pharaoh god nd highest class gon...,The highest class was Pharaohs these people we...
4,0070c9e7af47,814d6b,The Third Wave developed rapidly because the ...,3.272894,3.219757,wave develop rapidli student genuinli believ b...,wave developed rapidly student genuinly believ...,The Third Wave developed rapidly because the s...
...,...,...,...,...,...,...,...,...
7160,ff7c7e70df07,ebad26,They used all sorts of chemical concoctions to...,0.205683,0.380538,sort chemic concoct meat fine shown quot mirac...,sort chemical concoction meat fine shown quote...,They used all sorts of chemical concoctions to...
7161,ffc34d056498,3b9047,The lowest classes are slaves and farmers slav...,-0.308448,0.048171,lowest class slave farmer slave taken war farm...,lowest class slave farmer slave taken war farm...,The lowest classes are slaves and farmers slav...
7162,ffd1576d2e1b,3b9047,they sorta made people start workin...,-1.408180,-0.493603,sorta start work structour they barley got pai...,sorta start working structour theyed barley go...,they sorta made people start working on the st...
7163,ffe4a98093b2,39c16e,An ideal tragety has three elements that make ...,-0.393310,0.627128,ideal trageti element ideal start great traged...,ideal tragety element ideal start great traged...,An ideal tragety has three elements that make ...


In [60]:
prompts_train_df['prompt_question_hard_preprocessed_stemmed'] = prompts_train_df['prompt_question'].apply(preprocess_hard_stemming)
prompts_train_df['prompt_title_hard_preprocessed_stemmed'] = prompts_train_df['prompt_title'].apply(preprocess_hard_stemming)
prompts_train_df['prompt_text_hard_preprocessed_stemmed'] = prompts_train_df['prompt_text'].apply(preprocess_hard_stemming)

prompts_train_df['prompt_question_hard_preprocessed_lemmatized'] = prompts_train_df['prompt_question'].apply(preprocess_hard_lemmatizing)
prompts_train_df['prompt_title_hard_preprocessed_lemmatized'] = prompts_train_df['prompt_title'].apply(preprocess_hard_lemmatizing)
prompts_train_df['prompt_text_hard_preprocessed_lemmatized'] = prompts_train_df['prompt_text'].apply(preprocess_hard_lemmatizing)

prompts_train_df['prompt_question_soft_preprocessed'] = prompts_train_df['prompt_question'].apply(process_soft)
prompts_train_df['prompt_title_soft_preprocessed'] = prompts_train_df['prompt_title'].apply(process_soft)
prompts_train_df['prompt_text_soft_preprocessed'] = prompts_train_df['prompt_text'].apply(process_soft)

prompts_train_df

Unnamed: 0,prompt_id,prompt_question,prompt_title,prompt_text,prompt_question_hard_preprocessed_stemmed,prompt_title_hard_preprocessed_stemmed,prompt_text_hard_preprocessed_stemmed,prompt_question_hard_preprocessed_lemmatized,prompt_title_hard_preprocessed_lemmatized,prompt_text_hard_preprocessed_lemmatized,prompt_question_soft_preprocessed,prompt_title_soft_preprocessed,prompt_text_soft_preprocessed
0,39c16e,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,summar element ideal tragedi describ aristotl,tragedi,chapter sequel said proceed consid poet aim av...,summarize element ideal tragedy described aris...,tragedy,chapter sequel said proceed consider poet aim ...,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 . As the sequel to what has already...
1,3b9047,"In complete sentences, summarize the structure...",Egyptian Social Structure,Egyptian society was structured like a pyramid...,complet sentenc summar structur ancient egypti...,egyptian social structur,egyptian societi structur like pyramid god ra ...,complete sentence summarize structure ancient ...,egyptian social structure,egyptian society structured like pyramid god r...,"In complete sentences, summarize the structure...",Egyptian Social Structure,Egyptian society was structured like a pyramid...
2,814d6b,Summarize how the Third Wave developed over su...,The Third Wave,Background \r\nThe Third Wave experiment took ...,summar wave develop short period time experi end,wave,background wave experi took place cubberley hi...,summarize wave developed short period time exp...,wave,background wave experiment took place cubberle...,Summarize how the Third Wave developed over su...,The Third Wave,Background . The Third Wave experiment took pl...
3,ebad26,Summarize the various ways the factory would u...,Excerpt from The Jungle,"With one member trimming beef in a cannery, an...",summar way factori cover spoil meat cite evid ...,excerpt jungl,member trim beef canneri work sausag factori f...,summarize way factory cover spoiled meat cite ...,excerpt jungle,member trimming beef cannery working sausage f...,Summarize the various ways the factory would u...,Excerpt from The Jungle,"With one member trimming beef in a cannery, an..."


### Features from the Commonlit authors

From https://www.kaggle.com/code/gusthema/commonlit-evaluate-student-summaries-w-tfdf

In [52]:
# Count the stop words in the text.
def count_stopwords(text: str) -> int:
    stopword_list = set(stopwords.words('english'))
    words = text.split()
    stopwords_count = sum(1 for word in words if word.lower() in stopword_list)
    return stopwords_count

# Count the punctuations in the text.
# punctuation_set -> !"#$%&'()*+, -./:;<=>?@[\]^_`{|}~
def count_punctuation(text: str) -> int:
    punctuation_set = set(string.punctuation)
    punctuation_count = sum(1 for char in text if char in punctuation_set)
    return punctuation_count

# Count the digits in the text.
def count_numbers(text: str) -> int:
    numbers = re.findall(r'\d+', text)
    numbers_count = len(numbers)
    return numbers_count

# This function applies all the above preprocessing functions on a text feature.
def streamlit_feature_engineer(dataframe: pd.DataFrame, feature: str = 'text', preprocessed_hard: bool = False) -> pd.DataFrame:
    dataframe[f'{feature}_word_cnt'] = dataframe[feature].apply(lambda x: len(x.split(' ')))
    dataframe[f'{feature}_length'] = dataframe[feature].apply(lambda x: len(x))
    if not preprocessed_hard:
        dataframe[f'{feature}_stopword_cnt'] = dataframe[feature].apply(lambda x: count_stopwords(x))
        dataframe[f'{feature}_punct_cnt'] = dataframe[feature].apply(lambda x: count_punctuation(x))
        dataframe[f'{feature}_number_cnt'] = dataframe[feature].apply(lambda x: count_numbers(x))
    return dataframe

In [62]:
summaries_train_df = streamlit_feature_engineer(summaries_train_df)
summaries_train_df = streamlit_feature_engineer(summaries_train_df, feature = "text_hard_preprocessed_stemmed", preprocessed_hard=True)
summaries_train_df = streamlit_feature_engineer(summaries_train_df, feature = "text_hard_preprocessed_lemmatized", preprocessed_hard=True)
summaries_train_df = streamlit_feature_engineer(summaries_train_df, feature = "text_soft_preprocessed")

In [63]:
prompts_ids_to_is = {prompt_id: i for i, prompt_id in zip(prompts_train_df.index, prompts_train_df['prompt_id'])}
summaries_train_df['prompt_i'] = summaries_train_df['prompt_id'].apply(lambda prompt_id: prompts_ids_to_is[prompt_id])
summaries_train_df.head()

Unnamed: 0,student_id,prompt_id,text,content,wording,text_hard_preprocessed_stemmed,text_hard_preprocessed_lemmatized,text_soft_preprocessed,text_word_cnt,text_length,...,text_hard_preprocessed_stemmed_word_cnt,text_hard_preprocessed_stemmed_length,text_hard_preprocessed_lemmatized_word_cnt,text_hard_preprocessed_lemmatized_length,text_soft_preprocessed_word_cnt,text_soft_preprocessed_length,text_soft_preprocessed_stopword_cnt,text_soft_preprocessed_punct_cnt,text_soft_preprocessed_number_cnt,prompt_i
0,000e8c3c7ddb,814d6b,The third wave was an experimentto see how peo...,0.205683,0.380538,wave experimentto react new leader govern gain...,wave experimentto reacted new leader governmen...,The third wave was an experimentto see how peo...,61,346,...,28,173,28,201,61,346,25,3,0,2
1,0020ae56ffbf,ebad26,They would rub it up with soda to make the sme...,-0.548304,0.506755,rub soda smell away wouldnt smell meat toss fl...,rub soda smell away wouldnt smell meat tossed ...,They would rub it up with soda to make the sme...,52,244,...,14,80,14,82,52,244,30,2,0,3
2,004e978e639e,3b9047,"In Egypt, there were many occupations and soci...",3.128928,4.231226,egypt occup social class involv day day live i...,egypt occupation social class involved day day...,"In Egypt, there were many occupations and soci...",235,1370,...,101,638,101,706,235,1370,98,38,0,1
3,005ab0199905,3b9047,The highest class was Pharaohs these people we...,-0.210614,-0.471415,highest class pharaoh god nd highest class gon...,highest class pharaoh god nd highest class gon...,The highest class was Pharaohs these people we...,25,157,...,14,89,14,95,25,157,11,6,2,1
4,0070c9e7af47,814d6b,The Third Wave developed rapidly because the ...,3.272894,3.219757,wave develop rapidli student genuinli believ b...,wave developed rapidly student genuinly believ...,The Third Wave developed rapidly because the s...,206,1225,...,87,581,87,658,203,1222,92,30,3,2


## Modelling

Let's use Ridge regression and LightGBM again, but include several improvements, like:
- Separate models for each of two metrics.
- Different vectorization techniques performed for all the dataset.
- Different preprocessing variants evaluated.
- Cosine similarity of all the prompts fields and summaries instead of merged prompts fields and summaries.

### Ridge regression

In [66]:
def ridge_pipeline_evaluation(alpha,
                              features: list,
                              features_to_scale: list,
                              target : str="wording", 
                              vectorizer=TfidfVectorizer(
                                    analyzer='word',
                                    stop_words='english',
                                    ngram_range=(1, 3),
                                    lowercase=True,
                                    min_df=1,
                                    max_features=30000
                                ),
                              vectorizer_feature: str = "text",
                              prompt_processed_features: dict = {
                                  'prompt_question': 'prompt_question',
                                  'prompt_title': 'prompt_title',
                                  'prompt_text': 'prompt_text'
                              },
                              verbose=False):
    
    metrics_lists={'train': [], 'val': []}
    
    if verbose:
        print(f"Ridge starting, alpha={alpha}")
    
    for i in tqdm(range(len(prompts_train_df))):
        test_prompt_id = prompts_train_df.loc[i, 'prompt_id']
        summaries_train, summaries_val = summaries_train_df[~(summaries_train_df['prompt_id'] == test_prompt_id)], summaries_train_df[summaries_train_df['prompt_id'] == test_prompt_id]

        X_train, y_train = summaries_train.loc[:, ['prompt_i', *features]], summaries_train.loc[:, target]
        X_val, y_val = summaries_val.loc[:, ['prompt_i', *features]], summaries_val.loc[:, target]

        vectorizer = vectorizer.fit(X_train[vectorizer_feature])
        train_summaries_vectors = vectorizer.transform(X_train[vectorizer_feature])
        val_summaries_vectors = vectorizer.transform(X_val[vectorizer_feature])
        
        prompts_texts_vectors = vectorizer.transform(prompts_train_df[prompt_processed_features['prompt_text']])
        prompts_titles_vectors = vectorizer.transform(prompts_train_df[prompt_processed_features['prompt_title']])
        prompts_questions_vectors = vectorizer.transform(prompts_train_df[prompt_processed_features['prompt_question']])

        scaler = RobustScaler().fit(X_train[features_to_scale])
        X_train[features_to_scale] = scaler.transform(X_train[features_to_scale])
        X_val[features_to_scale] = scaler.transform(X_val[features_to_scale])

        y_scaler = RobustScaler().fit(y_train.to_numpy().reshape(-1, 1))
        y_train_scaled = y_scaler.transform(y_train.to_numpy().reshape(-1, 1))

        cosine_scores_train_prompts_texts = 1 - spatial.distance.cosine(prompts_texts_vectors[X_train['prompt_i']], train_summaries_vectors).reshape(-1, 1)
        cosine_scores_val_prompts_texts = 1 - spatial.distance.cosine(prompts_texts_vectors[X_val['prompt_i']], val_summaries_vectors).reshape(-1, 1)
        
        cosine_scores_train_prompts_titles = 1 - spatial.distance.cosine(prompts_titles_vectors[X_train['prompt_i']], train_summaries_vectors).reshape(-1, 1)
        cosine_scores_val_prompts_titles = 1 - spatial.distance.cosine(prompts_titles_vectors[X_val['prompt_i']], val_summaries_vectors).reshape(-1, 1)
        
        cosine_scores_train_prompts_questions = 1 - spatial.distance.cosine(prompts_questions_vectors[X_train['prompt_i']], train_summaries_vectors).reshape(-1, 1)
        cosine_scores_val_prompts_questions = 1 - spatial.distance.cosine(prompts_questions_vectors[X_val['prompt_i']], val_summaries_vectors).reshape(-1, 1)

        X_train = sparse.hstack((
            train_summaries_vectors,
            sparse.coo_matrix(cosine_scores_train_prompts_texts),
            sparse.coo_matrix(cosine_scores_train_prompts_titles),
            sparse.coo_matrix(cosine_scores_train_prompts_questions),
            sparse.coo_matrix(X_train[features].to_numpy()),
        ))
        X_val = sparse.hstack((
            val_summaries_vectors,
            sparse.coo_matrix(cosine_scores_val_prompts_texts),
            sparse.coo_matrix(cosine_scores_val_prompts_titles),
            sparse.coo_matrix(cosine_scores_val_prompts_questions),
            sparse.coo_matrix(X_val[features].to_numpy()),
        ))

        model = Ridge(alpha=alpha)
        model.fit(X_train, y_train_scaled)
        y_train_pred_scaled = model.predict(X_train)
        y_val_pred_scaled = model.predict(X_val)

        y_train_pred = y_scaler.inverse_transform(y_train_pred_scaled)
        y_val_pred = y_scaler.inverse_transform(y_val_pred_scaled)

        train_mse = mean_squared_error(y_train_pred, y_train)
        val_mse = mean_squared_error(y_val_pred, y_val)
        
        if verbose:
            print(f"Train MSE for {target}: {train_mse_content:.3f}, Val MSE for {target}: {val_mse_content:.3f}")
            
        metrics_lists['train'].append(train_mse)
        metrics_lists['val'].append(val_mse)
    
    metrics_avgs = {name: sum(metrics_list)/len(metrics_list) for name, metrics_list in metrics_lists.items()}
    return metrics_avgs