**Этап 1** (проводился в Colab, здесь только код): суммаризация текстов новостей. Самая времяемкая часть работы, суммаризация для 10 000 новостей заняла 4 часа в Colab с использованием GPU, по итогам сформирован отдельный датасет, включающий исходные данные + колонку с суммаризацией, для дальнейших экспериментов. Для суммаризации использовалась модель-transformer BART из библиотеки Hugging Face:
https://huggingface.co/facebook/bart-large-cnn
Итог: файл sc454k_10k_summary_rs42.parquet

In [None]:
# Stage 1: Summarization (completed) -> sc454k_10k_summary_rs42.parquet
from tqdm.notebook import tqdm
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=0)

def summarize_text(text):
    if len(text) > 1024:
        text = text[:1024]
    return summarizer(text, max_length=150, min_length=50, do_sample=False)[0]['summary_text']

tqdm.pandas(desc="Processing articles")

df['summary'] = df['Article'].progress_apply(summarize_text)

In [6]:
import torch

if torch.cuda.is_available():
    print("GPU is available!")
    print(f"Device Name: {torch.cuda.get_device_name(0)}")
else:
    print("GPU is not available.")

GPU is available!
Device Name: NVIDIA GeForce RTX 3080 Ti


In [30]:
# Проверим результаты суммаризации
import numpy as np
import pandas as pd

file_path = "sc454k_10k_summary_rs42.parquet"

df = pd.read_parquet(file_path)

display(df.info())
display(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 31 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Symbol                 10000 non-null  object 
 1   Security               10000 non-null  object 
 2   Sector                 9889 non-null   object 
 3   Industry               9889 non-null   object 
 4   URL                    10000 non-null  object 
 5   Date                   10000 non-null  object 
 6   RelatedStocksList      7302 non-null   object 
 7   Article                10000 non-null  object 
 8   Title                  9987 non-null   object 
 9   articleType            10000 non-null  object 
 10  Publication            9994 non-null   object 
 11  Author                 7003 non-null   object 
 12  weighted_avg_-96_hrs   10000 non-null  float64
 13  weighted_avg_-48_hrs   10000 non-null  float64
 14  weighted_avg_-24_hrs   10000 non-null  float64
 15  wei

None

Unnamed: 0,Symbol,Security,Sector,Industry,URL,Date,RelatedStocksList,Article,Title,articleType,...,weighted_avg_6_hrs,weighted_avg_8_hrs,weighted_avg_12_hrs,weighted_avg_24_hrs,weighted_avg_48_hrs,weighted_avg_72_hrs,weighted_avg_96_hrs,weighted_avg_360_hrs,weighted_avg_720_hrs,summary
0,PZZA,"Papa John's International, Inc.",Consumer Discretionary,Restaurants,https://www.nasdaq.com/articles/papa-johns-sha...,"Jun 18, 2019 07:39 PM ET",Markets,A **Papa John’s** (NASDAQ:) partnership with S...,Papa John’s Shaq Deal: 7 Things About Shaquill...,News,...,49.4716,48.9466,49.1189,49.0284,49.0755,46.8525,44.1457,44.7833,45.1615,PZZA stock is down 2.4% Tuesday. NBA legend Sh...
1,LOVE,The Lovesac Company,Consumer Discretionary,Other Specialty Stores,https://www.nasdaq.com/articles/monday-10-17-i...,"Oct 17, 2022 02:45 PM ET",FRD|Markets,Bargain hunters are wise to pay careful attent...,"Monday 10/17 Insider Buying Report: FRD, LOVE",News,...,21.0903,21.8352,22.5121,22.811,22.1818,21.3805,21.1099,24.7275,26.6617,Bargain hunters are wise to pay careful attent...
2,RCKT,"Rocket Pharmaceuticals, Inc.",Health Care,Biotechnology: Pharmaceutical Preparations,https://www.nasdaq.com/press-release/rocket-ph...,"Feb 09, 2023 04:07 PM ET",,"CRANBURY, N.J.--(BUSINESS WIRE)--\n [Rocket Ph...",Rocket Pharmaceuticals to Present at the SVB S...,Press Release,...,20.0081,20.2062,19.71,19.6557,20.0553,19.8299,19.8,18.8052,14.8639,"Rocket Pharmaceuticals, Inc. is a leading late..."
3,BMBL,Bumble Inc.,Technology,Computer Software: Programming Data Processing,https://www.nasdaq.com/articles/commit-to-purc...,"Jun 27, 2023 01:35 PM ET",Markets,Investors eyeing a purchase of Bumble Inc (Sym...,"Commit To Purchase Bumble At $10, Earn 12.5% U...",News,...,16.834,16.4434,16.8182,16.7866,16.7633,16.8915,16.8013,18.8289,18.7471,Selling a put does not give an investor access...
4,CSIQ,Canadian Solar Inc.,Technology,Semiconductors,https://www.nasdaq.com/articles/jinkosolar-jks...,"Jan 15, 2023 02:19 AM ET",Stocks|SOL|JKS|SEDG,"**JinkoSolar Holding Co., Ltd**. [JKS](https:/...",JinkoSolar (JKS) Introduces Upgraded Tiger Neo...,News,...,42.6765,42.6765,42.423,42.423,42.4461,41.6878,40.2684,38.7182,40.5397,"JinkoSolar Holding Co., Ltd. (JKS) recently un..."


**Этап 2** Sentiment analysis для текстов summary. Итог: файлsc454k_10k_summary_sentiment_rs42.parquet

In [32]:
# Stage 2: Sentiment analysis for summary (completed) -> sc454k_10k_summary_sentiment_rs42.parquet

output_file = "sc454k_10k_summary_sentiment_rs42.parquet"

if 'summary' not in df.columns:
    raise ValueError("The 'summary' column is missing in the input file. Ensure Step 1 is completed properly.")

sentiment_analyzer = pipeline(
    "text-classification",
    model="ProsusAI/finbert",
    tokenizer="ProsusAI/finbert",
    device=0
)

def analyze_sentiment(text):
    results = sentiment_analyzer(text)
    result = results[0]
    label = result['label']
    score = result['score']

    if label == "positive":
        positive_prob = score
        negative_prob = 1 - score
    elif label == "negative":
        negative_prob = score
        positive_prob = 1 - score
    else:  # Neutral case
        positive_prob = 0.5
        negative_prob = 0.5

    return positive_prob, negative_prob

tqdm.pandas(desc="Analyzing sentiment")

df[['positive_prob', 'negative_prob']] = df['summary'].progress_apply(
    lambda text: pd.Series(analyze_sentiment(text))
)

df.to_parquet(output_file, index=False)
print(f"Sentiment analysis completed. Results saved to {output_file}")


Device set to use cuda:0


Analyzing sentiment:   0%|          | 0/10000 [00:00<?, ?it/s]

Sentiment analysis completed. Results saved to sc454k_10k_summary_sentiment_rs42.parquet


**Этап 3**: Получение эмбеддингов для summary. Итог: файл sc454k_10k_summary_embeddings_sentiment_rs42.parquet

In [157]:
# Stage 3: Embeddings for summary (completed) -> sc454k_10k_summary_embeddings_sentiment_rs42.parquet

import torch
from transformers import AutoModel, AutoTokenizer
from sklearn.decomposition import PCA

output_file = "sc454k_10k_summary_embeddings_sentiment_rs42.parquet"

if 'summary' not in df.columns:
    raise ValueError("The 'summary' column is missing in the input file. Ensure Step 1 is completed properly.")

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "ProsusAI/finbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).to(device)
print("FinBERT model loaded.")


def generate_embeddings(texts, tokenizer, model, output_dim=64):
    all_embeddings = []
    
    for text in tqdm(texts, desc="Generating FinBERT embeddings"):
        inputs = tokenizer(
            text, 
            return_tensors="pt", 
            truncation=True, 
            max_length=512, 
            padding="max_length"
        ).to(device)
        
        with torch.no_grad():
            outputs = model(**inputs)
        
        cls_embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy()
        all_embeddings.append(cls_embedding.flatten())
    
    all_embeddings = np.vstack(all_embeddings)
    
    # Apply PCA to reduce to output_dim
    pca = PCA(n_components=output_dim, random_state=42)
    reduced_embeddings = pca.fit_transform(all_embeddings)
    
    return reduced_embeddings

# Apply the embedding function to the 'summary' column
embeddings_num = 16 # 64
summaries = df['summary'].tolist()
reduced_embeddings = generate_embeddings(summaries, tokenizer, model, output_dim=embeddings_num)

# Add embeddings as new columns in the existing DataFrame
for i in range(embeddings_num):
    df[f'embedding_{i}'] = reduced_embeddings[:, i]

df.to_parquet(output_file, index=False)
print(f"Updated DataFrame with embeddings saved to {output_file}")

FinBERT model loaded.


Generating FinBERT embeddings:   0%|          | 0/10000 [00:00<?, ?it/s]

Updated DataFrame with embeddings saved to sc454k_10k_summary_embeddings_sentiment_rs42.parquet


In [38]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 97 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Symbol                 10000 non-null  object 
 1   Security               10000 non-null  object 
 2   Sector                 9889 non-null   object 
 3   Industry               9889 non-null   object 
 4   URL                    10000 non-null  object 
 5   Date                   10000 non-null  object 
 6   RelatedStocksList      7302 non-null   object 
 7   Article                10000 non-null  object 
 8   Title                  9987 non-null   object 
 9   articleType            10000 non-null  object 
 10  Publication            9994 non-null   object 
 11  Author                 7003 non-null   object 
 12  weighted_avg_-96_hrs   10000 non-null  float64
 13  weighted_avg_-48_hrs   10000 non-null  float64
 14  weighted_avg_-24_hrs   10000 non-null  float64
 15  wei

Unnamed: 0,Symbol,Security,Sector,Industry,URL,Date,RelatedStocksList,Article,Title,articleType,...,embedding_54,embedding_55,embedding_56,embedding_57,embedding_58,embedding_59,embedding_60,embedding_61,embedding_62,embedding_63
0,PZZA,"Papa John's International, Inc.",Consumer Discretionary,Restaurants,https://www.nasdaq.com/articles/papa-johns-sha...,"Jun 18, 2019 07:39 PM ET",Markets,A **Papa John’s** (NASDAQ:) partnership with S...,Papa John’s Shaq Deal: 7 Things About Shaquill...,News,...,-0.428295,2.006473,0.394459,-1.294207,-0.588683,0.261005,-0.277107,-0.401201,0.993497,-1.298946
1,LOVE,The Lovesac Company,Consumer Discretionary,Other Specialty Stores,https://www.nasdaq.com/articles/monday-10-17-i...,"Oct 17, 2022 02:45 PM ET",FRD|Markets,Bargain hunters are wise to pay careful attent...,"Monday 10/17 Insider Buying Report: FRD, LOVE",News,...,0.373199,0.129881,-0.472512,-0.091278,-0.054997,0.237271,1.03059,0.954953,-0.40263,0.109326
2,RCKT,"Rocket Pharmaceuticals, Inc.",Health Care,Biotechnology: Pharmaceutical Preparations,https://www.nasdaq.com/press-release/rocket-ph...,"Feb 09, 2023 04:07 PM ET",,"CRANBURY, N.J.--(BUSINESS WIRE)--\n [Rocket Ph...",Rocket Pharmaceuticals to Present at the SVB S...,Press Release,...,0.828223,0.03973,-0.057006,-0.236145,0.212412,-0.291634,-0.168695,0.052711,0.324847,0.354604
3,BMBL,Bumble Inc.,Technology,Computer Software: Programming Data Processing,https://www.nasdaq.com/articles/commit-to-purc...,"Jun 27, 2023 01:35 PM ET",Markets,Investors eyeing a purchase of Bumble Inc (Sym...,"Commit To Purchase Bumble At $10, Earn 12.5% U...",News,...,-0.093483,-0.643263,-0.205195,0.656,-0.17482,0.320681,0.322185,-0.346533,-0.240105,-0.264372
4,CSIQ,Canadian Solar Inc.,Technology,Semiconductors,https://www.nasdaq.com/articles/jinkosolar-jks...,"Jan 15, 2023 02:19 AM ET",Stocks|SOL|JKS|SEDG,"**JinkoSolar Holding Co., Ltd**. [JKS](https:/...",JinkoSolar (JKS) Introduces Upgraded Tiger Neo...,News,...,0.045486,-0.707836,-0.528663,0.043703,0.113261,0.347916,0.209496,0.63364,-0.18271,-0.004339


**Этап 4**: Эксперименты с линейной регрессией

In [79]:
# Stage 4: Linreg

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

df = pd.read_parquet("sc454k_10k_summary_embeddings_sentiment_rs42.parquet")

df = df[df['weighted_avg_0_hrs'] > 0]

# Recalculate price_24h_change_percent after filtering
df['price_4h_change_percent'] = ((df['weighted_avg_4_hrs'] - df['weighted_avg_0_hrs']) / df['weighted_avg_0_hrs'] * 100).round(2)

# Drop rows with NaN values in the target or features
df = df.dropna(subset=['price_4h_change_percent'])

# Select features (embeddings + probabilities) and target
feature_columns = [f"embedding_{i}" for i in range(64)] + ["positive_prob", "negative_prob"]
X = df[feature_columns]
y = df['price_4h_change_percent']

display(X.head())
display(y.head())
display(X.info())
display(y.info())

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

# Evaluate the model
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2): {r2}")

Unnamed: 0,embedding_0,embedding_1,embedding_2,embedding_3,embedding_4,embedding_5,embedding_6,embedding_7,embedding_8,embedding_9,...,embedding_56,embedding_57,embedding_58,embedding_59,embedding_60,embedding_61,embedding_62,embedding_63,positive_prob,negative_prob
0,-4.851347,3.120531,-1.313104,1.09799,0.165416,-3.166709,-1.919456,-3.700295,1.397489,2.963977,...,0.394459,-1.294207,-0.588683,0.261005,-0.277107,-0.401201,0.993497,-1.298946,0.434472,0.565528
1,7.717551,5.86755,-1.688272,0.512769,1.687632,-2.153067,-0.399103,-0.039187,-1.714151,-0.24712,...,-0.472512,-0.091278,-0.054997,0.237271,1.03059,0.954953,-0.40263,0.109326,0.5,0.5
2,9.45083,4.293439,2.219385,-1.031255,0.517836,1.113833,-1.125759,-1.134538,-0.453618,-0.09801,...,-0.057006,-0.236145,0.212412,-0.291634,-0.168695,0.052711,0.324847,0.354604,0.5,0.5
3,5.271941,3.637712,-5.173738,0.358423,-1.138867,-3.150231,1.813922,2.00427,-2.041636,-0.370303,...,-0.205195,0.656,-0.17482,0.320681,0.322185,-0.346533,-0.240105,-0.264372,0.5,0.5
4,8.795382,-3.835295,3.250404,-3.269783,-1.900187,-1.254677,-2.295327,0.176403,-1.355946,-0.118056,...,-0.528663,0.043703,0.113261,0.347916,0.209496,0.63364,-0.18271,-0.004339,0.706716,0.293284


0   -0.72
1    0.00
2    0.07
3   -0.02
4    0.00
Name: price_4h_change_percent, dtype: float64

<class 'pandas.core.frame.DataFrame'>
Index: 9706 entries, 0 to 9999
Data columns (total 66 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   embedding_0    9706 non-null   float32
 1   embedding_1    9706 non-null   float32
 2   embedding_2    9706 non-null   float32
 3   embedding_3    9706 non-null   float32
 4   embedding_4    9706 non-null   float32
 5   embedding_5    9706 non-null   float32
 6   embedding_6    9706 non-null   float32
 7   embedding_7    9706 non-null   float32
 8   embedding_8    9706 non-null   float32
 9   embedding_9    9706 non-null   float32
 10  embedding_10   9706 non-null   float32
 11  embedding_11   9706 non-null   float32
 12  embedding_12   9706 non-null   float32
 13  embedding_13   9706 non-null   float32
 14  embedding_14   9706 non-null   float32
 15  embedding_15   9706 non-null   float32
 16  embedding_16   9706 non-null   float32
 17  embedding_17   9706 non-null   float32
 18  embedding_18 

None

<class 'pandas.core.series.Series'>
Index: 9706 entries, 0 to 9999
Series name: price_4h_change_percent
Non-Null Count  Dtype  
--------------  -----  
9706 non-null   float64
dtypes: float64(1)
memory usage: 151.7 KB


None

Mean Squared Error (MSE): 36.29354984352492
R-squared (R2): -0.021691268963867527


**Выводы**: результаты неудовлетворительные, пробуем улучшить с помощью ансамблевого метода Gradient Boosting. 
Модель не линейная, применяется в качестве эксперимента.

In [71]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-2.1.3-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-2.1.3-py3-none-win_amd64.whl (124.9 MB)
   ---------------------------------------- 0.0/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.8/124.9 MB 6.6 MB/s eta 0:00:19
   - -------------------------------------- 3.1/124.9 MB 9.7 MB/s eta 0:00:13
   - -------------------------------------- 5.5/124.9 MB 10.8 MB/s eta 0:00:12
   -- ------------------------------------- 7.9/124.9 MB 10.8 MB/s eta 0:00:11
   --- ------------------------------------ 10.5/124.9 MB 11.1 MB/s eta 0:00:11
   ---- ----------------------------------- 12.8/124.9 MB 11.2 MB/s eta 0:00:11
   ---- ----------------------------------- 15.5/124.9 MB 11.3 MB/s eta 0:00:10
   ----- ---------------------------------- 17.8/124.9 MB 11.4 MB/s eta 0:00:10
   ------ --------------------------------- 20.2/124.9 MB 11.4 MB/s eta 0:00:10
   ------- -------------------------------- 22.5/124.9 MB 11.4 MB/s

In [89]:
from xgboost import XGBRegressor

df = pd.read_parquet('sc454k_10k_summary_embeddings_sentiment_rs42.parquet')

# Add the target column
df['price_24h_change_percent'] = ((df['weighted_avg_24_hrs'] - df['weighted_avg_0_hrs']) / df['weighted_avg_0_hrs'] * 100).round(2)

# Remove rows with infinite or NaN values in the target
df = df.replace([np.inf, -np.inf], np.nan).dropna(subset=['price_24h_change_percent'])

X = df[[col for col in df.columns if col.startswith('embedding_')] + ['positive_prob', 'negative_prob']]
y = df['price_24h_change_percent']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Gradient Boosting model
model = XGBRegressor(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    random_state=42
)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2): {r2}")

Mean Squared Error (MSE): 696.075783830103
R-squared (R2): -0.2786938547389144


**Выводы**: результаты по-прежнему неудовлетворительны.

По итогам экспериментов с линейной регрессией:

Задача регрессии для предсказания точных цен на основе финансовых новостей невыполнима из-за высокой сложности и многозначности финансовых рынков. Цены зависят не только от новостей, но и от множества других факторов, таких как макроэкономические данные, технические индикаторы, рыночные настроения и алгоритмическая торговля. Финансовые новости, как правило, дают только общий контекст, недостаточный для точного численного прогноза цены.

**Этап 5** Переход к бинарной классификации 

Переход к классификации и определению направления движения цены (вверх или вниз) позволит нам:
- Снизить шум: Упрощение задачи до бинарного прогноза делает модель более устойчивой к нерелевантным данным.
- Повысить интерпретируемость: Модели классификации легче интерпретировать и использовать в торговых стратегиях.
- Соответствовать рыночной практике: Трейдеры чаще интересуются направлением изменения, чем точной ценой.
- 
Таким образом, классификация обеспечивает более достижимую и практически значимую цель для анализа финансовых новостей.

В качестве эксперимента попробуем нелинейный метод - классификатор из XGBoost на основе метода градиентного бустинга.

In [95]:
# Stage 5: Pivot to binary classification

from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

df = pd.read_parquet('sc454k_10k_summary_embeddings_sentiment_rs42.parquet')

df['price_24h_change_percent'] = ((df['weighted_avg_24_hrs'] - df['weighted_avg_0_hrs']) / df['weighted_avg_0_hrs'] * 100).round(2)

# Create binary target: 1 for price increase, 0 for price decrease
df['price_up'] = (df['price_24h_change_percent'] > 0).astype(int)

# Drop rows with missing or infinite values in the target
df = df.replace([np.inf, -np.inf], np.nan).dropna(subset=['price_up'])

# Features: embeddings and sentiment probabilities
X = df[[col for col in df.columns if col.startswith('embedding_')] + ['positive_prob', 'negative_prob']]
y = df['price_up']

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

model = XGBClassifier(n_estimators=200, learning_rate=0.05, max_depth=10, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.49

Classification Report:
              precision    recall  f1-score   support

           0       0.49      0.53      0.51       994
           1       0.49      0.46      0.48      1006

    accuracy                           0.49      2000
   macro avg       0.49      0.49      0.49      2000
weighted avg       0.49      0.49      0.49      2000


Confusion Matrix:
[[523 471]
 [545 461]]


**Выводы**: результаты всё ещё неудовлетворительны, попробуем другие алгоритмы и другую целевую переменную - направление движения цены спустя не сутки, а час с момента выхода новости. Попробуем также исключить из признаков данные сентиментов. И, наконец, подберем наилучшие гиперпараметры с помощью GridSearch.

Начнём с метода опорных векторов.

In [125]:
# SVM
from sklearn.svm import SVC

df = pd.read_parquet('sc454k_10k_summary_embeddings_sentiment_rs42.parquet')

df['price_1h_change_percent'] = ((df['weighted_avg_1_hrs'] - df['weighted_avg_0_hrs']) / df['weighted_avg_0_hrs'] * 100).round(2)

# Create binary target: 1 for price increase, 0 for price decrease
df['price_up'] = (df['price_1h_change_percent'] > 0).astype(int)

# Drop rows with missing or infinite values in the target
df = df.replace([np.inf, -np.inf], np.nan).dropna(subset=['price_up'])

# Features: embeddings without sentiment probabilities
X = df[[col for col in df.columns if col.startswith('embedding_')]]
y = df['price_up']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Настройка гиперпараметров через GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    #'gamma': ['scale', 'auto', 0.1, 1],
    'kernel': ['linear'],
    'class_weight': ['balanced']
}

# Инициализация SVC
svc = SVC(random_state=42)

# # Применяем GridSearchCV для подбора гиперпараметров
grid_search = GridSearchCV(svc, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

# Лучшие гиперпараметры
print("Best parameters found: ", grid_search.best_params_)

# Прогнозируем с лучшей моделью
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Оценка модели
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Подробный классификационный отчёт
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Матрица ошибок
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best parameters found:  {'C': 0.1, 'class_weight': 'balanced', 'gamma': 1, 'kernel': 'linear'}
Accuracy: 0.50

Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.48      0.58      1443
           1       0.29      0.56      0.38       557

    accuracy                           0.50      2000
   macro avg       0.51      0.52      0.48      2000
weighted avg       0.61      0.50      0.53      2000


Confusion Matrix:
[[693 750]
 [247 310]]


**Выводы:** 
- Точность не превышает случайное угадывание
- Классы несбалансированы: класс 0 значительно преобладает над классом 1 (1443 против 557). Параметр class_weight='balanced' пытается компенсировать несбалансированность классов, но результатов всё ещё недостаточно.

Модель испытывает сложности из-за несбалансированности классов, требуется использование техник балансировки. Сначала без балансировки попробуем более вычислительно легкий линейный алгоритм - Logistic Regression, с подбором параметров через GridSearch. Интервал в целевой переменной изменим на 12 часов с момента выхода новости.

In [163]:
# LogReg

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV


df = pd.read_parquet('sc454k_10k_summary_embeddings_sentiment_rs42.parquet')

df['price_12h_change_percent'] = ((df['weighted_avg_12_hrs'] - df['weighted_avg_0_hrs']) / df['weighted_avg_0_hrs'] * 100).round(2)

# Create binary target: 1 for price increase, 0 for price decrease
df['price_up'] = (df['price_12h_change_percent'] > 0).astype(int)

# Drop rows with missing or infinite values in the target
df = df.replace([np.inf, -np.inf], np.nan).dropna(subset=['price_up'])

# Features: embeddings and sentiment probabilities
X = df[[col for col in df.columns if col.startswith('embedding_')] + ['positive_prob', 'negative_prob']]
y = df['price_up']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Настройка гиперпараметров для логистической регрессии
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['lbfgs', 'liblinear'],
    'class_weight': ['balanced']
}

log_reg = LogisticRegression(solver='liblinear', penalty='l2', class_weight='balanced', max_iter=1000, random_state=42)

grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

print("Best parameters found: ", grid_search.best_params_)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Оценка модели
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Подробный классификационный отчёт
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Матрица ошибок
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters found:  {'C': 0.01, 'class_weight': 'balanced', 'solver': 'lbfgs'}
Accuracy: 0.53

Classification Report:
              precision    recall  f1-score   support

           0       0.57      0.54      0.55      1087
           1       0.48      0.51      0.49       913

    accuracy                           0.53      2000
   macro avg       0.52      0.52      0.52      2000
weighted avg       0.53      0.53      0.53      2000


Confusion Matrix:
[[587 500]
 [450 463]]


**Выводы**: видим, что даже без балансировки результаты модели улучшились. Немного повысилась точность (теперь чуть лучше случайного угадывания), модель не имеет сильного прекоса в сторону одного класса, но не очень хорошо разделяет их. 

Попробуем сбалансировать данные и посмотреть на результат. 

In [165]:
# LogReg balanced
from sklearn.utils import resample

df = pd.read_parquet('sc454k_10k_summary_embeddings_sentiment_rs42.parquet')

df['price_12h_change_percent'] = ((df['weighted_avg_12_hrs'] - df['weighted_avg_0_hrs']) / df['weighted_avg_0_hrs'] * 100).round(2)

df['price_up'] = (df['price_12h_change_percent'] > 0).astype(int)

df = df.replace([np.inf, -np.inf], np.nan).dropna(subset=['price_up'])

X = df[[col for col in df.columns if col.startswith('embedding_')] + ['positive_prob', 'negative_prob']]
y = df['price_up']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Применяем undersampling для класса 0
X_train_df = pd.DataFrame(X_train)
y_train_df = pd.Series(y_train)

# Сбрасываем индексы для согласования
X_train_df.reset_index(drop=True, inplace=True)
y_train_df.reset_index(drop=True, inplace=True)

# Делим данные на два класса
class_0 = X_train_df[y_train_df == 0]
class_1 = X_train_df[y_train_df == 1]

# Уменьшаем количество примеров класса 0
class_0_downsampled = resample(class_0, 
                               replace=False,  # без повторов
                               n_samples=len(class_1),  # количество примеров как в классе 1
                               random_state=42)

# Объединяем данные
X_train_balanced = pd.concat([class_0_downsampled, class_1])
y_train_balanced = pd.concat([pd.Series([0] * len(class_0_downsampled)), pd.Series([1] * len(class_1))])

# Масштабируем сбалансированные данные
X_train_balanced = scaler.fit_transform(X_train_balanced)

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['lbfgs', 'liblinear'],
    'class_weight': ['balanced']
}

log_reg = LogisticRegression(solver='liblinear', penalty='l2', class_weight='balanced', max_iter=1000, random_state=42)

grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train_balanced, y_train_balanced)

print("Best parameters found: ", grid_search.best_params_)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Оценка модели
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Подробный классификационный отчёт
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Матрица ошибок
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters found:  {'C': 0.01, 'class_weight': 'balanced', 'solver': 'lbfgs'}
Accuracy: 0.52

Classification Report:
              precision    recall  f1-score   support

           0       0.57      0.51      0.54      1087
           1       0.48      0.53      0.50       913

    accuracy                           0.52      2000
   macro avg       0.52      0.52      0.52      2000
weighted avg       0.53      0.52      0.52      2000


Confusion Matrix:
[[557 530]
 [426 487]]


**Выводы**: результаты чуть ухудшились, но в целом остались на том же уровне, что и в предыдущем эксперименте.

**Итоговые выводы**:
В процессе работы были сделаны следующие наблюдения:
- признаки сентиментов не влияют на качество моделей: если их убрать - метрики не менялись
- количество признаков эмбеддингов не влияет на качество моделей: проверялось на 16, 64 и 768 - метрики практически не менялись
- изменение временного интервала в таргете не влияет на качество моделей: проверялись 1, 2, 4, 8, 12, 24 часа - метрики практически не менялись

Даже после балансировки классов точность модели остается невысокой. Это говорит о том, что линейные модели не очень хорошо подходят для данной задачи, и нужно работать в сторону усложнения используемых моделей, например, попробовать нелинейные. Также имеет смысл попробовать более сложные методы обработки данных - полиномиальные признаки и feature engineering. Еще одно направление для улучшения - увеличить количество данных для обучения, но нужно учитывать, что в контексте данной задачи получение суммаризации и последующие этапы обработки данных для получения эмбеддингов - это достаточно затратный по времени процесс. 