# Feature Engineering for Cryptocurrency Market Sentiment & Price Data 2025

In this notebook, we will explore various feature engineering techniques to enhance the performance of our models predicting cryptocurrency prices based on market sentiment and price data. The goal is to create meaningful features that can improve the predictive power of our machine learning models.

In [1]:
# Load necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load the dataset
data = pd.read_csv('../../Dataset/crypto_sentiment_prediction_dataset.csv')
data.head()

Unnamed: 0,timestamp,cryptocurrency,current_price_usd,price_change_24h_percent,trading_volume_24h,market_cap_usd,social_sentiment_score,news_sentiment_score,news_impact_score,social_mentions_count,fear_greed_index,volatility_index,rsi_technical_indicator,prediction_confidence
0,2025-06-04 20:36:49,Algorand,0.3427,-5.35,1716266.1,1762124000.0,0.367,0.374,1.87,13,53.2,95.1,37.2,78.1
1,2025-06-04 20:48:25,Cosmos,12.042,5.14,10520739.91,209917800000.0,-0.278,-0.107,1.01,600,43.5,76.7,65.0,66.7
2,2025-06-04 21:28:54,Cosmos,11.7675,-6.12,642191.11,175536700000.0,-0.255,0.211,5.69,279,49.1,60.4,32.3,77.4
3,2025-06-04 21:57:48,Ethereum,2861.2829,-11.54,5356227.76,47864190000000.0,-0.531,-0.081,5.11,3504,37.0,100.0,63.0,81.7
4,2025-06-04 22:06:40,Solana,95.3583,5.79,735971.56,266761100000.0,0.369,0.248,1.82,3236,61.7,67.5,55.4,81.8


## 1. Handling Missing Values

Before feature engineering, we need to address any missing values in the dataset. This can be done through various methods such as imputation or removal.

In [2]:
# Check for missing values
missing_values = data.isnull().sum()
missing_values[missing_values > 0]

Series([], dtype: int64)

## 2. Feature Creation

We will create new features that may help in predicting cryptocurrency prices. This includes:
- **Lag Features**: Previous values of the target variable.
- **Rolling Statistics**: Moving averages and standard deviations.
- **Sentiment Scores**: Aggregate sentiment scores from textual data.

In [5]:
# Create lag features
for lag in range(1, 4):
    data[f'price_lag_{lag}'] = data['current_price_usd'].shift(lag)

# Create rolling features
data['rolling_mean'] = data['current_price_usd'].rolling(window=3).mean()
data['rolling_std'] = data['current_price_usd'].rolling(window=3).std()

# Display the new features
data[['current_price_usd', 'price_lag_1', 'price_lag_2', 'price_lag_3', 'rolling_mean', 'rolling_std']].head()

Unnamed: 0,current_price_usd,price_lag_1,price_lag_2,price_lag_3,rolling_mean,rolling_std
0,0.3427,,,,,
1,12.042,0.3427,,,,
2,11.7675,12.042,0.3427,,8.050733,6.676764
3,2861.2829,11.7675,12.042,0.3427,961.697467,1645.089248
4,95.3583,2861.2829,11.7675,12.042,989.469567,1621.576616


## 3. Encoding Categorical Variables

If there are any categorical variables, we will encode them using techniques such as one-hot encoding or label encoding.

In [6]:
# Identify categorical columns
categorical_cols = data.select_dtypes(include=['object']).columns.tolist()

# One-hot encode categorical variables
data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)
data.head()

Unnamed: 0,current_price_usd,price_change_24h_percent,trading_volume_24h,market_cap_usd,social_sentiment_score,news_sentiment_score,news_impact_score,social_mentions_count,fear_greed_index,volatility_index,...,timestamp_2025-07-04 19:58:28,cryptocurrency_Avalanche,cryptocurrency_Bitcoin,cryptocurrency_Cardano,cryptocurrency_Chainlink,cryptocurrency_Cosmos,cryptocurrency_Ethereum,cryptocurrency_Polkadot,cryptocurrency_Polygon,cryptocurrency_Solana
0,0.3427,-5.35,1716266.1,1762124000.0,0.367,0.374,1.87,13,53.2,95.1,...,False,False,False,False,False,False,False,False,False,False
1,12.042,5.14,10520739.91,209917800000.0,-0.278,-0.107,1.01,600,43.5,76.7,...,False,False,False,False,False,True,False,False,False,False
2,11.7675,-6.12,642191.11,175536700000.0,-0.255,0.211,5.69,279,49.1,60.4,...,False,False,False,False,False,True,False,False,False,False
3,2861.2829,-11.54,5356227.76,47864190000000.0,-0.531,-0.081,5.11,3504,37.0,100.0,...,False,False,False,False,False,False,True,False,False,False
4,95.3583,5.79,735971.56,266761100000.0,0.369,0.248,1.82,3236,61.7,67.5,...,False,False,False,False,False,False,False,False,False,True


## 4. Scaling Numerical Features

We will scale the numerical features to ensure they are on a similar scale, which can improve model performance.

In [7]:
# Scale numerical features
scaler = StandardScaler()
numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns.tolist()
data[numerical_cols] = scaler.fit_transform(data[numerical_cols])
data.head()

Unnamed: 0,current_price_usd,price_change_24h_percent,trading_volume_24h,market_cap_usd,social_sentiment_score,news_sentiment_score,news_impact_score,social_mentions_count,fear_greed_index,volatility_index,...,timestamp_2025-07-04 19:58:28,cryptocurrency_Avalanche,cryptocurrency_Bitcoin,cryptocurrency_Cardano,cryptocurrency_Chainlink,cryptocurrency_Cosmos,cryptocurrency_Ethereum,cryptocurrency_Polkadot,cryptocurrency_Polygon,cryptocurrency_Solana
0,-0.338078,-0.666587,-0.560222,-0.288936,1.173421,1.203785,-1.062343,-0.482004,0.202417,0.87865,...,False,False,False,False,False,False,False,False,False,False
1,-0.337149,0.644844,0.621688,-0.287619,-0.953525,-0.354293,-1.56478,-0.247264,-0.523417,0.012385,...,False,False,False,False,False,True,False,False,False,False
2,-0.337171,-0.76285,-0.704406,-0.287837,-0.87768,0.675788,1.169413,-0.375631,-0.104379,-0.755013,...,False,False,False,False,False,True,False,False,False,False
3,-0.111032,-1.440443,-0.071595,0.013782,-1.787815,-0.270073,0.83056,0.914039,-1.009801,1.10934,...,False,False,False,False,False,False,True,False,False,False
4,-0.330537,0.726105,-0.691817,-0.28726,1.180016,0.79564,-1.091554,0.806867,0.838458,-0.420748,...,False,False,False,False,False,False,False,False,False,True


## 5. Finalizing the Feature Set

After feature engineering, we will finalize the feature set to be used for modeling. This includes selecting the target variable and the features.

In [9]:
# Define target and features
X = data.drop(columns=['current_price_usd'])  # Features
y = data['current_price_usd']  # Target variable

# Display the final feature set
X.head()

Unnamed: 0,price_change_24h_percent,trading_volume_24h,market_cap_usd,social_sentiment_score,news_sentiment_score,news_impact_score,social_mentions_count,fear_greed_index,volatility_index,rsi_technical_indicator,...,timestamp_2025-07-04 19:58:28,cryptocurrency_Avalanche,cryptocurrency_Bitcoin,cryptocurrency_Cardano,cryptocurrency_Chainlink,cryptocurrency_Cosmos,cryptocurrency_Ethereum,cryptocurrency_Polkadot,cryptocurrency_Polygon,cryptocurrency_Solana
0,-0.666587,-0.560222,-0.288936,1.173421,1.203785,-1.062343,-0.482004,0.202417,0.87865,-0.880145,...,False,False,False,False,False,False,False,False,False,False
1,0.644844,0.621688,-0.287619,-0.953525,-0.354293,-1.56478,-0.247264,-0.523417,0.012385,0.959349,...,False,False,False,False,False,True,False,False,False,False
2,-0.76285,-0.704406,-0.287837,-0.87768,0.675788,1.169413,-0.375631,-0.104379,-0.755013,-1.204372,...,False,False,False,False,False,True,False,False,False,False
3,-1.440443,-0.071595,0.013782,-1.787815,-0.270073,0.83056,0.914039,-1.009801,1.10934,0.827011,...,False,False,False,False,False,False,True,False,False,False
4,0.726105,-0.691817,-0.28726,1.180016,0.79564,-1.091554,0.806867,0.838458,-0.420748,0.324128,...,False,False,False,False,False,False,False,False,False,True


## Conclusion

In this notebook, we have performed feature engineering on the Cryptocurrency Market Sentiment & Price Data 2025 dataset. We created new features, handled missing values, encoded categorical variables, and scaled numerical features. The next steps will involve model training and evaluation using the engineered features.