## Data Preprocessing and Feature Engineering

In this notebook, we prepare the scraped cryptocurrency dataset for machine learning tasks.  
The preprocessing phase focuses on cleaning the data, handling missing values, removing duplicates, and creating meaningful features that can improve model performance.

The final output of this notebook is a clean, transformed dataset saved as a CSV file, which will be used in the machine learning modeling stage.


## Import Required Libraries

This section imports all necessary Python libraries used for data manipulation, preprocessing, normalization, and imputation.


In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import KNNImputer

## Load the Dataset

The historical cryptocurrency dataset collected during the web scraping stage is loaded from a CSV file.
This dataset contains weekly (Sunday) snapshots of the top cryptocurrencies.

In [2]:
df = pd.read_csv("historical_crypto_sundays.csv")
df.head()

Unnamed: 0,Date,Name,Symbol,Market Cap,Price,Circulating Supply,Volume (24hr),% 1h,% 24h,% 7d
0,2013-04-28,Bitcoin,BTC,1488567000.0,134.21,11091325,,0.64,0.0,0.0
1,2013-04-28,Litecoin,LTC,74637020.0,4.3484,17164230,,0.8,0.0,0.0
2,2013-04-28,Peercoin,PPC,7250187.0,0.3865,18757362,,0.005,0.0,0.0
3,2013-04-28,Namecoin,NMC,5995997.0,1.1072,5415300,,0.005,0.0,0.0
4,2013-04-28,Terracoin,TRC,1503099.0,0.6469,2323570,,0.61,0.0,0.0


## Initial Data Exploration

We inspect the dataset to understand its structure, data types, number of rows and columns, and overall data quality.
This step helps identify missing values, duplicate records, and formatting issues.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6607 entries, 0 to 6606
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Date                6607 non-null   object 
 1   Name                6607 non-null   object 
 2   Symbol              6607 non-null   object 
 3   Market Cap          6607 non-null   float64
 4   Price               6607 non-null   float64
 5   Circulating Supply  6607 non-null   int64  
 6   Volume (24hr)       6260 non-null   float64
 7   % 1h                6607 non-null   float64
 8   % 24h               6607 non-null   float64
 9   % 7d                6607 non-null   float64
dtypes: float64(6), int64(1), object(3)
memory usage: 516.3+ KB


In [4]:
df.shape

(6607, 10)

## Remove Duplicate Records

Duplicate records can negatively impact machine learning models.
In this step, duplicate rows are removed to ensure data integrity.

In [5]:
df = df.drop_duplicates()
df.shape

(6607, 10)

## Missing Value Analysis

Before proceeding with feature engineering, we analyze the dataset to identify missing values in each column.
Special attention is given to numerical columns used for modeling.

In [6]:
# Check for missing data
df.isnull().sum()

Date                    0
Name                    0
Symbol                  0
Market Cap              0
Price                   0
Circulating Supply      0
Volume (24hr)         347
% 1h                    0
% 24h                   0
% 7d                    0
dtype: int64

## Missing Value Imputation using KNN

To handle missing values, a K-Nearest Neighbors (KNN) imputation approach is used.
Before imputation, numerical features are normalized using Min-Max Scaling to ensure equal contribution of all variables.

Steps involved:
- Normalize numerical features
- Apply KNN imputation
- Reverse normalization to restore original scales

This method provides more accurate estimates compared to simple mean or median imputation.

In [7]:
# Select numerical columns for imputation
numeric_cols = ['Market Cap', 'Price', 'Circulating Supply', '% 1h', '% 24h', '% 7d', 'Volume (24hr)']

# Normalization
scaler = MinMaxScaler()
df_normalized = df.copy()
df_normalized[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# KNN Imputation
imputer = KNNImputer(n_neighbors=3)
df_imputed = df_normalized.copy()
df_imputed[numeric_cols] = imputer.fit_transform(df_normalized[numeric_cols])

# Reverse normalization
df_imputed[numeric_cols] = scaler.inverse_transform(df_imputed[numeric_cols])

df = df_imputed.copy()
df

Unnamed: 0,Date,Name,Symbol,Market Cap,Price,Circulating Supply,Volume (24hr),% 1h,% 24h,% 7d
0,2013-04-28,Bitcoin,BTC,1.488567e+09,134.2100,1.109132e+07,2.073296e+08,0.640,0.00,0.00
1,2013-04-28,Litecoin,LTC,7.463702e+07,4.3484,1.716423e+07,2.086202e+08,0.800,0.00,0.00
2,2013-04-28,Peercoin,PPC,7.250187e+06,0.3865,1.875736e+07,5.939582e+06,0.005,0.00,0.00
3,2013-04-28,Namecoin,NMC,5.995997e+06,1.1072,5.415300e+06,5.939582e+06,0.005,0.00,0.00
4,2013-04-28,Terracoin,TRC,1.503099e+06,0.6469,2.323570e+06,9.445917e+07,0.610,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...
6602,2025-12-21,USDC,USDC,7.709223e+10,0.9998,7.710616e+10,5.054509e+09,0.005,-0.01,-0.02
6603,2025-12-21,Solana,SOL,7.084461e+10,125.9900,5.622967e+08,2.335972e+09,0.190,0.15,-2.70
6604,2025-12-21,TRON,TRX,2.729582e+10,0.2883,9.468736e+10,5.485263e+08,0.030,2.41,3.99
6605,2025-12-21,Dogecoin,DOGE,2.201681e+10,0.1311,1.679931e+11,6.623896e+08,0.440,-0.68,-2.32


## Verification After Imputation

After applying KNN imputation, the dataset is checked again to confirm that all missing values have been successfully handled.

In [8]:
df.isnull().sum()

Date                  0
Name                  0
Symbol                0
Market Cap            0
Price                 0
Circulating Supply    0
Volume (24hr)         0
% 1h                  0
% 24h                 0
% 7d                  0
dtype: int64

## Date Conversion and Sorting

The `Date` column is converted to a datetime format to enable time-based analysis.
The dataset is then sorted chronologically to support time-series feature engineering.


In [9]:
# Step 1: Convert 'Date' to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')

# Verify the data type conversion
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6607 entries, 0 to 6606
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Date                6607 non-null   datetime64[ns]
 1   Name                6607 non-null   object        
 2   Symbol              6607 non-null   object        
 3   Market Cap          6607 non-null   float64       
 4   Price               6607 non-null   float64       
 5   Circulating Supply  6607 non-null   float64       
 6   Volume (24hr)       6607 non-null   float64       
 7   % 1h                6607 non-null   float64       
 8   % 24h               6607 non-null   float64       
 9   % 7d                6607 non-null   float64       
dtypes: datetime64[ns](1), float64(7), object(2)
memory usage: 516.3+ KB


In [10]:
df = df.sort_values(by='Date')


## Feature Engineering

New features are created to enhance the predictive power of machine learning models:

- **Market Dominance:** Represents a cryptocurrency’s share of the total market capitalization on a given date.
- **Price Change (7 Days):** Measures weekly price momentum since the data is collected on Sundays.
- **Year and Month:** Extracted from the date to capture seasonal and long-term trends.

These engineered features help models learn market behavior more effectively.

In [11]:
# Market dominance proxy
df['Market_Dominance'] = df['Market Cap'] / df.groupby('Date')['Market Cap'].transform('sum')

# Weekly momentum feature (price change compared to previous Sunday)
df['Price_Change_7d'] = df.groupby('Symbol')['Price'].pct_change()

# Extracting date components
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

df.head()


Unnamed: 0,Date,Name,Symbol,Market Cap,Price,Circulating Supply,Volume (24hr),% 1h,% 24h,% 7d,Market_Dominance,Price_Change_7d,Year,Month
0,2013-04-28,Bitcoin,BTC,1488567000.0,134.21,11091325.0,207329600.0,0.64,0.0,0.0,0.941809,,2013,4
1,2013-04-28,Litecoin,LTC,74637020.0,4.3484,17164230.0,208620200.0,0.8,0.0,0.0,0.047222,,2013,4
2,2013-04-28,Peercoin,PPC,7250187.0,0.3865,18757362.0,5939582.0,0.005,0.0,0.0,0.004587,,2013,4
3,2013-04-28,Namecoin,NMC,5995997.0,1.1072,5415300.0,5939582.0,0.005,0.0,0.0,0.003794,,2013,4
4,2013-04-28,Terracoin,TRC,1503099.0,0.6469,2323570.0,94459170.0,0.61,0.0,0.0,0.000951,,2013,4


## Save Processed Dataset

The final cleaned and feature-engineered dataset is saved as a CSV file.
This file will be used as input for the machine learning modeling notebook.


In [12]:
df.to_csv("processed_crypto_data.csv", index=False)


## Summary

In this notebook, the following preprocessing and feature engineering steps were completed:

- Loaded the historical cryptocurrency dataset
- Removed duplicate records
- Handled missing values using KNN imputation
- Normalized numerical features
- Converted and sorted date values
- Created meaningful market-based and time-based features
- Saved the final processed dataset for machine learning

The dataset is now fully prepared for model training and evaluation in the next stage.