<a href="https://colab.research.google.com/github/KRamBalaji/Ether_Intraday_Prediction/blob/main/ML_Assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project -

## **1. Define Problem Statement-**

**Problem Statement:**

> To predict IntraDay Ether (ETH) prices using machine learning, enabling informed trading decisions and potential profit maximization.

### 1.1. Industry & Problem Type



* **Industry:** Cryptocurrency Trading/Financial Markets
* **Problem Type:** This is primarily a **supervised learning** problem. We'll use historical Ether price data (labeled data) to train a model that can predict future prices. There might be elements of **time series analysis** involved as well due to the temporal nature of the data.

### 1.2. Business Objective



* **Why this problem?** Accurate prediction of Ether prices can provide a significant advantage in trading, allowing for better timing of buy and sell orders.
* **Desired Outcome:** Develop a model capable of predicting IntraDay Ether prices with a certain level of accuracy (to be defined in the evaluation metrics section) to assist in trading decisions and potentially increase profitability.

### 1.3. Constraints & Limitations

* **Data Availability:** While historical Ether price data is generally available, obtaining high-quality, granular IntraDay data might require access to specific APIs or data providers.
* **Computational Power:** Training complex machine learning models, especially with large datasets, may require significant computational resources. Depending on the chosen model and data size, we might need to consider cloud-based solutions like Google Colab for training.
* **Obstacle:** The cryptocurrency market is highly volatile and influenced by various external factors. Achieving consistently accurate predictions can be challenging.
* **Budget:** Depending on the data sources and computational resources required, there might be associated costs.

### 1.4. Evaluation Metrics

* **Optimization Required:** We need to optimize the model for accuracy, precision, and potentially other metrics relevant to trading, such as minimizing false positives/negatives.
* **KPIs Tracking:** Key performance indicators (KPIs) could include:
 * **Mean Absolute Error (MAE):** Measures the average absolute difference between predicted and actual prices.
 * **Root Mean Squared Error (RMSE):** Gives a higher weight to large errors.
 * **R-squared:** Indicates the proportion of variance in the dependent variable (price) explained by the model.
 * **Profitability/Returns:** Simulating trading strategies based on model predictions to assess potential gains.

* **Required Testing:** We'll need to rigorously test the model on unseen data (a holdout set or through cross-validation) to ensure its generalization ability and avoid overfitting.

### 1.5. Target Audience Relevancy

* **Model Prediction Usage:** The primary target audience would be cryptocurrency traders or investors.
* **Speed of Prediction:** For IntraDay trading, prediction speed is crucial. The model should be able to generate predictions quickly to enable timely trading decisions.

### 1.6. Data Availability

* **Ease of Data Collection:** As mentioned earlier, historical Ether price data is readily available, but obtaining high-quality, granular IntraDay data might require some effort.
* **Necessary Features:** Potential features could include:
 * Historical price data (Open, High, Low, Close)
 * Trading volume
 * Market sentiment (derived from social media or news articles)
 * Technical indicators (e.g., moving averages, RSI)

### 1.7. Scope of the Solution

* **Capabilities:** The solution aims to provide:
 * Accurate IntraDay Ether price predictions.
 * Potential trading signals (buy/sell) based on predictions.
 * Visualization of predicted prices and historical data.
* **Expectations:** It's important to note that this solution is intended to be a tool to assist in trading decisions, not a guaranteed money-making machine. The cryptocurrency market is inherently unpredictable, and no model can perfectly predict future prices.

### 1.8. Deployment Considerations

* **Platform:** Google Colab could be used for development and initial deployment.
Integration:
 * For wider accessibility, we could consider deploying the model as a web application or integrating it with a trading platform.
 * An API could be developed to provide predictions to other applications or systems.
* **IntraDay Prices:** We'll need to ensure the data pipeline can continuously fetch and update IntraDay Ether prices to keep the model's predictions relevant.

## **2. Data Collection**

### 2.1. Source Identification

* **Reliable Sources:** For this project, we can consider the following source:
 * **Kaggle:** Kaggle is an online community platform for data scientists and machine learning enthusiasts. It's owned by Google and serves several key purposes:
   * **Datasets:** Kaggle provides a vast repository of publicly available datasets that users can explore, download, and use for their projects. You can find datasets on various topics, from image recognition to natural language processing.
   *  **Notebooks:** Kaggle offers cloud-based Jupyter Notebooks, allowing users to write and execute code directly on the platform. These notebooks are integrated with GPUs and TPUs for accelerated computation.
   * **Community:** Kaggle fosters a vibrant community of data scientists and machine learning practitioners. Users can share their work, collaborate on projects, ask questions, and learn from each other through forums and discussions.

### 2.2. Data Volume Required

* **Sufficient Data:** The amount of data required depends on the complexity of the model and the desired accuracy. Generally, more data is better for training machine learning models, especially for time series analysis. For this project, aiming for at least a few years of historical IntraDay data (hourly or even more granular) would be ideal.

### 2.3. Data Types

* **Labeled Data:** We'll be primarily working with labeled data, as historical Ether price data comes with timestamps and corresponding prices. These prices will serve as our target variable for training the model.

### 2.4. Data Quality

* **Data Cleaning:** Addressing data quality issues is crucial. We need to:
 * **Handle Missing Values:** Decide on a strategy for dealing with missing data points, such as imputation or removal.
 * **Outlier Detection and Treatment:** Identify and handle outliers that could skew the model's training.
 * **Data Consistency:** Ensure data is in a consistent format and units across all sources.

### 2.5. Data Relevancy

* **Feature Selection:** We'll need to carefully select features that are relevant to Ether price prediction. This might involve:
 * **Historical Price Data:** Open, High, Low, Close prices.
 * **Trading Volume:** Indicator of market activity.
 * **Technical Indicators:** Moving averages, RSI, MACD, etc.
 * **Market Sentiment:** Data derived from social media, news articles, or dedicated sentiment analysis APIs.

### 2.6. Temporal Considerations

* **Time Series Nature:** Ether price data is inherently temporal, and we need to account for this:
 * **Seasonality:** Analyze for any seasonal patterns in price movements.
 * **Time-Based Features:** Consider incorporating time-based features like day of the week, hour of the day, etc., as potential predictors.

### 2.7. Legal and Ethical Concerns

* **Data Usage Rights:** Ensure you have the right to use the data from the chosen sources. Check API terms of service and website scraping policies.
* **Privacy:** If incorporating personal or sensitive data (e.g., user data), ensure compliance with privacy regulations and data protection practices.

### 2.8. Sampling Strategy

* **Sampling:** While having more data is generally better, sampling can be useful for initial model development or when dealing with massive datasets. You can consider techniques like random sampling or stratified sampling if appropriate.

### 2.9. Data Privacy

* **Data Anonymization:** If using sensitive data, consider anonymization techniques to protect privacy. However, in this case, we'll primarily be working with publicly available market data, so this might not be a major concern.

### 2.10. Data Collection Tools

* **APIs:** Use libraries like requests in Python to interact with exchange APIs and fetch data.
* **Web Scraping:** Libraries like Beautiful Soup and Scrapy can be used for web scraping. However, exercise caution and respect website robots.txt rules.
* **Data Handling:** Use libraries like pandas for data manipulation and storage.

### 2.11. Data Versioning

* **Version Control:** Implement version control using Git to track changes to your dataset and code. This is essential for reproducibility and collaboration.

### 2.12. Continuous Data Collection

* **Data Pipelines:** Design a data pipeline to regularly fetch and update your dataset with fresh IntraDay Ether prices. This could involve scheduling scripts to run at specific intervals or using real-time data streams if available.

## **3. Data Preprocessing**

In [1]:
# Import Libraries
import numpy as np
import pandas as pd
import math
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams

In [2]:
# Importing the dataset
from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('/content/drive/My Drive/ether_intraday_prices.csv')

Mounted at /content/drive


In [3]:
# Dataset First
df.head()

Unnamed: 0,date,Open,High,Low,Close,Volume
0,2017-08-17 04:00:00,301.13,301.13,298.0,298.0,5.80167
1,2017-08-17 04:15:00,298.0,300.8,298.0,299.39,31.44065
2,2017-08-17 04:30:00,299.39,300.79,299.39,299.6,52.93579
3,2017-08-17 04:45:00,299.6,302.57,299.6,301.61,35.49066
4,2017-08-17 05:00:00,301.61,302.57,300.95,302.01,81.69235


### 3.1. Handling Missing Values

In [4]:
# Missing Values/Null Values Count
print(df.isnull().sum())

date      0
Open      0
High      0
Low       0
Close     0
Volume    0
dtype: int64


### 3.2. Handling Outliers

**Using Standard Deviation in Symmetric Curve**

In [5]:
import numpy as np
import pandas as pd

def find_outliers_sd(data, threshold=3):

  # Calculate the mean and standard deviation
  data_mean = np.mean(data)
  data_std = np.std(data)

  # Identify outliers
  outliers = data[(data < data_mean - threshold * data_std) |
                  (data > data_mean + threshold * data_std)]

  return outliers

# Example usage:
# Assuming 'df' is your pandas DataFrame and 'column_name' is the column
# containing the data you want to analyze:

outliers = find_outliers_sd(df['Close'])

# To see the output, run the code
print(outliers)

147243    4653.36
147649    4642.45
147650    4684.19
147651    4688.29
147652    4689.38
           ...   
149931    4648.60
149933    4655.00
149934    4656.21
149935    4660.58
149936    4644.00
Name: Close, Length: 644, dtype: float64


**Using IQR in skew-symmetric Curve**

In [6]:
import numpy as np
import pandas as pd

def find_outliers_iqr(data):
  # Calculate quantiles
  q1 = np.quantile(data, 0.25)
  q3 = np.quantile(data, 0.75)

  # Calculate IQR
  iqr = q3 - q1

  # Define upper and lower bounds
  upper_bound = q3 + 1.5 * iqr
  lower_bound = q1 - 1.5 * iqr

  # Identify outliers
  outliers = data[(data < lower_bound) | (data > upper_bound)]

  return outliers

# Assuming 'dataset' is your DataFrame and 'Close' is the column of interest:
outliers = find_outliers_iqr(df['Close'])

# Display the outliers
print(outliers)

130132    3979.19
130213    3974.73
130214    4009.39
130215    4020.00
130216    4062.43
           ...   
152454    3980.98
152455    3982.34
152456    3980.95
152457    3981.37
152458    3973.78
Name: Close, Length: 6141, dtype: float64


**Using Outlier Insensitive Algorithms.i.e. SVM, KNN, Decision Tree**

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR  # For SVM
from sklearn.neighbors import KNeighborsRegressor  # For KNN
from sklearn.tree import DecisionTreeRegressor  # For Decision Tree
from sklearn.metrics import mean_squared_error, r2_score

X = df[['Open', 'High', 'Low', 'Volume']]  # Features
y = df['Close']  # Target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 1. Support Vector Regression (SVR)
svr_model = SVR(kernel='rbf')  # You can explore other kernels like 'linear', 'poly'
svr_model.fit(X_train, y_train)
svr_predictions = svr_model.predict(X_test)

# 2. K-Nearest Neighbors (KNN)
knn_model = KNeighborsRegressor(n_neighbors=5)  # Experiment with different 'n_neighbors'
knn_model.fit(X_train, y_train)
knn_predictions = knn_model.predict(X_test)

# 3. Decision Tree Regression
dt_model = DecisionTreeRegressor(random_state=42)  # You can adjust hyperparameters
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)

# Evaluate models
def evaluate_model(predictions, model_name):
    mse = mean_squared_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)
    print(f"{model_name}:")
    print(f"  Mean Squared Error (MSE): {mse}")
    print(f"  R-squared (R2): {r2}")

evaluate_model(svr_predictions, "Support Vector Regression (SVR)")
evaluate_model(knn_predictions, "K-Nearest Neighbors (KNN)")
evaluate_model(dt_predictions, "Decision Tree Regression")

Support Vector Regression (SVR):
  Mean Squared Error (MSE): 29512.387855455163
  R-squared (R2): 0.979000938527319
K-Nearest Neighbors (KNN):
  Mean Squared Error (MSE): 463.20530405259603
  R-squared (R2): 0.9996704137699087
Decision Tree Regression:
  Mean Squared Error (MSE): 31.58908164538165
  R-squared (R2): 0.9999775233007039


### 3.3. Categorical Encoding

### 3.4. Data Transformation

**Standardisation**

This approach is based on the assumption that data follows a normal distribution, where most values cluster around the mean, and outliers lie further away.

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Assuming 'dataset' is your DataFrame and you want to standardize
# the features 'Open', 'High', 'Low', and 'Volume'
features_to_standardize = ['Open', 'High', 'Low', 'Volume']
X = df[features_to_standardize]
y = df['Close']  # Target variable

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a StandardScaler object
scaler = StandardScaler()

# Fit the scaler to the training data and transform it
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data using the fitted scaler
X_test_scaled = scaler.transform(X_test)

# Now, X_train_scaled and X_test_scaled contain the standardized features
# You can use these scaled features for training your models

**Normalisation**

This method scales your data to a specific range, typically between 0 and 1, which can be beneficial for certain machine learning algorithms.

In [21]:
from sklearn.preprocessing import MinMaxScaler

def normalize_data(data):

  # Create a MinMaxScaler object
  scaler = MinMaxScaler()

  # Fit the scaler to the data and transform it
  normalized_data = scaler.fit_transform(data)

  # If input was a pandas DataFrame, convert the output back to a DataFrame
  if isinstance(data, pd.DataFrame):
    normalized_data = pd.DataFrame(normalized_data, columns=data.columns, index=data.index)

  return normalized_data


columns_to_normalize = ['Open', 'High', 'Low', 'Volume']
df[columns_to_normalize] = normalize_data(df[columns_to_normalize])

# To see the output, run the code.
# print(df.head())

**Robust Scaler**

a data transformation technique that's particularly useful when your data contains outliers. Unlike standard scaling (using mean and standard deviation), the Robust Scaler is less sensitive to extreme values.

**How it Works:**

The Robust Scaler removes the median and scales the data according to the quantile range (IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile) of the data. This makes it robust to outliers, as the scaling is based on the more stable IQR rather than the potentially skewed mean and standard deviation.


In [22]:
from sklearn.preprocessing import RobustScaler

def robust_scale_data(data):

  # Create a RobustScaler object
  scaler = RobustScaler()

  # Fit the scaler to the data and transform it
  scaled_data = scaler.fit_transform(data)

  # If input was a pandas DataFrame, convert the output back to a DataFrame
  if isinstance(data, pd.DataFrame):
    scaled_data = pd.DataFrame(scaled_data, columns=data.columns, index=data.index)

  return scaled_data


columns_to_scale = ['Open', 'High', 'Low', 'Volume']
df[columns_to_scale] = robust_scale_data(df[columns_to_scale])

# To see the output, run the code
# print(df.head())

**Sum of (median-observation)/IQR**

* This method is a form of robust scaling that aims to center and scale your data using the median and the Interquartile Range (IQR), making it less sensitive to outliers.



**Formula:**



>` X_scaled = (X - X_median) / IQR`



In [23]:
def robust_scale_data_custom(data):

    data_median = np.median(data)
    q1 = np.quantile(data, 0.25)
    q3 = np.quantile(data, 0.75)
    iqr = q3 - q1

    scaled_data = (data - data_median) / iqr

    return scaled_data

columns_to_scale = ['Open', 'High', 'Low', 'Volume']
df['column_name_scaled'] = robust_scale_data_custom(df['Close'])

**Box-Cox Transformation**

a powerful technique for transforming non-normal dependent variables into a more normal shape.

**Purpose:**

The Box-Cox transformation aims to stabilize variance and make the data more closely resemble a normal distribution. This is often desirable because many statistical methods assume normality for optimal performance.

**Formula:**

The Box-Cox transformation is defined by the following formula:



```
T(y) = (y^λ - 1) / λ   if λ != 0
T(y) = ln(y)          if λ = 0
```



In [24]:
from scipy import stats

# Perform Box-Cox transformation
transformed_data, lambda_value = stats.boxcox(df['Close'])

# Add the transformed data to the DataFrame
df['transformed_dependent_variable'] = transformed_data

# To see the output, run the code.
# print(df.head())
# print(f"Lambda value: {lambda_value}")

**Gaussian Transformation**

In [25]:

from sklearn.preprocessing import QuantileTransformer

column_to_transform = 'Close'

# Create a QuantileTransformer object with 'normal' output distribution
qt = QuantileTransformer(output_distribution='normal', random_state=42)

# Fit and transform the data
df['column_name_gaussian'] = qt.fit_transform(df[[column_to_transform]])

**Logarithmic Transformation**

In [26]:
column_to_transform = 'Close'

# Apply logarithmic transformation using NumPy's log function
df[column_to_transform + '_log'] = np.log(df[column_to_transform])

# Handle potential errors (e.g., log of 0 or negative values)
# You might need to add a small constant to avoid log(0) errors
df[column_to_transform + '_log'] = np.log(df[column_to_transform] + 1e-6)  # Example

**Inverse Transformation**

In [27]:
df['Close_log_inverse'] = np.exp(df['Close_log'])

**Square Root Transformation**

In [28]:
df['Close_sqrt'] = np.sqrt(df['Close'])

**Exponential Transformation**

In [29]:
# Scale the 'Close' column
df['Close_scaled'] = scaler.fit_transform(df[['Close']])

# Apply exponential transformation to the scaled column
df['Close_scaled_exp'] = np.exp(df['Close_scaled'])

### 3.5. Handling Imbalanced Dataset

**Understanding Imbalanced Datasets**

An imbalanced dataset occurs when one class (or category) has significantly more instances than another. In your case, if you're trying to predict price movements (e.g., up or down), you might find that one direction is much more common than the other in your historical data. This imbalance can lead to biased models that perform poorly on the minority class.

**Cross-Validation**

In [30]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression  # Example model
from sklearn.metrics import mean_squared_error  # Example metric

# 1. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Create your model
model = LinearRegression()  # Example: using Linear Regression

# 3. Perform cross-validation
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')

# 4. Evaluate the results
# Convert negative MSE scores to positive
mse_scores = -scores
print("Cross-validation MSE scores:", mse_scores)
print("Average MSE:", mse_scores.mean())

# 5. Train the final model on the entire training set
model.fit(X_train, y_train)

# 6. Evaluate the final model on the test set
y_pred = model.predict(X_test)
mse_test = mean_squared_error(y_test, y_pred)
print("Test MSE:", mse_test)

Cross-validation MSE scores: [12.82658177 14.84659755 13.80088738 12.46887154 13.66572614]
Average MSE: 13.521732876544888
Test MSE: 13.499787319128627


**Under Sampling**

In [39]:
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split

# Discretize the target variable 'y' (e.g., using qcut)
num_classes = 3  # Choose the desired number of classes
y_classes = pd.qcut(y, q=num_classes, labels=False)

# 1. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_classes, test_size=0.2, random_state=42)

# 2. Create a RandomUnderSampler object
undersampler = RandomUnderSampler(random_state=42)  # You can adjust the random_state

# 3. Apply undersampling to the training data
X_train_resampled, y_train_resampled = undersampler.fit_resample(X_train, y_train)

# 4. Now you can train your model using the resampled data
# ... (e.g., model.fit(X_train_resampled, y_train_resampled))

**Over Sampling**

In [38]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Discretize the target variable 'y' (e.g., using qcut)
num_classes = 3  # Choose the desired number of classes
y_classes = pd.qcut(y, q=num_classes, labels=False)

# 1. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_classes, test_size=0.2, random_state=42)

# 2. Create a SMOTE object
oversampler = SMOTE(random_state=42)  # You can adjust the random_state and other parameters

# 3. Apply oversampling to the training data
X_train_resampled, y_train_resampled = oversampler.fit_resample(X_train, y_train)

# 4. Now you can train your model using the resampled data
# ... (e.g., model.fit(X_train_resampled, y_train_resampled))

**Synthetic Minority over Sampling Technique (SMOTE)**

In [37]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

num_classes = 3  # Choose the desired number of classes
y_classes = pd.qcut(y, q=num_classes, labels=False)

# 1. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_classes, test_size=0.2, random_state=42)

# 2. Create a SMOTE object
smote = SMOTE(random_state=42)  # You can adjust the random_state and other parameters

# 3. Apply SMOTE to the training data
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# 4. Now you can train your model using the resampled data
# ... (e.g., model.fit(X_train_resampled, y_train_resampled))

**Tree-Based Algorithm**

In [40]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 1. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Create a Decision Tree Regressor object
tree_regressor = DecisionTreeRegressor(random_state=42)  # You can adjust hyperparameters

# 3. Train the model
tree_regressor.fit(X_train, y_train)

# 4. Make predictions on the test set
y_pred = tree_regressor.predict(X_test)

# 5. Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared:", r2)

Mean Squared Error: 31.58908164538165
R-squared: 0.9999775233007039


**Overall Interpretation:**

The combination of a relatively low MSE (31.59) and a very high R-squared (0.999977) suggests that your Decision Tree Regressor is performing exceptionally well in predicting Ether prices based on the data you've provided.

###  3.6. Data Reduction

**Dimensionality Reduction**

In [41]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# 1. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Standardize the features (important for PCA)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Apply PCA
pca = PCA(n_components=0.95)  # Keep 95% of variance
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# 4. Now you can train your model using the reduced features
# ... (e.g., model.fit(X_train_pca, y_train))

**Numerosity Reduction - original data<>smaller form**

In [42]:
import pandas as pd
from sklearn.linear_model import LinearRegression

# ... (Assuming X and y are your features and target variable)

# 1. Fit a linear regression model
regressor = LinearRegression()
regressor.fit(X, y)

# 2. Store the model coefficients and intercept
coefficients = regressor.coef_
intercept = regressor.intercept_

# 3. Use the model parameters to represent the data
# ... (You can now use the coefficients and intercept to make predictions
#      or to reconstruct an approximation of the original data)