<a href="https://colab.research.google.com/github/MDaniels-AI-Projects/Bitcoin-Price-Predictor/blob/main/Bit_Coin_Price_Predictor_Basic_RF_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Step 1: Acquire BTC Data and Generate Core Features
First, I'll set my environment and parameters. I am using a full five-year history for robust training and setting the end date to today. Then, I'll download the Bitcoin (BTC) price data and create my core endogenous features like Returns, Moving Averages (MA), Volatility (Std_20), and the 7-Day Lagged Return.

In [1]:
import yfinance as yf
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np # Import numpy for completeness

# Define parameters based on user's requirements
TICKER = 'BTC-USD'
START_DATE = '2020-10-31'
END_DATE = '2025-10-31' # Setting the end date to today
PREDICTION_HORIZON = 7

# 1. Download BTC data
downloaded_data = yf.download(TICKER, start=START_DATE, end=END_DATE)

# Explicitly select the 'Close' price and ensure it is a DataFrame
if isinstance(downloaded_data, pd.DataFrame):
    prices_df = downloaded_data[['Close']].copy()
else:
    prices_df = downloaded_data[['Close']].copy()

prices_df.columns = [TICKER]
print(f"Initial {TICKER} data downloaded from {START_DATE} to {END_DATE}.")

# 2. Calculate Daily Returns
prices_df['Returns'] = prices_df[TICKER].pct_change()

# 3. Create Endogenous Features
prices_df['MA_5'] = prices_df[TICKER].rolling(window=5).mean()
prices_df['MA_20'] = prices_df[TICKER].rolling(window=20).mean()
prices_df['Std_20'] = prices_df['Returns'].rolling(window=20).std()
prices_df['Return_Lag_7'] = prices_df['Returns'].shift(7)

print("\n--- BTC Feature Generation Complete ---")
print(f"Total data points for BTC features (before NaN clean): {prices_df.shape[0]}")
print("Features created: MA_5, MA_20, Std_20, Returns, Return_Lag_7")

  downloaded_data = yf.download(TICKER, start=START_DATE, end=END_DATE)
[*********************100%***********************]  1 of 1 completed

Initial BTC-USD data downloaded from 2020-10-31 to 2025-10-31.

--- BTC Feature Generation Complete ---
Total data points for BTC features (before NaN clean): 1826
Features created: MA_5, MA_20, Std_20, Returns, Return_Lag_7





Step 2: Acquire, Lag, and Merge Exogenous Market Data ðŸ’°Next, I will download the five external factors: Gold ($\text{GLD}$), Ethereum ($\text{ETH-USD}$), S&P 500 ($\text{SPY}$), Dollar Index ($\text{DX-Y.NYB}$), and Ten Year Yield ($\text{\^TNX}$). Crucially, I will lag these exogenous features by 1 day using .shift(1) to ensure I am only using information that was available at the time of prediction, preventing look-ahead bias.

In [2]:
# Define parameters (same as before)
START_DATE = '2020-10-31'
END_DATE = '2025-10-31'

# 1. Download Exogenous Data
EXOGENOUS_TICKERS = ['GLD', 'ETH-USD', 'SPY', 'DX-Y.NYB', '^TNX']
exogenous_data_df = yf.download(EXOGENOUS_TICKERS, start=START_DATE, end=END_DATE)['Close']
exogenous_data_df.columns = EXOGENOUS_TICKERS

print("Exogenous data downloaded successfully.")

# ðŸŽ¯ CRITICAL FIX: Lag Exogenous Data by 1 Day ðŸŽ¯
exogenous_data_df_lagged = exogenous_data_df.shift(1)
print("All exogenous features lagged by 1 day to prevent look-ahead bias.")

# 2. Merge the DataFrames
# We merge prices_df (t features) with exogenous_data_df_lagged (t-1 features)
full_df = prices_df.merge(exogenous_data_df_lagged, how='inner', left_index=True, right_index=True)

# 3. Drop all rows with NaN values (removes initial rolling window NaNs and the first lagged row)
full_df = full_df.dropna().copy()

print("\n--- Full Feature DataFrame Created ---")
print(f"Final DataFrame Shape: {full_df.shape}")
print("Features created and ready (date features REMOVED due to poor performance).")
print(f"Features (X): {list(full_df.columns)}")

  exogenous_data_df = yf.download(EXOGENOUS_TICKERS, start=START_DATE, end=END_DATE)['Close']
[*********************100%***********************]  5 of 5 completed

Exogenous data downloaded successfully.
All exogenous features lagged by 1 day to prevent look-ahead bias.

--- Full Feature DataFrame Created ---
Final DataFrame Shape: (1241, 11)
Features created and ready (date features REMOVED due to poor performance).
Features (X): ['BTC-USD', 'Returns', 'MA_5', 'MA_20', 'Std_20', 'Return_Lag_7', 'GLD', 'ETH-USD', 'SPY', 'DX-Y.NYB', '^TNX']





Step 3: Create Target Variable ($\mathbf{Y}$) and Finalise Data SplitNow, I will define the target variable, $\mathbf{Y}$, as the Future 7-Day Return of BTC. Then, I'll split the clean data into features ($\mathbf{X}$) and target ($\mathbf{Y}$), reserving the last $\text{7}$ rows for my final, out-of-sample prediction set.

In [3]:
# NOTE: We assume 'full_df' from the UPDATED Code Block 2.1 is available.

PREDICTION_HORIZON = 7
TICKER = 'BTC-USD'

print("\nDefining Target and Splitting Data...")

# 1. Create the Target Variable (Y): Future 7-Day Return
full_df['Target_7Day_Return'] = full_df[TICKER].pct_change(PREDICTION_HORIZON).shift(-PREDICTION_HORIZON)
print(f"Target variable 'Target_7Day_Return' (Y) created, shifted back by {PREDICTION_HORIZON} days.")


# 2. Clean and Separate Features (X) and Target (Y)
full_df_cleaned = full_df.dropna().copy()
# Drop the raw price, Daily Returns, and the new Target from the feature set (X).
X = full_df_cleaned.drop(columns=[TICKER, 'Returns', 'Target_7Day_Return'])
Y = full_df_cleaned['Target_7Day_Return']


# 3. Final Split into Train, Validation, and Prediction Sets
PREDICT_DAYS = 7
X_predict = X.iloc[-PREDICT_DAYS:]
X = X.iloc[:-PREDICT_DAYS]
Y = Y.iloc[:-PREDICT_DAYS]

# Split the remaining data (80% Train, 20% Validation)
TRAIN_SPLIT_INDEX = int(len(X) * 0.8)

X_train, X_val = X.iloc[:TRAIN_SPLIT_INDEX], X.iloc[TRAIN_SPLIT_INDEX:]
Y_train, Y_val = Y.iloc[:TRAIN_SPLIT_INDEX], Y.iloc[TRAIN_SPLIT_INDEX:]

# 4. Review Final Data Shapes
print("\n--- Final Data Shapes ---")
print(f"X_train shape: {X_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"Total Prediction Samples (X_predict): {len(X_predict)}")


Defining Target and Splitting Data...
Target variable 'Target_7Day_Return' (Y) created, shifted back by 7 days.

--- Final Data Shapes ---
X_train shape: (981, 9)
X_val shape: (246, 9)
Total Prediction Samples (X_predict): 7


Step 4: Model Training (Random Forest Regressor)
I will now train my Random Forest Regressor using the hyperparameters that performed best in my testing. I'll evaluate the model on the validation set and check the feature importance to ensure the features are contributing meaningfully.

In [4]:
from sklearn.ensemble import RandomForestRegressor

# 1. Initialise the Random Forest Regressor
rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)

# 2. Train the Model
print("\nTraining Random Forest Regressor on Lagged Features (No Dates)...")
rf_model.fit(X_train, Y_train)
print("Training complete.")

# 3. Evaluate the Model on the Validation Set
Y_val_pred_rf = rf_model.predict(X_val)
rmse_rf = np.sqrt(mean_squared_error(Y_val, Y_val_pred_rf))

# 4. Extract Feature Importances
feature_importances_rf = pd.Series(rf_model.feature_importances_, index=X_train.columns)
top_5_features_rf = feature_importances_rf.sort_values(ascending=False).head(5)

print("\n--- Random Forest Validation Results (Lagged Features) ---")
print(f"New RMSE: {rmse_rf:.6f}")
print("\n--- Top 5 Feature Importances ---")
print(top_5_features_rf)


Training Random Forest Regressor on Lagged Features (No Dates)...
Training complete.

--- Random Forest Validation Results (Lagged Features) ---
New RMSE: 0.071355

--- Top 5 Feature Importances ---
^TNX       0.224828
ETH-USD    0.152131
Std_20     0.118769
MA_5       0.104065
MA_20      0.100474
dtype: float64


Step 5: Final Prediction and Trading SignalFinally, I will use my trained Random Forest model to predict the 7-day return for the next week using the $\mathbf{X\_predict}$ data. I'll translate that predicted return into an actual BTC price target based on a placeholder of today's market price.

In [5]:
# NOTE: We assume 'rf_model' is the best model trained in the previous cell.

# 1. Ensure the final model is trained (in case of a fresh run)
rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)
rf_model.fit(X_train, Y_train)

# 2. Generate the 7-day future return predictions
final_7_day_predictions = rf_model.predict(X_predict)
predictions_df = pd.DataFrame({
    'Features_Date': X_predict.index,
    'Predicted_7Day_Return': final_7_day_predictions
})
predictions_df['Forecast_Date'] = predictions_df['Features_Date'] + pd.Timedelta(days=PREDICTION_HORIZON)

# 3. Calculate the Predicted BTC Price Target (Using the LAST prediction)
# Placeholder price for the last feature date: 2025-10-31
last_known_price = 109300.00
last_prediction = final_7_day_predictions[-1]


predicted_price_target = last_known_price * (1 + last_prediction)
predicted_end_date = predictions_df['Forecast_Date'].iloc[-1]


print("\n--- Final 7-Day Forecast (Using Best Model: Random Forest) ---")
print(f"**Base Price for Forecast (2025-10-31):** {last_known_price:,.2f} USD (Placeholder)")
print(f"**Predicted 7-Day Return:** {last_prediction:.4f}")
print(f"**Predicted BTC Price Target for 7 days from today is {predicted_price_target:,.2f} USD")
print("\n")
print(f"**Trading Signal:** Buy/Hold if the Predicted Return ({last_prediction:.4f}) is greater than your required profit threshold.")


--- Final 7-Day Forecast (Using Best Model: Random Forest) ---
**Base Price for Forecast (2025-10-31):** 109,300.00 USD (Placeholder)
**Predicted 7-Day Return:** -0.0378
**Predicted BTC Price Target for 7 days from today is 105,164.78 USD


**Trading Signal:** Buy/Hold if the Predicted Return (-0.0378) is greater than your required profit threshold.


In [6]:
# --- Input Cell for LAST KNOWN PRICE ---

# The while/try/except loop is used to ensure the user enters a valid, positive number.
while True:
    try:
        # Prompt the user to enter the base price for the prediction calculation
        # The float() function converts the input text into a number with a decimal point.
        last_known_price = float(input("Enter the last known Bitcoin price in USD.. No commas. (e.g., 100000.00) :$"))

        # Validation check for a positive price
        if last_known_price <= 0:
            print("The price must be a positive number. Please try again.")
        else:
            # Input is valid and positive, so we exit the loop
            break

    except ValueError:
        # Catch non-numeric input (e.g., user types 'text')
        print("Invalid input. Please enter a numerical value.")

# The 'last_known_price' variable is now ready for use in subsequent cells
# e.g., predicted_price_target = last_known_price * (1 + last_prediction)

Enter the last known Bitcoin price in USD.. No commas. (e.g., 100000.00) :$110487.80


In [11]:
# NOTE: We assume 'rf_model' is the best model trained in the previous cell.

# 1. Ensure the final model is trained (in case of a fresh run)
rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)
rf_model.fit(X_train, Y_train)

# ... (Training the model and preparing X_predict remains the same)

# 2. Generate the 7-day future return predictions
final_7_day_predictions = rf_model.predict(X_predict)

# --- ðŸš¨ THIS IS THE LINE THAT GETS YOUR MODEL'S ACTUAL PREDICTION ðŸš¨ ---
# We take the first element [0] since X_predict is a single-row DataFrame
last_prediction = final_7_day_predictions[0]
# ---------------------------------------------------------------------

# 3. Calculate the Predicted BTC Price Target (Using the LAST prediction)
predicted_price_target = last_known_price * (1 + last_prediction)
# Get the correct date for the final summary (2025-11-07)
predicted_end_date = predictions_df['Forecast_Date'].iloc[-1]


print("\n--- Final 7-Day Forecast Using Random Forest) ---")
print(f"**Base Price for Forecast:** {last_known_price:,.2f} USD")
# This now prints the dynamic value from your model:
print(f"**Predicted 7-Day Return:** {last_prediction:.4f}")
print(f"**Predicted BTC Price Target (In 7 days time from today is):** {predicted_price_target:,.2f} USD")
print("\n")

# Logic remains correct based on the *actual* prediction value
if last_prediction > 0:
    print(f"**The Predicted return ({last_prediction:.4f}) suggests an increase in the number of days chosen.")
else:
    print(f"**The Predicted Return ({last_prediction:.4f}) suggests a reduction in the number of days chosen.")


--- Final 7-Day Forecast Using Random Forest) ---
**Base Price for Forecast:** 110,487.80 USD
**Predicted 7-Day Return:** -0.0354
**Predicted BTC Price Target (In 7 days time from today is):** 106,575.90 USD


**The Predicted Return (-0.0354) suggests a reduction in the number of days chosen.
