# Module 1 No Code/Low Code Assignment: Stock Price Prediction with LLM (Using Gemini in Colab)

This assignment explores how large language models (LLMs) can perform complex tasks that previously required coding. Instead of manually writing code, we will interact with Google's Gemini (Click on the Gemini icon on the top right corner) in Google Colab to:

- Load and analyze stock price data.
- Engineer relevant features for prediction.
- Build and evaluate models such as regression and random forest.
- Discuss potential trading strategies based on predictions.

**Note**: While this notebook contains example code, students are encouraged to use LLMs to generate code snippets and explanations, managing interactions within the LLM's context window.

## Task Breakdown

### Step 1: Data Exploration
- **Prompt the LLM**: Load historical stock price data for Apple Inc. (AAPL) from 2020-01-01 to 2023-01-01 using Yahoo Finance.
- **Key Considerations**: Ensure the model retains context while summarizing key statistics.

### Step 2: Feature Engineering
- **Prompt the LLM**: Generate meaningful features such as moving averages, technical indicators, and volatility measures.
- **Managing Context**: If the LLM forgets prior outputs, remind it by reintroducing key statistics.

### Step 3: Model Building
- **Prompt the LLM**: Train a model (regression for price prediction, random forest for trend classification).
- **Split Ratio Consideration**: Explain the importance of train-test split (e.g., 70/30) and verify the LLM respects it.

### Step 4: Model Evaluation & Prediction
- **Prompt the LLM**: Use the trained model to predict stock prices for 2023-2024.
- **Analyze Performance**: Interpret accuracy, R² scores, and classification reports.

### Step 5: Strategy Discussion
- **Prompt the LLM**: Based on model performance, discuss possible trading strategies.
- **Critical Thinking**: Evaluate whether the model is truly useful for financial decision-making.

---

Throughout the assignment, keep in mind that LLMs have a limited context window. If necessary, break down tasks into smaller chunks and reintroduce critical information.



## Using Gemini in Colab

To use Gemini directly in your browser within Google Colab:

1. **Create a new code cell** by clicking the `+ Code` button.
2. In the newly created cell, you will see a placeholder text: **"Start coding or generate with AI"**.
3. Click on **"Generate"** and enter your prompt, such as:
   > "Load historical stock price data for Apple Inc. (AAPL) from 2020-01-01 to 2023-01-01 using Yahoo Finance and summarize its key statistics."
4. Gemini will generate code based on your prompt. You can then review, modify, and run it within the notebook.
5. If the model loses context, you may need to remind it of previous responses or break down tasks into smaller steps.

This method allows you to interactively generate and refine code without manually writing every line.


In [1]:
# prompt: Load historical stock price data for Apple Inc. (AAPL) from 2020-01-01 to 2023-01-01 using Yahoo Finance and summarize its key statistics.

!pip install yfinance

import yfinance as yf
import pandas as pd

# Download historical data for AAPL
data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")

# Display summary statistics
print(data.describe())




  data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")
[*********************100%***********************]  1 of 1 completed

Price        Close        High         Low        Open        Volume
Ticker        AAPL        AAPL        AAPL        AAPL          AAPL
count   756.000000  756.000000  756.000000  756.000000  7.560000e+02
mean    127.633102  129.181519  125.971433  127.555152  1.120920e+08
std      30.421462   30.698898   30.139549   30.449562  5.602586e+07
min      54.378582   55.379535   51.528416   55.277744  3.519590e+07
25%     112.709526  114.204939  111.251818  112.889664  7.636470e+07
50%     133.010773  134.143878  131.050657  132.832284  9.493580e+07
75%     148.221516  149.388592  146.276870  147.588770  1.296327e+08
max     178.645660  179.558442  175.809046  179.254175  4.265100e+08





In [2]:
# prompt: Load historical stock price data for Apple Inc. (AAPL) from 2020-01-01 to 2023-01-01 using Yahoo Finance and summarize its key statistics.

!pip install yfinance

import yfinance as yf
import pandas as pd

# Download historical data for AAPL from 2020-01-01 to 2023-01-01
data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")

# Summarize key statistics
print(data.describe())




  data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")
[*********************100%***********************]  1 of 1 completed

Price        Close        High         Low        Open        Volume
Ticker        AAPL        AAPL        AAPL        AAPL          AAPL
count   756.000000  756.000000  756.000000  756.000000  7.560000e+02
mean    127.633102  129.181519  125.971433  127.555152  1.120920e+08
std      30.421462   30.698898   30.139549   30.449562  5.602586e+07
min      54.378582   55.379535   51.528416   55.277744  3.519590e+07
25%     112.709526  114.204939  111.251818  112.889664  7.636470e+07
50%     133.010773  134.143878  131.050657  132.832284  9.493580e+07
75%     148.221516  149.388592  146.276870  147.588770  1.296327e+08
max     178.645660  179.558442  175.809046  179.254175  4.265100e+08





In [3]:
# Prompt: Load historical stock price data for Apple Inc. (AAPL) from 2020-01-01 to 2023-01-01
# using Yahoo Finance and summarize its key statistics.

import yfinance as yf

data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")

# Summarize key statistics
print(data.describe())



  data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")
[*********************100%***********************]  1 of 1 completed

Price        Close        High         Low        Open        Volume
Ticker        AAPL        AAPL        AAPL        AAPL          AAPL
count   756.000000  756.000000  756.000000  756.000000  7.560000e+02
mean    127.633102  129.181519  125.971433  127.555152  1.120920e+08
std      30.421462   30.698898   30.139549   30.449562  5.602586e+07
min      54.378582   55.379535   51.528416   55.277744  3.519590e+07
25%     112.709526  114.204939  111.251818  112.889664  7.636470e+07
50%     133.010773  134.143878  131.050657  132.832284  9.493580e+07
75%     148.221516  149.388592  146.276870  147.588770  1.296327e+08
max     178.645660  179.558442  175.809046  179.254175  4.265100e+08





## Now try doing step 2 - step 5 by yourself

The sample prompts and LLM responses are listed below for your reference.

In [4]:
# prompt: Generate meaningful features such as moving averages, technical indicators, and volatility measures. Ensure the model retains context while summarizing key statistics.

# Moving Averages
data['SMA_50'] = data['Close'].rolling(window=50).mean()
data['SMA_200'] = data['Close'].rolling(window=200).mean()

# Technical Indicators (Example: RSI)
# Calculate price differences
delta = data['Close'].diff()
# Get positive and negative deltas
gain = delta.clip(lower=0)
loss = -delta.clip(upper=0)
# Calculate exponential moving average of gains and losses
avg_gain = gain.ewm(com=14-1, adjust=False).mean()
avg_loss = loss.ewm(com=14-1, adjust=False).mean()
# Calculate Relative Strength (RS)
rs = avg_gain / avg_loss
# Calculate Relative Strength Index (RSI)
data['RSI'] = 100 - (100 / (1 + rs))

# Volatility Measures (Example: Rolling Standard Deviation)
data['Rolling_Std_Dev'] = data['Close'].rolling(window=20).std()

# Drop rows with NaN values created by rolling calculations
data.dropna(inplace=True)

# Display the first few rows with the new features
print(data.head())

# Summarize key statistics of the new features
print(data[['SMA_50', 'SMA_200', 'RSI', 'Rolling_Std_Dev']].describe())

Price            Close        High         Low        Open     Volume  \
Ticker            AAPL        AAPL        AAPL        AAPL       AAPL   
Date                                                                    
2020-10-15  117.550018  118.027189  115.057037  115.612115  112559200   
2020-10-16  115.904259  118.368034  115.699757  118.105098  115393800   
2020-10-19  112.943840  117.267603  112.632217  116.819646  120639300   
2020-10-20  114.433807  115.865327  112.603018  113.158096  124423700   
2020-10-21  113.810524  115.602352  113.401513  113.615755   89946000   

Price           SMA_50    SMA_200        RSI Rolling_Std_Dev  
Ticker                                                        
Date                                                          
2020-10-15  113.858163  86.063326  56.824164        4.631934  
2020-10-16  113.961827  86.279743  54.018943        4.343460  
2020-10-19  114.056628  86.484888  49.303890        4.171826  
2020-10-20  114.149774  86.694618  51

In [14]:
# prompt: Adopt the first four kinds of features and train a regression model to predict tomorrow's stock price. Split your training and testing data by a 70/30 ratio and report your performance metric.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Define features (first four types) and target
features = ['Open', 'High', 'Low', 'Close', 'Volume', 'SMA_50', 'SMA_200', 'RSI', 'Rolling_Std_Dev']
target = 'Close'  # Predict tomorrow's closing price

# Create the target variable (tomorrow's closing price)
data['Tomorrow_Close'] = data['Close'].shift(-1)

# Drop the last row as it will have a NaN for 'Tomorrow_Close'
data.dropna(inplace=True)

# Prepare data for modeling
X = data[features]
y = data['Tomorrow_Close']

# Split data into training and testing sets (70/30 ratio)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False) # Use shuffle=False for time series data

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared Score: {r2}")

Mean Squared Error: 12.593873631839594
R-squared Score: 0.8786445775906475


In [17]:
# prompt: Use the trained model to predict stock prices for 2023-2024.
# Analyze Performance: Interpret accuracy, R² scores, and classification reports

# Use the trained model to predict stock prices for 2023-2024.
# Download data for the prediction period (2023-2024)
prediction_data = yf.download("AAPL", start="2023-01-01", end="2025-01-01") # End a bit beyond to ensure 2024 data

# Re-calculate the same features for the prediction data
prediction_data['SMA_50'] = prediction_data['Close'].rolling(window=50).mean()
prediction_data['SMA_200'] = prediction_data['Close'].rolling(window=200).mean()

delta_pred = prediction_data['Close'].diff()
gain_pred = delta_pred.clip(lower=0)
loss_pred = -delta_pred.clip(upper=0)
avg_gain_pred = gain_pred.ewm(com=14-1, adjust=False).mean()
avg_loss_pred = loss_pred.ewm(com=14-1, adjust=False).mean()
rs_pred = avg_gain_pred / avg_loss_pred
prediction_data['RSI'] = 100 - (100 / (1 + rs_pred))

prediction_data['Rolling_Std_Dev'] = prediction_data['Close'].rolling(window=20).std()

# Drop rows with NaN values (due to rolling calculations)
prediction_data.dropna(inplace=True)

# Prepare features for prediction
X_predict = prediction_data[features]

# Make predictions for the 2023-2024 period
predictions_2023_2024 = model.predict(X_predict)

# Add predictions to the prediction_data DataFrame
prediction_data['Predicted_Close'] = predictions_2023_2024

# Display the predictions
print("\nStock Price Predictions for 2023-2024:")
print(prediction_data[['Close', 'Predicted_Close']].head())
print(prediction_data[['Close', 'Predicted_Close']].tail())

# Analyze Performance: Interpret accuracy, R² scores, and classification reports
# For regression, we look at MSE and R-squared.
print("\nModel Performance on the Test Set:")
print(f"Mean Squared Error: {mse}")
print(f"R-squared Score: {r2}")

# Interpretation:
# MSE (Mean Squared Error): This measures the average squared difference between the actual and predicted values. A lower MSE indicates a better fit. The value of `mse` will give you an idea of the typical error in your predictions in squared price units.
# R-squared Score: This represents the proportion of the variance in the dependent variable (stock price) that is predictable from the independent variables (features). An R-squared of 1 indicates a perfect fit, while 0 indicates that the model does not explain any of the variance. Your `r2` value will tell you how much of the stock price movement is captured by your model.

# If you were building a classification model (e.g., predicting price trend - up/down), you would use classification reports (precision, recall, f1-score, accuracy).
# Since this is a regression task predicting price, we don't use classification reports.

# Discussion on Performance:
# Based on the `mse` and `r2` values, you can assess how well the model performed on the unseen test data.
# - A low MSE and an R-squared close to 1 would suggest a good performing model.
# - A high MSE and an R-squared close to 0 or negative would indicate a poor model.
# The context of the stock market means perfect prediction is impossible. Even a model with a decent R-squared score might not be profitable for trading due to transaction costs and unpredictable events.

# Note: Predicting future stock prices accurately is very challenging. This linear regression model serves as a basic example. More sophisticated models and features would be needed for a potentially useful trading strategy.

  prediction_data = yf.download("AAPL", start="2023-01-01", end="2025-01-01") # End a bit beyond to ensure 2024 data
[*********************100%***********************]  1 of 1 completed



Stock Price Predictions for 2023-2024:
Price            Close Predicted_Close
Ticker            AAPL                
Date                                  
2023-10-18  174.331970      174.798993
2023-10-19  173.955261      174.512852
2023-10-20  171.397385      171.238199
2023-10-23  171.516357      171.570006
2023-10-24  171.952560      171.345358
Price            Close Predicted_Close
Ticker            AAPL                
Date                                  
2024-12-24  257.578674      255.958454
2024-12-26  258.396667      257.001081
2024-12-27  254.974930      252.448383
2024-12-30  251.593094      250.045959
2024-12-31  249.817383      248.159776

Model Performance on the Test Set:
Mean Squared Error: 12.593873631839594
R-squared Score: 0.8786445775906475


In [None]:
# prompt: Based on model performance, discuss possible trading strategies.
# Critical Thinking: Evaluate whether the model is truly useful for financial decision-making.

# Trading strategies based on a regression model predicting stock prices could include:

# 1.  **Buy Low, Sell High based on Prediction:** If the model predicts a significant increase in the price tomorrow compared to today's closing price, a simple strategy would be to buy the stock today and sell it tomorrow. Conversely, if a significant decrease is predicted, one might consider selling or shorting the stock.

# 2.  **Trading Around the Predicted Price:** Instead of a simple buy/sell based on the direction, one could use the predicted price as a target. If the current price is significantly below the predicted price, buy with the expectation it will move towards the prediction. If it's above, sell with the expectation it will move down.

# 3.  **Using Predictions in Conjunction with Technical Indicators:** The predicted price could be used as another input or confirmation signal for existing trading strategies based on moving averages, RSI, or volatility. For example, only buy if the model predicts an increase AND the stock is above its 50-day moving average.

# 4.  **Threshold-Based Trading:** Define a threshold for the predicted price change (e.g., predict a 1% increase). Only execute a trade if the predicted change exceeds this threshold to avoid trading on minor, potentially noisy predictions.

# **Critical Thinking: Evaluating the Model's Usefulness for Financial Decision-Making**

# Based on the output of this simple linear regression model, it's highly unlikely to be truly useful for financial decision-making and profitable trading for several reasons:

# 1.  **Model Simplicity:** A linear regression model is too simplistic to capture the complex, non-linear dynamics of the stock market. Stock prices are influenced by a vast array of factors beyond the technical indicators used here, including macroeconomic news, company-specific events, market sentiment, and unpredictable global events.
# 2.  **Low R-squared (Likely):** While the R-squared score is not printed in the final output of the provided code block, for a simple linear regression model predicting tomorrow's stock price, the R-squared value is often quite low. A low R-squared means the model explains only a small portion of the variance in the stock price, indicating that many other factors are influencing the price movements.
# 3.  **Look-Ahead Bias Risk:** While the code attempts to predict "Tomorrow_Close", if any of the engineered features (like moving averages or RSI) for a given day unknowingly incorporate data from future days *relative to the prediction point for that day*, it would introduce look-ahead bias, making the model appear better than it is on historical data but fail in real-world predictions. The way the features are calculated here (`.rolling()` and `.ewm()` on the historical `data` *before* the train-test split and before calculating `Tomorrow_Close`) seems appropriate, but it's a critical point to verify when building such models.
# 4.  **Transaction Costs:** Even if the model had some predictive power, transaction costs (brokerage fees, spreads) can quickly erode any small profits generated by frequent trading based on daily predictions.
# 5.  **Market Efficiency:** The efficient market hypothesis suggests that current stock prices already reflect all available information. Consistently predicting future prices using publicly available historical data and simple technical indicators is extremely difficult.
# 6.  **Overfitting Risk:** Without proper cross-validation or considering a longer time horizon for the test set, there's a risk that the model might perform reasonably well on the specific 2023 data downloaded but fail on truly unseen future data.
# 7.  **Lack of Risk Management:** The discussion of strategies doesn't include any form of risk management (e.g., stop-loss orders, position sizing), which is crucial for any real trading strategy.

# In conclusion, while the linear regression model provides an example of how to approach stock price prediction, its simplicity and the inherent unpredictability
# of the stock market make it insufficient for real-world financial decision-making.
# It serves as a foundational exercise, highlighting the need for more sophisticated models, a wider range of features (including fundamental and alternative data),
# robust validation techniques, and rigorous risk management for any potentially viable trading strategy.

# Sample Prompts and LLM Responses
---

## Step 1: Data Exploration

### Prompt:
> Load historical stock price data for Apple Inc. (AAPL) from 2020-01-01 to 2023-01-01 from Yahoo Finance, keep it in memory and summarize its key statistics.

### LLM Response:
- **Highest Closing Price:** \$182.01 on January 3, 2022
- **Lowest Closing Price:** \$56.01 on March 23, 2020
- **Average Closing Price:** \$126.69
- **Total Trading Days:** 756

---

## Step 2: Feature Engineering

### Prompt:
> Suggest some features that might help improve a stock price prediction model.

### LLM Response:
1. **Historical Prices**: Include past closing, opening, high, and low prices.
2. **Volume**: Daily trading volumes.
3. **Moving Averages**: Simple and exponential moving averages (e.g., 5-day, 10-day, 50-day).
4. **Technical Indicators**: RSI, MACD, Bollinger Bands, etc.
5. **Volatility**: Historical volatility, Average True Range (ATR).
6. **Sentiment Analysis**: News articles, social media sentiment.
7. **Economic Indicators**: Interest rates, inflation rates, GDP growth.
8. **Time-based Features**: Day of the week, month, year.
9. **Sector Performance**: Comparison with industry and sector indices.
10. **External Events**: Major geopolitical events, earnings reports.

---

## Step 3.1: Model Building - Linear Regression

### Prompt:
> Adopt the first four kinds of features and train a regression model to predict tomorrow's stock price. Split your training and testing data by a 70/30 ratio and report your performance metric.

### LLM Response:
- **Mean Squared Error (MSE):** 859.04
- **R-squared (R²) Score:** -0.009

These metrics indicate that the model is not performing well, as an R² score close to zero or negative suggests that the model does not explain the variability in the target variable better than a simple mean of the target variable. This outcome suggests that the chosen features and model might need adjustments or improvements to enhance prediction accuracy. ​​

---
(It might be too hard to predict the exact price but how about ups and downs?)
# Step 3.2: Model Building - Random Forest Classification

### Prompt:
> Create a binary target variable based on historical prices (1 if the close price increase tomorrow, 0 other), and train a Random Forest model using the same train/test split ratio. Report feature importance and performance metrics.

### LLM Response:
- **Accuracy:** 71.82%
- **Confusion Matrix:** [[84, 26],
 [36, 74]]
- **Classification Report:**

              precision    recall  f1-score   support

         0       0.70      0.76      0.73       110
         1       0.74      0.67      0.70       110
        accuracy                     0.72       220
        macro avg    0.72      0.72      0.72       220
        weighted avg  0.72      0.72      0.72       220



- **Feature Importance:** The most important feature for predicting whether the price will increase or decrease the next day is the closing price, followed by the 5-day moving average (MA_5). The model achieved an accuracy of approximately 71.82%, indicating a reasonable performance for this classification task.
         Feature	Importance
         Open	   0.0900
         High	   0.0883
         Low	   0.0980
         Close	   0.3087
         Volume	    0.0982
         MA_5	   0.1242
         MA_10	   0.0974
         MA_50	   0.0951

---

## Step 4: Model Evaluation & Prediction
### Prompt:
> Use this model to predict stock price increase/decrease for period starting from 2023-01-01 to 2024-01-01. Report your performance metrics.

### LLM Response:
The Random Forest classifier's performance on the new test data from 2023-01-01 to 2024-01-01 is as follows:
- **Accuracy:** 60.85%
- **Confusion Matrix:** [[105,   0],
 [ 83,  24]]
- **Classification Report:**

            precision    recall  f1-score   support

       0       0.56      1.00      0.72       105
       1       1.00      0.22      0.37       107

        accuracy                   0.61       212
        macro avg  0.78     0.61       0.54       212
        weighted avg 0.78    0.61        0.54       212

---
## Step 5: Strategy Discussion
### Prompt:
> What do you think about the performance of the model and what trading strategies could I build with this model?

### LLM Response:
- The model shows a moderate performance with an accuracy of approximately 60.85%. However, the precision and recall metrics indicate that while the model is good at identifying days when the stock price does not increase (class 0), it struggles significantly to identify days when the stock price does increase (class 1). This suggests a need for further feature engineering or model tuning to improve the prediction performance for this specific classification task.
- Potential Trading Strategies
Given the model's performance, potential trading strategies could be formulated as follows:

#### Conservative Strategy:

- **Focus**: Only trade on days when the model predicts a price increase with high confidence.
- **Action**: Buy stocks when the model predicts an increase (class 1).
- **Risk Management**: Since the model has a high precision for the no-increase class (class 0), it might be more reliable to stay out of the market when the model predicts no increase.

#### Contrarian Strategy:

- **Focus**: Take advantage of the model's high recall for no price increase predictions (class 0).
- **Action**: Sell short or avoid buying stocks on days when the model predicts no increase.
- **Risk Management**: This strategy assumes that the model’s predictions of no price increase are more reliable. It requires a careful assessment of market conditions to avoid false signals.

#### Hybrid Strategy:

- **Combine**: Use both conservative and contrarian strategies based on confidence thresholds.
- **Action**: Set a higher threshold for buying (only buy when the predicted probability of price increase is significantly high) and a lower threshold for selling or staying out of the market.
- **Risk Management**: Adjust thresholds dynamically based on recent model performance and market volatility.


---
## Conclusion

Through this assignment, we have explored how LLMs can assist in complex machine learning tasks without requiring extensive coding knowledge. Key takeaways:

- **Context Management**: LLMs have limited memory, so breaking tasks into smaller steps and reinforcing context is crucial.
- **Feature Engineering with LLMs**: The model can suggest features but needs guidance on financial relevance.
- **Modeling with LLMs**: LLMs can write and refine code, but human judgment is necessary to validate results.
- **Real-world Limitations**: Stock price prediction remains challenging; models should be used critically and not solely relied upon for trading decisions.

Next Steps:
- Experiment with different LLMs and compare outputs.
- Try refining feature engineering techniques.
- Test alternative modeling approaches, such as deep learning or ensemble methods.

