<a href="https://colab.research.google.com/github/ChinthanChinnappa/GENAI/blob/main/M1_Assignment_Stock_Price_Prediction_with_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 1 No Code/Low Code Assignment: Stock Price Prediction with LLM (Using Gemini in Colab)

This assignment explores how large language models (LLMs) can perform complex tasks that previously required coding. Instead of manually writing code, we will interact with Google's Gemini (Click on the Gemini icon on the top right corner) in Google Colab to:

- Load and analyze stock price data.
- Engineer relevant features for prediction.
- Build and evaluate models such as regression and random forest.
- Discuss potential trading strategies based on predictions.

**Note**: While this notebook contains example code, students are encouraged to use LLMs to generate code snippets and explanations, managing interactions within the LLM's context window.

## Task Breakdown

### Step 1: Data Exploration
- **Prompt the LLM**: Load historical stock price data for Apple Inc. (AAPL) from 2020-01-01 to 2023-01-01 using Yahoo Finance.
- **Key Considerations**: Ensure the model retains context while summarizing key statistics.

### Step 2: Feature Engineering
- **Prompt the LLM**: Generate meaningful features such as moving averages, technical indicators, and volatility measures.
- **Managing Context**: If the LLM forgets prior outputs, remind it by reintroducing key statistics.

### Step 3: Model Building
- **Prompt the LLM**: Train a model (regression for price prediction, random forest for trend classification).
- **Split Ratio Consideration**: Explain the importance of train-test split (e.g., 70/30) and verify the LLM respects it.

### Step 4: Model Evaluation & Prediction
- **Prompt the LLM**: Use the trained model to predict stock prices for 2023-2024.
- **Analyze Performance**: Interpret accuracy, R² scores, and classification reports.

### Step 5: Strategy Discussion
- **Prompt the LLM**: Based on model performance, discuss possible trading strategies.
- **Critical Thinking**: Evaluate whether the model is truly useful for financial decision-making.

---

Throughout the assignment, keep in mind that LLMs have a limited context window. If necessary, break down tasks into smaller chunks and reintroduce critical information.



In [7]:
!pip install yfinance

import yfinance as yf
import pandas as pd

# Download historical data for AAPL
data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")

# Display summary statistics
print(data.describe())



  data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")
[*********************100%***********************]  1 of 1 completed

Price        Close        High         Low        Open        Volume
Ticker        AAPL        AAPL        AAPL        AAPL          AAPL
count   756.000000  756.000000  756.000000  756.000000  7.560000e+02
mean    127.488415  129.035076  125.828630  127.410553  1.120920e+08
std      30.386976   30.664098   30.105383   30.415044  5.602586e+07
min      54.316940   55.316759   51.470005   55.215083  3.519590e+07
25%     112.581741  114.075470  111.125714  112.761711  7.636470e+07
50%     132.859985  133.991789  130.902085  132.681718  9.493580e+07
75%     148.053474  149.219241  146.111048  147.421466  1.296327e+08
max     178.443115  179.354917  175.609770  179.050994  4.265100e+08





## Using Gemini in Colab

To use Gemini directly in your browser within Google Colab:

1. **Create a new code cell** by clicking the `+ Code` button.
2. In the newly created cell, you will see a placeholder text: **"Start coding or generate with AI"**.
3. Click on **"Generate"** and enter your prompt, such as:
   > "Load historical stock price data for Apple Inc. (AAPL) from 2020-01-01 to 2023-01-01 using Yahoo Finance and summarize its key statistics."
4. Gemini will generate code based on your prompt. You can then review, modify, and run it within the notebook.
5. If the model loses context, you may need to remind it of previous responses or break down tasks into smaller steps.

This method allows you to interactively generate and refine code without manually writing every line.


In [8]:
# prompt: Load historical stock price data for Apple Inc. (AAPL) from 2020-01-01 to 2023-01-01 using Yahoo Finance and summarize its key statistics.

!pip install yfinance

import yfinance as yf
import pandas as pd

# Download historical data for AAPL
data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")

# Display summary statistics
print(data.describe())




  data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")
[*********************100%***********************]  1 of 1 completed

Price        Close        High         Low        Open        Volume
Ticker        AAPL        AAPL        AAPL        AAPL          AAPL
count   756.000000  756.000000  756.000000  756.000000  7.560000e+02
mean    127.488415  129.035076  125.828630  127.410553  1.120920e+08
std      30.386976   30.664098   30.105383   30.415044  5.602586e+07
min      54.316940   55.316759   51.470005   55.215083  3.519590e+07
25%     112.581741  114.075470  111.125714  112.761711  7.636470e+07
50%     132.859985  133.991789  130.902085  132.681718  9.493580e+07
75%     148.053474  149.219241  146.111048  147.421466  1.296327e+08
max     178.443115  179.354917  175.609770  179.050994  4.265100e+08





In [9]:
# prompt: Load historical stock price data for Apple Inc. (AAPL) from 2020-01-01 to 2023-01-01 using Yahoo Finance and summarize its key statistics.

!pip install yfinance

import yfinance as yf
import pandas as pd

# Download historical data for AAPL from 2020-01-01 to 2023-01-01
data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")

# Summarize key statistics
print(data.describe())




  data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")
[*********************100%***********************]  1 of 1 completed

Price        Close        High         Low        Open        Volume
Ticker        AAPL        AAPL        AAPL        AAPL          AAPL
count   756.000000  756.000000  756.000000  756.000000  7.560000e+02
mean    127.488415  129.035076  125.828630  127.410553  1.120920e+08
std      30.386976   30.664098   30.105383   30.415044  5.602586e+07
min      54.316940   55.316759   51.470005   55.215083  3.519590e+07
25%     112.581741  114.075470  111.125714  112.761711  7.636470e+07
50%     132.859985  133.991789  130.902085  132.681718  9.493580e+07
75%     148.053474  149.219241  146.111048  147.421466  1.296327e+08
max     178.443115  179.354917  175.609770  179.050994  4.265100e+08





In [10]:
# Prompt: Load historical stock price data for Apple Inc. (AAPL) from 2020-01-01 to 2023-01-01
# using Yahoo Finance and summarize its key statistics.

import yfinance as yf

data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")

# Summarize key statistics
print(data.describe())



  data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")
[*********************100%***********************]  1 of 1 completed

Price        Close        High         Low        Open        Volume
Ticker        AAPL        AAPL        AAPL        AAPL          AAPL
count   756.000000  756.000000  756.000000  756.000000  7.560000e+02
mean    127.488415  129.035076  125.828630  127.410553  1.120920e+08
std      30.386976   30.664098   30.105383   30.415044  5.602586e+07
min      54.316940   55.316759   51.470005   55.215083  3.519590e+07
25%     112.581741  114.075470  111.125714  112.761711  7.636470e+07
50%     132.859985  133.991789  130.902085  132.681718  9.493580e+07
75%     148.053474  149.219241  146.111048  147.421466  1.296327e+08
max     178.443115  179.354917  175.609770  179.050994  4.265100e+08





## Now try doing step 2 - step 5 by yourself

The sample prompts and LLM responses are listed below for your reference.

In [11]:
import yfinance as yf
import pandas as pd

# Download historical data for AAPL from 2020-01-01 to 2023-01-01
data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")

# Display summary statistics
print(data.describe())

  data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")
[*********************100%***********************]  1 of 1 completed


Price        Close        High         Low        Open        Volume
Ticker        AAPL        AAPL        AAPL        AAPL          AAPL
count   756.000000  756.000000  756.000000  756.000000  7.560000e+02
mean    127.488415  129.035076  125.828630  127.410553  1.120920e+08
std      30.386976   30.664098   30.105383   30.415044  5.602586e+07
min      54.316940   55.316759   51.470005   55.215083  3.519590e+07
25%     112.581741  114.075470  111.125714  112.761711  7.636470e+07
50%     132.859985  133.991789  130.902085  132.681718  9.493580e+07
75%     148.053474  149.219241  146.111048  147.421466  1.296327e+08
max     178.443115  179.354917  175.609770  179.050994  4.265100e+08


In [12]:
# Calculate moving averages
data['MA_5'] = data['Close'].rolling(window=5).mean()
data['MA_10'] = data['Close'].rolling(window=10).mean()
data['MA_50'] = data['Close'].rolling(window=50).mean()

# Calculate historical volatility (using a 20-day window)
data['Volatility'] = data['Close'].pct_change().rolling(window=20).std() * (252**0.5) # Annualized volatility

# Drop rows with NaN values resulting from the rolling calculations
# Add a check to see if data is empty after dropping NaNs
initial_rows = data.shape[0]
data.dropna(inplace=True)
if data.empty:
    print("Warning: All rows were dropped after adding features and removing NaNs. Please check your data or feature calculations.")
else:
    print(f"Dropped {initial_rows - data.shape[0]} rows with NaN values.")


# Display the first few rows with the new features
display(data.head())

Dropped 49 rows with NaN values.


Price,Close,High,Low,Open,Volume,MA_5,MA_10,MA_50,Volatility
Ticker,AAPL,AAPL,AAPL,AAPL,AAPL,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
2020-03-13,67.292755,67.764827,61.235753,64.126268,370732000,65.515375,68.412667,73.941989,0.877838
2020-03-16,58.635757,62.719751,58.100745,58.572813,322423600,64.355293,67.042457,73.663934,0.978484
2020-03-17,61.213985,62.363892,57.713413,59.918821,324056000,62.782697,66.159809,73.451547,0.998581
2020-03-18,59.715462,60.52161,57.403535,58.045067,300233600,61.390215,64.802431,73.197743,0.995659
2020-03-19,59.257927,61.209143,58.7326,59.889773,271857200,61.223177,63.637025,72.941599,0.995748


In [13]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import pandas as pd # Import pandas

# Create the binary target variable
data['Price_Increase'] = (data['Close'].shift(-1) > data['Close']).astype(int)

# Drop the last row as it will have a NaN for 'Price_Increase'
# Removed the redundant data.dropna(inplace=True) here as it's done after feature engineering


# Define features and target
features = ['Open', 'High', 'Low', 'Close', 'Volume', 'MA_5', 'MA_10', 'MA_50', 'Volatility']
target = 'Price_Increase'

# Ensure data is not empty after feature engineering before proceeding
if not data.empty:
    X = data[features]
    y = data[target]

    # Split data into training and testing sets (70/30 ratio) - Ensure consistency with previous split if possible
    # Using the same random state as before for reproducibility
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Train a Random Forest Classifier model
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)

    print(f"Accuracy: {accuracy:.2f}")
    print("Confusion Matrix:")
    print(conf_matrix)
    print("Classification Report:")
    print(class_report)

    # Report Feature Importance
    feature_importances = pd.Series(model.feature_importances_, index=features).sort_values(ascending=False)
    print("\nFeature Importance:")
    print(feature_importances)
else:
    print("Data is empty after feature engineering. Cannot proceed with model training.")

Accuracy: 0.52
Confusion Matrix:
[[48 56]
 [46 63]]
Classification Report:
              precision    recall  f1-score   support

           0       0.51      0.46      0.48       104
           1       0.53      0.58      0.55       109

    accuracy                           0.52       213
   macro avg       0.52      0.52      0.52       213
weighted avg       0.52      0.52      0.52       213


Feature Importance:
Volatility    0.129709
Volume        0.121155
High          0.113507
MA_50         0.111646
Close         0.108801
Open          0.107000
MA_10         0.105648
MA_5          0.103189
Low           0.099345
dtype: float64


# Sample Prompts and LLM Responses
---

## Step 1: Data Exploration

### Prompt:
> Load historical stock price data for Apple Inc. (AAPL) from 2020-01-01 to 2023-01-01 from Yahoo Finance, keep it in memory and summarize its key statistics.

### LLM Response:
- **Highest Closing Price:** \$182.01 on January 3, 2022
- **Lowest Closing Price:** \$56.01 on March 23, 2020
- **Average Closing Price:** \$126.69
- **Total Trading Days:** 756

---

## Step 2: Feature Engineering

### Prompt:
> Suggest some features that might help improve a stock price prediction model.

### LLM Response:
1. **Historical Prices**: Include past closing, opening, high, and low prices.
2. **Volume**: Daily trading volumes.
3. **Moving Averages**: Simple and exponential moving averages (e.g., 5-day, 10-day, 50-day).
4. **Technical Indicators**: RSI, MACD, Bollinger Bands, etc.
5. **Volatility**: Historical volatility, Average True Range (ATR).
6. **Sentiment Analysis**: News articles, social media sentiment.
7. **Economic Indicators**: Interest rates, inflation rates, GDP growth.
8. **Time-based Features**: Day of the week, month, year.
9. **Sector Performance**: Comparison with industry and sector indices.
10. **External Events**: Major geopolitical events, earnings reports.

---

## Step 3.1: Model Building - Linear Regression

### Prompt:
> Adopt the first four kinds of features and train a regression model to predict tomorrow's stock price. Split your training and testing data by a 70/30 ratio and report your performance metric.

### LLM Response:
- **Mean Squared Error (MSE):** 859.04
- **R-squared (R²) Score:** -0.009

These metrics indicate that the model is not performing well, as an R² score close to zero or negative suggests that the model does not explain the variability in the target variable better than a simple mean of the target variable. This outcome suggests that the chosen features and model might need adjustments or improvements to enhance prediction accuracy. ​​

---
(It might be too hard to predict the exact price but how about ups and downs?)
# Step 3.2: Model Building - Random Forest Classification

### Prompt:
> Create a binary target variable based on historical prices (1 if the close price increase tomorrow, 0 other), and train a Random Forest model using the same train/test split ratio. Report feature importance and performance metrics.

### LLM Response:
- **Accuracy:** 71.82%
- **Confusion Matrix:** [[84, 26],
 [36, 74]]
- **Classification Report:**

              precision    recall  f1-score   support

         0       0.70      0.76      0.73       110
         1       0.74      0.67      0.70       110
        accuracy                     0.72       220
        macro avg    0.72      0.72      0.72       220
        weighted avg  0.72      0.72      0.72       220



- **Feature Importance:** The most important feature for predicting whether the price will increase or decrease the next day is the closing price, followed by the 5-day moving average (MA_5). The model achieved an accuracy of approximately 71.82%, indicating a reasonable performance for this classification task.
         Feature	Importance
         Open	   0.0900
         High	   0.0883
         Low	   0.0980
         Close	   0.3087
         Volume	    0.0982
         MA_5	   0.1242
         MA_10	   0.0974
         MA_50	   0.0951

---

## Step 4: Model Evaluation & Prediction
### Prompt:
> Use this model to predict stock price increase/decrease for period starting from 2023-01-01 to 2024-01-01. Report your performance metrics.

### LLM Response:
The Random Forest classifier's performance on the new test data from 2023-01-01 to 2024-01-01 is as follows:
- **Accuracy:** 60.85%
- **Confusion Matrix:** [[105,   0],
 [ 83,  24]]
- **Classification Report:**

            precision    recall  f1-score   support

       0       0.56      1.00      0.72       105
       1       1.00      0.22      0.37       107

        accuracy                   0.61       212
        macro avg  0.78     0.61       0.54       212
        weighted avg 0.78    0.61        0.54       212

---
## Step 5: Strategy Discussion
### Prompt:
> What do you think about the performance of the model and what trading strategies could I build with this model?

### LLM Response:
- The model shows a moderate performance with an accuracy of approximately 60.85%. However, the precision and recall metrics indicate that while the model is good at identifying days when the stock price does not increase (class 0), it struggles significantly to identify days when the stock price does increase (class 1). This suggests a need for further feature engineering or model tuning to improve the prediction performance for this specific classification task.
- Potential Trading Strategies
Given the model's performance, potential trading strategies could be formulated as follows:

#### Conservative Strategy:

- **Focus**: Only trade on days when the model predicts a price increase with high confidence.
- **Action**: Buy stocks when the model predicts an increase (class 1).
- **Risk Management**: Since the model has a high precision for the no-increase class (class 0), it might be more reliable to stay out of the market when the model predicts no increase.

#### Contrarian Strategy:

- **Focus**: Take advantage of the model's high recall for no price increase predictions (class 0).
- **Action**: Sell short or avoid buying stocks on days when the model predicts no increase.
- **Risk Management**: This strategy assumes that the model’s predictions of no price increase are more reliable. It requires a careful assessment of market conditions to avoid false signals.

#### Hybrid Strategy:

- **Combine**: Use both conservative and contrarian strategies based on confidence thresholds.
- **Action**: Set a higher threshold for buying (only buy when the predicted probability of price increase is significantly high) and a lower threshold for selling or staying out of the market.
- **Risk Management**: Adjust thresholds dynamically based on recent model performance and market volatility.


---
## Conclusion

Through this assignment, we have explored how LLMs can assist in complex machine learning tasks without requiring extensive coding knowledge. Key takeaways:

- **Context Management**: LLMs have limited memory, so breaking tasks into smaller steps and reinforcing context is crucial.
- **Feature Engineering with LLMs**: The model can suggest features but needs guidance on financial relevance.
- **Modeling with LLMs**: LLMs can write and refine code, but human judgment is necessary to validate results.
- **Real-world Limitations**: Stock price prediction remains challenging; models should be used critically and not solely relied upon for trading decisions.

Next Steps:
- Experiment with different LLMs and compare outputs.
- Try refining feature engineering techniques.
- Test alternative modeling approaches, such as deep learning or ensemble methods.



In [14]:
# Download historical data for AAPL for the prediction period
prediction_data = yf.download("AAPL", start="2023-01-01", end="2024-01-01")

# Apply the same feature engineering steps to the prediction data
prediction_data['MA_5'] = prediction_data['Close'].rolling(window=5).mean()
prediction_data['MA_10'] = prediction_data['Close'].rolling(window=10).mean()
prediction_data['MA_50'] = prediction_data['Close'].rolling(window=50).mean()
prediction_data['Volatility'] = prediction_data['Close'].pct_change().rolling(window=20).std() * (252**0.5)

# Create the binary target variable for the prediction period
prediction_data['Price_Increase'] = (prediction_data['Close'].shift(-1) > prediction_data['Close']).astype(int)

# Drop rows with NaN values
prediction_data.dropna(inplace=True)

# Define features and target for prediction
X_predict = prediction_data[features] # Use the same features as training
y_true = prediction_data[target] # True values for evaluation

# Ensure prediction data is not empty before predicting
if not prediction_data.empty:
    # Make predictions on the prediction data
    y_pred_predict = model.predict(X_predict)

    # Evaluate the model on the prediction data
    accuracy_predict = accuracy_score(y_true, y_pred_predict)
    conf_matrix_predict = confusion_matrix(y_true, y_pred_predict)
    class_report_predict = classification_report(y_true, y_pred_predict)

    print(f"Accuracy on prediction data: {accuracy_predict:.2f}")
    print("Confusion Matrix on prediction data:")
    print(conf_matrix_predict)
    print("Classification Report on prediction data:")
    print(class_report_predict)
else:
    print("Prediction data is empty after feature engineering and target variable creation. Cannot make predictions.")

  prediction_data = yf.download("AAPL", start="2023-01-01", end="2024-01-01")
[*********************100%***********************]  1 of 1 completed

Accuracy on prediction data: 0.47
Confusion Matrix on prediction data:
[[ 91   1]
 [106   3]]
Classification Report on prediction data:
              precision    recall  f1-score   support

           0       0.46      0.99      0.63        92
           1       0.75      0.03      0.05       109

    accuracy                           0.47       201
   macro avg       0.61      0.51      0.34       201
weighted avg       0.62      0.47      0.32       201






In [15]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Define features (first four kinds) and target variable
# Assuming 'Close' is the target variable for tomorrow's price
features = ['Open', 'High', 'Low', 'Close']
target = 'Close' # Predict tomorrow's closing price

# Create the target variable (tomorrow's closing price)
data['Tomorrow_Close'] = data['Close'].shift(-1)

# Drop the last row as it will have a NaN for 'Tomorrow_Close'
data.dropna(inplace=True)

X = data[features]
y = data['Tomorrow_Close']

# Split data into training and testing sets (70/30 ratio)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R²) Score: {r2:.2f}")

Mean Squared Error (MSE): 6.53
R-squared (R²) Score: 0.99


# Task
Correct the code to load historical stock price data for Apple Inc. (AAPL) from 2020-01-01 to 2023-01-01 from Yahoo Finance, keep it in memory, create a binary target variable (1 if the close price increases tomorrow, 0 otherwise), engineer features, train a Random Forest model with a 70/30 train/test split, report feature importance and performance metrics, use the model to predict stock price increase/decrease for the period from 2023-01-01 to 2024-01-01, and report performance metrics for this prediction period.

## Inspect data after loading

### Subtask:
Inspect the data immediately after loading from Yahoo Finance to ensure it is not empty.


**Reasoning**:
Inspect the data immediately after loading from Yahoo Finance to ensure it is not empty.



In [16]:
import yfinance as yf
import pandas as pd

# Download historical data for AAPL from 2020-01-01 to 2023-01-01
data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")

# Inspect the data
print("First 5 rows of the data:")
display(data.head())
print("\nShape of the data (rows, columns):")
print(data.shape)

# Summarize key statistics
print("\nSummary statistics:")
print(data.describe())

  data = yf.download("AAPL", start="2020-01-01", end="2023-01-01")
[*********************100%***********************]  1 of 1 completed

First 5 rows of the data:





Price,Close,High,Low,Open,Volume
Ticker,AAPL,AAPL,AAPL,AAPL,AAPL
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2020-01-02,72.538536,72.598914,71.292326,71.545912,135480400
2020-01-03,71.833305,72.594071,71.6087,71.765682,146322800
2020-01-06,72.405678,72.444321,70.703012,70.954188,118387200
2020-01-07,72.065147,72.671341,71.845369,72.415337,108872000
2020-01-08,73.224411,73.526303,71.768086,71.768086,132079200



Shape of the data (rows, columns):
(756, 5)

Summary statistics:
Price        Close        High         Low        Open        Volume
Ticker        AAPL        AAPL        AAPL        AAPL          AAPL
count   756.000000  756.000000  756.000000  756.000000  7.560000e+02
mean    127.488415  129.035076  125.828630  127.410553  1.120920e+08
std      30.386976   30.664098   30.105383   30.415044  5.602586e+07
min      54.316940   55.316759   51.470005   55.215083  3.519590e+07
25%     112.581741  114.075470  111.125714  112.761711  7.636470e+07
50%     132.859985  133.991789  130.902085  132.681718  9.493580e+07
75%     148.053474  149.219241  146.111048  147.421466  1.296327e+08
max     178.443115  179.354917  175.609770  179.050994  4.265100e+08


**Reasoning**:
The data has been loaded and inspected. The next step is to create the binary target variable and engineer the features as requested in the overall task.



In [17]:
# Create the binary target variable
data['Price_Increase'] = (data['Close'].shift(-1) > data['Close']).astype(int)

# Calculate moving averages
data['MA_5'] = data['Close'].rolling(window=5).mean()
data['MA_10'] = data['Close'].rolling(window=10).mean()
data['MA_50'] = data['Close'].rolling(window=50).mean()

# Calculate historical volatility (using a 20-day window)
data['Volatility'] = data['Close'].pct_change().rolling(window=20).std() * (252**0.5) # Annualized volatility

# Drop rows with NaN values resulting from the rolling calculations and the shifted target variable
initial_rows = data.shape[0]
data.dropna(inplace=True)

# Add a check to see if data is empty after dropping NaNs
if data.empty:
    print("Warning: All rows were dropped after adding features and removing NaNs. Please check your data or feature calculations.")
else:
    print(f"Dropped {initial_rows - data.shape[0]} rows with NaN values.")


# Display the first few rows with the new features
print("\nFirst 5 rows after feature engineering:")
display(data.head())
print("\nShape after feature engineering:")
print(data.shape)

Dropped 49 rows with NaN values.

First 5 rows after feature engineering:


Price,Close,High,Low,Open,Volume,Price_Increase,MA_5,MA_10,MA_50,Volatility
Ticker,AAPL,AAPL,AAPL,AAPL,AAPL,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
2020-03-13,67.292755,67.764827,61.235753,64.126268,370732000,0,65.515375,68.412667,73.941989,0.877838
2020-03-16,58.635757,62.719751,58.100745,58.572813,322423600,1,64.355293,67.042457,73.663934,0.978484
2020-03-17,61.213985,62.363892,57.713413,59.918821,324056000,0,62.782697,66.159809,73.451547,0.998581
2020-03-18,59.715462,60.52161,57.403535,58.045067,300233600,0,61.390215,64.802431,73.197743,0.995659
2020-03-19,59.257927,61.209143,58.7326,59.889773,271857200,0,61.223177,63.637025,72.941599,0.995748



Shape after feature engineering:
(707, 10)


**Reasoning**:
The features and target variable have been created and the data has been cleaned. Now, I need to train a Random Forest model with a 70/30 train/test split and report the performance metrics and feature importance.



In [18]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Define features and target
features = ['Open', 'High', 'Low', 'Close', 'Volume', 'MA_5', 'MA_10', 'MA_50', 'Volatility']
target = 'Price_Increase'

# Ensure data is not empty before proceeding
if not data.empty:
    X = data[features]
    y = data[target]

    # Split data into training and testing sets (70/30 ratio)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Train a Random Forest Classifier model
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)

    print(f"\nAccuracy: {accuracy:.2f}")
    print("Confusion Matrix:")
    print(conf_matrix)
    print("Classification Report:")
    print(class_report)

    # Report Feature Importance
    feature_importances = pd.Series(model.feature_importances_, index=features).sort_values(ascending=False)
    print("\nFeature Importance:")
    print(feature_importances)
else:
    print("Data is empty after feature engineering. Cannot proceed with model training.")


Accuracy: 0.52
Confusion Matrix:
[[48 56]
 [46 63]]
Classification Report:
              precision    recall  f1-score   support

           0       0.51      0.46      0.48       104
           1       0.53      0.58      0.55       109

    accuracy                           0.52       213
   macro avg       0.52      0.52      0.52       213
weighted avg       0.52      0.52      0.52       213


Feature Importance:
Volatility    0.129709
Volume        0.121155
High          0.113507
MA_50         0.111646
Close         0.108801
Open          0.107000
MA_10         0.105648
MA_5          0.103189
Low           0.099345
dtype: float64


**Reasoning**:
The Random Forest model has been trained and evaluated on the test data. The next step is to use the trained model to predict stock price increase/decrease for the period from 2023-01-01 to 2024-01-01 and report the performance metrics for this prediction period.



In [19]:
# Download historical data for AAPL for the prediction period
prediction_data = yf.download("AAPL", start="2023-01-01", end="2024-01-01")

# Apply the same feature engineering steps to the prediction data
prediction_data['Price_Increase'] = (prediction_data['Close'].shift(-1) > prediction_data['Close']).astype(int)
prediction_data['MA_5'] = prediction_data['Close'].rolling(window=5).mean()
prediction_data['MA_10'] = prediction_data['Close'].rolling(window=10).mean()
prediction_data['MA_50'] = prediction_data['Close'].rolling(window=50).mean()
prediction_data['Volatility'] = prediction_data['Close'].pct_change().rolling(window=20).std() * (252**0.5)

# Drop rows with NaN values
prediction_data.dropna(inplace=True)

# Define features and target for prediction
features = ['Open', 'High', 'Low', 'Close', 'Volume', 'MA_5', 'MA_10', 'MA_50', 'Volatility']
target = 'Price_Increase'

# Ensure prediction data is not empty before predicting
if not prediction_data.empty:
    X_predict = prediction_data[features]
    y_true = prediction_data[target]

    # Make predictions on the prediction data
    y_pred_predict = model.predict(X_predict)

    # Evaluate the model on the prediction data
    accuracy_predict = accuracy_score(y_true, y_pred_predict)
    conf_matrix_predict = confusion_matrix(y_true, y_pred_predict)
    class_report_predict = classification_report(y_true, y_pred_predict)

    print(f"\nAccuracy on prediction data: {accuracy_predict:.2f}")
    print("Confusion Matrix on prediction data:")
    print(conf_matrix_predict)
    print("Classification Report on prediction data:")
    print(class_report_predict)
else:
    print("Prediction data is empty after feature engineering and target variable creation. Cannot make predictions.")

  prediction_data = yf.download("AAPL", start="2023-01-01", end="2024-01-01")
[*********************100%***********************]  1 of 1 completed


Accuracy on prediction data: 0.47
Confusion Matrix on prediction data:
[[ 91   1]
 [106   3]]
Classification Report on prediction data:
              precision    recall  f1-score   support

           0       0.46      0.99      0.63        92
           1       0.75      0.03      0.05       109

    accuracy                           0.47       201
   macro avg       0.61      0.51      0.34       201
weighted avg       0.62      0.47      0.32       201






## Refine feature engineering

### Subtask:
Review and potentially adjust the feature engineering code in cell `z3MldVMjnNOX` to prevent all rows from being dropped. This might involve checking for sufficient data points for rolling calculations or handling NaNs differently if appropriate.


## Verify data after feature engineering

### Subtask:
Add a check and display the data after feature engineering and `dropna()` to confirm that data remains.


## Correct random forest model training

### Subtask:
Ensure the Random Forest model training cell (`044f509e`) uses the correct features and target and runs successfully with the non-empty data.


**Reasoning**:
Examine the code in cell `044f509e` to confirm that the features and target are correctly defined and that the data is not empty before splitting and training the model.



In [20]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import pandas as pd

# Define features and target
features = ['Open', 'High', 'Low', 'Close', 'Volume', 'MA_5', 'MA_10', 'MA_50', 'Volatility']
target = 'Price_Increase'

# Ensure data is not empty before proceeding
if not data.empty:
    X = data[features]
    y = data[target]

    # Split data into training and testing sets (70/30 ratio)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Train a Random Forest Classifier model
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)

    print(f"\nAccuracy: {accuracy:.2f}")
    print("Confusion Matrix:")
    print(conf_matrix)
    print("Classification Report:")
    print(class_report)

    # Report Feature Importance
    feature_importances = pd.Series(model.feature_importances_, index=features).sort_values(ascending=False)
    print("\nFeature Importance:")
    print(feature_importances)
else:
    print("Data is empty after feature engineering. Cannot proceed with model training.")


Accuracy: 0.52
Confusion Matrix:
[[48 56]
 [46 63]]
Classification Report:
              precision    recall  f1-score   support

           0       0.51      0.46      0.48       104
           1       0.53      0.58      0.55       109

    accuracy                           0.52       213
   macro avg       0.52      0.52      0.52       213
weighted avg       0.52      0.52      0.52       213


Feature Importance:
Volatility    0.129709
Volume        0.121155
High          0.113507
MA_50         0.111646
Close         0.108801
Open          0.107000
MA_10         0.105648
MA_5          0.103189
Low           0.099345
dtype: float64


## Correct prediction and evaluation

### Subtask:
Ensure the prediction and evaluation cell (`Cci-VCAtoeK9`) runs successfully after the Random Forest model is trained, using the correctly loaded and engineered prediction data.


**Reasoning**:
The prediction and evaluation cell failed because the `model` variable was not defined in the previous execution context. This is likely due to running the cells out of order or an issue with the notebook state. The code in cell `Cci-VCAtoeK9` already correctly loads the prediction data, applies feature engineering, creates the target variable, handles NaNs, and sets up the feature and target variables for prediction. The only missing part is ensuring the `model` variable from the training step is available. Since the previous subtask confirmed the training cell (`044f509e`) runs successfully and defines the `model`, re-executing the training cell before the prediction cell should resolve the `NameError`. However, since the instructions ask to correct the *prediction and evaluation cell*, and the error is that `model` is not defined, I will add the model definition to the prediction cell to make it self-contained for this specific subtask's execution, even though the ideal fix in a notebook flow would be to ensure the training cell runs first. This aligns with the instruction to correct cell `Cci-VCAtoeK9`. I will also ensure all necessary imports are included in the cell to make it runnable independently for testing.



In [21]:
import yfinance as yf
import pandas as pd
from sklearn.ensemble import RandomForestClassifier # Import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Download historical data for AAPL for the prediction period
prediction_data = yf.download("AAPL", start="2023-01-01", end="2024-01-01")

# Apply the same feature engineering steps to the prediction data
prediction_data['Price_Increase'] = (prediction_data['Close'].shift(-1) > prediction_data['Close']).astype(int)
prediction_data['MA_5'] = prediction_data['Close'].rolling(window=5).mean()
prediction_data['MA_10'] = prediction_data['Close'].rolling(window=10).mean()
prediction_data['MA_50'] = prediction_data['Close'].rolling(window=50).mean()
prediction_data['Volatility'] = prediction_data['Close'].pct_change().rolling(window=20).std() * (252**0.5)

# Drop rows with NaN values
prediction_data.dropna(inplace=True)

# Define features and target for prediction
features = ['Open', 'High', 'Low', 'Close', 'Volume', 'MA_5', 'MA_10', 'MA_50', 'Volatility']
target = 'Price_Increase'

# Ensure prediction data is not empty before predicting
if not prediction_data.empty:
    X_predict = prediction_data[features]
    y_true = prediction_data[target]

    # Re-define and train the model here to make the cell self-contained for correction
    # In a typical notebook workflow, you would run the training cell first.
    # For this subtask, we include it to make the prediction cell runnable and correct the NameError.
    # Define features and target for training (assuming 'data' DataFrame is available from previous steps)
    # In a real scenario, 'data', X_train, y_train would be available from the training cell.
    # For this correction, we simulate having the trained model.
    # NOTE: This is a workaround to make this specific cell runnable for correction.
    # The 'data' DataFrame must be available in the kernel for this to work correctly.
    if 'data' in locals() and not data.empty:
        X = data[features]
        y = data[target]
        from sklearn.model_selection import train_test_split # Import train_test_split here
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
        model = RandomForestClassifier(n_estimators=100, random_state=42)
        model.fit(X_train, y_train)
    else:
         print("Training data ('data' DataFrame) not found or is empty. Cannot train the model for prediction.")
         model = None # Set model to None if training data is not available


    # Proceed with prediction only if the model was successfully trained
    if model is not None:
        # Make predictions on the prediction data
        y_pred_predict = model.predict(X_predict)

        # Evaluate the model on the prediction data
        accuracy_predict = accuracy_score(y_true, y_pred_predict)
        conf_matrix_predict = confusion_matrix(y_true, y_pred_predict)
        class_report_predict = classification_report(y_true, y_pred_predict)

        print(f"\nAccuracy on prediction data: {accuracy_predict:.2f}")
        print("Confusion Matrix on prediction data:")
        print(conf_matrix_predict)
        print("Classification Report on prediction data:")
        print(class_report_predict)
    else:
        print("Model not trained successfully. Cannot make predictions.")

else:
    print("Prediction data is empty after feature engineering and target variable creation. Cannot make predictions.")


  prediction_data = yf.download("AAPL", start="2023-01-01", end="2024-01-01")
[*********************100%***********************]  1 of 1 completed



Accuracy on prediction data: 0.47
Confusion Matrix on prediction data:
[[ 91   1]
 [106   3]]
Classification Report on prediction data:
              precision    recall  f1-score   support

           0       0.46      0.99      0.63        92
           1       0.75      0.03      0.05       109

    accuracy                           0.47       201
   macro avg       0.61      0.51      0.34       201
weighted avg       0.62      0.47      0.32       201



## Summary:

### Data Analysis Key Findings

*   The initial data loading for AAPL from 2020-01-01 to 2023-01-01 retrieved 756 rows and 5 columns.
*   After feature engineering (moving averages, volatility) and creating the binary target ('Price\_Increase'), 49 rows with NaN values were removed, resulting in 707 rows for model training.
*   A Random Forest model was trained on the 2020-2023 data, achieving an accuracy of 0.52 on the test set. Feature importance showed 'Volatility', 'Volume', 'High', and 'MA\_50' were important predictors in this period.
*   When the trained model predicted stock price movement for 2023-01-01 to 2024-01-01, the accuracy was 0.47. The model exhibited high precision (0.75) but low recall (0.03) for predicting price increases (class 1) and high recall (0.99) but lower precision (0.46) for predicting no price increase (class 0). This suggests the model heavily favored predicting no price increase in the prediction period.

### Insights or Next Steps

*   The model's performance (accuracy around 50%) suggests it has limited predictive power for AAPL stock price direction using the current features and model. Future steps could explore alternative features (e.g., technical indicators, sentiment) or different models (e.g., time series models, deep learning).
*   The significant shift in model behavior between the training (2020-2023) and prediction (2023-2024) periods, particularly the strong bias towards predicting class 0 in the latter, suggests potential issues with data stationarity or concept drift. Analyzing market conditions and AAPL's performance in these periods could provide context.
