*Tech Titans* : Stock Price Prediction Using Linear Regression and Random Forest  
_Srikant barik_

# 1. Getting the required packages

We first have to install the following packages using pip.
- package `Yahoo Finance`.
- `matplotlib`

In [None]:
# installation of packages
# uncomment below if packages are not already installed in your local environment

# !pip install numpy pandas
# !pip install matplotlib
# !pip install yfinance

A `!` mark is needed before `pip` for shell command. If you are installing it in your local machine from command line interface (CLI), then you don't need this exclamation (`!`) mark.  

In [None]:
# Data Analysis
import numpy as np
import pandas as pd


# Financial Data Retrieval
import yfinance as yf

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning Model Training
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# Evaluation Metrics
from sklearn.metrics import accuracy_score, confusion_matrix, mean_squared_error, r2_score, mean_absolute_error, median_absolute_error

In [None]:
import warnings
import itertools
warnings.filterwarnings("ignore")

# 2. Getting the dataset

- We will Fetch the dataset from https://finance.yahoo.com/lookup/
  1. Go through the link Above ⤴️
  2. Copy the ticker ex: TSLA for Tesla

In [None]:
import yfinance as yf

# Get stock ticker from: https://finance.yahoo.com/lookup/
# Ex: TSLA

# ticker = input("Enter the stock ticker: ")
ticker = 'VBL.NS'

# Time Period
S = '2023-01-01'
E = '2024-12-07'

df = yf.download(ticker, start = S, end = E)

In [None]:
#print first 5 rows of the datatset
df.head()

In the above dataset,
- `Date` will be in yyyy-mm-dd format
- Now, Saving the file.

In [None]:
# Save the DataFrame to a CSV file with the ticker name
file_name = f"{ticker}.csv"
df.to_csv(file_name, index=False)
print("Data saved to " + file_name)

In [None]:
# Check dataset with .csv extension is there in the directory
!ls

## Getting to know the dataset

In [None]:
# Type of the dataset.
df.dtypes

We see above that, all features are numerics except, date. This makes sense. All features have their datatype as expected.

In [None]:
# Summary Statistics
df.describe().round(2)

In [None]:
# Getting the basic information about the dataset:
df.info()

In [None]:
df.shape

# 3. Basic formatting of the dataset

### Missing Values

In [None]:
# Getting the sum count of missing values in each column:
df.isnull().sum()

We have very small amount of missing data. We can just ignore them by deleting the missing values.

In [None]:
df = df.dropna()

# 4. Feature Engineering

Adding Feature:
1. Moving Averages
2. Relative Strength Index
3. Daily Change of price percentage.

In [None]:
# Moving Averages 10, 50, 200
df['MA10'] = df['Close'].rolling(window=10).mean()
df['MA50'] = df['Close'].rolling(window=50).mean()
df['MA200'] = df['Close'].rolling(window=200).mean()

# Relative Strength Index
df['RSI'] = 100 - (100 / (1 + df['Close'].diff().clip(lower=0).rolling(window=14).mean() / df['Close'].diff().clip(upper=0).abs().rolling(window=14).mean()))

# Change in Price Percentage Daily.
df['daily_change'] = df['Close'].pct_change()

# Dropping rows with NaN values due to rolling calculations
df = df.dropna()


In [None]:
df.head()

In [None]:
df.tail()

# 5. Splitting the dataset

**REMEMBER** We don't shuffle data in time-series unlike usual tabular data, because if we shuffle, then the model may know the future data by mistake causing data-leakage.

We'll splitt the data into train and test (80/20).

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Split the data into training (80%) and testing (20%) sets
df_train, df_test = train_test_split(df, test_size=0.2, shuffle = False)

In [None]:
df_train.tail()

In [None]:
df_test.head()

We see above that splitting is done successfully without random shuffling.

In [None]:
len(df_train), len(df_test),  len(df_train), len(df_train)+\
len(df_test), len(df)

Above we see that lengths of the dataframes after after splitting are consistent.

# Exploratory Data Analysis (EDA) on training dataset

In [None]:
df_train.head()

In [None]:
df_train.isnull().sum()

We have negligible number od missing values. We'll neglect them for now.

Let's plot the variable `volume` over time

In [None]:
df_train.Volume.plot(figsize=(18, 6), color='purple')
plt.title(ticker + ' Volume Over Time')

### What is the trend in closing price of stock (adj_close) with time?

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(15, 8))

# Plotting adjusted closing price:
df_train['Adj Close'].plot(ax=axes[0])
axes[0].set_xlabel('')
axes[0].set_ylabel('Adjusted Close',fontsize=14)

# Creating a line plot for Volumes:
df_train['Volume'].plot(ax=axes[1], color = 'green')
axes[1].set_ylabel('volumes', fontsize=14)
axes[1].set_xlabel('')  # to get nothing in x axis as labels

plt.suptitle(ticker + ' Adjusted Closing Price and Volumes')
plt.tight_layout()
plt.show()



In [None]:
plt.figure(figsize=(18, 6))

# Plot the columns using matplotlib
plt.plot(df_train.index, df_train['Close'], label="Close")
plt.plot(df_train.index, df_train['Adj Close'], label="Adj Close")

# Add a legend, axis labels, and title
plt.legend()
plt.xlabel("Year")
plt.ylabel("Value")
plt.title(ticker + ": Plot of close and adjusted close Over Time")


# Show the plot
plt.show()




We see in the above figure that close and adjusted close follows the same trend.

Next we'll see how `open` and `close` compared in the same plot.

In [None]:
plt.figure(figsize=(18, 6))
# Plot the columns using matplotlib
plt.plot(df_train.index, df_train['Open'], label="Open", color="green")
plt.plot(df_train.index, df_train['Close'], label="Close", color="orange")

# Add a legend, axis labels, and title
plt.legend()
plt.xlabel("Year")
plt.ylabel("Value")
plt.title(ticker + ": Plot of open and close Value Over Time")

# Show the plot
plt.show()

Open and close are almost indstiguisable at this scale with bare eye.

Now we'll plot open, close, high and low in the same plot.

In [None]:
plt.figure(figsize=(25, 6))

# Plot the columns using matplotlib
plt.plot(df_train.index, df_train['Open'], label="open")
plt.plot(df_train.index, df_train['Close'], label="close")
plt.plot(df_train.index, df_train['High'], label="high")
plt.plot(df_train.index, df_train['Low'], label="low")

# Add a legend, axis labels, and title
plt.legend()
plt.xlabel("Year")
plt.ylabel("Value")
plt.title(ticker + ": OHLC")

# Show the plot
plt.show()

- Sometimes high in green are visible. All of them folllow the same trend at this scale.

### Moving average Plotting on training Data

This is done to smoothen the data. 50 day and 200 days moving averages are common among traders and investors.

In [None]:
df_train.head()

In [None]:
df_train.tail()

In [None]:
df_train['Adj Close'].plot(figsize = (20,5))
df_train['MA10'].plot()
df_train['MA50'].plot()
df_train['MA200'].plot()
plt.xlabel('Year')
plt.ylabel('Adj Close', fontsize=12)
plt.title('Moving Averages of ' + ticker)
plt.legend()
plt.show()

Observations:<br>
    - Moving averages have an smoothening effect on adj_close<br>
    - 200 day moving average is smoother than moving average 50<br>
    - Higher moving average has higher lag<br>

### Relative Strength Index (RSI)

RSI is a momentum indicator used to measure the speed and change of price movements. It ranges from 0 to 100 and helps identify overbought or oversold conditions in a stock.

RSI is calculated using the average gain and average loss over a specified period (usually 14 days). The formula is:

1. **Calculate the average gain and average loss** over the past 14 days.
2. **Calculate the relative strength (RS)**:  
 RS=
Average Loss/Average Gain

3. **Calculate the RSI**:  
   RSI=100−( 1+RS/100 )

- **RSI > 70**: Stock may be overbought (potential for a price drop).
- **RSI < 30**: Stock may be oversold (potential for a price increase).

In simple terms: RSI helps to show if a stock is priced too high or too low based on its recent performance.

In [None]:
import matplotlib.pyplot as plt

# Plotting the RSI
df_train['RSI'].plot(figsize=(20,5), color='purple', label='RSI')

# Adding horizontal lines for overbought and oversold levels (70 and 30)
plt.axhline(70, color='red', linestyle='--', label='Overbought (70)')
plt.axhline(30, color='green', linestyle='--', label='Oversold (30)')

# Adding labels and title
plt.xlabel('Year')
plt.ylabel('RSI', fontsize=12)
plt.title('Relative Strength Index (RSI) with Overbought/Oversold Levels of ' + ticker)
plt.legend()
plt.show()


### Daily Average Return (DAR)

The daily average return for a stock is the average gain or loss that a stock has experienced over a given period, usually expressed as a percentage of the stock's price. It is calculated by dividing the total return of a stock over a period of time by the number of trading days in that period.

For example, if a stock had an adjusted close price of Rs. 100 on Monday and Rs. 102 on Tuesday, its daily return for Tuesday would be:

(102 - 100) / 100 = 0.02 = 2%

To calculate the average daily return over a period of, say, 30 trading days, you would sum the daily returns for each of the 30 days, then divide the sum by 30.

In [None]:
# Calculating the returns using the pct_change() function.pct means percent
df_train['daily_change'].plot(figsize=(14,5),linestyle='--',marker='o')
plt.xlabel('Year')
plt.ylabel('Daily Average Return')
plt.title(ticker + ": Daily Average Return")
plt.show()

In [None]:
# Adding Year, Month, and Weekday to df_train
df_train['Year'] = df_train.index.year
df_train['Month'] = df_train.index.month_name()
df_train['Weekday'] = df_train.index.day_name()


In [None]:
df_train.head()

### Correlation matrix

In [None]:
df_train_corr = df_train.copy()
df_train_corr.head()

In [None]:
df_train_corr = df_train_corr.drop(['MA50','MA200','daily_change','Year','Month','Weekday'],axis=1)
df_train_corr.head()

Above correleation matrix plot is not very insightful, as most of them have correlation coefficient of 1. We'll look into how how the differences correlate with volume.

In [None]:
# Calculate the correlation matrix
corrM = df_train_corr.corr()

# Set up the figure size
plt.figure(figsize=(14, 7))

# Create a mask to hide the upper triangle of the correlation matrix
mask = np.zeros_like(corrM, dtype=bool)
mask[np.triu_indices_from(mask)] = True

# Plot the heatmap
sns.heatmap(corrM, annot=True, fmt=".2f", cmap="coolwarm", mask=mask, cbar=True, linewidths=0.5)

# Show the plot
plt.title("Correlation Heatmap", fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
df_train_corr_diff = df_train_corr.copy()
df_train_corr_diff.head()

Observations:<br>

- it is more useful than the previous heatmap
- highest correlation of volume traded is with high-low. It makes sense because higher is the change in stock price, more activities in the market.

### Visualization: Adj Close and Monthly mean

In [None]:
# Visualizing the Daily Adjusted Close and the Monthly Average resampled data
fig, ax = plt.subplots(figsize=(22, 8))
ax.plot(df_train['Adj Close'],marker='.', linestyle='-', linewidth=0.5, label='Daily')
ax.plot(df_train['Adj Close'].resample('M').mean(),marker='o', markersize=8, linestyle='-', label='Monthly Mean Resample')
ax.set_ylabel('Adjusted Close')
ax.legend()
plt.show()

* Both daily and monthly mean follow the same macroscopic trend.
* no clear trend or seasonality is there.

# 1. Linear Regression model

In [None]:
df_train = df_train.sort_index()
df_train.head()


In [None]:
# Prepare the features and target
df_train['Weekday'] = pd.Categorical(df_train['Weekday']).codes  # Encode Weekday if it's categorical
X = df_train[['MA10', 'MA50', 'MA200', 'RSI', 'daily_change', 'Weekday']]
y = df_train['Adj Close']
df_train.head()

## Split and Fit LinearRegression

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Train Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
# Backtest: Make predictions on test set
y_pred = model.predict(X_test)

## Back-Test Actual v/s Predicted

    Since we cannot use cross validation in our time series based datasets,as it can jumble the datasets during different folds.

    This is not true of time series data, where the time dimension of observations means that we cannot randomly split them into groups.

We can use backtesting method for time series.

    In backtesting we can create multiple train-test splits keeping in mind the temporal order of our data during splits . For example if I have dataset between Jan to Dec

In [None]:
# Plot of actual vs predicted values
plt.figure(figsize=(20, 6))

y_test_1d = y_test.values.ravel()
y_pred_1d = y_pred.ravel()

# Plot actual vs predicted values
sns.lineplot(x=range(len(y_test_1d)), y=y_test_1d, label='Actual', marker='o', linewidth=2, color='blue')
sns.lineplot(x=range(len(y_test_1d)), y=y_pred_1d, label='Predicted', marker='x', linewidth=2, color='orange')
plt.fill_between(range(len(y_test_1d)), y_test_1d, y_pred_1d, color='gray', alpha=0.2, label='Error')
plt.axhline(y=y_test_1d.mean(), color='red', linestyle='--', linewidth=1.5, label='Mean Actual Value')

plt.legend(fontsize=12)
plt.title('Actual vs Predicted Prices', fontsize=16, fontweight='bold')
plt.xlabel('Test Set Samples', fontsize=14)
plt.ylabel('Adj Close', fontsize=14)

plt.grid(True, linestyle='--', alpha=0.6)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Display the plot
plt.show()


After looking at how different cross validation mechanism works, we'll now use TimeSeriesSplit for our time series dataset. In time series, evaluating models on past data is called **Backtesting**. In some area such as meteorology, the word **hindcasting** is used for the same.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score

# Example: Assuming y_test and y_pred are class labels
# Convert regression predictions to classes (if applicable)
# For binary classification, using a threshold of 0.5:
# y_pred_classes = (y_pred >= 0.5).astype(int)

# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

# Display the confusion matrix
print("Confusion Matrix:")
print(cm)
print(f"Accuracy: {accuracy:.2f}")

# Visualize confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Class 0", "Class 1"])  # Update labels as needed
disp.plot(cmap="Blues", values_format="d")
plt.title("Confusion Matrix")
plt.show()


## Forecasting

In [None]:
# Forecast: Predict future prices
future_data = pd.DataFrame({
    'MA10': [230],  # Replace with future values
    'MA50': [225],
    'MA200': [220],
    'RSI': [50],
    'daily_change': [0.02],
    'Weekday': [3]  # Assume Wednesday
})
future_prediction = model.predict(future_data)
print(f"Forecasted Adj Close: {future_prediction[0]}")

In [None]:
forecast_days = 10
historical_data = df_train['Adj Close']

# Create a sequence of future dates for plotting
last_timestamp = historical_data.index[-1]

future_predictions = []
future_dates = []

for i in range(1, forecast_days + 1):
    future_data = pd.DataFrame({
        'MA10': [historical_data.iloc[-1] * 1.01],  # Example: Assume future values grow by 1%
        'MA50': [historical_data.iloc[-1] * 1.02],  # Example: Assume future values grow by 2%
        'MA200': [historical_data.iloc[-1] * 1.03],  # Example: Assume future values grow by 3%
        'RSI': [50],  # Assuming RSI remains constant
        'daily_change': [0.02],  # Assuming daily change remains constant
        'Weekday': [(historical_data.index[-1].weekday() + i) % 7]  # Cycle through weekdays
    })

    # Predict future price
    future_prediction = model.predict(future_data)
    future_predictions.append(future_prediction[0])

    # Get the future date
    future_date = last_timestamp + pd.DateOffset(days=i)
    future_dates.append(future_date)


plt.figure(figsize=(20, 6))
plt.plot(historical_data.index, historical_data, label='Historical Data', color='blue', linestyle='-', linewidth=2)
plt.plot(future_dates, future_predictions, label='Forecasted Future Prices', color='orange', linestyle='-', marker='x', markersize=8)

plt.title('Historical Prices with Forecasted Future Prices', fontsize=16, fontweight='bold')
plt.xlabel('Date', fontsize=14)
plt.ylabel('Adj Close', fontsize=14)

plt.grid(True, linestyle='--', alpha=0.6)
plt.legend(fontsize=12)

plt.show()

# Display the forecasted future prices for reference
print("\n--- Forecasted Future Prices ---")
for date, price in zip(future_dates, future_predictions):
    print(f"{date.strftime('%Y-%m-%d')}: {price}")


## Calculating evaluation metrics

- Mean Absolute Error (MAE)
- r2 Score
- Mean Absolute Percentage Error (MAPE)
- Root Mean Squared Error (RMSE)
- Symmetric Mean Absolute Percentage Error (SMAPE)

In [None]:
# Calculate the MAE
mae = mean_absolute_error(y_test, y_pred)
print('MAE: %.2f' % mae)

In [None]:
# Calculate the r2 Score
r2 = r2_score(y_test, y_pred)
print("MAPE: %.2f%%"% r2)

In [None]:
# Calculate the MAPE
mape = (abs((y_test - y_pred) / y_test)).mean() * 100
print("MAPE: %.2f%%"% mape)

In [None]:
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE: %.2f"% rmse)

In [None]:
# Calculate SMAPE
smape = (100/len(y_test)) * np.sum(2 * np.abs(y_pred - y_test) / (np.abs(y_test) + np.abs(y_pred)))
print("SMAPE: %.2f%%" % smape)

# 2. Random Forest Model

In [None]:
df_train = df_train.sort_index()
df_train.head()

In [None]:
# Sort and prepare the data
df_train = df_train.sort_index()
df_train['Weekday'] = pd.Categorical(df_train['Weekday']).codes  # Encode Weekday if it's categorical
X = df_train[['MA10', 'MA50', 'MA200', 'RSI', 'daily_change', 'Weekday']]
y = df_train['Adj Close']


## Split and Fit Random Forest

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Initialize the Random Forest Regressor model
rf_model = RandomForestRegressor(random_state=42)


In [None]:
# Train the model on the training data
rf_model.fit(X_train, y_train)


In [None]:
# Backtest: Make predictions on the test set
y_pred = rf_model.predict(X_test)

In [None]:
# Optionally, display the feature importance
importances = rf_model.feature_importances_
print("Feature importances:", dict(zip(X.columns, importances)))

In [None]:
# Plotting
plt.figure(figsize=(15, 10))

# 1. Actual vs Predicted Plot
plt.subplot(2,2,1)
plt.scatter(y_test, y_pred, color='blue', alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linewidth=2)
plt.title('Actual vs Predicted')
plt.xlabel('Actual')
plt.ylabel('Predicted')

# 2. Predicted vs. Actual Values Line Plot
plt.subplot(2,2,2)
plt.plot(y_test.values, label='Actual', color='blue', marker='o')
plt.plot(y_pred, label='Predicted', color='red', marker='o')
plt.title('Actual vs Predicted Values Over Time')
plt.xlabel('Sample Index')
plt.ylabel('Adj Close')
plt.legend()

plt.tight_layout()
plt.show()


In [None]:
# Plotting the actual vs predicted and forecasted future prices
forecast_days = 10
historical_data = df_train['Adj Close']
last_timestamp = historical_data.index[-1]

future_predictions = []
future_dates = []

# Generate future data and make predictions
for i in range(1, forecast_days + 1):
    future_data = pd.DataFrame({
        'MA10': [historical_data.iloc[-1] * 1.01],  # Example: Assume future values grow by 1%
        'MA50': [historical_data.iloc[-1] * 1.02],  # Example: Assume future values grow by 2%
        'MA200': [historical_data.iloc[-1] * 1.03],  # Example: Assume future values grow by 3%
        'RSI': [50],  # Assuming RSI remains constant
        'daily_change': [0.02],  # Assuming daily change remains constant
        'Weekday': [(historical_data.index[-1].weekday() + i) % 7]  # Cycle through weekdays
    })

    # Predict future price
    future_prediction = rf_model.predict(future_data)
    future_predictions.append(future_prediction[0])

    # Get the future date
    future_date = last_timestamp + pd.DateOffset(days=i)
    future_dates.append(future_date)

# Plot historical and forecasted data
plt.figure(figsize=(20, 6))
plt.plot(historical_data.index, historical_data, label='Historical Data', color='purple', linestyle='-', linewidth=2)
plt.plot(future_dates, future_predictions, label='Forecasted Future Prices', color='orange', linestyle='-', marker='x', markersize=8)

plt.title('Historical Prices with Forecasted Future Prices', fontsize=16, fontweight='bold')
plt.xlabel('Date', fontsize=14)
plt.ylabel('Adj Close', fontsize=14)
plt.grid(True, linestyle='--', alpha=0.6)
plt.legend(fontsize=12)

plt.show()

# Display the forecasted future prices for reference
print("\n--- Forecasted Future Prices ---")
for date, price in zip(future_dates, future_predictions):
    print(f"{date.strftime('%Y-%m-%d')}: {price:.2f}")

## Calculating evaluation metrics

- Mean Absolute Error (MAE)
- r2 Score
- Root Mean Squared Error (RMSE)

In [None]:
# Calculate the MAE
mae = mean_absolute_error(y_test, y_pred)
print('MAE: %.2f' % mae)

In [None]:
# Calculate the r2 Score
r2 = r2_score(y_test, y_pred)
print("MAPE: %.2f%%"% r2)

In [None]:
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE: %.2f"% rmse)

**References**:

1. *Yahoo Finance API Documentation: [Yahoo Finance API - yfinance](https://pypi.org/project/yfinance/)*
2. *Linear Regression Resources: Scikit-learn Linear Regression Documentation: [Linear Model Documentation](https://scikit-learn.org/stable/modules/linear_model.html#linear-regression)*
3. *Random Forest Resources: Scikit-learn Random Forest Regressor Documentation: [Random Forest Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)*
4. *Visualization Libraries: Matplotlib Documentation: [Matplotlib](https://matplotlib.org/stable/contents.html)*
5. *Seaborn Documentation: [Seaborn](https://seaborn.pydata.org/)*

In [None]:
# #@title Convert ipynb to HTML in Colab
# # First File > Download > Download .ipynb
# # Upload ipynb
# from google.colab import files
# f = files.upload()

# # Convert ipynb to html
# import subprocess
# file0 = list(f.keys())[0]
# _ = subprocess.run(["pip", "install", "nbconvert"])
# _ = subprocess.run(["jupyter", "nbconvert", file0, "--to", "html"])

# # download the html
# files.download(file0[:-5]+"html")