<a href="https://colab.research.google.com/github/RajeshkumarA/Springboard_assignments/blob/main/Pre_processing_Work_and_Model_Capstone_Project_III_Rajesh_Ananthula.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stock Price Prediction Project: Preprocessing and Modeling

This notebook focuses on the data preprocessing and model building phases of a stock price prediction project. We will prepare the historical stock data, engineer relevant features, handle missing values, scale the data, split it into training and testing sets, select and train machine learning models, and evaluate their performance.

**Project Goal:** To develop a machine learning model that predicts the top 10 most promising stocks each trading day based on historical market data and technical indicators.

## 1. Preparation

Install the required libraries.

In [11]:
%pip install yahooquery pandas numpy scikit-learn matplotlib seaborn



## 2. Data Collection

We will use the `yahooquery` library to download historical stock data for a selected set of stocks.

In [12]:
from yahooquery import Ticker
from datetime import date, timedelta

# 1. Define a list of stock tickers
tickers_list = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'FB'] # Using a few popular tech stocks as examples

# 2. Specify the start and end dates for the historical data (last 10 years)
end_date = date.today()
start_date = end_date - timedelta(days=10*365) # Approximately 10 years

# 3. Use the yahooquery.Ticker class to initialize a Ticker object
tickers = Ticker(tickers_list)

# 4. Use the history() method to download historical data
historical_data = tickers.history(start=start_date, end=end_date)

# 5. Store the downloaded historical data in a pandas DataFrame
df_historical_data = historical_data

# Display the first few rows of the DataFrame
print("Historical Data Head:")
display(df_historical_data.head())

Historical Data Head:


Unnamed: 0_level_0,Unnamed: 1_level_0,open,high,low,close,volume,adjclose,dividends,splits
symbol,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
AAPL,2015-09-02,27.557501,28.084999,27.282499,28.084999,247555200,25.245667,0.0,0.0
AAPL,2015-09-03,28.122499,28.195,27.51,27.592501,212935600,24.802961,0.0,0.0
AAPL,2015-09-04,27.2425,27.612499,27.127501,27.317499,199985200,24.555756,0.0,0.0
AAPL,2015-09-08,27.9375,28.139999,27.58,28.077499,219374400,25.238924,0.0,0.0
AAPL,2015-09-09,28.440001,28.504999,27.442499,27.5375,340043200,24.753523,0.0,0.0


## 3. Feature Engineering

Engineer relevant features from the historical data using common financial analysis techniques.

In [13]:
import pandas as pd

# Ensure the index is sorted for rolling calculations
df_historical_data.sort_index(inplace=True)

# Calculate daily price change and daily price range
df_historical_data['daily_price_change'] = df_historical_data['close'] - df_historical_data['open']
df_historical_data['daily_price_range'] = df_historical_data['high'] - df_historical_data['low']

# Calculate moving averages
df_historical_data['ma_50'] = df_historical_data.groupby('symbol')['adjclose'].transform(lambda x: x.rolling(window=50).mean())
df_historical_data['ma_200'] = df_historical_data.groupby('symbol')['adjclose'].transform(lambda x: x.rolling(window=200).mean())

# Calculate Relative Strength Index (RSI)
def calculate_rsi(data, window=14):
    diff = data.diff(1)
    gain = diff.where(diff > 0, 0)
    loss = -diff.where(diff < 0, 0)

    avg_gain = gain.rolling(window=window, min_periods=1).mean()
    avg_loss = loss.rolling(window=window, min_periods=1).mean()

    rs = avg_gain / avg_loss
    rsi = 100 - (100 / (1 + rs))
    return rsi

df_historical_data['rsi_14'] = df_historical_data.groupby('symbol')['adjclose'].transform(lambda x: calculate_rsi(x, window=14))

# Calculate the difference between short-term and long-term moving averages
df_historical_data['ma_50_200_diff'] = df_historical_data['ma_50'] - df_historical_data['ma_200']

# Define the target variable (next day's percentage change in adjusted close price)
df_historical_data['target'] = df_historical_data.groupby('symbol')['adjclose'].pct_change(periods=-1).shift(1) * 100

# Display the first few rows with the new features and target
print("\nData with Engineered Features and Target Head:")
display(df_historical_data.head())


Data with Engineered Features and Target Head:


Unnamed: 0_level_0,Unnamed: 1_level_0,open,high,low,close,volume,adjclose,dividends,splits,daily_price_change,daily_price_range,ma_50,ma_200,rsi_14,ma_50_200_diff,target
symbol,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
AAPL,2015-09-02,27.557501,28.084999,27.282499,28.084999,247555200,25.245667,0.0,0.0,0.527498,0.8025,,,,,
AAPL,2015-09-03,28.122499,28.195,27.51,27.592501,212935600,24.802961,0.0,0.0,-0.529999,0.684999,,,0.0,,1.784888
AAPL,2015-09-04,27.2425,27.612499,27.127501,27.317499,199985200,24.555756,0.0,0.0,0.074999,0.484999,,,0.0,,1.006712
AAPL,2015-09-08,27.9375,28.139999,27.58,28.077499,219374400,25.238924,0.0,0.0,0.139999,0.559999,,,49.754476,,-2.706805
AAPL,2015-09-09,28.440001,28.504999,27.442499,27.5375,340043200,24.753523,0.0,0.0,-0.9025,1.0625,,,36.759516,,1.960938


## 4. Data Preprocessing

Clean and preprocess the data for the machine learning model. This includes handling missing values, defining features and the target, splitting the data, and scaling the features.

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

# Handle missing values that resulted from feature engineering (e.g., initial rolling windows)
# We'll fill NaN values using forward fill, which is common for time series data.
df_historical_data.fillna(method='ffill', inplace=True)

# Drop any remaining rows with NaN values, which might occur at the very beginning of the data for each symbol
df_historical_data.dropna(inplace=True)

# Select features and target variable
features = ['open', 'high', 'low', 'close', 'volume', 'daily_price_change', 'daily_price_range', 'ma_50', 'ma_200', 'rsi_14', 'ma_50_200_diff']
target = 'target'

X = df_historical_data[features]
y = df_historical_data[target]

# Split the preprocessed data into training and testing sets using a time-based split
# Determine the split point (e.g., 80% for training, 20% for testing)
split_ratio = 0.8
split_index = int(len(df_historical_data) * split_ratio)

# Get the date at the split index to ensure a clean time-based split
split_date = df_historical_data.index.get_level_values('date')[split_index]

# Split data based on the split date
X_train = X.loc[X.index.get_level_values('date') < split_date]
X_test = X.loc[X.index.get_level_values('date') >= split_date]
y_train = y.loc[y.index.get_level_values('date') < split_date]
y_test = y.loc[y.index.get_level_values('date') >= split_date]


print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

# Apply scaling to the selected features
scaler = StandardScaler()

# Fit the scaler only on the training data and transform both training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert the scaled arrays back to DataFrames with original index and columns
X_train_scaled = pd.DataFrame(X_train_scaled, index=X_train.index, columns=features)
X_test_scaled = pd.DataFrame(X_test_scaled, index=X_test.index, columns=features)

print("\nScaled Training Data Head:")
display(X_train_scaled.head())

print("\nScaled Testing Data Head:")
display(X_test_scaled.head())

Shape of X_train: (1933, 11)
Shape of X_test: (7966, 11)
Shape of y_train: (1933,)
Shape of y_test: (7966,)

Scaled Training Data Head:


  df_historical_data.fillna(method='ffill', inplace=True)


Unnamed: 0_level_0,Unnamed: 1_level_0,open,high,low,close,volume,daily_price_change,daily_price_range,ma_50,ma_200,rsi_14,ma_50_200_diff
symbol,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AAPL,2016-06-17,-1.623658,-1.642467,-1.630088,-1.650766,3.948274,-0.774744,-0.887277,-0.685478,-0.857469,-1.951097,-1.097742
AAPL,2016-06-20,-1.636458,-1.644111,-1.635689,-1.655512,1.660472,-0.551461,-0.759814,-0.686761,-0.857698,-1.91064,-1.106211
AAPL,2016-06-21,-1.658343,-1.648633,-1.642949,-1.638797,1.758061,0.519153,-0.672606,-0.687998,-0.857886,-1.278404,-1.114901
AAPL,2016-06-22,-1.631297,-1.637533,-1.62905,-1.646226,1.213789,-0.436955,-0.759814,-0.689412,-0.858064,-1.184069,-1.125421
AAPL,2016-06-23,-1.637697,-1.649867,-1.631125,-1.634877,1.473663,0.055409,-1.095237,-0.690929,-0.858278,-1.040596,-1.136345



Scaled Testing Data Head:


Unnamed: 0_level_0,Unnamed: 1_level_0,open,high,low,close,volume,daily_price_change,daily_price_range,ma_50,ma_200,rsi_14,ma_50_200_diff
symbol,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AAPL,2017-10-13,-0.382632,-0.396179,-0.362417,-0.378378,0.110597,0.112669,-1.209285,-0.374035,-0.730413,1.160739,-0.173795
AAPL,2017-10-16,-0.358476,-0.340267,-0.336694,-0.318742,0.775296,1.097404,-0.216426,-0.373618,-0.729739,1.195257,-0.180554
AAPL,2017-10-17,-0.319661,-0.322384,-0.303919,-0.306567,0.334514,0.358848,-0.692731,-0.373386,-0.729042,1.136769,-0.189417
AAPL,2017-10-18,-0.306448,-0.325673,-0.296243,-0.321218,0.108876,-0.414058,-1.048278,-0.373355,-0.72836,1.230804,-0.199926
AAPL,2017-10-19,-0.382219,-0.40029,-0.391251,-0.39922,2.36345,-0.477035,-0.410977,-0.373807,-0.727734,-0.056807,-0.214123


## 5. Model Selection and Training

Select and train machine learning models suitable for regression on the prepared scaled data.

In [15]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor

# Choose at least two appropriate machine learning models
# Linear Regression: A simple baseline model.
# Ridge Regression: A linear model with L2 regularization, can help prevent overfitting.
# RandomForestRegressor: An ensemble model that can capture non-linear relationships.

# Instantiate the chosen models with appropriate hyperparameters
linear_reg_model = LinearRegression()
ridge_model = Ridge(alpha=1.0) # alpha is the regularization strength
random_forest_model = RandomForestRegressor(n_estimators=100, random_state=42) # n_estimators is the number of trees

# Train each model using the scaled training data and the training target
print("Training Linear Regression Model...")
linear_reg_model.fit(X_train_scaled, y_train)
print("Linear Regression Model Trained.")

print("\nTraining Ridge Regression Model...")
ridge_model.fit(X_train_scaled, y_train)
print("Ridge Regression Model Trained.")

print("\nTraining Random Forest Regressor Model...")
random_forest_model.fit(X_train_scaled, y_train)
print("Random Forest Regressor Model Trained.")

# The trained models are stored in the variables: linear_reg_model, ridge_model, random_forest_model

Training Linear Regression Model...
Linear Regression Model Trained.

Training Ridge Regression Model...
Ridge Regression Model Trained.

Training Random Forest Regressor Model...
Random Forest Regressor Model Trained.


## Summary of Completed Steps:

1.  **Preparation:** Installed the necessary libraries (`yahooquery`, `pandas`, `numpy`, `scikit-learn`, `matplotlib`, `seaborn`).
2.  **Data Collection:** Downloaded historical stock data for a list of tickers using `yahooquery`.
3.  **Feature Engineering:** Engineered relevant features from the historical data, such as daily price change, daily price range, moving averages (50-day and 200-day), Relative Strength Index (RSI), and the difference between moving averages. We also defined the target variable as the next day's percentage change in adjusted close price.
4.  **Data Preprocessing:** Handled missing values by forward filling and dropping any remaining NaNs. We then split the data into training and testing sets based on a time-based split and scaled the numerical features using `StandardScaler`.
5.  **Model Selection and Training:** Selected and trained three regression models: Linear Regression, Ridge Regression, and Random Forest Regressor on the scaled training data.