# Stock Market Analysis and Prediction

### Table of Contents

By following this roadmap, readers can gain a comprehensive understanding of the stock market, learn how to leverage Data Science for financial analysis, and optimize investment strategies for maximum returns. Let's get started!
1. [Introduction](#1-Introduction)
2. [Data Collection and Preprocessing](#2-Data-collection-and-preprocessing)
   - [Key Libraries](#key-libraries)
   - [Stock Data Retrieval with APIs](#3-Stock-data-retrieval-with-apis)
   - [Data Cleaning and Formatting](#data-cleaning-and-formatting)
3. [Feature Engineering for Machine Learning](#3-Feature-engineering-for-machine-learning)
   - [Technical Indicators](#technical-indicators)
   - [Fundamental Metrics](#sentiment-analysis)
4. [Selected Features for Technical Indicators](#4-Selected-features-for-technical-indicators)
   - [Simple Moving Average (SMA)](#simple-moving-average-sma)
   - [Moving Average Convergence Divergence (MACD)](#moving-average-convergence-divergence-macd)
   - [Relative Strength Index (RSI)](#relative-strength-index-rsi)
   - [Bollinger Bands](#bollinger-bands)
   - [On-Balance Volume (OBV)](#on-balance-volume-obv)
5. [Selected Features for Fundamental Metrics](#5-Selected-features-for-fundamental-metrics)
   - [Price-to-Earnings Ratio (P/E)](#price-to-earnings-ratio-pe)
   - [Revenue Growth](#revenue-growth)
   - [Debt-to-Equity Ratio](#debt-to-equity-ratio)
   - [Return on Equity (ROE)](#return-on-equity-roe)
6. [Exploratory Data Analysis (EDA)](#3-exploratory-data-analysis-eda)
   - [Statistical Summaries](#statistical-summaries)
   - [Visualizing Trends](#visualizing-trends)
7. [Financial Metrics](#4-financial-metrics)
   - [Performance Metrics](#performance-metrics)
   - [Risk and Volatility Analysis](#risk-and-volatility-analysis)
8. [Machine Learning Applications](#5-machine-learning-applications)
   - [Predictive Modeling](#predictive-modeling)
   - [Classification Tasks](#classification-tasks)
   - [Clustering Analysis](#clustering-analysis)
9. [Portfolio Optimization](#6-portfolio-optimization)
   - [Markowitz Mean-Variance Optimization](#markowitz-mean-variance-optimization)
   - [Black-Litterman Allocation](#black-litterman-allocation)
   - [Reinforcement Learning Approaches](#reinforcement-learning-approaches)
10. [Backtesting Investment Strategies](#7-backtesting-investment-strategies)
11. [Insights and Conclusions](#8-insights-and-conclusions)
12. [Future Work](#9-future-work)



# 1-Introduction
____

<h4 style="font-family: 'Cambria', Georgia, serif; font-weight: bold; margin-bottom: 20px; text-align: left; letter-spacing: 1px;">Project Overview</h4>

The stock market is a dynamic and complex environment where investors face significant challenges, including identifying profitable opportunities, managing risks, and optimizing portfolio allocation. This project focuses on the analysis of stock data from four leading technology companies: Apple (AAPL), Microsoft (MSFT), Google (GOOGL), and Amazon (AMZN). These companies were chosen for their market leadership, innovation, and global influence.

The goal of the project is to uncover valuable insights from historical stock data, leverage machine learning models to predict future stock prices, and evaluate investment strategies through portfolio optimization and backtesting. By integrating data analytics and advanced financial modeling techniques, this project aims to support better decision-making in the fast-paced stock market.


<h4 style="font-family: 'Cambria', Georgia, serif; font-weight: bold; margin-bottom: 20px; text-align: left; letter-spacing: 1px;">Project Objectives</h3>

The project aims to achieve the following objectives:

1. **Analyze Historical Data:**  
   Identify patterns and trends in historical stock data to gain a deeper understanding of market behavior.

2. **Predict Stock Prices:**  
   Use machine learning models to forecast stock prices and classify potential movements, enhancing predictive accuracy.

3. **Optimize Investment Strategies:**  
   Implement portfolio optimization techniques and conduct backtesting to evaluate the effectiveness of various investment strategies.

<h4 style="font-family: 'Cambria', Georgia, serif; font-weight: bold; margin-bottom: 20px; text-align: left; letter-spacing: 1px;">Problem Statement</h3>

Investors face three critical challenges in the stock market:

1. **Identifying Profitable Opportunities:**  
   Navigating the vast array of data to pinpoint stocks with high potential for returns.

2. **Managing Risks:**  
   Balancing the need for returns with effective risk mitigation strategies.

3. **Optimizing Portfolio Allocation:**  
   Allocating assets efficiently to maximize returns while maintaining an acceptable level of risk.



# 2-Data-collection-and-preprocessing
____

<h3 style="font-family: 'Cambria', Georgia, serif; font-weight: bold; margin-bottom: 20px; text-align: left; letter-spacing: 1px;">Key Library</h3>

The success of this project hinges on leveraging powerful Python libraries that enable financial analysis, portfolio optimization, and technical analysis. These libraries form the backbone of the notebook, facilitating data retrieval, manipulation, visualization, and modeling. Below is an overview of the key libraries used and their specific contributions to the project:


- **`yfinance`** 
  A popular library that provides access to historical stock price data, financial statements, and other key metrics for a wide range of stocks. It is a valuable resource for extracting stock data directly from Yahoo Finance for analysis.

- **`Quantstats`** 
  This library specializes in quantitative finance, offering tools for analyzing investment strategies, backtesting, and evaluating portfolio performance. It provides a comprehensive suite of functions for detailed financial analysis and visualization of key metrics


- **`PyPortfolioOpt`**
  This library focuses on portfolio optimization, enabling users to construct optimal portfolios based on various criteria such as risk, return, and constraints. It is a powerful tool for optimizing investment strategies, including mean-variance optimization and Black-Litterman models.

- **`TA-Lib`** 
  A Technical Analysis Library (TA-Lib) offers a wide range of technical indicators for analyzing stock price data. It includes functions for calculating moving averages, RSI, MACD, Bollinger Bands, and other commonly used technical indicators.

- **`Plotly`**
  This library offers interactive visualization capabilities, allowing users to create dynamic and engaging plots for exploring stock data. It provides tools for creating interactive charts, dashboards, and visualizations.

Other commonly used libraries: 

- **`Pandas`**
  This library is essential for data manipulation and analysis, allowing us to handle and preprocess stock data efficiently. It provides powerful data structures and functions for cleaning, transforming, and analyzing financial data.

- **`Numpy`**
  A fundamental library for numerical computing, Numpy provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.

- **`Matplotlib and Seaborn`**
  This combination of libraries is used for data visualization, enabling the creation of informative plots, charts, and graphs to visualize trends, patterns, and relationships in the stock data.

- **`Scikit-Learn`**
  A machine learning library that provides a wide range of tools for building predictive models, evaluating performance, and optimizing parameters. It includes functions for regression, classification, clustering, and model evaluation.

By combining these libraries with Python's robust data science capabilities, the full potential of financial analysis and stock market prediction. The subsequent sections will delve into the process of collecting, preprocessing, and analyzing stock data to derive actionable insights for investors.

In [1]:
# Data Handling and Statistical Analysis
import pandas as pd
from pandas_datareader import data
import numpy as np
from scipy import stats
import skimpy as sp
pd.set_option('display.max_columns', None)


# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Interactive visualization libraries
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)  # Enable Plotly offline


# Financial Data and Analysis
import ta
import talib
import quantstats as qs
import yfinance as yf
from pypfopt.efficient_frontier import EfficientFrontier
from pypfopt import risk_models, expected_returns
from pypfopt import black_litterman, BlackLittermanModel



# Machine Learning and Optimization
import optuna
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, HalvingGridSearchCV
from sklearn.experimental import enable_halving_search_cv
from sklearn.linear_model import SGDClassifier


import warnings 
warnings.filterwarnings('ignore')

<h3 style="font-family: 'Cambria', Georgia, serif; font-weight: bold; margin-bottom: 20px; text-align: left; letter-spacing: 1px;">Data Retrieval with API</h3>

To initiate the project, two key data sources are required for the selected companies:

- Historical stock returns 
- Historical stock price data 

These two data sources will provide the foundation for the analysis, enabling the exploration of the stock trends, calculate financial metrics, and build predictive models. The `yfinance` library will be used to retrieve historical stock price data, while the `Quantstats` library will be used to calculate stock returns


In [2]:
# Define the time window and stock tickers
start = '2010-01-01'
end = '2021-12-31'
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN']

# Loop through tickers to download and save data
for ticker in tickers:
    # Download historical prices
    data = yf.download(ticker, start=start, end=end)
    data.to_csv(f'{ticker.lower()}_price.csv')

    # Download returns
    returns = qs.utils.download_returns(ticker).loc[start:end]
    returns.to_csv(f'{ticker.lower()}_returns.csv')


[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


In [3]:
# Loading the Stock Returns
aapl_returns = pd.read_csv('aapl_returns.csv', index_col=0, parse_dates=True) 
msft_returns = pd.read_csv('msft_returns.csv', index_col=0, parse_dates=True) 
googl_returns = pd.read_csv('googl_returns.csv', index_col=0, parse_dates=True) 
amzn_returns = pd.read_csv('amzn_returns.csv', index_col=0, parse_dates=True) 

In [4]:
# Loading the Stock historical prices:
aapl_price = pd.read_csv('aapl_price.csv', index_col=0, parse_dates=True) 
msft_price = pd.read_csv('msft_price.csv', index_col=0, parse_dates=True) 
googl_price = pd.read_csv('googl_price.csv', index_col=0, parse_dates=True) 
amzn_price = pd.read_csv('amzn_price.csv', index_col=0, parse_dates=True) 

In [5]:
# Display the first 5 rows of the apple data returns 
aapl_returns.head()

Unnamed: 0_level_0,AAPL
Date,Unnamed: 1_level_1
2010-01-04 00:00:00+00:00,0.015565
2010-01-05 00:00:00+00:00,0.001729
2010-01-06 00:00:00+00:00,-0.015906
2010-01-07 00:00:00+00:00,-0.001849
2010-01-08 00:00:00+00:00,0.006648


In [6]:
# Display the first 5 rows of the microsoft data returns
msft_returns.head()

Unnamed: 0_level_0,MSFT
Date,Unnamed: 1_level_1
2010-01-04 00:00:00+00:00,0.01542
2010-01-05 00:00:00+00:00,0.000323
2010-01-06 00:00:00+00:00,-0.006137
2010-01-07 00:00:00+00:00,-0.0104
2010-01-08 00:00:00+00:00,0.006897


In [7]:
# Display the first 5 rows of the google data returns
googl_returns.head()

Unnamed: 0_level_0,GOOGL
Date,Unnamed: 1_level_1
2010-01-04 00:00:00+00:00,0.01092
2010-01-05 00:00:00+00:00,-0.004404
2010-01-06 00:00:00+00:00,-0.025209
2010-01-07 00:00:00+00:00,-0.02328
2010-01-08 00:00:00+00:00,0.013331


In [8]:
# Display the first 5 rows of the amazon data returns
amzn_returns.head()

Unnamed: 0_level_0,AMZN
Date,Unnamed: 1_level_1
2010-01-04 00:00:00+00:00,-0.004609
2010-01-05 00:00:00+00:00,0.0059
2010-01-06 00:00:00+00:00,-0.018116
2010-01-07 00:00:00+00:00,-0.017013
2010-01-08 00:00:00+00:00,0.027077


In [9]:
# Display the first 5 rows of the apple data price
aapl_price.head()

Unnamed: 0_level_0,Adj Close,Close,High,Low,Open,Volume
Price,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Ticker,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL
Date,,,,,,
2010-01-04 00:00:00+00:00,6.447412014007568,7.643214225769043,7.660714149475098,7.585000038146973,7.622499942779541,493729600
2010-01-05 00:00:00+00:00,6.458559513092041,7.656428813934326,7.699643135070801,7.6160712242126465,7.664286136627197,601904800
2010-01-06 00:00:00+00:00,6.3558268547058105,7.534643173217773,7.68678617477417,7.526785850524902,7.656428813934326,552160000


In [10]:
# Display the first 5 rows of the microsoft data price
msft_price.head()

Unnamed: 0_level_0,Adj Close,Close,High,Low,Open,Volume
Price,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Ticker,MSFT,MSFT,MSFT,MSFT,MSFT,MSFT
Date,,,,,,
2010-01-04 00:00:00+00:00,23.300683975219727,30.950000762939453,31.100000381469727,30.59000015258789,30.6200008392334,38409100
2010-01-05 00:00:00+00:00,23.30821418762207,30.959999084472656,31.100000381469727,30.639999389648438,30.850000381469727,49749600
2010-01-06 00:00:00+00:00,23.16516876220703,30.770000457763672,31.079999923706055,30.520000457763672,30.8799991607666,58182400


In [11]:
# Display the first 5 rows of the google data price
googl_price.head()

Unnamed: 0_level_0,Adj Close,Close,High,Low,Open,Volume
Price,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Ticker,GOOGL,GOOGL,GOOGL,GOOGL,GOOGL,GOOGL
Date,,,,,,
2010-01-04 00:00:00+00:00,15.62778091430664,15.684433937072754,15.753503799438477,15.621622085571289,15.689438819885254,78169752
2010-01-05 00:00:00+00:00,15.558961868286133,15.615365028381348,15.711711883544922,15.554054260253906,15.695195198059082,120067812
2010-01-06 00:00:00+00:00,15.166741371154785,15.221721649169922,15.662161827087402,15.174174308776855,15.662161827087402,158988852


In [12]:
# Display the first 5 rows of the amazon data price
amzn_price.head()

Unnamed: 0_level_0,Adj Close,Close,High,Low,Open,Volume
Price,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Ticker,AMZN,AMZN,AMZN,AMZN,AMZN,AMZN
Date,,,,,,
2010-01-04 00:00:00+00:00,6.695000171661377,6.695000171661377,6.83050012588501,6.6570000648498535,6.8125,151998000
2010-01-05 00:00:00+00:00,6.734499931335449,6.734499931335449,6.77400016784668,6.5904998779296875,6.671500205993652,177038000
2010-01-06 00:00:00+00:00,6.612500190734863,6.612500190734863,6.736499786376953,6.582499980926514,6.730000019073486,143576000


Following the retrieval of necessary data, the next step involves cleaning and formatting the data to ensure consistency and accuracy. This process includes converting time zone to none, handling missing values, converting data types,changing the column names, and normalizing data for analysis. The subsequent sections will delve into the data preprocessing steps and feature engineering techniques used to prepare the stock data for analysis.

<h3 style="font-family: 'Cambria', Georgia, serif; font-weight: bold; margin-bottom: 20px; text-align: left; letter-spacing: 1px;">Data Cleaning and Formatting </h3>

This section focuses on preparing the dataset to ensure consistency, accuracy, and compatibility for subsequent analysis. The following steps will be implemented to achieve a clean and well-structured dataset:

- **Convert Time Zones**: Adjust or remove time zone information to standardize the timestamps across all records.
- **Handle Missing Values**: Address missing data using appropriate techniques or removal based on the context.
- **Standardize Column Headers**: Ensure column names follow a consistent format, such as lowercase with underscores, for better readability and compatibility.
- **Convert Data Types**: Modify data types as needed to align with the intended usage,converting strings to datetime or categorical data to numeric values.
- **Rename Columns**: Rename headers to improve clarity and ensure they accurately represent the underlying data.
- **Normalize Data**: Scale or normalize numeric data to bring all features onto a comparable scale for analysis or modeling.
- **Remove Redundant Data**: Drop unnecessary columns or rows that do not add value to the analysis or are duplicates.

After completing these steps, the cleaned dataset will be stored in a Pandas DataFrame, enabling easy manipulation, visualization, and exploration for further analysis.


#### **Historical Stock Returns Data**

In [13]:
# converting time zone to none
aapl_returns.index = aapl_returns.index.tz_convert(None)
msft_returns.index = msft_returns.index.tz_convert(None)
googl_returns.index = googl_returns.index.tz_convert(None)
amzn_returns.index = amzn_returns.index.tz_convert(None)

In [14]:
# Rename the columns to reflect the stock returns data
aapl_returns.columns = ['aapl_returns']
googl_returns.columns = ['googl_returns']
msft_returns.columns = ['msft_returns']
amzn_returns.columns = ['amzn_returns']

In [15]:
# Display the first few rows of the appl_data
aapl_returns.head()

Unnamed: 0_level_0,aapl_returns
Date,Unnamed: 1_level_1
2010-01-04,0.015565
2010-01-05,0.001729
2010-01-06,-0.015906
2010-01-07,-0.001849
2010-01-08,0.006648


In [16]:
googl_returns.head()

Unnamed: 0_level_0,googl_returns
Date,Unnamed: 1_level_1
2010-01-04,0.01092
2010-01-05,-0.004404
2010-01-06,-0.025209
2010-01-07,-0.02328
2010-01-08,0.013331


In [17]:
msft_returns.head()

Unnamed: 0_level_0,msft_returns
Date,Unnamed: 1_level_1
2010-01-04,0.01542
2010-01-05,0.000323
2010-01-06,-0.006137
2010-01-07,-0.0104
2010-01-08,0.006897


In [18]:
amzn_returns.head()

Unnamed: 0_level_0,amzn_returns
Date,Unnamed: 1_level_1
2010-01-04,-0.004609
2010-01-05,0.0059
2010-01-06,-0.018116
2010-01-07,-0.017013
2010-01-08,0.027077


#### **Historical Stock Prices Data**


In [19]:
# Drop Redundant Data: Remove the Adj Close column.
aapl_price = aapl_price.drop(['Adj Close'], axis=1)
msft_price = msft_price.drop(['Adj Close'], axis=1)
googl_price = googl_price.drop(['Adj Close'], axis=1)
amzn_price = amzn_price.drop(['Adj Close'], axis=1)

In [20]:
# Index Management: Reset the index for uniformity.
aapl_price.reset_index(inplace=True)
msft_price.reset_index(inplace=True)
googl_price.reset_index(inplace=True)
amzn_price.reset_index(inplace=True)

In [21]:
# convert the columns headers to lower case
def clean_columns_headers(df):
    df.columns = df.columns.str.strip()
    df.columns = df.columns.str.lower()
    df.columns = df.columns.str.replace(' ', '_')
    
    return df

# Clean the columns headers
aapl_price = clean_columns_headers(aapl_price)
msft_price = clean_columns_headers(msft_price)
googl_price = clean_columns_headers(googl_price)
amzn_price = clean_columns_headers(amzn_price)

In [22]:
# drop rows with missing values, and drop the first row
def drop_rows(df):
    df.dropna(inplace=True)
    df.drop(index=0, inplace=True)
    return df

aapl_price = drop_rows(aapl_price)
msft_price = drop_rows(msft_price)
googl_price = drop_rows(googl_price)
amzn_price = drop_rows(amzn_price)

In [23]:
# Rename the first column to 'date'
aapl_price.rename(columns={'price':'date'}, inplace=True)
msft_price.rename(columns={'price':'date'}, inplace=True)
googl_price.rename(columns={'price':'date'}, inplace=True)
amzn_price.rename(columns={'price':'date'}, inplace=True)

In [24]:
# convert the data types of the columns to float
def convert_data_types(df):
    df['open'] = df['open'].astype(float)
    df['high'] = df['high'].astype(float)
    df['low'] = df['low'].astype(float)
    df['close'] = df['close'].astype(float)
    df['volume'] = df['volume'].astype(float)
    df['date'] = pd.to_datetime(df['date'])
    return df


# convert the data types of the columns to float
aapl_price = convert_data_types(aapl_price)
msft_price = convert_data_types(msft_price)
googl_price = convert_data_types(googl_price)
amzn_price = convert_data_types(amzn_price)

In [25]:
# Setting the date as the index
aapl_price.set_index('date', inplace=True)
msft_price.set_index('date', inplace=True)
googl_price.set_index('date', inplace=True)
amzn_price.set_index('date', inplace=True)

In [26]:
# convert the time zone to none
aapl_price.index = aapl_price.index.tz_convert(None)
msft_price.index = msft_price.index.tz_convert(None)
googl_price.index = googl_price.index.tz_convert(None)
amzn_price.index = amzn_price.index.tz_convert(None)

In [27]:
# view the data to check the changes to the data
aapl_price.head()

Unnamed: 0_level_0,close,high,low,open,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-01-04,7.643214,7.660714,7.585,7.6225,493729600.0
2010-01-05,7.656429,7.699643,7.616071,7.664286,601904800.0
2010-01-06,7.534643,7.686786,7.526786,7.656429,552160000.0
2010-01-07,7.520714,7.571429,7.466071,7.5625,477131200.0
2010-01-08,7.570714,7.571429,7.466429,7.510714,447610800.0


In [28]:
# view the data to check the changes to the data
msft_price.head()

Unnamed: 0_level_0,close,high,low,open,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-01-04,30.950001,31.1,30.59,30.620001,38409100.0
2010-01-05,30.959999,31.1,30.639999,30.85,49749600.0
2010-01-06,30.77,31.08,30.52,30.879999,58182400.0
2010-01-07,30.450001,30.700001,30.190001,30.629999,50559700.0
2010-01-08,30.66,30.879999,30.24,30.280001,51197400.0


In [29]:
# view the data to check the changes to the data
amzn_price.head()

Unnamed: 0_level_0,close,high,low,open,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-01-04,6.695,6.8305,6.657,6.8125,151998000.0
2010-01-05,6.7345,6.774,6.5905,6.6715,177038000.0
2010-01-06,6.6125,6.7365,6.5825,6.73,143576000.0
2010-01-07,6.5,6.616,6.44,6.6005,220604000.0
2010-01-08,6.676,6.684,6.4515,6.528,196610000.0


In [30]:
# view the data to check the changes to the data
googl_price.head()

Unnamed: 0_level_0,close,high,low,open,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-01-04,15.684434,15.753504,15.621622,15.689439,78169752.0
2010-01-05,15.615365,15.711712,15.554054,15.695195,120067812.0
2010-01-06,15.221722,15.662162,15.174174,15.662162,158988852.0
2010-01-07,14.867367,15.265265,14.831081,15.25025,256315428.0
2010-01-08,15.065566,15.096346,14.742492,14.814815,188783028.0


Thus , we have the `monthly returns` and the `historical prices` of the four stocks. We can now proceed to extract other relevant financial metrics and perform exploratory data analysis to gain insights into the stock market trends.

# 3-Feature-engineering-for-machine-learning
____

Feature engineering is a critical step in building robust machine learning models for stock price analysis. It involves creating relevant and informative features from raw data to enhance the predictive performance of the model. In this analysis,  the focus is on generating features from `technical indicators` and `fundamental metrics`. These features capture essential aspects of stock market behavior, offering insights into trends, volatility, and financial health.


<h3 style="font-family: 'Cambria', Georgia, serif; font-weight: bold; margin-bottom: 20px; text-align: left; letter-spacing: 1px;">Technical Indicators</h3>

Technical indicators are mathematical calculations derived from historical price, volume, or other market data to provide insights into market trends, momentum, volatility, and volume. These indicators are essential for understanding market behavior and identifying potential opportunities for trading or investment. There are 4 major categories of technical indicators:


1.	**Trend Indicators**:    
These indicators identify the direction of market movements and help traders determine whether a market is in an uptrend, downtrend, or consolidating. Examples include:      
* `Moving Averages (MA)`: Smooths out price data to identify trends over time.   
* `Exponential Moving Average (EMA)`: Gives more weight to recent prices for faster responses to price changes.   
* `MACD (Moving Average Convergence Divergence)`: Highlights changes in momentum and trend direction.   
* `Parabolic SAR`: Indicates potential reversals in market trends.    	

2.	**Momentum Indicators**:   
These indicators measure the speed and strength of price movements, helping traders identify overbought or oversold conditions. Examples include:   
* `Relative Strength Index (RSI)`: Measures the price changes to evaluate overbought or oversold conditions.   
* `Stochastic Oscillator`: Compares a security's closing price to its price range over a specific period.       
* `Rate of Change (ROC)`: Measures the percentage change in price between the current price and a past price.   

3.	**Volatility Indicators**:   
These indicators quantify the degree of price fluctuations in the market, helping traders assess risk and potential price movements. Examples include:        
* `Bollinger Bands`: Consist of a moving average and two standard deviation bands to identify price volatility.   
* `Average True Range (ATR)`: Measures market volatility by calculating the average range between price highs and lows.   
* `Keltner Channels`: Similar to Bollinger Bands, but use average true range to set channel boundaries.   

4.	**Volume Indicators**:   
These indicators analyze trading volume to assess the strength of price movements and identify potential reversals. Examples include:      
* `On-Balance Volume (OBV)`: Tracks cumulative volume to predict price movements.   
* `Accumulation/Distribution Line`: Combines price and volume data to assess the flow of money in and out of a security.   
* `Chaikin Money Flow (CMF)`: Measures the buying and selling pressure for a security.   


<h3 style="font-family: 'Cambria', Georgia, serif; font-weight: bold; margin-bottom: 20px; text-align: left; letter-spacing: 1px;">Fundamental Metrics </h3>   

In addition to technical indicators, fundamental metrics provide valuable insights into a company's financial health, performance, and valuation. These metrics are derived from financial statements, earnings reports, and other fundamental data sources, offering a comprehensive view of a company's operations and prospects. By incorporating fundamental metrics into the feature set, it will enhance the predictive power of the machine learning models and gain a deeper understanding of the factors driving stock price movements. The following are the key categories of fundamental metrics:   


1.	**Valuation Metrics**:    
These metrics assess the relative value of a company's stock by comparing its market price to fundamental indicators such as earnings, book value, and cash flow. Examples include:   
*  `Price-to-Earnings (P/E) Ratio`: Compares a company's stock price to its earnings per share to evaluate valuation.   
* `Price-to-Book (P/B) Ratio`: Compares a company's stock price to its book value per share to assess valuation.   
* `Price-to-Sales (P/S) Ratio`: Compares a company's stock price to its revenue per share to evaluate valuation.   

2.	**Profitability Metrics**:    
These metrics measure a company's ability to generate profits and manage costs effectively. Examples include:
* `Return on Equity (ROE)`: Measures a company's profitability by evaluating its return on shareholders' equity.   
* `Net Profit Margin`: Measures the percentage of revenue that translates into profit after accounting for expenses.   
* `Operating Margin`: Measures the percentage of revenue that translates into profit after accounting for operating expenses.   

3.	**Growth Metrics**:   
These metrics assess a company's growth prospects and potential for future expansion. Examples include:
* `Revenue Growth Rate`: Measures the percentage increase in a company's revenue over a specific period.   
* `Earnings Growth Rate`: Measures the percentage increase in a company's earnings over a specific period.   
* `Dividend Yield`: Measures the percentage of dividends paid relative to a company's stock price.   

4.	**Financial Health Metrics**:    
These metrics evaluate a company's financial stability, liquidity, and debt levels. Examples include:
* `Debt-to-Equity Ratio`: Measures a company's debt relative to its equity to assess financial leverage.   
* `Current Ratio`: Measures a company's ability to cover short-term liabilities with its short-term assets.   
* `Interest Coverage Ratio`: Measures a company's ability to pay interest on its debt with its earnings.   
	

# 4-Selected-features-for-technical-indicators
___

Following the above insights of availability of features, the choice of features for stock price analysis is crucial for building accurate and robust machine learning models. The selected features should capture essential aspects of stock market dynamics, including trends, momentum, volatility, and financial health. By incorporating these features into the machine learning models, it enhances the predictive power and gain valuable insights into stock price movements. The following features have been selected, which is supported by academic and industry research:

| Indicator Type | Indicator Name | Abbreviation | Description |
|:--------------|:---------------|:-------------|:------------|
| Trend | Moving Average | MA | Smooths price data to show trend direction |
| Trend | Moving Average Convergence Divergence | MACD | Shows relationship between two moving averages |
| Momentum | Relative Strength Index | RSI | Measures speed and magnitude of price changes |
| Volatility | Bollinger Bands | BB | Shows price volatility with standard deviation bands |
| Volume | On-Balance Volume | OBV | Relates volume to price changes |

1. **`Moving Average (MA):`**  
According to research by Dr. P. H. Zope, published in the International Journal of Research Publication and Reviews (Vol. 4, No. 6, June 2023), the moving average is a widely recognized tool in time series analysis for identifying long-term trends by smoothing out short-term price fluctuations.  This technique helps filter market noise, enabling a clearer understanding of underlying price trends. The study compares three key methods—Simple Moving Average (SMA), Weighted Moving Average (WMA), and Exponential Moving Average (EMA)—highlighting their effectiveness in forecasting stock prices. Among these, the moving average based on a 20-day period is particularly noted for its utility in stock price prediction due to its balance of responsiveness and trend stability. Thus, 20 days moving average is selected as a key feature for trend analysis.

<font color = red> **Reference** : Zope, P. H. (2023). ["Stock Price Prediction using Moving Average Time Series"](https://ijrpr.com/uploads/V4ISSUE6/IJRPR14104.pdf) International Journal of Research Publication and Reviews, Vol. 4, No. 6. </font>


In [31]:
# Moving Average (SMA) of 20 days
aapl_price['sma_20'] = aapl_price['close'].rolling(window=20).mean()
msft_price['sma_20'] = msft_price['close'].rolling(window=20).mean()
googl_price['sma_20'] = googl_price['close'].rolling(window=20).mean()
amzn_price['sma_20'] = amzn_price['close'].rolling(window=20).mean()

2. **`MACD (Moving Average Convergence Divergence):`**  
The Moving Average Convergence Divergence (MACD) is a widely utilized momentum indicator in technical analysis, designed to identify changes in the strength, direction, momentum, and duration of a trend in a stock's price. It achieves this by calculating the difference between short-term and long-term exponential moving averages (EMAs) of closing prices. The standard parameters for MACD are typically set as follows:
        
- Fast Period (12): This parameter represents the short-term EMA, capturing recent price movements to reflect the latest market sentiment.   
- Slow Period (26): This denotes the long-term EMA, smoothing out price fluctuations to highlight the overarching trend.    
- Signal Period (9): This is the EMA of the MACD line itself, serving as a trigger for buy or sell signals based on its crossover with the MACD line.    

The MACD indicator, as discussed in the International Journal of Engineering Research and Technology (IJERT), is a powerful tool for identifying trend reversals and momentum shifts in stock prices. By comparing two moving averages, the MACD provides insights into the strength and direction of price movements, enabling traders to make informed decisions. The study highlights the MACD's effectiveness in predicting stock price movements and its utility in technical analysis. Thus, the MACD is selected as a key feature for trend analysis.

<font color = red> **Reference**: ["A Comparative Study of the MACD-based Trading Strategies: Evidence from the US Stock Market"](https://arxiv.org/abs/2206.12282) International Journal of Engineering Research and Technology, Vol. 10, No. 6, June 2022. </font>



In [32]:
# Moving Average Convergence Divergence (MACD)


# Define the parameters 
fastperiod = 12
slowperiod = 26
signalperiod = 9

# macd for aapl
aapl_price['macd'], aapl_price['macd_signal'], aapl_price['macd_hist'] = talib.MACD (aapl_price['close'], 
    fastperiod=fastperiod, 
    slowperiod=slowperiod, 
    signalperiod=signalperiod
)

# macd for msft
msft_price['macd'], msft_price['macd_signal'], msft_price['macd_hist'] = talib.MACD(msft_price['close'], 
    fastperiod=fastperiod, 
    slowperiod=slowperiod, 
    signalperiod=signalperiod
)

# macd for googl
googl_price['macd'], googl_price['macd_signal'], googl_price['macd_hist'] = talib.MACD(googl_price['close'], 
    fastperiod=fastperiod, 
    slowperiod=slowperiod, 
    signalperiod=signalperiod
)

# macd for amzn
amzn_price['macd'], amzn_price['macd_signal'], amzn_price['macd_hist'] = talib.MACD(amzn_price['close'], 
    fastperiod=fastperiod, 
    slowperiod=slowperiod, 
    signalperiod=signalperiod
)


3. **`Relative Strength Index (RSI)`:**  
The Relative Strength Index (RSI) is a popular momentum oscillator that measures the speed and magnitude of recent price movements, identifying overbought or oversold conditions in a stock. The RSI is calculated based on the average gain and loss over a specified period, typically 14 days. The RSI ranges from 0 to 100, with values above 70 indicating overbought conditions and values below 30 indicating oversold conditions. The RSI is a valuable tool for detecting potential reversals and trend changes in stock prices, providing insights into market sentiment and momentum. 

As discussed in the American Journal of Engineering Research (AJER), RSI is designed to evaluate the speed and magnitude of recent price movements, identifying overbought or oversold conditions. Its ability to detect potential reversals makes it invaluable in stock price analysis.

<font color = red> **Reference**: ["A Study on Technical Indicators in Stock Price Movement Prediction"](https://ajer.org/papers/v5(12)/Z05120207212.pdf) American Journal of Engineering Research (AJER), Vol. 5, Issue 12. </font>


In [33]:
# Relative Strength Index (RSI)
aapl_price['RSI'] = talib.RSI(aapl_price['close'], timeperiod=14)
msft_price['RSI'] = talib.RSI(msft_price['close'], timeperiod=14)
googl_price['RSI'] = talib.RSI(googl_price['close'], timeperiod=14)
amzn_price['RSI'] = talib.RSI(amzn_price['close'], timeperiod=14)

4. **`Bollinger Bands`:**  
 The Bollinger Bands indicator, developed by John Bollinger, is a popular tool for measuring market volatility and identifying potential overbought or oversold conditions. It consists of a moving average (typically a 20-day SMA) and two standard deviation bands above and below the moving average. The bands expand and contract based on market volatility, providing insights into price movements and potential trend reversals. Bollinger Bands are particularly useful for identifying extreme price movements and assessing market conditions. 

According to AJER, Bollinger Bands measure market volatility by using a moving average and standard deviation bands. These bands are particularly useful for identifying overbought or oversold conditions in highly volatile markets.  

<font color = red> **Reference**: ["Technical Indicators and Their Effectiveness in Trading Strategies"](https://www.ajer.org/papers/v5%2812%29/Z05120207212.pdf) American Journal of Engineering Research (AJER), Vol. 5, Issue 12. </font>


In [34]:
# Bollinger Bands

# Bollinger Bands for AAPL
aapl_price['upper_band'], aapl_price['middle_band'], aapl_price['lower_band'] = talib.BBANDS(
    aapl_price['close'], 
    timeperiod=20, 
    nbdevup=2, 
    nbdevdn=2, 
    matype=0
)

# Bollinger Bands for MSFT
msft_price['upper_band'], msft_price['middle_band'], msft_price['lower_band'] = talib.BBANDS(
    msft_price['close'], 
    timeperiod=20, 
    nbdevup=2, 
    nbdevdn=2, 
    matype=0
)

# Bollinger Bands for GOOGL
googl_price['upper_band'], googl_price['middle_band'], googl_price['lower_band'] = talib.BBANDS(
    googl_price['close'], 
    timeperiod=20, 
    nbdevup=2, 
    nbdevdn=2, 
    matype=0
)

# Bollinger Bands for AMZN
amzn_price['upper_band'], amzn_price['middle_band'], amzn_price['lower_band'] = talib.BBANDS(
    amzn_price['close'], 
    timeperiod=20, 
    nbdevup=2, 
    nbdevdn=2, 
    matype=0
)


5. **`On-Balance Volume (OBV)`:**  

The On-Balance Volume (OBV) indicator, developed by Joseph Granville, is a volume-based indicator that tracks cumulative trading volume to predict price movements. It is designed to identify buying and selling pressure in the market by analyzing the flow of money into and out of a security. The OBV indicator is calculated by adding the trading volume on days when the price closes higher and subtracting the volume on days when the price closes lower. By evaluating the relationship between volume and price movements, the OBV indicator provides early signals of potential trend reversals and market sentiment.

As demonstrated by AJER, OBV evaluates the flow of money by tracking cumulative trading volume. This indicator provides early signals of potential trend reversals by measuring buying and selling pressure.  
   
<font color = red> **Reference**: ["Volume-Based Indicators in Market Predictions"](https://www.ajer.org/papers/v5%2812%29/Z05120207212.pdf) American Journal of Engineering Research (AJER), Vol. 5, Issue 12. </font>

In [35]:
# On Balance Volume (OBV)

aapl_price['OBV'] = talib.OBV(aapl_price['close'], aapl_price['volume'])
msft_price['OBV'] = talib.OBV(msft_price['close'], msft_price['volume'])
googl_price['OBV'] = talib.OBV(googl_price['close'], googl_price['volume'])
amzn_price['OBV'] = talib.OBV(amzn_price['close'], amzn_price['volume'])


In [36]:
# Display the first few rows of the AAPL data
aapl_price.head()

Unnamed: 0_level_0,close,high,low,open,volume,sma_20,macd,macd_signal,macd_hist,RSI,upper_band,middle_band,lower_band,OBV
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2010-01-04,7.643214,7.660714,7.585,7.6225,493729600.0,,,,,,,,,493729600.0
2010-01-05,7.656429,7.699643,7.616071,7.664286,601904800.0,,,,,,,,,1095634000.0
2010-01-06,7.534643,7.686786,7.526786,7.656429,552160000.0,,,,,,,,,543474400.0
2010-01-07,7.520714,7.571429,7.466071,7.5625,477131200.0,,,,,,,,,66343200.0
2010-01-08,7.570714,7.571429,7.466429,7.510714,447610800.0,,,,,,,,,513954000.0


In [37]:
# Display the first few rows of the MSFT data
msft_price.head()

Unnamed: 0_level_0,close,high,low,open,volume,sma_20,macd,macd_signal,macd_hist,RSI,upper_band,middle_band,lower_band,OBV
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2010-01-04,30.950001,31.1,30.59,30.620001,38409100.0,,,,,,,,,38409100.0
2010-01-05,30.959999,31.1,30.639999,30.85,49749600.0,,,,,,,,,88158700.0
2010-01-06,30.77,31.08,30.52,30.879999,58182400.0,,,,,,,,,29976300.0
2010-01-07,30.450001,30.700001,30.190001,30.629999,50559700.0,,,,,,,,,-20583400.0
2010-01-08,30.66,30.879999,30.24,30.280001,51197400.0,,,,,,,,,30614000.0


In [38]:
# Display the first few rows of the GOOGL data
googl_price.head()

Unnamed: 0_level_0,close,high,low,open,volume,sma_20,macd,macd_signal,macd_hist,RSI,upper_band,middle_band,lower_band,OBV
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2010-01-04,15.684434,15.753504,15.621622,15.689439,78169752.0,,,,,,,,,78169752.0
2010-01-05,15.615365,15.711712,15.554054,15.695195,120067812.0,,,,,,,,,-41898060.0
2010-01-06,15.221722,15.662162,15.174174,15.662162,158988852.0,,,,,,,,,-200886912.0
2010-01-07,14.867367,15.265265,14.831081,15.25025,256315428.0,,,,,,,,,-457202340.0
2010-01-08,15.065566,15.096346,14.742492,14.814815,188783028.0,,,,,,,,,-268419312.0


In [39]:
# Display the first few rows of the AMZN data
amzn_price.head()

Unnamed: 0_level_0,close,high,low,open,volume,sma_20,macd,macd_signal,macd_hist,RSI,upper_band,middle_band,lower_band,OBV
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2010-01-04,6.695,6.8305,6.657,6.8125,151998000.0,,,,,,,,,151998000.0
2010-01-05,6.7345,6.774,6.5905,6.6715,177038000.0,,,,,,,,,329036000.0
2010-01-06,6.6125,6.7365,6.5825,6.73,143576000.0,,,,,,,,,185460000.0
2010-01-07,6.5,6.616,6.44,6.6005,220604000.0,,,,,,,,,-35144000.0
2010-01-08,6.676,6.684,6.4515,6.528,196610000.0,,,,,,,,,161466000.0


# 5-Selected-features-for-fundamental-metrics
___

Fundamental metrics analyze a company’s financial health, valuation, and growth potential. These metrics complement technical indicators by offering insights into a company’s intrinsic value and profitability. The focus is on selecting fundamental metrics that capture key aspects of a company’s operations, performance, and valuation. The following fundamental metrics have been selected for stock price analysis:

| Metric Type         | Metric Name                | Abbreviation | Description                                                         |
|:--------------------|:---------------------------|:-------------|:--------------------------------------------------------------------|
| Valuation           | Price-to-Earnings Ratio   | P/E          | Compares stock price to earnings per share to assess valuation     |
| Growth              | Revenue Growth Rate       | RGR          | Measures the percentage increase in revenue over a specific period |
| Financial Health    | Debt-to-Equity Ratio      | D/E          | Assesses financial leverage by comparing debt to shareholders' equity |
| Profitability       | Return on Equity          | ROE          | Measures profitability relative to shareholders' equity            |


In [40]:
import yfinance as yf

import pandas as ps

sp500 = yf.Ticker("^GSPC")

sp500 = sp500.history(period="max")

sp500

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1927-12-30 00:00:00-05:00,17.660000,17.660000,17.660000,17.660000,0,0.0,0.0
1928-01-03 00:00:00-05:00,17.760000,17.760000,17.760000,17.760000,0,0.0,0.0
1928-01-04 00:00:00-05:00,17.719999,17.719999,17.719999,17.719999,0,0.0,0.0
1928-01-05 00:00:00-05:00,17.549999,17.549999,17.549999,17.549999,0,0.0,0.0
1928-01-06 00:00:00-05:00,17.660000,17.660000,17.660000,17.660000,0,0.0,0.0
...,...,...,...,...,...,...,...
2025-01-03 00:00:00-05:00,5891.069824,5949.339844,5888.660156,5942.470215,3667340000,0.0,0.0
2025-01-06 00:00:00-05:00,5982.810059,6021.040039,5960.009766,5975.379883,4940120000,0.0,0.0
2025-01-07 00:00:00-05:00,5993.259766,6000.680176,5890.680176,5909.029785,4517330000,0.0,0.0
2025-01-08 00:00:00-05:00,5910.660156,5927.890137,5874.779785,5918.250000,4441740000,0.0,0.0


In [41]:
import yfinance as yf
import pandas as pd

tickers = ["AAPL", "AMZN", "GOOGL", "MSFT"]

data = []
for ticker in tickers:
    stock = yf.Ticker(ticker)
    stats = stock.info

    metrics = {
        "Ticker": ticker,
        "P/E": stats.get("trailingPE"),
        "ROE": stats.get("returnOnEquity"),
        "D/E": stats.get("debtToEquity"),
        "RGR": stats.get("revenueGrowth")
    }
    data.append(metrics)

df = pd.DataFrame(data)
print(df)


  Ticker        P/E      ROE      D/E    RGR
0   AAPL  38.891624  1.57413  209.059  0.061
1   AMZN  46.682304  0.22558   61.175  0.110
2  GOOGL  25.435760  0.32101    9.324  0.151
3   MSFT  34.595380  0.35604   33.657  0.160
