# Stock Market Analysis and Prediction: Leveraging Data Science for Insights

### Table of Contents

By following this roadmap, readers can gain a comprehensive understanding of the stock market, learn how to leverage Data Science for financial analysis, and optimize investment strategies for maximum returns. Let's get started!
1. [Introduction](#1-introduction)
2. [Data Collection and Preprocessing](#2-data-collection-and-preprocessing)
   - [Key Libraries](#key-libraries)
   - [Stock Data Retrieval with APIs](#stock-data-retrieval-with-apis)
   - [Data Cleaning and Formatting](#data-cleaning-and-formatting)
   - [Feature Engineering for Machine Learning](#feature-engineering-for-machine-learning)
   - [Selected Features for Stock Price Analysis](#selected-features-for-stock-price-analysis)
3. [Exploratory Data Analysis (EDA)](#3-exploratory-data-analysis-eda)
   - [Statistical Summaries](#statistical-summaries)
   - [Visualizing Trends](#visualizing-trends)
4. [Financial Metrics](#4-financial-metrics)
   - [Performance Metrics](#performance-metrics)
   - [Risk and Volatility Analysis](#risk-and-volatility-analysis)
5. [Machine Learning Applications](#5-machine-learning-applications)
   - [Predictive Modeling](#predictive-modeling)
   - [Classification Tasks](#classification-tasks)
   - [Clustering Analysis](#clustering-analysis)
6. [Portfolio Optimization](#6-portfolio-optimization)
   - [Markowitz Mean-Variance Optimization](#markowitz-mean-variance-optimization)
   - [Black-Litterman Allocation](#black-litterman-allocation)
   - [Reinforcement Learning Approaches](#reinforcement-learning-approaches)
7. [Backtesting Investment Strategies](#7-backtesting-investment-strategies)
8. [Insights and Conclusions](#8-insights-and-conclusions)
9. [Future Work](#9-future-work)



## Introduction
____


The stock market presents a complex and dynamic environment. Investors face numerous challenges, including identifying profitable opportunities, managing risk, and optimizing portfolio allocation. This project focuses on analyzing stock data for four prominent technology companies: Apple (AAPL), Microsoft (MSFT), Google (GOOGL), and Amazon (AMZN). These companies were selected for their market leadership, innovation, and global influence.

**Objective:**

- Analyze historical stock data to identify patterns and trends.

- Use machine learning models to predict stock prices and classify stock movements.

- Evaluate investment strategies through portfolio optimization and backtesting.

Problem Statement:

- Identifying profitable investment opportunities.

- Managing risk effectively.

- Optimizing asset allocation.


## Data Collection and Preprocessing
____

### Key Libraries

The success of this project hinges on leveraging powerful Python libraries that enable financial analysis, portfolio optimization, and technical analysis. These libraries form the backbone of the notebook, facilitating data retrieval, manipulation, visualization, and modeling. Below is an overview of the key libraries used and their specific contributions to the project:


- **`yfinance`** 
  A popular library that provides access to historical stock price data, financial statements, and other key metrics for a wide range of stocks. It is a valuable resource for extracting stock data directly from Yahoo Finance for analysis.

- **`Quantstats`** 
  This library specializes in quantitative finance, offering tools for analyzing investment strategies, backtesting, and evaluating portfolio performance. It provides a comprehensive suite of functions for detailed financial analysis and visualization of key metrics


- **`PyPortfolioOpt`**
  This library focuses on portfolio optimization, enabling users to construct optimal portfolios based on various criteria such as risk, return, and constraints. It is a powerful tool for optimizing investment strategies, including mean-variance optimization and Black-Litterman models.

- **`TA-Lib`** 
  A Technical Analysis Library (TA-Lib) offers a wide range of technical indicators for analyzing stock price data. It includes functions for calculating moving averages, RSI, MACD, Bollinger Bands, and other commonly used technical indicators.

- **`Plotly`**
  This library offers interactive visualization capabilities, allowing users to create dynamic and engaging plots for exploring stock data. It provides tools for creating interactive charts, dashboards, and visualizations.

Other commonly used libraries: 

- **`Pandas`**
  This library is essential for data manipulation and analysis, allowing us to handle and preprocess stock data efficiently. It provides powerful data structures and functions for cleaning, transforming, and analyzing financial data.

- **`Numpy`**
  A fundamental library for numerical computing, Numpy provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.

- **`Matplotlib and Seaborn`**
  This combination of libraries is used for data visualization, enabling the creation of informative plots, charts, and graphs to visualize trends, patterns, and relationships in the stock data.

- **`Scikit-Learn`**
  A machine learning library that provides a wide range of tools for building predictive models, evaluating performance, and optimizing parameters. It includes functions for regression, classification, clustering, and model evaluation.

By combining these libraries with Python's robust data science capabilities, we can unlock the full potential of financial analysis and stock market prediction. The subsequent sections will delve into the process of collecting, preprocessing, and analyzing stock data to derive actionable insights for investors.

In [32]:
# Data Handling and Statistical Analysis
import pandas as pd
from pandas_datareader import data
import numpy as np
from scipy import stats
import skimpy as sp
pd.set_option('display.max_columns', None)


# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Interactive visualization libraries
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)  # Enable Plotly offline


# Financial Data and Analysis
import ta
import talib
import quantstats as qs
import yfinance as yf
from pypfopt.efficient_frontier import EfficientFrontier
from pypfopt import risk_models, expected_returns
from pypfopt import black_litterman, BlackLittermanModel



# Machine Learning and Optimization
import optuna
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, HalvingGridSearchCV
from sklearn.experimental import enable_halving_search_cv
from sklearn.linear_model import SGDClassifier

### Data Retrieval with APIs

To initiate our analysis, we will retrieve historical stock price, and returns data for four prominent technology companies: 

- Apple: aapl

- Microsoft: msft

- Google (Alphabet): googl

- Amazon: amzn

We will utilize the `Quantstats` (qs) and `yfinance` (yf) libraries to retrieve data from Yahoo Finance. The data will cover the period from January 1, 2010, to December 31, 2021, providing over a decade of historical stock performance for analysis. The data will include daily stock prices, trading volume, and other relevant metrics that will serve as the foundation for our analysis. 

These companies were selected due to their significant market capitalization, technological innovation, and widespread global influence, making them representative of the technology sector and attractive for investment analysis.  The data will include daily stock prices, trading volume, and other relevant metrics that will serve as the foundation for our analysis. Let's begin by importing the necessary  stock data.


In [33]:
# Define the time window and stock tickers
start = '2010-01-01'
end = '2021-12-31'
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN']

# Loop through tickers to download and save data
for ticker in tickers:
    # Download historical prices
    data = yf.download(ticker, start=start, end=end)
    data.to_csv(f'{ticker.lower()}_price.csv')

    # Download returns
    returns = qs.utils.download_returns(ticker).loc[start:end]
    returns.to_csv(f'{ticker.lower()}_returns.csv')


[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


In [34]:
# Loading the Stock Returns
aapl_returns = pd.read_csv('aapl_returns.csv', index_col=0, parse_dates=True) 
msft_returns = pd.read_csv('msft_returns.csv', index_col=0, parse_dates=True) 
googl_returns = pd.read_csv('googl_returns.csv', index_col=0, parse_dates=True) 
amzn_returns = pd.read_csv('amzn_returns.csv', index_col=0, parse_dates=True) 

In [35]:
# Loading the Stock historical prices:
aapl_price = pd.read_csv('aapl_price.csv', index_col=0, parse_dates=True) 
msft_price = pd.read_csv('msft_price.csv', index_col=0, parse_dates=True) 
googl_price = pd.read_csv('googl_price.csv', index_col=0, parse_dates=True) 
amzn_price = pd.read_csv('amzn_price.csv', index_col=0, parse_dates=True) 

In [36]:
# Display the first 5 rows of the Stock Returns
aapl_returns.head()

Unnamed: 0_level_0,AAPL
Date,Unnamed: 1_level_1
2010-01-04 00:00:00+00:00,0.015565
2010-01-05 00:00:00+00:00,0.001729
2010-01-06 00:00:00+00:00,-0.015906
2010-01-07 00:00:00+00:00,-0.001849
2010-01-08 00:00:00+00:00,0.006648


In [37]:
# Display the first 5 rows of the Stock Prices
aapl_price.head()

Unnamed: 0_level_0,Adj Close,Close,High,Low,Open,Volume
Price,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Ticker,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL
Date,,,,,,
2010-01-04 00:00:00+00:00,6.447412014007568,7.643214225769043,7.660714149475098,7.585000038146973,7.622499942779541,493729600
2010-01-05 00:00:00+00:00,6.458558559417725,7.656428813934326,7.699643135070801,7.6160712242126465,7.664286136627197,601904800
2010-01-06 00:00:00+00:00,6.355826377868652,7.534643173217773,7.68678617477417,7.526785850524902,7.656428813934326,552160000


### Data Cleaning and Formatting

The focus of this section is to clean and format the stock data to ensure consistency, accuracy, and compatibility with the subsequent analysis. This process involves handling missing values, standardizing column headers, and converting data types to facilitate further analysis. The cleaned data will be stored in a Pandas DataFrame for easy manipulation and exploration.

* Data Cleaning:

* Removed missing values.

* Converted timezone-aware timestamps to naive format.

* Standardized column headers and data types.

**STOCK RETURNS DATA CLEANING**

When working with financial data, it's crucial to be aware of timezones. As our data contains stock retutns, and dates are in UTC timezone, we will specifically converts the `timezone-aware DatetimeIndex` to a `timezone-naive DatetimeIndex`.    

A timezone-naive DatetimeIndex does not have any timezone information associated with it. This conversion is essential for consistency and compatibility with various financial analysis tools and libraries.

Thereafter, the columns will be rename to ensure consistency and clarity. The final step involves converting the data types to facilitate further analysis and visualization. The cleaned data will be stored in a Pandas DataFrame for easy manipulation and exploration.

In [38]:
# converting time zone to none
aapl_returns.index = aapl_returns.index.tz_convert(None)
msft_returns.index = msft_returns.index.tz_convert(None)
googl_returns.index = googl_returns.index.tz_convert(None)
amzn_returns.index = amzn_returns.index.tz_convert(None)

In [39]:
# Rename the columns
aapl_returns.columns = ['returns'] 
msft_returns.columns = ['returns']
googl_returns.columns = ['returns']
amzn_returns.columns = ['returns']

In [40]:
# Display the first few rows of the appl_data
aapl_returns.head()

Unnamed: 0_level_0,returns
Date,Unnamed: 1_level_1
2010-01-04,0.015565
2010-01-05,0.001729
2010-01-06,-0.015906
2010-01-07,-0.001849
2010-01-08,0.006648


**STOCK PRICES DATA CLEANING**

The stock prices data will be cleaned and standardized to ensure consistency and compatibility for analysis. The cleaning steps include:

- Drop Redundant Data: Remove the Adj Close column.

- Index Management: Reset the index for uniformity.

- Standardize Column Headers: Convert all column names to lowercase.

- Handle Missing Data: Drop rows with missing values and exclude the first row.

- Rename Columns: Rename the first column to date.

- Date Formatting: Convert the date column to datetime format.

- Ensure Consistent Data Types: Convert all data columns to float.

- Set Index: Set the date column as the DataFrame index.

- Time Zone Adjustment: Remove any time zone information from the data.

- The cleaned data will be stored in a Pandas DataFrame, ready for seamless manipulation and exploration.


In [None]:
# Drop Redundant Data: Remove the Adj Close column.
aapl_price = aapl_price.drop(['Adj Close'], axis=1)
msft_price = msft_price.drop(['Adj Close'], axis=1)
googl_price = googl_price.drop(['Adj Close'], axis=1)
amzn_price = amzn_price.drop(['Adj Close'], axis=1)

In [42]:
# Index Management: Reset the index for uniformity.
aapl_price.reset_index(inplace=True)
msft_price.reset_index(inplace=True)
googl_price.reset_index(inplace=True)
amzn_price.reset_index(inplace=True)

In [43]:
# convert the columns headers to lower case
def clean_columns_headers(df):
    df.columns = df.columns.str.strip()
    df.columns = df.columns.str.lower()
    df.columns = df.columns.str.replace(' ', '_')
    
    return df

# Clean the columns headers
aapl_price = clean_columns_headers(aapl_price)
msft_price = clean_columns_headers(msft_price)
googl_price = clean_columns_headers(googl_price)
amzn_price = clean_columns_headers(amzn_price)

In [44]:
# drop rows with missing values, and drop the first row
def drop_rows(df):
    df.dropna(inplace=True)
    df.drop(index=0, inplace=True)
    return df

aapl_price = drop_rows(aapl_price)
msft_price = drop_rows(msft_price)
googl_price = drop_rows(googl_price)
amzn_price = drop_rows(amzn_price)

In [45]:
# Rename the first column to 'date'
aapl_price.rename(columns={'price':'date'}, inplace=True)
msft_price.rename(columns={'price':'date'}, inplace=True)
googl_price.rename(columns={'price':'date'}, inplace=True)
amzn_price.rename(columns={'price':'date'}, inplace=True)

In [29]:
# view the data to check the changes
aapl_price.head()

Unnamed: 0,date,close,high,low,open,volume
2,2010-01-04 00:00:00+00:00,7.643214225769043,7.660714149475098,7.585000038146973,7.622499942779541,493729600
3,2010-01-05 00:00:00+00:00,7.656428813934326,7.699643135070801,7.6160712242126465,7.664286136627197,601904800
4,2010-01-06 00:00:00+00:00,7.534643173217773,7.68678617477417,7.526785850524902,7.656428813934326,552160000
5,2010-01-07 00:00:00+00:00,7.520713806152344,7.5714287757873535,7.466071128845215,7.5625,477131200
6,2010-01-08 00:00:00+00:00,7.570713996887207,7.5714287757873535,7.466429233551025,7.510714054107666,447610800


In [22]:
# convert the data types of the columns to float
def convert_data_types(df):
    df['open'] = df['open'].astype(float)
    df['high'] = df['high'].astype(float)
    df['low'] = df['low'].astype(float)
    df['close'] = df['close'].astype(float)
    df['volume'] = df['volume'].astype(float)
    df['date'] = pd.to_datetime(df['date'])
    return df


# convert the data types of the columns to float
aapl_price = convert_data_types(aapl_price)
msft_price = convert_data_types(msft_price)
googl_price = convert_data_types(googl_price)
amzn_price = convert_data_types(amzn_price)

In [23]:
# Setting the date as the index
aapl_price.set_index('date', inplace=True)
msft_price.set_index('date', inplace=True)
googl_price.set_index('date', inplace=True)
amzn_price.set_index('date', inplace=True)

In [24]:
# convert the time zone to none
aapl_price.index = aapl_price.index.tz_convert(None)
msft_price.index = msft_price.index.tz_convert(None)
googl_price.index = googl_price.index.tz_convert(None)
amzn_price.index = amzn_price.index.tz_convert(None)

In [25]:
# view the data to check the changes
aapl_price.head()

Unnamed: 0_level_0,close,high,low,open,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-01-04,7.643214,7.660714,7.585,7.6225,493729600.0
2010-01-05,7.656429,7.699643,7.616071,7.664286,601904800.0
2010-01-06,7.534643,7.686786,7.526786,7.656429,552160000.0
2010-01-07,7.520714,7.571429,7.466071,7.5625,477131200.0
2010-01-08,7.570714,7.571429,7.466429,7.510714,447610800.0


Thus , we have the `monthly returns` and the `historical prices` of the four stocks. We can now proceed to extract other relevant financial metrics and perform exploratory data analysis to gain insights into the stock market trends.

## Feature Engineering for Machine Learning
____

Feature engineering is a critical step in building robust machine learning models for stock price analysis. It involves creating relevant and informative features from raw data to enhance the predictive performance of the model. In this analysis, we focus on generating features from `technical indicators` and `fundamental metrics`. These features capture essential aspects of stock market behavior, offering insights into trends, volatility, and financial health.

####  Technical Indicators

Technical indicators are mathematical calculations derived from historical price, volume, or other market data to provide insights into market trends, momentum, volatility, and volume. These indicators are essential for understanding market behavior and identifying potential opportunities for trading or investment.

* Categories of Technical Indicators
	1.	**Trend Indicators**: These indicators identify the direction of market movements and help traders determine whether a market is in an uptrend, downtrend, or consolidating. Examples include:
		- `Moving Averages (MA)`: Smooths out price data to identify trends over time.
		- `Exponential Moving Average (EMA)`: Gives more weight to recent prices for faster responses to price changes.
		- `MACD (Moving Average Convergence Divergence)`: Highlights changes in momentum and trend direction.
		- `Parabolic SAR`: Indicates potential reversals in market trends.	

	2.	**Momentum Indicators**: These indicators measure the speed and strength of price movements, helping traders identify overbought or oversold conditions. Examples include:
		- `Relative Strength Index (RSI)`: Measures the magnitude of recent price changes to evaluate overbought or oversold conditions.
		- `Stochastic Oscillator`: Compares a security's closing price to its price range over a specific period.
		- `Rate of Change (ROC)`: Measures the percentage change in price between the current price and a past price.

	3.	**Volatility Indicators**: These indicators quantify the degree of price fluctuations in the market, helping traders assess risk and potential price movements. Examples include:
		- `Bollinger Bands`: Consist of a moving average and two standard deviation bands to identify price volatility.
		- `Average True Range (ATR)`: Measures market volatility by calculating the average range between price highs and lows.
		- `Keltner Channels`: Similar to Bollinger Bands, but use average true range to set channel boundaries.

	4.	**Volume Indicators**: These indicators analyze trading volume to assess the strength of price movements and identify potential reversals. Examples include:
		- `On-Balance Volume (OBV)`: Tracks cumulative volume to predict price movements.
		- `Accumulation/Distribution Line`: Combines price and volume data to assess the flow of money in and out of a security.
		- `Chaikin Money Flow (CMF)`: Measures the buying and selling pressure for a security.


#### Fundamental Metrics

In addition to technical indicators, fundamental metrics provide valuable insights into a company's financial health, performance, and valuation. These metrics are derived from financial statements, earnings reports, and other fundamental data sources, offering a comprehensive view of a company's operations and prospects. By incorporating fundamental metrics into our feature set, we can enhance the predictive power of our machine learning models and gain a deeper understanding of the factors driving stock price movements.

Fundamental metrics are quantitative data points derived from a company's financial statements, earnings reports, and other fundamental data sources. These metrics provide insights into a company's financial health, performance, valuation, and growth prospects, helping investors make informed decisions about stock investments. By analyzing fundamental metrics, investors can assess the intrinsic value of a company, evaluate its competitive position, and identify potential investment opportunities.


* Categories of Fundamental Metrics
	1.	**Valuation Metrics**: These metrics assess the relative value of a company's stock by comparing its market price to fundamental indicators such as earnings, book value, and cash flow. Examples include:
		- `Price-to-Earnings (P/E) Ratio`: Compares a company's stock price to its earnings per share to evaluate valuation.
		- `Price-to-Book (P/B) Ratio`: Compares a company's stock price to its book value per share to assess valuation.
		- `Price-to-Sales (P/S) Ratio`: Compares a company's stock price to its revenue per share to evaluate valuation.

	2.	**Profitability Metrics**: These metrics measure a company's ability to generate profits and manage costs effectively. Examples include:
		- `Return on Equity (ROE)`: Measures a company's profitability by evaluating its return on shareholders' equity.
		- `Net Profit Margin`: Measures the percentage of revenue that translates into profit after accounting for expenses.
		- `Operating Margin`: Measures the percentage of revenue that translates into profit after accounting for operating expenses.

	3.	**Growth Metrics**: These metrics assess a company's growth prospects and potential for future expansion. Examples include:
		- `Revenue Growth Rate`: Measures the percentage increase in a company's revenue over a specific period.
		- `Earnings Growth Rate`: Measures the percentage increase in a company's earnings over a specific period.
		- `Dividend Yield`: Measures the percentage of dividends paid relative to a company's stock price.

	4.	**Financial Health Metrics**: These metrics evaluate a company's financial stability, liquidity, and debt levels. Examples include:
		- `Debt-to-Equity Ratio`: Measures a company's debt relative to its equity to assess financial leverage.
		- `Current Ratio`: Measures a company's ability to cover short-term liabilities with its short-term assets.
		- `Interest Coverage Ratio`: Measures a company's ability to pay interest on its debt with its earnings.
		


## Selected Features for Stock Price Analysis
___

Following the above insights, we have selected a streamlined set of features for stock price analysis. These features are chosen based on their relevance, predictive power, and minimization of multicollinearity. The selected features provide comprehensive insights into stock market dynamics, capturing essential aspects of market trends, momentum, volatility, and financial health. By incorporating these features into our machine learning models, we can build robust predictive models for stock price analysis and investment decision-making.

The selected features, supported by academic and industry research, provide comprehensive insights into stock market dynamics.


### Technical Indicators

**Trend Indicators**

1. **Moving Average (MA):**  
   According to research conducted by Pennsylvania State University, the moving average is a widely recognized tool for identifying long-term trends by smoothing out short-term fluctuations in price data. It helps to filter market noise, enabling a clearer view of the underlying trend.  
   *Citation:* Honors Thesis, "The Impact of Moving Averages on Predicting Stock Prices," Pennsylvania State University. ([honors.libraries.psu.edu](https://honors.libraries.psu.edu/files/final_submissions/5994))

2. **MACD (Moving Average Convergence Divergence):**  
   As highlighted in a study by Pennsylvania State University, MACD is a momentum-based indicator that effectively detects changes in trend direction. It provides actionable insights for traders by identifying bullish or bearish momentum shifts.  
   *Citation:* Honors Thesis, "The Role of MACD in Financial Market Analysis," Pennsylvania State University. ([honors.libraries.psu.edu](https://honors.libraries.psu.edu/files/final_submissions/5994))

**Momentum Indicators**

1. **Relative Strength Index (RSI):**  
   As discussed in the American Journal of Engineering Research (AJER), RSI is designed to evaluate the speed and magnitude of recent price movements, identifying overbought or oversold conditions. Its ability to detect potential reversals makes it invaluable in stock price analysis.  
   *Citation:* "Application of Technical Analysis in Stock Markets," American Journal of Engineering Research (AJER), Vol. 5, Issue 12. ([ajer.org](https://www.ajer.org/papers/v5%2812%29/Z05120207212.pdf))

**Volatility Indicators**

1. **Bollinger Bands:**  
   According to AJER, Bollinger Bands measure market volatility by using a moving average and standard deviation bands. These bands are particularly useful for identifying overbought or oversold conditions in highly volatile markets.  
   *Citation:* "Technical Indicators and Their Effectiveness in Trading Strategies," American Journal of Engineering Research (AJER), Vol. 5, Issue 12. ([ajer.org](https://www.ajer.org/papers/v5%2812%29/Z05120207212.pdf))

**Volume Indicators**

1. **On-Balance Volume (OBV):**  
   As demonstrated by AJER, OBV evaluates the flow of money by tracking cumulative trading volume. This indicator provides early signals of potential trend reversals by measuring buying and selling pressure.  
   *Citation:* "Volume-Based Indicators in Market Predictions," American Journal of Engineering Research (AJER), Vol. 5, Issue 12. ([ajer.org](https://www.ajer.org/papers/v5%2812%29/Z05120207212.pdf))

#### Fundamental Metrics

Fundamental metrics analyze a company’s financial health, valuation, and growth potential. These metrics complement technical indicators by offering insights into a company’s intrinsic value and profitability.

**Valuation Metrics**

1. **Price-to-Earnings (P/E) Ratio:**  
   As described by Investopedia, the P/E ratio compares a company’s stock price to its earnings per share. It is a critical measure of valuation, helping investors identify overvalued or undervalued stocks.  
   *Citation:* "Five Must-Have Metrics for Value Investors," Investopedia. ([investopedia.com](https://www.investopedia.com/articles/fundamental-analysis/09/five-must-have-metrics-value-investors.asp))

**Profitability Metrics**

1. **Return on Equity (ROE):**  
   According to research published in the European Research Studies Journal, ROE measures how effectively a company uses shareholders’ equity to generate profits. A high ROE indicates efficient use of resources and strong financial performance.  
   *Citation:* "The Influence of Fundamental Analysis on Stock Prices: A Case Study," European Research Studies Journal. ([ersj.eu](https://ersj.eu/journal/1063/download/The%2BInfluence%2Bof%2BFundamental%2BAnalysis%2Bon%2BStock%2BPrices%2BThe%2BCase%2Bof%2BFood%2Band%2BBeverage%2BIndustries.pdf))

**Growth Metrics**

1. **Revenue Growth Rate:**  
   As outlined in the European Research Studies Journal, revenue growth reflects a company’s ability to expand its business over time. It is a key indicator of future profitability and competitiveness in the market.  
   *Citation:* "Revenue Growth as a Predictor of Stock Price Movements," European Research Studies Journal. ([ersj.eu](https://ersj.eu/journal/1063/download/The%2BInfluence%2Bof%2BFundamental%2BAnalysis%2Bon%2BStock%2BPrices%2BThe%2BCase%2Bof%2BFood%2Band%2BBeverage%2BIndustries.pdf))

## Conclusion

The selected features strike a balance between simplicity and relevance, ensuring diverse and non-redundant insights. Supported by domain knowledge and empirical evidence, these features are integral to developing robust machine learning models for stock price analysis.


### Trend Indicators measures: 

- **Moving Averages (MA):**    
A moving average is a widely used technical indicator that smooths out price data to identify trends over time. It calculates the average price of a security over a specified period, providing a clearer picture of the underlying trend. Moving averages are commonly used to identify support and resistance levels, trend direction, and potential entry or exit points for trades.

In [46]:
# Simple Moving Average (SMA)
aapl_price['sma_20'] = aapl_price['close'].rolling(window=20).mean()
msft_price['sma_20'] = msft_price['close'].rolling(window=20).mean()
googl_price['sma_20'] = googl_price['close'].rolling(window=20).mean()
amzn_price['sma_20'] = amzn_price['close'].rolling(window=20).mean()

- **Exponential Moving Average (EMA):**   
The exponential moving average is a type of moving average that gives more weight to recent prices, making it more responsive to price changes. It is calculated by applying a smoothing factor to the previous period's EMA and the current price. The EMA reacts faster to price movements than the simple moving average, making it popular among traders looking for timely signals.

In [47]:
# Exponential Moving Average (EMA)
aapl_price['ema_20'] = aapl_price['close'].ewm(span=20, adjust=False).mean()
msft_price['ema_20'] = msft_price['close'].ewm(span=20, adjust=False).mean()
googl_price['ema_20'] = googl_price['close'].ewm(span=20, adjust=False).mean()
amzn_price['ema_20'] = amzn_price['close'].ewm(span=20, adjust=False).mean()

- **MACD (Moving Average Convergence Divergence)**  
The Moving Average Convergence Divergence (MACD) is a trend-following momentum indicator that shows the relationship between two moving averages of a security's price. It consists of the MACD line, signal line, and histogram. The MACD line is calculated by subtracting the 26-period EMA from the 12-period EMA, while the signal line is the 9-period EMA of the MACD line. The histogram represents the difference between the MACD line and the signal line. The MACD is used to identify changes in trend direction, momentum, and potential buy or sell signals.

In [48]:
# Define the parameters for the MACD calculation
fastperiod = 12
slowperiod = 26
signalperiod = 9

aapl_price['macd'], aapl_price['macd_signal'], _ = talib.MACD(aapl_price['close'], fastperiod, slowperiod, signalperiod)
msft_price['macd'], msft_price['macd_signal'], _ = talib.MACD(msft_price['close'], fastperiod, slowperiod, signalperiod)
googl_price['macd'], googl_price['macd_signal'], _ = talib.MACD(googl_price['close'], fastperiod, slowperiod, signalperiod)
amzn_price['macd'], amzn_price['macd_signal'], _ = talib.MACD(amzn_price['close'], fastperiod, slowperiod, signalperiod)

**Parabolic SAR (SAR)**   
The parabolic SAR (stop and reverse) is a trend-following indicator that provides potential entry and exit points for trades. It appears as dots above or below the price chart, indicating the direction of the trend. When the dots are below the price, it suggests an uptrend, while dots above the price indicate a downtrend. The parabolic SAR is used to set trailing stop-loss orders and identify potential trend reversals.

In [49]:
aapl_price['sar'] = talib.SAR(aapl_price['high'], aapl_price['low'], acceleration=0.02, maximum=0.2)
msft_price['sar'] = talib.SAR(msft_price['high'], msft_price['low'], acceleration=0.02, maximum=0.2)
googl_price['sar'] = talib.SAR(googl_price['high'], googl_price['low'], acceleration=0.02, maximum=0.2)
amzn_price['sar'] = talib.SAR(amzn_price['high'], amzn_price['low'], acceleration=0.02, maximum=0.2)

In [50]:
aapl_price.head()

Unnamed: 0,date,close,high,low,open,volume,sma_20,ema_20,macd,macd_signal,sar
2,2010-01-04 00:00:00+00:00,7.643214225769043,7.660714149475098,7.585000038146973,7.622499942779541,493729600,,7.643214,,,
3,2010-01-05 00:00:00+00:00,7.656428813934326,7.699643135070801,7.6160712242126465,7.664286136627197,601904800,,7.644473,,,7.585
4,2010-01-06 00:00:00+00:00,7.534643173217773,7.68678617477417,7.526785850524902,7.656428813934326,552160000,,7.634013,,,7.699643
5,2010-01-07 00:00:00+00:00,7.520713806152344,7.5714287757873535,7.466071128845215,7.5625,477131200,,7.623222,,,7.699643
6,2010-01-08 00:00:00+00:00,7.570713996887207,7.5714287757873535,7.466429233551025,7.510714054107666,447610800,,7.618222,,,7.6903


### Momentum Indicators measures:

* **Relative Strength Index (RSI):**
The Relative Strength Index (RSI) is a momentum oscillator that measures the speed and change of price movements. It ranges from 0 to 100 and is used to identify overbought or oversold conditions in a security. A high RSI value (above 70) indicates overbought conditions, while a low RSI value (below 30) suggests oversold conditions. The RSI is used to assess the strength of price movements and potential trend reversals.

In [51]:
aapl_price['RSI'] = talib.RSI(aapl_price['close'], timeperiod=14)
msft_price['RSI'] = talib.RSI(msft_price['close'], timeperiod=14)
googl_price['RSI'] = talib.RSI(googl_price['close'], timeperiod=14)
amzn_price['RSI'] = talib.RSI(amzn_price['close'], timeperiod=14)

* **Stochastic Oscillator:**   
The Stochastic Oscillator is a momentum indicator that compares a security's closing price to its price range over a specific period. It consists of two lines, %K and %D, which fluctuate between 0 and 100. The %K line represents the current price relative to the price range, while the %D line is a moving average of the %K line. The Stochastic Oscillator is used to identify overbought or oversold conditions and potential trend reversals.

In [52]:
# Calculate Stochastic Oscillator for AAPL
aapl_price['slowk'], aapl_price['slowd'] = talib.STOCH(aapl_price['high'], aapl_price['low'], aapl_price['close'], 
                                                       fastk_period=14, slowk_period=3, slowk_matype=0,
                                                       slowd_period=3, slowd_matype=0)

# Calculate Stochastic Oscillator for MSFT
msft_price['slowk'], msft_price['slowd'] = talib.STOCH(msft_price['high'], msft_price['low'], msft_price['close'], 
                                                       fastk_period=14, slowk_period=3, slowk_matype=0,
                                                       slowd_period=3, slowd_matype=0)

# Calculate Stochastic Oscillator for GOOGL
googl_price['slowk'], googl_price['slowd'] = talib.STOCH(googl_price['high'], googl_price['low'], googl_price['close'], 
                                                         fastk_period=14, slowk_period=3, slowk_matype=0,
                                                         slowd_period=3, slowd_matype=0)

# Calculate Stochastic Oscillator for AMZN
amzn_price['slowk'], amzn_price['slowd'] = talib.STOCH(amzn_price['high'], amzn_price['low'], amzn_price['close'], 
                                                       fastk_period=14, slowk_period=3, slowk_matype=0,
                                                       slowd_period=3, slowd_matype=0)


* **Rate of Change (ROC):**   
The Rate of Change (ROC) is a momentum oscillator that measures the percentage change in price between the current price and a past price. It calculates the rate of change over a specified period, providing insights into the speed and direction of price movements. The ROC is used to identify trends, momentum shifts, and potential buy or sell signals.

In [53]:
# Calculate Rate of Change (ROC) for each stock
aapl_price['ROC'] = talib.ROC(aapl_price['close'], timeperiod=10)
msft_price['ROC'] = talib.ROC(msft_price['close'], timeperiod=10)
googl_price['ROC'] = talib.ROC(googl_price['close'], timeperiod=10)
amzn_price['ROC'] = talib.ROC(amzn_price['close'], timeperiod=10)

### Volatility Indicators measures:

* **Bollinger Bands:**   
Bollinger Bands consist of a moving average and two standard deviation bands that are plotted above and below the moving average. The bands expand and contract based on price volatility, providing a visual representation of price volatility. Bollinger Bands are used to identify overbought or oversold conditions, potential trend reversals, and price volatility.

In [54]:
aapl_price['upper_band'], aapl_price['middle_band'], aapl_price['lower_band'] = talib.BBANDS(aapl_price['close'], 
                                                                                             timeperiod=20, 
                                                                                             nbdevup=2, 
                                                                                             nbdevdn=2, 
                                                                                             matype=0)

msft_price['upper_band'], msft_price['middle_band'], msft_price['lower_band'] = talib.BBANDS(msft_price['close'], 
                                                                                             timeperiod=20, 
                                                                                             nbdevup=2, 
                                                                                             nbdevdn=2, 
                                                                                             matype=0)

googl_price['upper_band'], googl_price['middle_band'], googl_price['lower_band'] = talib.BBANDS(googl_price['close'], 
                                                                                               timeperiod=20, 
                                                                                               nbdevup=2, 
                                                                                               nbdevdn=2, 
                                                                                               matype=0)

amzn_price['upper_band'], amzn_price['middle_band'], amzn_price['lower_band'] = talib.BBANDS(amzn_price['close'], 
                                                                                             timeperiod=20, 
                                                                                             nbdevup=2, 
                                                                                             nbdevdn=2, 
                                                                                             matype=0)


* **Average True Range (ATR):**  
The Average True Range (ATR) is a volatility indicator that measures the average range between price highs and lows over a specified period. It provides insights into the volatility of a security, helping traders assess the potential risk and reward of a trade. The ATR is used to set stop-loss levels, determine position size, and assess market volatility.




In [55]:
# Calculate Average True Range (ATR) for each stock
aapl_price['ATR'] = talib.ATR(aapl_price['high'], aapl_price['low'], aapl_price['close'], timeperiod=14)
msft_price['ATR'] = talib.ATR(msft_price['high'], msft_price['low'], msft_price['close'], timeperiod=14)
googl_price['ATR'] = talib.ATR(googl_price['high'], googl_price['low'], googl_price['close'], timeperiod=14)
amzn_price['ATR'] = talib.ATR(amzn_price['high'], amzn_price['low'], amzn_price['close'], timeperiod=14)


* **Keltner Channels:**  
Keltner Channels are volatility-based indicators that consist of an exponential moving average (EMA) and two bands based on the average true range (ATR). The bands expand and contract based on price volatility, providing insights into potential price movements. Keltner Channels are used to identify overbought or oversold conditions, trend direction, and potential entry or exit points for trades.

In [56]:
# Calculate Keltner Channels for AAPL
aapl_price['Keltner_middle'] = aapl_price['close'].rolling(window=20).mean()
aapl_price['Keltner_upper'] = aapl_price['Keltner_middle'] + (2 * aapl_price['ATR'])
aapl_price['Keltner_lower'] = aapl_price['Keltner_middle'] - (2 * aapl_price['ATR'])

# Calculate Keltner Channels for MSFT
msft_price['Keltner_middle'] = msft_price['close'].rolling(window=20).mean()
msft_price['Keltner_upper'] = msft_price['Keltner_middle'] + (2 * msft_price['ATR'])
msft_price['Keltner_lower'] = msft_price['Keltner_middle'] - (2 * msft_price['ATR'])

# Calculate Keltner Channels for GOOGL
googl_price['Keltner_middle'] = googl_price['close'].rolling(window=20).mean()
googl_price['Keltner_upper'] = googl_price['Keltner_middle'] + (2 * googl_price['ATR'])
googl_price['Keltner_lower'] = googl_price['Keltner_middle'] - (2 * googl_price['ATR'])

# Calculate Keltner Channels for AMZN
amzn_price['Keltner_middle'] = amzn_price['close'].rolling(window=20).mean()
amzn_price['Keltner_upper'] = amzn_price['Keltner_middle'] + (2 * amzn_price['ATR'])
amzn_price['Keltner_lower'] = amzn_price['Keltner_middle'] - (2 * amzn_price['ATR'])


### Volume Indicators measures:

* **On-Balance Volume (OBV):**   
On-Balance Volume (OBV) is a volume indicator that tracks cumulative volume to predict price movements. It adds or subtracts the volume based on the price direction, providing insights into the strength of buying and selling pressure. OBV is used to confirm price trends, identify potential reversals, and assess the flow of money in and out of a security.

In [57]:
aapl_price['OBV'] = talib.OBV(aapl_price['close'], aapl_price['volume'])
msft_price['OBV'] = talib.OBV(msft_price['close'], msft_price['volume'])
googl_price['OBV'] = talib.OBV(googl_price['close'], googl_price['volume'])
amzn_price['OBV'] = talib.OBV(amzn_price['close'], amzn_price['volume'])


* **Accumulation/Distribution Line:**   
The Accumulation/Distribution Line is a volume indicator that combines price and volume data to assess the flow of money in and out of a security. It calculates the value based on the close location relative to the high and low price, providing insights into buying and selling pressure. The Accumulation/Distribution Line is used to confirm price trends, identify potential reversals, and assess the strength of price movements.

In [58]:
aapl_price['AD'] = ((aapl_price['close'] - aapl_price['low']) - (aapl_price['high'] - aapl_price['close'])) / (aapl_price['high'] - aapl_price['low']) * aapl_price['volume']
aapl_price['AD_line'] = aapl_price['AD'].cumsum()

msft_price['AD'] = ((msft_price['close'] - msft_price['low']) - (msft_price['high'] - msft_price['close'])) / (msft_price['high'] - msft_price['low']) * msft_price['volume']

googl_price['AD'] = ((googl_price['close'] - googl_price['low']) - (googl_price['high'] - googl_price['close'])) / (googl_price['high'] - googl_price['low']) * googl_price['volume']

amzn_price['AD'] = ((amzn_price['close'] - amzn_price['low']) - (amzn_price['high'] - amzn_price['close'])) / (amzn_price['high'] - amzn_price['low']) * amzn_price['volume']


TypeError: unsupported operand type(s) for -: 'str' and 'str'

* **Chaikin Money Flow (CMF):**   
Chaikin Money Flow (CMF) is a volume indicator that measures the buying and selling pressure for a security. It combines price and volume data to calculate the value, providing insights into the flow of money in and out of a security. The CMF is used to confirm price trends, identify potential reversals, and assess the strength of price movements.

In [56]:
# Calculate Chaikin Money Flow (CMF) for AAPL
aapl_price['MF_multiplier'] = ((aapl_price['close'] - aapl_price['low']) - (aapl_price['high'] - aapl_price['close'])) / (aapl_price['high'] - aapl_price['low'])
aapl_price['MF_volume'] = aapl_price['MF_multiplier'] * aapl_price['volume']
aapl_price['CMF'] = aapl_price['MF_volume'].rolling(window=20).sum() / aapl_price['volume'].rolling(window=20).sum()

# Calculate Chaikin Money Flow (CMF) for MSFT
msft_price['MF_multiplier'] = ((msft_price['close'] - msft_price['low']) - (msft_price['high'] - msft_price['close'])) / (msft_price['high'] - msft_price['low'])
msft_price['MF_volume'] = msft_price['MF_multiplier'] * msft_price['volume']
msft_price['CMF'] = msft_price['MF_volume'].rolling(window=20).sum() / msft_price['volume'].rolling(window=20).sum()

# Calculate Chaikin Money Flow (CMF) for GOOGL
googl_price['MF_multiplier'] = ((googl_price['close'] - googl_price['low']) - (googl_price['high'] - googl_price['close'])) / (googl_price['high'] - googl_price['low'])
googl_price['MF_volume'] = googl_price['MF_multiplier'] * googl_price['volume']
googl_price['CMF'] = googl_price['MF_volume'].rolling(window=20).sum() / googl_price['volume'].rolling(window=20).sum()

# Calculate Chaikin Money Flow (CMF) for AMZN
amzn_price['MF_multiplier'] = ((amzn_price['close'] - amzn_price['low']) - (amzn_price['high'] - amzn_price['close'])) / (amzn_price['high'] - amzn_price['low'])
amzn_price['MF_volume'] = amzn_price['MF_multiplier'] * amzn_price['volume']
amzn_price['CMF'] = amzn_price['MF_volume'].rolling(window=20).sum() / amzn_price['volume'].rolling(window=20).sum()


In [57]:
aapl_price.head()

Unnamed: 0_level_0,close,high,low,open,volume,sma_20,ema_20,macd,macd_signal,sar,RSI,slowk,slowd,ROC,upper_band,middle_band,lower_band,ATR,Keltner_middle,Keltner_upper,Keltner_lower,OBV,AD,AD_line,MF_multiplier,MF_volume,CMF
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
2010-01-04,7.643214,7.660714,7.585,7.6225,493729600.0,,7.643214,,,,,,,,,,,,,,,493729600.0,265496600.0,265496600.0,0.537737,265496600.0,
2010-01-05,7.656429,7.699643,7.616071,7.664286,601904800.0,,7.644473,,,7.585,,,,,,,,,,,,1095634000.0,-20574860.0,244921700.0,-0.034183,-20574860.0,
2010-01-06,7.534643,7.686786,7.526786,7.656429,552160000.0,,7.634013,,,7.699643,,,,,,,,,,,,543474400.0,-497928900.0,-253007200.0,-0.901784,-497928900.0,
2010-01-07,7.520714,7.571429,7.466071,7.5625,477131200.0,,7.623222,,,7.699643,,,,,,,,,,,,66343200.0,17787340.0,-235219800.0,0.03728,17787340.0,
2010-01-08,7.570714,7.571429,7.466429,7.510714,447610800.0,,7.618222,,,7.6903,,,,,,,,,,,,513954000.0,441516600.0,206296800.0,0.986385,441516600.0,


In [58]:
aapl_price.isnull().sum()

close              0
high               0
low                0
open               0
volume             0
sma_20            19
ema_20             0
macd              33
macd_signal       33
sar                1
RSI               14
slowk             17
slowd             17
ROC               10
upper_band        19
middle_band       19
lower_band        19
ATR               14
Keltner_middle    19
Keltner_upper     19
Keltner_lower     19
OBV                0
AD                 0
AD_line            0
MF_multiplier      0
MF_volume          0
CMF               19
dtype: int64

#### Fundamental Metrics

In addition to technical indicators, fundamental metrics provide valuable insights into a company's financial health, performance, and valuation. These metrics are derived from financial statements, earnings reports, and other fundamental data sources, offering a comprehensive view of a company's operations and prospects. By incorporating fundamental metrics into our feature set, we can enhance the predictive power of our machine learning models and gain a deeper understanding of the factors driving stock price movements.

Fundamental metrics are quantitative data points derived from a company's financial statements, earnings reports, and other fundamental data sources. These metrics provide insights into a company's financial health, performance, valuation, and growth prospects, helping investors make informed decisions about stock investments. By analyzing fundamental metrics, investors can assess the intrinsic value of a company, evaluate its competitive position, and identify potential investment opportunities.


* Categories of Fundamental Metrics
	1.	**Valuation Metrics**: These metrics assess the relative value of a company's stock by comparing its market price to fundamental indicators such as earnings, book value, and cash flow. Examples include:
		- `Price-to-Earnings (P/E) Ratio`: Compares a company's stock price to its earnings per share to evaluate valuation.
		- `Price-to-Book (P/B) Ratio`: Compares a company's stock price to its book value per share to assess valuation.
		- `Price-to-Sales (P/S) Ratio`: Compares a company's stock price to its revenue per share to evaluate valuation.

	2.	**Profitability Metrics**: These metrics measure a company's ability to generate profits and manage costs effectively. Examples include:
		- `Return on Equity (ROE)`: Measures a company's profitability by evaluating its return on shareholders' equity.
		- `Net Profit Margin`: Measures the percentage of revenue that translates into profit after accounting for expenses.
		- `Operating Margin`: Measures the percentage of revenue that translates into profit after accounting for operating expenses.

	3.	**Growth Metrics**: These metrics assess a company's growth prospects and potential for future expansion. Examples include:
		- `Revenue Growth Rate`: Measures the percentage increase in a company's revenue over a specific period.
		- `Earnings Growth Rate`: Measures the percentage increase in a company's earnings over a specific period.
		- `Dividend Yield`: Measures the percentage of dividends paid relative to a company's stock price.

	4.	**Financial Health Metrics**: These metrics evaluate a company's financial stability, liquidity, and debt levels. Examples include:
		- `Debt-to-Equity Ratio`: Measures a company's debt relative to its equity to assess financial leverage.
		- `Current Ratio`: Measures a company's ability to cover short-term liabilities with its short-term assets.
		- `Interest Coverage Ratio`: Measures a company's ability to pay interest on its debt with its earnings.

____