# COGS 108 - Data Checkpoint

# Names

- Auritro Dutta
- Jacquelyn Garcia
- Prabhmeet Gujral
- Ethan Heath
- Aniruddh Krovvidi

# Research Question

Can historical financial data be combined with Environmental, Social, and Governance (ESG) criteria to effectively predict future stock prices using Machine Learning for companies that also meet high ESG standards, thereby facilitating data-driven and socially responsible investing? The features that the model will be trained on include:

* Financial Indicators: The model will use traditional financial metrics such as historical stock prices, financial ratios (e.g., P/E ratio, debt-to-equity ratio), and indicators of volatility.

* ESG Scores: ESG data will be categorized into Environmental, Social, and Governance scores, which include sub-factors like carbon footprint (Environmental), employee welfare (Social), and board diversity (Governance).


## Background and Prior Work

#### Introduction to ESG Investing and Stock Price Prediction
Environmental, Social, and Governance (ESG) investing has gained significant traction in recent years as investors seek to align their financial goals with their values. ESG investing involves considering a company's environmental impact, social responsibilities, and governance practices alongside traditional financial metrics. The underlying hypothesis is that companies with high ESG ratings not only contribute positively to society but also exhibit more stable and potentially superior financial performance. Our project aims to leverage Machine Learning to predict stock prices by incorporating ESG scores with traditional financial indicators. This dual-focus approach aims to provide insights into whether ESG factors enhance the predictive power of financial models, thereby supporting socially responsible investment decisions.

#### Prior Work on ESG and Financial Performance
Previous studies have explored the relationship between ESG factors and financial performance, providing a foundation for our research. A notable study by Friede, Busch, and Bassen (2015) conducted a meta-analysis of over 2,000 empirical studies and found that the majority of these studies reported a positive relationship between ESG factors and corporate financial performance. This comprehensive review suggests that ESG criteria can be financially beneficial and supports the hypothesis that ESG-compliant companies may exhibit favorable stock performance.<a name="#fn1"></a>[<sup>1</sup>](#note-1)

Further, a study by Khan, Serafeim, and Yoon (2016) published in the Journal of Accounting and Economics examined how material ESG issues—those that are likely to affect a company’s financial condition—are linked to stock price performance. They found that firms with good performance on material sustainability issues outperform those with poor performance, suggesting that ESG factors, when material, can provide valuable insights for investors.<a name="#fn2"></a>[<sup>2</sup>](#note-2)

In the realm of Machine Learning, there have been several attempts to predict stock prices using various algorithms. For instance, the use of LSTM neural networks for stock price prediction has been well-documented. A study by Fischer and Krauss (2018) utilized LSTM networks to predict S&P 500 stock prices and found that LSTM models significantly outperformed traditional models in capturing the temporal dependencies in financial data.<a name="#fn3"></a>[<sup>3</sup>](#note-3) This study highlights the use of leveraging neural networks to enhance the accuracy of stock price predictions.

#### In-Depth Study Analysis
An in-depth analysis of the intersection of ESG investing and stock price prediction reveals a growing body of work focused on integrating ESG factors into financial models. One such study by Henisz, Koller, and Nuttall (2019) in the McKinsey Quarterly emphasized the increasing importance of ESG factors in driving long-term financial performance. Their research suggested that ESG issues are often linked to critical factors such as regulatory compliance, operational efficiencies, and brand reputation, which can significantly impact stock prices.<a name="#fn4"></a>[<sup>4</sup>](#note-4) This is promising for those looking to make sustainable investments.

Another significant study by Bolton and Kacperczyk (2020) investigated the relationship between carbon emissions and stock returns. They found that firms with higher carbon emissions tend to have lower stock returns, indicating that environmental factors can have a substantial impact on financial performance. This study aligns with the broader hypothesis that ESG factors, particularly environmental issues, play a critical role in influencing investor behavior and stock price trends.<a name="#fn5"></a>[<sup>5</sup>](#note-5)

Furthermore, a paper by Albuquerque, Koskinen, and Zhang (2019) published in the Journal of Financial Economics examined how corporate social responsibility (CSR) activities influence firm risk and stock returns. They discovered that firms engaging in CSR activities generally experience lower risk and higher returns, supporting the integration of social factors into investment models.<a name="#fn6"></a>[<sup>6</sup>](#note-6)

#### Relevant References
1. <a name="#note-1"></a>[^](#fn1) Friede, G., Busch, T., & Bassen, A. (2015). ESG and financial performance: aggregated evidence from more than 2000 empirical studies. Journal of Sustainable Finance & Investment, 5(4), 210-233. 
https://www.tandfonline.com/doi/full/10.1080/20430795.2015.1118917#d1e255

2. <a name="#note-2"></a>[^](#fn2) Khan, M., Serafeim, G., & Yoon, A. (2016). Corporate sustainability: First evidence on materiality. The Accounting Review, 91(6), 1697-1724. 
https://dash.harvard.edu/bitstream/handle/1/14369106/15-073.pdf;jsessionid=5212220466676E63E99E26EF77D83571?sequence=1

3. <a name="#note-3"></a>[^](#fn3) Fischer, T., & Krauss, C. (2018). Deep learning with long short-term memory networks for financial market predictions. European Journal of Operational Research, 270(2), 654-669. 
https://www.sciencedirect.com/science/article/pii/S0377221717310652

4. <a name="#note-4"></a>[^](#fn4) Henisz, W., Koller, T., & Nuttall, R. (2019). Five ways that ESG creates value. McKinsey Quarterly. 
https://info.fiduciary-trust.com/hubfs/Fiduciary_Insights/McKinsey_Five_Ways_that_ESG_Creates_Value.pdf

5. <a name="#note-5"></a>[^](#fn5) Bolton, P., & Kacperczyk, M. (2020). Do investors care about carbon risk? Journal of Financial Economics, 142(2), 517-549. 
https://www.sciencedirect.com/science/article/pii/S0304405X21001902

6. <a name="#note-6"></a>[^](#fn6) Albuquerque, R., Koskinen, Y., & Zhang, C. (2019). Corporate Social Responsibility and Firm Risk: Theory and Empirical Evidence. Journal of Financial Economics, 137(2), 479-497.
https://pubsonline.informs.org/doi/epdf/10.1287/mnsc.2018.3043

By building on these studies, our project will integrate ESG scores with traditional financial indicators in a machine learning model to predict stock prices, aiming to validate and extend the understanding of ESG's role in financial performance. We will explore whether ESG factors provide additional predictive power beyond traditional metrics, potentially leading to more robust and socially responsible investment strategies.

# Hypothesis


We hypothesize that incorporating ESG factors alongside traditional financial indicators will improve the accuracy of stock price predictions, while helping make sustainable investments. 

Specifically, we believe that companies with higher ESG scores will demonstrate more stable and potentially superior stock performance compared to those with lower ESG scores. 

This hypothesis is based on existing literature that indicates a positive correlation between ESG compliance and financial performance, suggesting that ESG factors provide valuable insights into a company's long-term stability and growth potential. By integrating ESG scores with a machine learning model, we expect to capture additional dimensions of company performance that are not fully reflected in traditional financial metrics, leading to more accurate and socially responsible investment predictions.

# Data

For meaningful analysis, data from publicly traded companies across various sectors spanning multiple years will be considered to capture long-term trends. This will allow for insights into broader trends beyond isolated industry cases. The variables that we will be taking into consideration for our financial indicators include daily closing stock prices for a predetermined amount of time, P/E ratios, dividend yield, and market capitalization. Additionally, ESG score variables will be used along with risk ratings for the same to determine the sustainability levels associated with each stock.

Organizing this data by company and date in a structured database would allow for easy querying and updating for ongoing analysis. Financial data which includes historical stock prices and ratios can be sourced from Yahoo Finance. It offers consistent financial metrics through public APIs, as well as alternative providers like Alpha Vantage and Quandl. ESG scores, on the other hand, often require more specialized sources; agencies like MSCI and Refinitiv provide detailed ESG datasets, though access may be restricted by company size, geography, and disclosure practices. Public filings and government databases, such as the EPA and SEC in the U.S., can also supplement ESG data.

By combining these sources, we can create an ideal dataset that captures both financial and ESG factors. This will allow for easier investigation into the impact of ESG factors in financial performance prediction, even with certain limitations.

Potential datasets that will be useful to our project include:
   - Yahoo Finance for historical stock prices and financial metrics.
   - MSCI or Refinitiv for ESG ratings and scores.
   - Government databases for environmental and social indicators.
   - Other financial data providers that might offer public APIs for stock data.
While these sources may not provide the ideal dataset in its entirety, they each offer valuable components that could contribute to a comprehensive analysis of ESG’s impact on stock performance.

### Data Overview
Dataset #1
- Dataset Name: Public Company ESG Ratings
- Link to the dataset: https://www.kaggle.com/datasets/alistairking/public-company-esg-ratings-dataset
- Number of observations: 716
- Number of variables: 20

Dataset #2
- Dataset Name: S&P 500 ESG Risk Ratings
- Link to the dataset: https://www.kaggle.com/datasets/pritish509/s-and-p-500-esg-risk-ratings
- Number of observations: 503
- Number of variables: 14

The first dataset provides ratings on the major ESG metrics, while the second provides risk ratings on the same. The most important variables in the first dataset are the ESG scores, while the most important variables in the second one are Risk ratings for the same. These are all of numeric datatype. Additionally Yahoo finance will help us find financial metrics for the same stocks. In conjunction, all of these datasets will help evaluate the financial health as well as sustainability levels for each stock. Additional descriptions of the wrangling required to make these datasets usable along with the methodologies to merge them have been described below.

## Dataset #1: Public Company ESG Ratings

In [1]:
import pandas as pd

esg_data = pd.read_csv('Data/ESGData.csv')
esg_data.shape

(722, 21)

In [2]:
esg_data = esg_data[['ticker', 'name', 'environment_score', 'social_score', 'governance_score', 'total_score']]
print(esg_data.isnull().sum())

ticker               0
name                 0
environment_score    0
social_score         0
governance_score     0
total_score          0
dtype: int64


In [3]:
esg_data.head(6)

Unnamed: 0,ticker,name,environment_score,social_score,governance_score,total_score
0,dis,Walt Disney Co,510,316,321,1147
1,gm,General Motors Co,510,303,255,1068
2,gww,WW Grainger Inc,255,385,240,880
3,mhk,Mohawk Industries Inc,570,298,303,1171
4,lyv,Live Nation Entertainment Inc,492,310,250,1052
5,lvs,Las Vegas Sands Corp,547,318,313,1178


In [4]:
esg_data['ticker'] = esg_data['ticker'].str.upper()
esg_data.head(6)

Unnamed: 0,ticker,name,environment_score,social_score,governance_score,total_score
0,DIS,Walt Disney Co,510,316,321,1147
1,GM,General Motors Co,510,303,255,1068
2,GWW,WW Grainger Inc,255,385,240,880
3,MHK,Mohawk Industries Inc,570,298,303,1171
4,LYV,Live Nation Entertainment Inc,492,310,250,1052
5,LVS,Las Vegas Sands Corp,547,318,313,1178


This dataset provides Environmental, Social, and Governance (ESG) ratings for publicly traded companies, with important variables including **environment_score**, **social_score**, **governance_score**, and **total_score**. Each of these scores represents an assessment of the company's practices in various ESG categories. For instance, **environment_score** reflects how well the company manages its environmental impact, while **governance_score** assesses the quality of its corporate governance practices. These scores are numerical, which allows us to use them as metrics for quantifying a company’s sustainability and ethical performance. In terms of wrangling, we simplified the dataset by dropping unnecessary columns and ensuring that all ticker symbols were converted to uppercase to facilitate merging with other datasets.

## Dataset #2: S&P500 ESG Risk Ratings

In [5]:
risk_ratings = pd.read_csv('Data/RiskRatings.csv')
risk_ratings.shape

(503, 15)

In [6]:
risk_ratings = risk_ratings[['Symbol', 'Name', 'Total ESG Risk score', 'Environment Risk Score', 'Governance Risk Score', 'Social Risk Score']]

In [7]:
risk_ratings.rename(columns={'Symbol': 'ticker'}, inplace=True)
print(risk_ratings.isnull().sum())

ticker                     0
Name                       0
Total ESG Risk score      73
Environment Risk Score    73
Governance Risk Score     73
Social Risk Score         73
dtype: int64


In [8]:
risk_ratings.dropna(inplace=True)
risk_ratings.head(6)

Unnamed: 0,ticker,Name,Total ESG Risk score,Environment Risk Score,Governance Risk Score,Social Risk Score
1,EMN,Eastman Chemical Company,25.3,12.8,6.6,5.8
2,DPZ,Domino's Pizza Inc.,29.2,10.6,6.3,12.2
4,DVA,Davita Inc.,22.6,0.1,8.4,14.1
5,DRI,"Darden Restaurants, Inc.",27.5,7.9,4.6,15.0
6,ZTS,Zoetis Inc.,18.8,3.2,8.7,6.8
7,ZBH,"Zimmer Biomet Holdings, Inc.",26.0,3.6,7.9,14.5


The S&P500 ESG Risk Ratings dataset contains risk scores for companies based on their environmental, social, and governance practices, with fields such as **Total ESG Risk score**, **Environment Risk Score**, **Governance Risk Score**, and **Social Risk Score**. The Total ESG Risk score quantifies a company's overall risk related to ESG factors, and individual **Environment**, **Social**, and **Governance** risk scores give more detailed insights into specific areas of risk. The data was preprocessed by dropping rows with missing values, which may reflect unavailable or incomplete assessments of a company's ESG risk.

## Combining Datasets

In [9]:
merged_data = pd.merge(esg_data, risk_ratings, on='ticker')

merged_data.head(6)

Unnamed: 0,ticker,name,environment_score,social_score,governance_score,total_score,Name,Total ESG Risk score,Environment Risk Score,Governance Risk Score,Social Risk Score
0,DIS,Walt Disney Co,510,316,321,1147,The Walt Disney Company,15.7,0.0,6.7,9.0
1,GM,General Motors Co,510,303,255,1068,General Motors Company,28.5,9.8,7.2,11.5
2,GWW,WW Grainger Inc,255,385,240,880,"W.W. Grainger, Inc.",16.0,4.5,5.9,5.6
3,MHK,Mohawk Industries Inc,570,298,303,1171,"Mohawk Industries, Inc.",14.1,5.8,4.2,4.2
4,LVS,Las Vegas Sands Corp,547,318,313,1178,Las Vegas Sands Corp.,18.7,2.3,8.7,7.7
5,CLX,Clorox Co,560,350,345,1255,Clorox Company,21.6,7.7,5.3,8.6


In [10]:
import yfinance as yf

# Define the tickers of interest
tickers = merged_data['ticker'].unique().tolist()

# Download historical stock price data
stock_data = yf.download(tickers, start='2019-01-01', end='2022-12-31')

[*********************100%***********************]  367 of 367 completed


In [11]:
stock_data.head(6)

Price,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,...,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume
Ticker,A,AAL,AAPL,ABBV,ABT,ACGL,ACN,ADBE,ADI,ADM,...,WMT,WRB,WY,WYNN,XEL,XOM,XYL,YUM,ZBH,ZTS
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2019-01-02 00:00:00+00:00,63.061295,31.963158,37.708595,68.155113,62.657753,24.904034,129.003647,224.570007,77.160187,34.588242,...,24458100,1251788,7442000,4174400,4476100,16727200,990900,1743400,1186663,2665600
2019-01-03 00:00:00+00:00,60.738132,29.581665,33.952545,65.909515,59.700691,24.514164,124.599243,215.699997,72.499313,34.436195,...,24831900,1270013,9788300,2885100,5287600,13866100,1243100,2680300,1201083,2390900
2019-01-04 00:00:00+00:00,62.840508,31.530159,35.401951,68.032906,61.404617,25.094212,129.444107,226.190002,74.259499,35.230171,...,24087300,1783350,5843900,3007200,5535600,16043600,970900,2142700,1627812,3383500
2019-01-07 00:00:00+00:00,64.174858,32.425678,35.323154,69.025864,62.324181,25.037159,129.893707,229.259995,74.726471,35.407543,...,23369100,1006088,6562200,3143800,4208100,10844200,1341600,2561100,927309,2360800
2019-01-08 00:00:00+00:00,65.115685,31.904114,35.996525,69.346664,61.575916,25.132248,133.178711,232.679993,76.549507,36.066372,...,21602700,1399613,6232800,2255700,3128300,11439000,912800,2604700,3356770,2250300
2019-01-09 00:00:00+00:00,66.478828,32.888199,36.607803,67.070518,62.603668,25.455555,133.848511,235.429993,78.489288,36.378887,...,18816900,1671300,3926900,2193000,3625600,13473500,1164400,2279700,1558699,3158400


In [12]:
financial_indicators = []
for ticker in tickers:
    stock_info = yf.Ticker(ticker).info
    financial_indicators.append({
        'ticker': ticker,
        'market_cap': stock_info.get('marketCap'),
        'pe_ratio': stock_info.get('trailingPE'),
        'dividend_yield': stock_info.get('dividendYield')
    })

financial_indicators_df = pd.DataFrame(financial_indicators)

In [13]:
financial_indicators_df.head(6)

Unnamed: 0,ticker,market_cap,pe_ratio,dividend_yield
0,DIS,209435213824,42.518383,0.0082
1,GM,64359587840,6.246531,0.0082
2,GWW,58764095488,32.68283,0.0068
3,MHK,8759343104,15.680225,
4,LVS,36309303296,24.79208,0.016
5,CLX,20956123136,58.78472,0.0288


In [14]:
# Adjusted close prices as a dataframe with 'Date' and 'ticker'
adj_close = stock_data['Adj Close']
adj_close = adj_close.stack().reset_index()
adj_close.columns = ['Date', 'ticker', 'adj_close']

adj_close.head(6)

Unnamed: 0,Date,ticker,adj_close
0,2019-01-02 00:00:00+00:00,A,63.061295
1,2019-01-02 00:00:00+00:00,AAL,31.963158
2,2019-01-02 00:00:00+00:00,AAPL,37.708595
3,2019-01-02 00:00:00+00:00,ABBV,68.155113
4,2019-01-02 00:00:00+00:00,ABT,62.657753
5,2019-01-02 00:00:00+00:00,ACGL,24.904034


In [15]:
# Merge ESG and stock price data on both 'ticker' and 'Date'
prefinal_data = pd.merge(merged_data, adj_close, on='ticker', how='inner')

prefinal_data.head(6)

Unnamed: 0,ticker,name,environment_score,social_score,governance_score,total_score,Name,Total ESG Risk score,Environment Risk Score,Governance Risk Score,Social Risk Score,Date,adj_close
0,DIS,Walt Disney Co,510,316,321,1147,The Walt Disney Company,15.7,0.0,6.7,9.0,2019-01-02 00:00:00+00:00,106.811844
1,DIS,Walt Disney Co,510,316,321,1147,The Walt Disney Company,15.7,0.0,6.7,9.0,2019-01-03 00:00:00+00:00,104.224129
2,DIS,Walt Disney Co,510,316,321,1147,The Walt Disney Company,15.7,0.0,6.7,9.0,2019-01-04 00:00:00+00:00,107.439171
3,DIS,Walt Disney Co,510,316,321,1147,The Walt Disney Company,15.7,0.0,6.7,9.0,2019-01-07 00:00:00+00:00,108.370354
4,DIS,Walt Disney Co,510,316,321,1147,The Walt Disney Company,15.7,0.0,6.7,9.0,2019-01-08 00:00:00+00:00,109.213318
5,DIS,Walt Disney Co,510,316,321,1147,The Walt Disney Company,15.7,0.0,6.7,9.0,2019-01-09 00:00:00+00:00,110.43856


In [16]:
# Merge with financial indicators
final_data = pd.merge(prefinal_data, financial_indicators_df, on='ticker', how='left')

In [17]:
# Ensure dates are aligned and drop rows with missing values
final_data['Date'] = pd.to_datetime(final_data['Date'])
final_data.dropna(inplace=True)

final_data.drop('name', axis = 1, inplace=True)
final_data = final_data[['ticker', 'Name', 'environment_score', 'social_score', 'governance_score', 
                         'total_score', 'Total ESG Risk score', 'Environment Risk Score', 'Governance Risk Score', 
                         'Social Risk Score', 'adj_close', 'market_cap', 'pe_ratio', 'dividend_yield', 'Date']]
final_data.head(6)

Unnamed: 0,ticker,Name,environment_score,social_score,governance_score,total_score,Total ESG Risk score,Environment Risk Score,Governance Risk Score,Social Risk Score,adj_close,market_cap,pe_ratio,dividend_yield,Date
0,DIS,The Walt Disney Company,510,316,321,1147,15.7,0.0,6.7,9.0,106.811844,209435213824,42.518383,0.0082,2019-01-02 00:00:00+00:00
1,DIS,The Walt Disney Company,510,316,321,1147,15.7,0.0,6.7,9.0,104.224129,209435213824,42.518383,0.0082,2019-01-03 00:00:00+00:00
2,DIS,The Walt Disney Company,510,316,321,1147,15.7,0.0,6.7,9.0,107.439171,209435213824,42.518383,0.0082,2019-01-04 00:00:00+00:00
3,DIS,The Walt Disney Company,510,316,321,1147,15.7,0.0,6.7,9.0,108.370354,209435213824,42.518383,0.0082,2019-01-07 00:00:00+00:00
4,DIS,The Walt Disney Company,510,316,321,1147,15.7,0.0,6.7,9.0,109.213318,209435213824,42.518383,0.0082,2019-01-08 00:00:00+00:00
5,DIS,The Walt Disney Company,510,316,321,1147,15.7,0.0,6.7,9.0,110.43856,209435213824,42.518383,0.0082,2019-01-09 00:00:00+00:00


To merge these datasets, we used the ticker symbol as the common key, aligning companies based on their unique stock symbols. We performed an inner join, ensuring that only companies appearing in both datasets are included. This allows us to create a more comprehensive dataset that incorporates both ESG performance metrics (from Dataset #1) and ESG risk scores (from Dataset #2), giving a fuller picture of how ESG practices correlate with financial and risk metrics.

Preprocessing included ensuring that data types were compatible for merging (e.g., converting dates to a datetime format and aligning ticker symbols). We also removed unnecessary or redundant columns and dropped rows with missing values to maintain data integrity and facilitate analysis. This combined dataset was further merged with financial indicators and historical stock prices to support our goal of examining whether ESG scores correlate with stock price performance.

## Exploratory Data Analysis

We begin with feature engineering, a crucial step in the machine learning pipeline. This step enhances the predictive power of models by creating new features from raw data. It also provides us with an opportunity to make our models more robust by incorporatin domain knowledge into the data. Outlined below are the exact features added and how they can help make our models better:

- Lagged Features:
Captures Temporal Dependencies: Lagged features help capture the temporal dependencies and trends in stock prices, providing context about past performance which is often predictive of future trends.

- Moving Averages:
Moving averages smooth out short-term fluctuations and highlight longer-term trends in stock prices, which can be crucial for making predictions. They help in identifying upward or downward trends over different time horizons, aiding in trend analysis and decision-making.

- Volatility:
Volatility features measure the variability of stock prices, providing insights into the risk associated with the stock. Higher volatility indicates higher risk, which is critical for risk management strategies. Volatility can be predictive of future price movements, as periods of high volatility are often followed by significant price changes.
Composite ESG Score:

- Composite ESG Factors
ESG scores reflect a company’s performance on environmental, social, and governance factors, which are increasingly relevant for investors. A composite score provides a holistic view of the company’s sustainability practices. These factors can affect a company’s reputation, regulatory compliance, and operational efficiency, thereby impacting financial performance and stock prices.

Overall, these features provide a richer and more comprehensive dataset, enabling machine learning models to make more accurate and reliable predictions.

In [18]:
# Create lagged features for stock prices
final_data['adj_close_lag_1'] = final_data.groupby('ticker')['adj_close'].shift(1)
final_data['adj_close_lag_7'] = final_data.groupby('ticker')['adj_close'].shift(7)
final_data['adj_close_lag_30'] = final_data.groupby('ticker')['adj_close'].shift(30)

# Calculate moving averages
final_data['ma_7'] = final_data.groupby('ticker')['adj_close'].transform(lambda x: x.rolling(window=7).mean())
final_data['ma_30'] = final_data.groupby('ticker')['adj_close'].transform(lambda x: x.rolling(window=30).mean())

# Calculate volatility (standard deviation over the past 7 days and 30 days)
final_data['volatility_7'] = final_data.groupby('ticker')['adj_close'].transform(lambda x: x.rolling(window=7).std())
final_data['volatility_30'] = final_data.groupby('ticker')['adj_close'].transform(lambda x: x.rolling(window=30).std())

# Develop composite ESG scores
# Assuming equal weighting for simplicity, adjust weights as needed
final_data['composite_esg_score'] = final_data[['environment_score', 'social_score', 'governance_score']].mean(axis=1)
final_data['composite_esg_risk_score'] = final_data[['Environment Risk Score', 'Social Risk Score', 'Governance Risk Score']].mean(axis=1)

# Drop rows with NaN values resulting from lagged features and rolling calculations
final_data.dropna(inplace=True)

# Verify the newly created features
final_data.head(6)

# Save the final_data to a new CSV file for future analysis
final_data.to_csv('final_data_with_features.csv', index=False)

We'll first begin by preparing our dataset by splitting it into training and testing sets and scaling the features to ensure all variables contribute equally to the model.

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Drop columns that are not features or targets
features = final_data.drop(columns=['ticker', 'Name', 'Date', 'adj_close'])
target = final_data['adj_close']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

We will now experiment with various machine learning algorithms such as Linear Regression, Ridge Regression, Random Forest Regressor, and Gradient Boosting Regressor.

In [38]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [21]:
def evaluate_model(model, X_test, y_test):
    predictions = model.predict(X_test)
    mae = mean_absolute_error(y_test, predictions)
    mse = mean_squared_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)
    return mae, mse, r2

In [26]:
# Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)
lr_mae, lr_mse, lr_r2 = evaluate_model(lr_model, X_test_scaled, y_test)
print("Linear Regression: MAE =", lr_mae, "MSE =", lr_mse, "R² =", lr_r2)

Linear Regression: MAE = 1.6112009765048643 MSE = 15.524749840299213 R² = 0.9992564386095931


In [None]:
# Random Forest Regressor (reduced number of estimators to speed up the process for now)
rf_model = RandomForestRegressor(n_estimators=10, n_jobs=-1, random_state=42)
rf_model.fit(X_train_scaled, y_train)
rf_mae, rf_mse, rf_r2 = evaluate_model(rf_model, X_test_scaled, y_test)
print("Random Forest: MAE =", rf_mae, "MSE =", rf_mse, "R² =", rf_r2)

Random Forest: MAE = 1.65261506576951 MSE = 17.551030739604613 R² = 0.99915938942952


In [None]:
# Gradient Boosting Regressor (reduced number of estimators to speed up the process for now)
gb_model = GradientBoostingRegressor(n_estimators=10, random_state=42)
gb_model.fit(X_train_scaled, y_train)
gb_mae, gb_mse, gb_r2 = evaluate_model(gb_model, X_test_scaled, y_test)
print("Gradient Boosting: MAE =", gb_mae, "MSE =", gb_mse, "R² =", gb_r2)

Gradient Boosting: MAE = 27.90746393537541 MSE = 2942.4268201254336 R² = 0.8590718046957799


In [39]:
# Ridge Regression
ridge = Ridge()
ridge.fit(X_train_scaled, y_train)
ridge_mae, ridge_mse, ridge_r2 = evaluate_model(ridge, X_test_scaled, y_test)
print("Ridge Regression: MAE =", ridge_mae, "MSE =", ridge_mse, "R² =", ridge_r2)

Ridge Regression: MAE = 1.6113099253278742 MSE = 15.526318490735045 R² = 0.9992563634787273


For complex models like Random Forest and Gradient Boosting, we'll perform hyperparameter tuning using RandomizedSearchCV to find the best parameters.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Using a smaller subset of the data for hyperparameter tuning
subset_size = 1000  # Adjust this value as needed
X_train_subset = X_train_scaled[:subset_size]
y_train_subset = y_train[:subset_size]

# Random Forest hyperparameter tuning with reduced search space and subset of data
rf_params = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20],
    'min_samples_split': [2, 5]
}

rf_grid = RandomizedSearchCV(
    RandomForestRegressor(random_state=42),
    rf_params,
    n_iter=10,  # Number of parameter settings sampled
    cv=3,  # 3-fold cross-validation
    scoring='neg_mean_squared_error',
    n_jobs=-1,  # Utilize all available cores
    random_state=42
)
rf_grid.fit(X_train_subset, y_train_subset)
best_rf = rf_grid.best_estimator_

# Gradient Boosting hyperparameter tuning with reduced search space and subset of data
gb_params = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5],
    'min_samples_split': [2, 5]
}

gb_grid = RandomizedSearchCV(
    GradientBoostingRegressor(random_state=42),
    gb_params,
    n_iter=10,
    cv=3,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    random_state=42
)
gb_grid.fit(X_train_subset, y_train_subset)
best_gb = gb_grid.best_estimator_

# Evaluate models
rf_mae, rf_mse, rf_r2 = evaluate_model(best_rf, X_test_scaled, y_test)
gb_mae, gb_mse, gb_r2 = evaluate_model(best_gb, X_test_scaled, y_test)

print("Random Forest: MAE =", rf_mae, "MSE =", rf_mse, "R² =", rf_r2)
print("Gradient Boosting: MAE =", gb_mae, "MSE =", gb_mse, "R² =", gb_r2)

Random Forest: MAE = 3.1016929920595016 MSE = 492.8126877290189 R² = 0.9763966253197379
Gradient Boosting: MAE = 3.368309710448231 MSE = 438.62721170952017 R² = 0.978991850898467


Finally, we combine all the evaluation metrics for the different models into a table for better analysis and to inform future direction.

In [41]:
# Compile the results into a DataFrame for comparison
results = pd.DataFrame({
    'Model': ['Random Forest', 'Gradient Boosting', 'Linear Regression', 'Ridge Regression'],
    'MAE': [rf_mae, gb_mae, lr_mae, ridge_mae],
    'MSE': [rf_mse, gb_mse, lr_mse, ridge_mse],
    'R²': [rf_r2, gb_r2, lr_r2, ridge_r2]
})

# Display the comparison
results

Unnamed: 0,Model,MAE,MSE,R²
0,Random Forest,3.101693,492.812688,0.976397
1,Gradient Boosting,3.36831,438.627212,0.978992
2,Linear Regression,1.611201,15.52475,0.999256
3,Ridge Regression,1.61131,15.526318,0.999256


**Analysis** 
- Random Forest:
The Random Forest model has a relatively low MAE and MSE, indicating it performs well. The R² score of 0.976397 suggests that the model explains approximately 97.64% of the variance in the target variable.

- Gradient Boosting:
The Gradient Boosting model has slightly higher MAE compared to Random Forest but a lower MSE. The R² score of 0.978992 indicates that this model explains approximately 97.90% of the variance in the target variable, making it slightly better than Random Forest in terms of variance explanation.

- Linear Regression:
Linear Regression has the lowest MAE and MSE among all models, indicating that it performs extremely well on this dataset. The R² score of 0.999256 suggests that the model explains 99.93% of the variance in the target variable, which is exceptionally high.

- Ridge Regression:
Ridge Regression shows similar performance to Linear Regression with very close MAE, MSE, and identical R² scores. This suggests that adding regularization in Ridge Regression does not significantly change the performance compared to Linear Regression on this dataset.

**Conclusion**

Linear Regression and Ridge Regression: These models perform the best, with very low MAE and MSE and extremely high R² scores. This indicates that a simple linear relationship explains the data very well. Ridge Regression’s similar performance suggests that the data does not suffer significantly from multicollinearity.
Gradient Boosting: While it performs well, it does not outperform the simpler Linear and Ridge Regression models. This might be due to the dataset being well-suited for linear models.
Random Forest: This model also performs well but is outperformed by both Gradient Boosting and the linear models.

# Ethics & Privacy

Our project aim is to predict stock performance using ESG and financial data, but we recognize ethical considerations surrounding biases, privacy, and the equitable representation of companies. One key ethical concern is the assumption that high ESG scores universally translate to better financial performance. This can potentially inherently favor larger companies with resources to meet high ESG standards, leaving smaller firms with the disadvantage of lacking the means to fully report or attain these scores. Therefore, we will diversify our dataset by including companies of varying sizes and across different industries to reduce this potential bias and transparently communicate this limitation in our results.

Using publicly available data from reputable sources such as Yahoo Finance and MSCI, while respecting the terms of use and reporting biases will be considered, since companies are likely to be selective in the ESG data that they choose to report. This is particularly relevant for companies with robust resources, which may skew the dataset toward well-established firms.. Our analysis will include balancing techniques where possible and clear communication around potential skewed findings.

Bias Detection and Mitigation:
- Before Analysis: We will assess data diversity by evaluating the range of company sizes, industries, and geographical locations to ensure our dataset reflects a balanced representation.
- During Analysis: We will conduct feature analysis to evaluate the weights assigned to financial versus ESG factors, ensuring both are equitably represented.
- Post-Analysis: When reporting results, we will explicitly address any biases or limitations observed, including transparency on whether ESG factors have influenced predictions in unintended ways.

We recognize that ESG scores are not absolute indicators of financial success, we will communicate that our model’s predictions are limited by the current availability and quality of ESG data. We aim to provide responsible insights by acknowledging the potential limitations of ESG scores for financial forecasting, and we will structure our analysis and reporting to transparently communicate any ethical concerns.

# Team Expectations 


Communication:
* Use Discord as our primary communication. Team members should respond within a 24 hour time frame to messages.
* Meet weekly virtually to discuss project progress, next steps, and obstacles.
* Communicate openly and respectfully. When providing feedback, suggestions should aim to be constructive.
* It is unacceptable to not contribute to group assignments and be unresponsive on the group chat consistently. If any team member is not pulling their weight and contributing equally, they will be informed by the rest of the team to begin doing so.

Task Delegation:
* Use Github to manage and view task assignments and progress. Push to repo consistenly.
* Auritro: Responsible for gathering datasets from sources such as Yahoo Finance, MSCI, and Kaggle. Will be identifying the required financial and ESG data, merging datasets, and organizing the data for analysis. Ensure that deadlines are being met, coordinating communication, and using GitHub to track the teams progress.
* Jacquelyn: Assist with data cleaning, handling any missing values, outlier analysis, as well as ensuring that the data is properly prepared for analysis. Drafts the report sections on the interpretation of results, such as the impact of ESG factors on the prediction and summarize findings.
* Prabhmeet: Will be conducting the Exploratory Data Analysis (EDA), creating visuals that explore relationships between the variables (such as financial indicators and ESG scores) and summarizing key insights. Will ensure that the dataset is balanced. Will write about any ethical considerations, including biases and mitigation strategies in ESG data. Will assist in preparing the ethics and privacy sections.
* Ethan: Develops the Machine Learning models for stock price predictions, starting with basic models and soon transitioning to LSTMs/other nueral networks. Ensures that there is proper model evaluation through metrics like RSME/R-squared. Will prepare presentation slides summarizing results, focusing on the significance of incorporating ESG data in financial predictions.
* Aniruddh: Will test the models performance and performers hyperparameter tuning. Checks for overfitting and ensures that ESG scores are being properly integrated into the model. Will assist with the documentation on GitHub, including notebooks, final code, and creating an effective README file for project reproduction.

Decision Making Process:
* Use a majority vote for general decisions.
* Document decisions in our Discord server.

Supporting Teammates in Difficulty:
* Anyone that is facing any challenges should notify the rest of the group as soon as possible so we can work together to either reallocate tasks or provide support.
* Communicate issues privately first and if we cannot come to a general agreement, we can escalate to the professor if needed.

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 10/24  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 10/28  |  10 AM |  Do background research on topic | Review background information,; identify ideal datasets; discuss ethical considerations and finalize project proposal draft. | 
| 10/30  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Review available datasets; Start preprocessing data; Assign group members to lead each specific part of the project.   |
| 11/8  | 6 PM  | Import & Wrangle Data; EDA | Review EDA findings; finalize data cleaning; develop an analysis plan and identify key features of modeling.  |
| 11/16  | 12 PM  | Finalize wrangling/EDA; Begin Analysis | Discuss/edit Analysis and the progress; address any challenges encountered in the project; Complete project check-in; finalize analysis plan for the Machine Learning model. |
| 11/24  | 12 PM  | Complete analysis; Draft results/conclusion/discussion | Review analysis and results; discuss the final interpretation of results; begin writing the conclusion and ethical considerations. |
| 12/1  | Before 11:59 PM  | Final edits | Turn in Final Project & Group Project Surveys |