# ESG Stock Selection

This project aims to build simple classification models to predict whether the U.S. stock will be added to the ESG portfolio. It is a binary classification problem where each asset has a target variable with a value of one, meaning that the stock is added to the ESG portfolio with the overall ESG score more than or equal to five and zero otherwise. 

This project is a simulation to help students understand the machine learning process widely used in the financial industries to help select the asset. The dataset contains  746 U.S. stocks that were preprocessed and merged from two sources:  
1. https://www.kaggle.com/datasets/finintelligence/nasdaq-financial-fundamentals 
2. https://www.kaggle.com/datasets/debashish311601/esg-scores-and-ratings?resource=download


The dataset contains outdated fundamental data and has not been entirely verified. Hence, using this dataset for personal academic assignments is not recommended. The information is not intended as financial advice and shall not be understood or construed as financial advice.

The process are inspired by the paper 'Heterogeneous Ensemble for ESG Ratings Prediction' by Krappel, Boggun and Borth (2021). They collected fundamental data and built ML model to predict the ESG score. https://arxiv.org/abs/2109.10085

This Jupyter notebook will outline the following processes: 
    
1. Import the data
2. Data Analysis
3. Basic Data Transformation
4. Prepare Data for Machine Learning Model
5. Basic Machine Learning Models and Evaluate Performance
6. Key takeaways

## 1. Import the data

In [None]:
# import the library that we will use for this project
# Pandas is popular Python library used in data analysis and manipulation https://pandas.pydata.org/
# Numpy is another library for working on arrays and matrices https://numpy.org/
# You will see more usecases of Pandas and Numpy in the next semester.
 
import pandas as pd
import numpy as np

In [None]:
# declare variable called FILE_NAME in capital letters 

# Usually we declare in capital letters to seprate them from other variables 
# to let reader know that we do not want to reassign the value to this variable.
# However, this approach does not actually prevent reassignment. 

FILE_NAME = 'US_Stock_ESG_and_Fundamental_seminar.csv'

In [None]:
# Read csv file to create DataFrame
# DataFrame is two-dimensional data strcutures that has columns and rows.

df = pd.read_csv(<>, index_col=0) #use the first column as the index
df.<> #show the first 5 rows of the DataFrame

In [None]:
# show the size of dataset (number of rows, number of columns)
df.shape

## 2. Data Analysis

In [None]:
# Giving a summary of the dataframe
# including the index dtype and columns, non-null values and memory usage.
df.info()

There are many built-in data types available in Python. You can check from this site: https://www.w3schools.com/python/python_datatypes.asp 

In [None]:
# generate descriptive statistics for numeric series
df.describe()

In [None]:
# generate descriptive statistics for object (i.e., non-numeric) series
df.describe(exclude=[np.number])  

In [None]:
# List the columns in DataFrame
df.columns

## 3. Basic Data Transformation

f you check the result from df.info(), some columns have objects as the data type instead of the numerical type such as float or integer (int).

In [None]:
# Let's check the first row of column 'Assets'
# we can see that type(var) gives Strinng data types

print(df.Assets[0], <>(df.Assets[0]))

We want these values to be numerical data i.e., int

In [None]:
# Create a function that help us to convert values in the column

def convert_to_numerical(row):
    if isinstance(row, float): #if the data is None or NaN i.e., missing data
        return 0
    else:
        return int(row.replace(',','')) #replace comma symbol and convert string value to integer

In [None]:
# Using lambda to apply function accorss all rows 
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

df['Assets'] = df['Assets'].apply(lambda x: convert_to_numerical(x))
df['Cash and Cash Equivalents, at Carrying Value'] = df['Cash and Cash Equivalents, at Carrying Value'].apply(lambda x: convert_to_numerical(x))
df['Final Revenue'] = df['Final Revenue'].apply(lambda x: convert_to_numerical(x))
df['Gross Profit'] = df['Gross Profit'].apply(lambda x: convert_to_numerical(x))
df['Income from Continuing Operations before Taxes'] = df['Income from Continuing Operations before Taxes'].apply(lambda x: convert_to_numerical(x))
df['Operating Income (Loss)'] = df['Operating Income (Loss)'].apply(lambda x: convert_to_numerical(x))
df['Total Equity'] = df['Total Equity'].apply(lambda x: convert_to_numerical(x))
df['Total Liabilities and Equity'] = df['Total Liabilities and Equity'].apply(lambda x: convert_to_numerical(x))
df['Net Income (Loss)'] = df['Net Income (Loss)'].apply(lambda x: convert_to_numerical(x))
df['Cash and Cash Equivalents, Period Increase (Decrease)'] = df['Cash and Cash Equivalents, Period Increase (Decrease)'].apply(lambda x: convert_to_numerical(x))

In [None]:
# run .describe() to check the data types again
df.<>

In [None]:
df.describe(exclude=[np.number])  

The machine learning model could only take the numerical form. So we could not directly use a text of Country, Sector, Subsector in the ML model.

In [None]:
# Example values of Country, Sector and Subsector column
# there are String 
print(df.Country[0], type(df.Country[0]))
print(df.Sector[0], type(df.Sector[0]))
print(df.Subsector[0], type(df.Subsector[0]))

We can use the technique called Label Encoding to convert the labels into a numeric form that the machine can read. https://www.geeksforgeeks.org/ml-label-encoding-of-datasets-in-python

The Label encoding is available in the scikit-learn library, a ML library in Python.

In [None]:
# conda install -c intel scikit-learn
from sklearn import preprocessing

In [None]:
# Encode target labels with value between 0 and n_classes-1. 
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html 
le = preprocessing.LabelEncoder()

# crate new column as a result of LabelEncoder
df['country_label'] = le.fit_transform(df['Country'])
df['sector_label'] = le.fit_transform(df['Sector'])
df['subsector_label'] = le.fit_transform(df['Subsector'])

In [None]:
df[['country_label', 'sector_label', 'subsector_label']].describe()

Tips: Another encoding method is one-hot encoding to handle Categorical data https://www.geeksforgeeks.org/ml-one-hot-encoding-of-datasets-in-python/?ref=lbp 

#### Question: Should we keep 'country_label' column?

Answer: No, all stocks are in the US and have the same 'US' value in Country column. There is no need to have this column to train ML model.

In [None]:
# Return a Series containing counts of unique values.
# https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html 

df['Country'].value_counts()
# df.Country.value_counts()

In [None]:
# drop unused columns -  Company Name, Country, Sector, Subsector
dropped_columns = ['Ticker', 'Company Name', 'Country', 'country_label', 'Sector', 'Subsector']

df.drop(dropped_columns, axis=1, inplace=True)
df.shape

## 4. Prepare Data for Machine Learning Model

#### Target variable vs Predictor variable

- Target variable is the variable whose value is predicted by the model.
- Predictor variable is the variable used to predict the target variable.

In this project, the target variable is the 'Target' column in the DataFrame, which contains the binary value of zero and one. The value of one means that the stock has an overall ESG score greater than or equal to the median. The asset is added to the ESG portfolio. 

In [None]:
# use .value_counts() to see how many stocks are classified into zero and one value in Target column
df.Target.<>

#### Split the data

The next step is to split the dataset into training and testing dataset.
- Training dataset is used to train and fit the ML model.
- Testing data set is used to evaluate the performance of ML model.

In [None]:
# we don't need 'Target' column to be included in training dataset
X = df.drop('Target', axis=1)
y = df.Target

# train:test ratio is 80:20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<>)

In [None]:
print('The size of the training dataset: ', X_train.<>)
print('The size of the testing dataset: ', X_test.<>)

In [None]:
print('Target count of training dataset: ', y_train.value_counts().to_dict())
print('Target count of testing dataset: ', y_test.<>)

## 5. Basic Machine Learning Models

In this project, we will use three basic machine learning models.

1. Logistics Regression
2. Decision Tree
3. Random Forest

These supervised learning models require training datasets to learn and predict the value. They are commonly used for classification problems. Some models, such as decision tree and random forest, could be used to predict a continuous value (i.e., regression problem), such as predicting the house price.

You will cover more details in the next semester.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import tree #decision tree
from sklearn.ensemble import RandomForestClassifier

There are many matrics that we can use to measure the performance of our prediction model. We will use accuracy, a fraction of how many predictions our model got right. 

In [None]:
from sklearn.metrics import accuracy_score

def print_score(y_true, y_pred):
    print('Accuracy: ', accuracy_score(y_true, y_pred))

### Logistics Regression

Using the sigmoid function 𝑓(𝐱): 𝑝(𝐱) = 1 / (1 + exp(−𝑓(𝐱)) where 𝑓(𝐱) is a linear regression. The function can esimate probability that an instance belongs to a particular class. If the probaibility is greater than 50% then the model predicts that instance belongs to that class.

In [None]:
clf = LogisticRegression(random_state=0).fit(X_train, y_train)

y_pred = clf.predict(X_test)
print_score(y_test, y_pred)

### Decision Tree

Decision tree is the model that divide instance into smaller decisions nodes and leaf. The model used the impurity function or loss function to split the decision node. The process continue spliting until the the impurity value is minimised. You can find mathematical formula on this website: https://scikit-learn.org/stable/modules/tree.html 

In [None]:
clf = tree.DecisionTreeClassifier().fit(<>)
y_pred = clf.predict(<>)

print_score(<>)

### Random Forest

It is an ensemble of decision trees  that fits a number of decision tree classifiers on various sub-samples of the dataset and trains via the bagging method. Random Forest uses averaging to improve the predictive accuracy and control over-fitting. You can read more on this article they have a clear example and visualisation https://towardsdatascience.com/understanding-random-forest-58381e0602d2 

In [None]:
clf = RandomForestClassifier().fit(<>)
y_pred = clf.predict(<>)

print_score(<>)

#### Question: Given the accuracy, which model is the best one?

Answer:

## 6. Key takeaway

In this session, you have learned the following:

- How the industry has used a combination of the data to make a stock selection.
- The nature of the structured dataset that could be processed in the form of DataFrame.
- Basic data processing using Python programming language.
- Basic machine learning models using Python and the scikit-learn library.
- Basic machine learning performance evaluation using Accuracy as the main matric. 

There is future work that you are encouraged to investigate further if you are interested in this topic:

- How could you explain the source of E, S, G and overall ESG score? There is a lack of transparency in ESG ratings, and we do not know how the agency generates these ratings.
- Further transformation of data. You can create a new column of financial ratios using the fundamental data.
- This model hasn't been validated and doesn't have optimal hyperparameters. You could split the dataset into a validation dataset to train the model and find hyperparameters.
- You can use other performance metrics, such as F1 score, recall and precision.
- There are other classification models such as Naive Bayes, support vector machine, and advanced deep learning models.
- You could find more data to improve the prediction result, such as sentiment from news and other ESG rating sources. 