# ESG Stock Selection

This project aims to build simple classification models to predict whether the U.S. stock will be added to the ESG portfolio. It is a binary classification problem where each asset has a target variable with a value of one, meaning that the stock is added to the ESG portfolio with the overall ESG score more than or equal to five and zero otherwise. 

This project is a simulation to help students understand the machine learning process widely used in the financial industries to help select the asset. The dataset contains  746 U.S. stocks that were preprocessed and merged from two sources:  
1. https://www.kaggle.com/datasets/finintelligence/nasdaq-financial-fundamentals 
2. https://www.kaggle.com/datasets/debashish311601/esg-scores-and-ratings?resource=download


The dataset contains outdated fundamental data and has not been entirely verified. Hence, using this dataset for personal academic assignments is not recommended. The information is not intended as financial advice and shall not be understood or construed as financial advice.

This Jupyter notebook will outline the following processes: 
    
1. Import the data
2. Data Analysis
3. Basic Data Transformation
4. Prepare Data for Machine Learning Model
5. Basic Machine Learning Models
6. Evaluate Performance

## 1. Import the data

In [1]:
import pandas as pd
import numpy as np

In [2]:
FILE_NAME = 'US_Stock_ESG_and_Fundamental_seminar.csv'

In [3]:
# read csv file to create DataFrame
# DataFrame is two-dimensional data strcutures that has columns and rows.

df = pd.read_csv(FILE_NAME, index_col=0) #use the first column as the index
df.head(5) 

Unnamed: 0,Ticker,Company Name,Country,Sector,Subsector,Environmental SCORE,Social SCORE,Governance SCORE,Assets,"Cash and Cash Equivalents, at Carrying Value",Final Revenue,Gross Profit,Income from Continuing Operations before Taxes,Operating Income (Loss),Total Equity,Total Liabilities and Equity,Net Income (Loss),"Cash and Cash Equivalents, Period Increase (Decrease)",Target
0,FLWS,"1-800-FLOWERSCOM, INC",US,Retail - Consumer Discretionary,Internet & Direct Marketing Retail,1.2,6.7,3.8,536570000,61696000,234207000.0,96721000.0,-14620000.0,-13236000.0,249186000.0,536570000,,,1
1,SRCE,1ST SOURCE CORPORATION,US,Banks,Regional Banks,0.0,2.6,4.7,5245610000,85227000,,,21236000.0,,649973000.0,5245610000,13818000.0,,0
2,TWOU,"2U, INC",US,Software & Services,Education Services,4.7,4.7,6.3,236718000,186710000,47444000.0,,,-3446000.0,196623000.0,236718000,-3380000.0,2981000.0,1
3,AAON,"AAON, INC",US,Building Products,Building Products,5.0,7.0,5.6,236669000,17248000,85422000.0,25731000.0,25731000.0,16826000.0,,236669000,11551000.0,9340000.0,1
4,ABMD,"ABIOMED, INC",US,Health Care Equipment & Supplies,Health Care Equipment,6.5,7.6,4.9,423931000,48231000,93957000.0,,,19813000.0,368775000.0,423931000,10998000.0,,1


In [4]:
# show the size of dataset (number of rows, number of columns)
df.shape

(746, 19)

## 2. Data Analysis

In [5]:
df.info()

TypeError: Cannot interpret '<attribute 'dtype' of 'numpy.generic' objects>' as a data type

In [6]:
df.describe()

Unnamed: 0,Environmental SCORE,Social SCORE,Governance SCORE,Target
count,746.0,746.0,746.0,746.0
mean,4.486729,4.417158,5.175201,0.474531
std,2.437189,1.325557,1.022506,0.499686
min,0.0,0.4,1.1,0.0
25%,2.7,3.5,4.6,0.0
50%,4.5,4.2,5.3,0.0
75%,6.3,5.3,5.9,1.0
max,10.0,9.6,7.6,1.0


In [7]:
df.describe(exclude=[np.number])  

Unnamed: 0,Ticker,Company Name,Country,Sector,Subsector,Assets,"Cash and Cash Equivalents, at Carrying Value",Final Revenue,Gross Profit,Income from Continuing Operations before Taxes,Operating Income (Loss),Total Equity,Total Liabilities and Equity,Net Income (Loss),"Cash and Cash Equivalents, Period Increase (Decrease)"
count,746,746,746,746,746,707,690,622,371,504,618,674,706,673,533
unique,717,746,1,61,107,681,658,584,358,489,587,653,680,639,506
top,CHTR,"PAYPAL HOLDINGS, INC",US,Banks,Biotechnology,40524000000,1278000000,0,447091000,194910000,302000000,14756000000,40524000000,-188000000,1273000000
freq,4,1,746,92,87,4,4,13,2,2,4,2,4,4,4


In [8]:
df.columns

Index(['Ticker', 'Company Name', 'Country', 'Sector', 'Subsector',
       'Environmental SCORE', 'Social SCORE', 'Governance SCORE', 'Assets',
       'Cash and Cash Equivalents, at Carrying Value', 'Final Revenue',
       'Gross Profit', 'Income from Continuing Operations before Taxes',
       'Operating Income (Loss)', 'Total Equity',
       'Total Liabilities and Equity', 'Net Income (Loss)',
       'Cash and Cash Equivalents, Period Increase (Decrease)', 'Target'],
      dtype='object')

## 3. Basic Data Transformation

In [None]:
def convert_to_numerical(row):
    if isinstance(row, float):
        return 0
    else:
        return int(row.replace(',',''))

In [None]:
df['Assets'] = df['Assets'].apply(lambda x: convert_to_numerical(x))
df['Cash and Cash Equivalents, at Carrying Value'] = df['Cash and Cash Equivalents, at Carrying Value'].apply(lambda x: convert_to_numerical(x))
df['Final Revenue'] = df['Final Revenue'].apply(lambda x: convert_to_numerical(x))
df['Gross Profit'] = df['Gross Profit'].apply(lambda x: convert_to_numerical(x))
df['Income from Continuing Operations before Taxes'] = df['Income from Continuing Operations before Taxes'].apply(lambda x: convert_to_numerical(x))
df['Operating Income (Loss)'] = df['Operating Income (Loss)'].apply(lambda x: convert_to_numerical(x))
df['Total Equity'] = df['Total Equity'].apply(lambda x: convert_to_numerical(x))
df['Total Liabilities and Equity'] = df['Total Liabilities and Equity'].apply(lambda x: convert_to_numerical(x))
df['Net Income (Loss)'] = df['Net Income (Loss)'].apply(lambda x: convert_to_numerical(x))
df['Cash and Cash Equivalents, Period Increase (Decrease)'] = df['Cash and Cash Equivalents, Period Increase (Decrease)'].apply(lambda x: convert_to_numerical(x))

In [None]:
df.describe()

In [None]:
df.describe(exclude=[np.number])  

In [None]:
le = preprocessing.LabelEncoder()

df['country_label'] = le.fit_transform(df['Country'])
df['sector_label'] = le.fit_transform(df['Sector'])
df['subsector_label'] = le.fit_transform(df['Subsector'])

In [None]:
df.describe()

# Should we keep 'country_label' column?




In [None]:
# drop unused columns -  Company Name, Country, Sector, Subsector
dropped_columns = ['Ticker', 'Company Name', 'Country', 'country_label', 'Sector', 'Subsector']

df.drop(dropped_columns, axis=1, inplace=True)
df.shape

## 4. Prepare Data for Machine Learning Model

In [None]:
Why do we need to split dataset?
What is training dataset?
What is testing dataset?

In [None]:
X = df.drop('Target', axis=1)
y = df.Target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

## 5. Basic Machine Learning Models

In [None]:
what is accuracy? 

In [None]:
def print_score(y_true, y_pred):
    print('Accuracy: ', accuracy_score(y_true, y_pred))

In [None]:
# Logistics Regression
clf = LogisticRegression(random_state=0).fit(X_train, y_train)

y_pred = clf.predict(X_test)
print_score(y_test, y_pred)

In [None]:
# decision tree
clf = tree.DecisionTreeClassifier().fit(X_train, y_train)
y_pred = clf.predict(X_test)

print_score(y_test, y_pred)

In [None]:
# random forest
clf = RandomForestClassifier().fit(X_train, y_train)
y_pred = clf.predict(X_test)

print_score(y_test, y_pred)

In [None]:
Which model isthe best?

## 6. Evaluate Performance

In [None]:
There are other performance metrics such as recall, F1 score

In [None]:
Additional Acitivity

- Create new column: e.g., calculate financial raito and create new column
- Validate the model (e.g., train-test,validate?) fidn the best optimal parameters 
- Using different metrics e.g., f1 score