It is the first analysis of the data available. Here we explore 3 datasets:
- esg_scores_history_rated.csv
- companies_all.csv
- environmental_data_history_all.csv
We are mainly interested in solving the investor/company matching problem. To do so, we considered 2 approches, which will be clear later: one with clustering approch, and another with ESG forecasting.
First of all, clone the repository
$ git clone <https://github.com/Hackganization/
In order to run the notebook, you must install jupyter notebook, which can be done in this link: https://www.anaconda.com/products/individual
Once the software is downloaded, be sure that you have matplotlib, numpy, pandas, sklearn and statsmodels installed.
To do so, simply write
$conda install package
for each of the above libraries.
Now that you're have the dependencies, open the notebook writing the command
$jupyter notebook
and click above the desired notebook. Now you can click at cell button at the top of the notebook and run all cells
Let's start looking at the companies_all.csv. It contains data related to each company, such as Id, country, region, industry segment and more:
Next, let's see the frequencies of each industry segment:
We found that the frequencies of companies at the financial markets are very imbalanced, which could possibly generate a bias if some prediction model were used. The first idea we had was to make a cluster of companies, but we changed to time series forecasting, which reason will become clear later.
Now, we will look at ESG data at esg_scores_history_rated.csv:
It contains data like the company_id, industry_segment, assessment_year, parent_aspect, score_weight and score_value. We will use the parent_aspect in order to aggregate the data to obtain the score for each ESG dimension, but first let's look how these scores are distributed:
It seems like the distributions are similar, but for environmental dimension the median is significantly lower than the others. It could be an indicative that companies still are not very worried about environmental issues.
Finally, we are going to build a model to predict the next ESG given the historical data. It can be useful if an investor is looking for a prediction of the ESG performance in the next years. To do so, we used SimpleExpSmoothing model, from the statsmodel library. It is a quite simple and fast model, so it is easy scalable for huge datasets. For more about the theory, see: Time series forecasting: principles and practices. Once these future ESG values are predicted, including the global score, it is used to provide a ranking of the companies to the potential investor, given his(her) preferences. The final step is aggregate both ESG predicted scores and company details to serve for the application, resulting in the following Amostra_das_empresas.csv dataset:
It is important to note that since it is a proof of concept, only a sample considering 4 industry segments with 5 companies each.
- Python
- Docker
- Pandas
- Flask
- Statsmodels