### Colab Setup

In [None]:
! pip install -q johnsnowlabs

In [None]:
from johnsnowlabs import *

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install(refresh_install=True, force_browser = True)

In [None]:
import pandas as pd
from pyspark.sql.functions import *
from pyspark.ml import PipelineModel
from pyspark.sql import functions as F
import warnings
warnings.filterwarnings('ignore')

# Automatically load licenste data and start a session with all jars user has access to
spark = nlp.start()

# Analysis
During this process of analysis, we will continue with `cf_industries_pages`` dataframe

In this case, we don't know which page contains what information. We can use `FinanceBertForSequenceClassification` models to do Text Classification, in this case, at `Page` level.

To check the Responsibility Reports page, we have specific models called `"finclf_augmented_esg"` and `"finclf_esg"`

Let's import dataset and create generic classification pipeline

In [21]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/solution_accelerator_esg_and_rr/cf_industries_pages.parquet.zip

--2023-03-27 14:44:53--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/solution_accelerator_esg_and_rr/cf_industries_pages.parquet.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 141300 (138K) [application/zip]
Saving to: ‘cf_industries_pages.parquet.zip’


2023-03-27 14:44:53 (36.5 MB/s) - ‘cf_industries_pages.parquet.zip’ saved [141300/141300]



In [None]:
!unzip cf_industries_pages.parquet.zip -d cf_industries_pages.parquet

In [6]:
def generic_clf_pipeline(model_name):
  document_assembler = nlp.DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

  tokenizer = nlp.Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

  sequenceClassifier = finance.BertForSequenceClassification.pretrained(model_name, "en", "finance/models")\
    .setInputCols(["document",'token'])\
    .setOutputCol("class")

  pipeline = nlp.Pipeline(stages=[
      document_assembler, 
      tokenizer,
      sequenceClassifier
  ])

  empty_data = spark.createDataFrame([[""]]).toDF("text")

  model = pipeline.fit(empty_data)

  return model

In [23]:
data = spark.read.parquet("./cf_industries_pages.parquet")

data.show()

+--------+--------------------+
|page_num|                text|
+--------+--------------------+
|       1|CF Industries
202...|
|       2|A Message from Ou...|
|       3|OUR BUSINESS AND ...|
|       4|ACCOUNTABILITY AN...|
|       5|This report detai...|
|       6|At CF Industries,...|
|       7|AMMONIA’S ROLE IN...|
|       8|(1) Other segment...|
|       9|“We operate advan...|
|      10|DIMENSIONS & KEY ...|
|      11|1. Energy, Emissi...|
|      12|Our four distinct...|
|      13|Key Issues 
 
1) ...|
|      14|14) Product Desig...|
|      15|Our intensive wor...|
|      16|ESG Goals
ENERGY,...|
|      17|We are excited ab...|
|      18| ▶ Supplier Scree...|
|      19|Energy, 
Emission...|
|      20|CF Industries bel...|
+--------+--------------------+
only showing top 20 rows



In [8]:
from pyspark.sql.functions import *
# We remove repeating text from all pages
data = data.withColumn('text', regexp_replace('text', 'Message from CEO About this Report About the Company Energy, Emissions & Climate ChangeOur Workplace & Communities Food Security & Product Stewardship Ethics & Governance Coalitions, Partnerships & Policy EngagementApproach to ESG & SustainabilityReporting & Data', ''))

data = data.withColumn('text', regexp_replace('text', '\n', ' '))

data = data.withColumn('text', regexp_replace('text', '▶', '\n▶'))

data.show(truncate = 200)

+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|page_num|                                                                                                                                                                                                    text|
+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|       1|                                                                                                                                     CF Industries 2021  ESG Report 1CF INDUSTRIES  |  2021 ESG REPORT  |
|       2|A Message from Our CEO To our Stakeholders: CF Industries made substantial progress in 2021  across the environmental, social, and governance 

## Using Text Classification to find Relevant Parts of the Document: `Environment`, `Social`, `Governance` 
To check the Responsibility Reports pages, we have a specific model called `"finclf_esg"` and `"finclf_augmented_esg"`

Firstly, we use our `finclf_esg` classfication model to classify pages into three categories: `Environmental`, `Social`, `Governance`.

In [9]:
model_name = "finclf_esg"
result = generic_clf_pipeline(model_name).transform(data)

finclf_esg download started this may take some time.
[OK!]


In [10]:
result = result.select("page_num", "text", "class.result")

In [11]:
result.show(20, truncate = 80)

+--------+--------------------------------------------------------------------------------+---------------+
|page_num|                                                                            text|         result|
+--------+--------------------------------------------------------------------------------+---------------+
|       1|             CF Industries 2021  ESG Report 1CF INDUSTRIES  |  2021 ESG REPORT  |   [Governance]|
|       2|A Message from Our CEO To our Stakeholders: CF Industries made substantial pr...|[Environmental]|
|       3|OUR BUSINESS AND STRATEGY At our core, CF Industries is a producer of ammonia...|[Environmental]|
|       4|ACCOUNTABILITY AND TRANSPARENCY Our strategy of accelerating the world’s tran...|[Environmental]|
|       5|This report details CF Industries’ progress and opportunities within key envi...|[Environmental]|
|       6|At CF Industries, our mission is to provide clean energy to feed and fuel  th...|[Environmental]|
|       7|AMMONIA’S ROLE IN 

In [12]:
result.write.mode("overwrite").parquet("finclf_esg_result.parquet")

## Using Text Classification to find Relevant Parts of the Document: ESG Text Classification
Secondly, we will use `finclf_augmented_esg` model. This model classifes the Responsibility and ESG reports pages into 26 categories:

`Business_Ethics`, `Data_Security`, `Access_And_Affordability`, `Business_Model_Resilience`, `Competitive_Behavior`, `Critical_Incident_Risk_Management`, `Customer_Welfare`, `Director_Removal`, `Employee_Engagement_Inclusion_And_Diversity`, `Employee_Health_And_Safety`, `Human_Rights_And_Community_Relations`, `Labor_Practices`, `Management_Of_Legal_And_Regulatory_Framework`, `Physical_Impacts_Of_Climate_Change`, `Product_Quality_And_Safety`, `Product_Design_And_Lifecycle_Management`, `Selling_Practices_And_Product_Labeling`, `Supply_Chain_Management`, `Systemic_Risk_Management`, `Waste_And_Hazardous_Materials_Management`, `Water_And_Wastewater_Management`, `Air_Quality`, `Customer_Privacy`, `Ecological_Impacts`, `Energy_Management`, `GHG_Emissions`

In [15]:
model_name = "finclf_augmented_esg"
result_esg = generic_clf_pipeline(model_name).transform(data)

finclf_augmented_esg download started this may take some time.
[OK!]


In [16]:
result_esg.show()

+--------+--------------------+--------------------+--------------------+--------------------+
|page_num|                text|            document|               token|               class|
+--------+--------------------+--------------------+--------------------+--------------------+
|       1|CF Industries 202...|[{document, 0, 66...|[{token, 0, 1, CF...|[{category, 0, 66...|
|       2|A Message from Ou...|[{document, 0, 32...|[{token, 0, 0, A,...|[{category, 0, 32...|
|       3|OUR BUSINESS AND ...|[{document, 0, 43...|[{token, 0, 2, OU...|[{category, 0, 43...|
|       4|ACCOUNTABILITY AN...|[{document, 0, 51...|[{token, 0, 13, A...|[{category, 0, 51...|
|       5|This report detai...|[{document, 0, 14...|[{token, 0, 3, Th...|[{category, 0, 14...|
|       6|At CF Industries,...|[{document, 0, 11...|[{token, 0, 1, At...|[{category, 0, 11...|
|       7|AMMONIA’S ROLE IN...|[{document, 0, 34...|[{token, 0, 8, AM...|[{category, 0, 34...|
|       8|(1) Other segment...|[{document, 0, 27..

In [17]:
result_esg = result_esg.select("page_num", "text", "class.result")

In [18]:
result_esg.show(truncate = 80)

+--------+--------------------------------------------------------------------------------+----------------------------------------------+
|page_num|                                                                            text|                                        result|
+--------+--------------------------------------------------------------------------------+----------------------------------------------+
|       1|             CF Industries 2021  ESG Report 1CF INDUSTRIES  |  2021 ESG REPORT  |[Management_Of_Legal_And_Regulatory_Framework]|
|       2|A Message from Our CEO To our Stakeholders: CF Industries made substantial pr...|    [Waste_And_Hazardous_Materials_Management]|
|       3|OUR BUSINESS AND STRATEGY At our core, CF Industries is a producer of ammonia...|    [Waste_And_Hazardous_Materials_Management]|
|       4|ACCOUNTABILITY AND TRANSPARENCY Our strategy of accelerating the world’s tran...|                               [GHG_Emissions]|
|       5|This report detai

In [19]:
result_esg.write.mode("overwrite").parquet("finclf_augmented_esg_result.parquet")

# Continue you analysis
Make sure you understand the contents of a Responsibilty Reports and what information can be extracted. In the following notebooks we are going to extract all of that information using Finance NLP

# You are ready to proceed to the 03 Named Entity Recognition notebook!

In next and the following notebooks you will use Finance NLP to extract information, more specifically:
- `NER`: To extract up to 20 quantifiable entities, including KPI, from the Responsibility and ESG Reports of companies.;
- `Table Extract` to get table from RR reports.
- `Table Understanding`: to understand the tables using Financial Question-Answering model.