### Colab Setup

In [None]:
! pip install -q johnsnowlabs

In [None]:
from johnsnowlabs import *
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install(refresh_install=True, visual=True, force_browser = True)

In [3]:
from johnsnowlabs import *
from pyspark.ml import PipelineModel
from pyspark.sql import functions as F
import pandas as pd

# Automatically load licenste data and start a session with all jars user has access to
spark = nlp.start(visual = True)  

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.3.2, 💊Spark-Healthcare==4.3.2, 🕶Spark-OCR==4.3.3, running on ⚡ PySpark==3.1.2


## Read pdf file to the dataframe and display

In [6]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/solution_accelerator_esg_and_rr/CF_Industries.pdf

--2023-03-27 13:45:15--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/solution_accelerator_esg_and_rr/CF_Industries.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12674821 (12M) [application/octet-stream]
Saving to: ‘CF_Industries.pdf’


2023-03-27 13:45:15 (86.2 MB/s) - ‘CF_Industries.pdf’ saved [12674821/12674821]



In [7]:
pdf_path = './CF_Industries.pdf'

pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()

visual.display_pdf(pdf_example_df, limit = 20)

Output hidden; open in https://colab.research.google.com to view.

## Extract text from the pdf files

If your PDF files ınclude selectable texts, you should use following code to get raw text

In [8]:
# If text PDF extract text
pdf_to_text = visual.PdfToText()\
    .setInputCol("content")\
    .setOutputCol("text")\
    .setSplitPage(True)\
    .setExtractCoordinates(True)\
    .setStoreSplittedPdf(True)

pipeline = PipelineModel(stages=[
    pdf_to_text
])



## Run pipeline and show results

In [9]:
result = pipeline.transform(pdf_example_df).cache()
result.show(5)

+--------------------+--------------------+--------+--------------------+--------------------+----------------+---------------+--------------------+---------+-------+
|                path|    modificationTime|  length|                text|           positions|height_dimension|width_dimension|             content|exception|pagenum|
+--------------------+--------------------+--------+--------------------+--------------------+----------------+---------------+--------------------+---------+-------+
|file:/content/CF_...|2023-03-27 13:45:...|12674821|CF Industries
202...|[{[{C, 0, 56.0, 1...|           768.0|         1366.0|[25 50 44 46 2D 3...|     null|      0|
|file:/content/CF_...|2023-03-27 13:45:...|12674821|A Message from Ou...|[{[{A, 1, 55.0, 6...|           768.0|         1366.0|[25 50 44 46 2D 3...|     null|      1|
|file:/content/CF_...|2023-03-27 13:45:...|12674821|OUR BUSINESS AND ...|[{[{O, 2, 56.0, 5...|           768.0|         1366.0|[25 50 44 46 2D 3...|     null|      2

## Display text using pandas dataframe

In [10]:
import pandas as pd

pd.set_option('display.max_colwidth', None)

result.select("text").toPandas().head().style.set_properties(**{'white-space': 'pre-wrap', 'text-align': 'left'})

Unnamed: 0,text
0,CF Industries 2021 ESG Report 1CF INDUSTRIES | 2021 ESG REPORT
1,"A Message from Our CEO To our Stakeholders: CF Industries made substantial progress in 2021 across the environmental, social, and governance (ESG) dimensions that we believe are critical to the long-term success of our Company. These advances have been driven by outstanding work from the CF Industries team, guided by a strategy of decarbonizing our manufacturing processes and supporting the transition to a clean energy economy, underpinned by continued strong financial performance. I am pleased to highlight some of our most significant accomplishments in 2021 related to our ESG objectives:  ▶ First and foremost, we operated safely. Our full year recordable incident rate was 0.32 incidents per 200,000 work hours, which is significantly better than industry averages. This is especially impressive as we had our highest level of maintenance activities ever during the year, including completing seven ammonia plant turnarounds.  ▶ We began construction on North America’s first commercial scale green ammonia production at our Donaldsonville, Louisiana, manufacturing complex. Once complete in 2023, we will be able to produce 20,000 tons of carbon-free ammonia per year.  ▶ Our Board of Directors approved two projects that, when complete, will reduce our emissions by up to 2.5 million tons annually. This will enable the pro- duction of up to 1.25 million tons of net zero-carbon ammonia. byproduct captured and permanently sequestered – that will help decarbonize other in- dustries.  ▶ We exceeded our expectations on certain ESG goals we had set in 2020, such as identifying decarboniza- tion projects across our network, increasing repre- sentation of females and persons of color in senior leadership roles, and advancing nutrient steward- ship in the farming community.  ▶ We also established a new goal during the year to re- duce our Scope 3 greenhouse gas emissions by 10% by 2030. As part of this commitment, we published our Scope 3 emissions for the first time.  ▶ We voluntarily made our EEO-1 report publicly avail- able on our website. The EEO-1 report is a required annual private disclosure to the Equal Employment Opportunity Commission (EEOC) that details the Company’s US employment data by gender and sev- eral racial and ethnic categories. ▶ We established an Inclusion Council, our first Inclu- sion Resource Group and produced our first-ever Inclusion, Diversity and Equity Report. Message from CEO About this Report About the Company ESG & Sustainability Energy, Emissions, & Climate Change Our Workplace & Communities Food Security & Product Stewardship Ethics & Governance Coalitions, Partnerships, & Policy Engagement Reporting & Data We believe that these accomplishments have energized our entire Company as we continue to transform our business to enable us to capture the growth opportunity of the coming clean energy economy. This is important work for CF Industries and the world as our success in decarbonizing our manufacturing network can help drive the success of other decarbonization efforts. TCFD Governance B  fr l ce & Approach to ESG & Sustainability e orti ata 2CF INDUSTRIES | 2021 ESG REPORT"
2,"OUR BUSINESS AND STRATEGY At our core, CF Industries is a producer of ammonia. For decades, we have used the Haber-Bosch process to fix atmospheric nitrogen with hydrogen from natural gas to produce anhydrous ammonia, whose chemical composition is NH3. Over this same time frame, we have made a business of selling ammonia and other derivative fertilizer products such as urea and urea ammonium nitrate (UAN) for the energy that the nitrogen component of ammonia provides to plants, which increases crop yields. Humankind’s ability to produce nitrogen fertilizer has had an undeniably positive effect on the world. Along with advancements in seed technology and farming practices, the use of nitrogen fertilizer and other nutrients dramatically increased food production in the second half of the 1900s. Food security and quality of life around the world improved as well. The annual rate of people dying due to a famine globally per decade declined nearly 99 percent from the 1960s to the 2010s. At the same time, fertilizer allows more food to be grown on fewer acres. This reduces the amount of land cleared for agriculture, preserving carbon sequestering forests and important wildlife ecosystems as well as enhancing biodiversity. Ammonia and its derivative fertilizer products are commodities. Given this, our first order of business is to focus on achieving the lowest delivered cost per ton. At CF Industries, we do this by leveraging our unique capabilities – an outstanding operational capability and disciplined capital and corporate stewardship. In practice, this means access to low feedstock costs, high asset utilization and productivity, an extensive multimode distribution network to lower logistics costs, maximizing margins by optimizing customer locations and product type, and a constant focus on cost management. The clear benefits of ammonia and its derivative fertilizers in providing energy to crops to increase their yield come with a trade-off. While our plants are among the most efficient in the global industry, the ammonia production process is nevertheless energy- intensive and therefore results in significant carbon emissions. We regularly engage stakeholders about this challenge. At the same time, some energy-intensive industries looking to reduce their own carbon footprints have identified ammonia as a potential source of clean energy. This is due to ammonia’s hydrogen component, which is widely viewed as a scalable source of clean energy. Ammonia can be a clean fuel itself as it does not release any carbon when used as an energy source. Ammonia also could serve as a medium to store and transport hydrogen. In essence, these industries are exploring purchasing ammonia for the hydrogen value of the molecule – a value that is enhanced if the ammonia production process is decarbonized. Our need to reduce our carbon footprint and the need to decarbonize the global economy intersect. As a result, our strategy is to leverage our unique capabilities to accelerate the world’s transition to clean energy. This approach builds on our existing business. We will still provide energy to agriculture end-users in the form of ammonia and derivative nitrogen fertilizers. But we are also pursuing the growth opportunities available from providing clean energy for power generation in the form of ammonia. Alongside these efforts, we will decarbonize our ammonia production network, including producing ammonia from carbon-free sources (green ammonia) and ammonia produced conventionally with the CO2 byproduct captured and permanently sequestered (blue ammonia). These are more than promises: we have committed $385 million in capital through 2025 to advance these initiatives. “Our need to reduce our carbon footprint and the need to decarbonize the global economy intersect. As a result, our strategy is to leverage our unique capabilities to accelerate the world’s transition to clean energy.” Message from CEO About this Report About the Company Energy, Emissions & Climate Change Our Workplace & Communities Food Security & Product Stewardship Ethics & Governance Coalitions, Partnerships & Policy Engagement Approach to ESG & Sustainability Reporting & Data 3CF INDUSTRIES | 2021 ESG REPORT"
3,"ACCOUNTABILITY AND TRANSPARENCY Our strategy of accelerating the world’s transition to clean energy is linked to a comprehensive set of ESG goals. These commitments include a dramatic reduction in carbon emissions across our global network to achieve net-zero carbon emissions by 2050 and an intermediate goal of a 25% reduction in emissions intensity by 2030. Our ESG goals also encompass other issues important to CF Industries and its stakeholders, including inclusion, diversity & equity (ID&E), safety, food security, nutrient management, and community involvement. Our complete list of ESG goals appears later in this report and can also be found at sustainability.cfindustries. com. Given the critical importance of these efforts to the Company, shareholders, and stakeholders, our Board of Directors aligns executive compensation directly to ESG objectives. Management also benefits from the oversight of the Board, including two committees with a focus on specific ESG-relevant areas: one that has oversight over ID&E matters and employee well-being initiatives and another that oversees the Company’s clean energy initiatives and progress toward net-zero carbon emissions, community involvement efforts and overall accountability for meeting the Company’s ESG objectives. We communicate our performance in these areas and others through our annual ESG and sustainability reporting, which include our submissions under the Global Reporting Initiative (GRI), Sustainability Accounting Standards Board (SASB) framework and the Task Force on Climate-related Financial Disclosures (TCFD). Additionally, we remain committed to make the UN Global Compact and its principles part of the strategy, culture, and day-to- day operations of our company and to engage in collaborative projects that advance the UN Sustainable Development Goals (SDGs). A PLATFORM TO DELIVER ON OUR STRATEGIC OBJECTIVES Our strategic evolution is occurring at a time when our business is demonstrating strong returns from our investments in operational excellence and growth over the previous decade. During 2021, we experienced strong global nitrogen demand, less nitrogen supply due to lower global industry operating rates and favorable energy spreads that increased the Company’s margin opportunities. These dynamics became much more pronounced in the second half of the year and, in particular, during the fourth quarter of 2021 when global nitrogen prices and energy spreads reached record highs. For 2021, the Company reported net earnings attributable to common stockholders of $917 million, or $4.24 per diluted share. Adjusted EBITDA1 was just over $2.7 billion. Net cash from operations was approximately $2.9 billion and free cash flow2 was approximately $2.2 billion, both Company records. As we entered 2022, global nitrogen industry fundamentals remain very favorable. The need to replenish global grain stocks and increased economic activity should continue to support robust global demand. High energy prices in Europe and Asia, as well as nitrogen export restrictions through at least the middle of 2022 from key-producing countries such as China, Egypt, Russia and Turkey, are expected to challenge supply availability. As a result, we believe CF Industries is well-positioned for the years ahead, providing us a strong platform from which to pursue our clean energy initiatives and take the steps necessary to achieve our ESG goals.  President and Chief Executive Officer Tony Will A CLEAN ENERGY FUTURE Few companies are fortunate enough to find a business as vital to the health and well-being of the world as helping feed the crops that feed the world. Even fewer have the opportunity to have a second act that can be just as impactful: providing clean energy to help the world decarbonize. As you will see in the following pages, we are focused on realizing the promise of what CF Industries can offer the world. When we do this, we believe we will create value for all our stakeholders. Thank you for your interest in CF Industries. We look forward to working with you as we advance our shared commitment to a more sustainable world. 1 EBITDA is defined as net earnings attributable to common stockholders plus interest expense-net, income taxes, and depreciation and amortization. Adjusted EBITDA is defined as EBITDA adjusted with the selected items included in EBITDA. See “Reporting and Data” in this report for a reconciliation of EBITDA and adjusted EBITDA to the most directly comparable GAAP measures. 2 Free cash flow is defined as net cash from operating activities less capital expenditures and distributions to noncontrolling interest. See “Reporting and Data” in this report for a reconciliation of free cash flow to the most directly comparable GAAP measure. Message from CEO About this Report About the Company Energy, Emissions & Climate Change Our Workplace & Communities Food Security & Product Stewardship Ethics & Governance Coalitions, Partnerships & Policy Engagement Approach to ESG & Sustainability Reporting & Data 4CF INDUSTRIES | 2021 ESG REPORT"
4,"This report details CF Industries’ progress and opportunities within key environmental, social, and governance areas from January 1, 2021, to December 31, 2021. It covers the operations under Company control in North America and the United Kingdom. This report serves as an annual United Nations Global Compact Communication on Progress. It has been written in accordance with the Global Reporting Initiative (GRI) Standards (Comprehensive option) and includes a Sustainability Accounting Standards Board (SASB) Index with industry-specific disclosures. This is the fourth time the Company has issued a report disclosing on all GRI Standards, the third time CF Industries has issued a SASB Index and the second time the Company is reporting in line with Task Force on Climate-related Financial Disclosures (TCFD) guidelines. GRI, SASB, and TCFD Indices are referenced throughout this report, and can be found at its conclusion. Additionally, the indices are available at www.sustainability.cfindustries.com. For financial information on CF Industries, please see our annual report. About this Report Message from CEO About this Report About the Company Energy, Emissions & Climate Change Our Workplace & Communities Food Security & Product Stewardship Ethics & Governance Coalitions, Partnerships & Policy Engagement Approach to ESG & Sustainability Reporting & Data 5CF INDUSTRIES | 2021 ESG REPORT"


## Extract text from the **scanned** pdf files

If your PDF files include scanned image, you can use follwing code to get raw text

In [11]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/solution_accelerator_esg_and_rr/environment_policy.pdf

--2023-03-27 13:47:29--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/solution_accelerator_esg_and_rr/environment_policy.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 155570 (152K) [application/octet-stream]
Saving to: ‘environment_policy.pdf’


2023-03-27 13:47:29 (6.18 MB/s) - ‘environment_policy.pdf’ saved [155570/155570]



In [12]:
pdf_path = './environment_policy.pdf'

scanned_pdf_df = spark.read.format("binaryFile").load(pdf_path).cache()

visual.display_pdf(scanned_pdf_df)

Output hidden; open in https://colab.research.google.com to view.

In [13]:
# If your pdf file is scanned image, you can extract image or text with this code
pdf_to_image = visual.PdfToImage()\
    .setInputCol("content")\
    .setOutputCol("image")\
    .setKeepInput(True)

# Run OCR
ocr = visual.ImageToText()\
    .setInputCol("image")\
    .setOutputCol("text")\
    .setConfidenceThreshold(60)

pipeline = PipelineModel(stages=[
    pdf_to_image,
    ocr
])

In [14]:
scanned_result = pipeline.transform(scanned_pdf_df).cache()
scanned_result.show(5)

+--------------------+--------------------+------+--------------------+--------------------+-----------+-------+-----------+-----------------+---------+--------------------+--------------------+
|                path|    modificationTime|length|             content|               image|total_pages|pagenum|documentnum|       confidence|exception|                text|           positions|
+--------------------+--------------------+------+--------------------+--------------------+-----------+-------+-----------+-----------------+---------+--------------------+--------------------+
|file:/content/env...|2023-03-27 13:47:...|155570|[25 50 44 46 2D 3...|{file:/content/en...|          1|      0|          0|80.80283081054688|         |TLScontact Enviro...|[{[{TLScontact En...|
+--------------------+--------------------+------+--------------------+--------------------+-----------+-------+-----------+-----------------+---------+--------------------+--------------------+



In [15]:
import pandas as pd

pd.set_option('display.max_colwidth', None)

scanned_result.select("text").toPandas().head().style.set_properties(**{'white-space': 'pre-wrap', 'text-align': 'left'})

Unnamed: 0,text
0,"TLScontact Environmental Responsibility Policy  TLScontact has a Global Corporate Social Responsibility committee whichis Spon cbiity soateny and overseeing itsworldwide social responsibility this includes developing policies and ensuring compliance with Teleper- formance and the UN Global Compact principles, to which we have been a signatorysince 2011. Our Corporate Social Responsibility strategy ncludes our approach to environmental protection. This approach is centred around our Citizen of the Planet initiative which tracks, manages, and promotes green practices in an effort to reduce our impact on the environment. We also monitor, record, and report the environmental impact of our opera- tions in line with our responsibilities as a signatory of the UN Global Compact. To minimise the environmental impactof our activities we * Encourage work from home initiatives for non-operations employees and conduct online meetings whenever possible to reduce carbon emissions associated with travel to the workplace. * Promote best practices in energy usage, water and paper consumption, recycling, and donation of used equipment, etc. + Participate in local initiatives to support environmental engagement activities (Citizen of the Planet initiative) and raise emmironmental awareness among employees. + Aim to procure ethically (and locally whenever possible), ensuring that those within oursupply chain actresponsibly towards the environment they operate in. * Consider energy &environmental buildi (such as LEED and BREEAM) as part of the for our premises. Our leadership is committed to integrating our Corporate Social Responsbility principles in our core functions. Currently. + 4496 ofour suppliers have ISO 14001 certification; + 35% of our locations are LEED (or equivalent) certified; + 4496 of our locations have electricity consumption reduction programmes in place; and * 50% of our locations have paper consumption reduction programmes in place. Inadditior: * We monitor our carbon footprintand gather data about ourenerpy use (water, paper, ar travel and fuel consumption). As part of the Teleperformance Group we engage inactities aimed at monitoring and measuring the impact of our business operations on the environmentwe operate in and taking steps to reduce our carbon footprint Forexample: * In2016, we introduced scanning ofsupporting documents for visa applications, tohelp move the visa application process away from a purely paper-based system. Since that time, over 204 million pages of supporting documentation have been scanned, instead of printed. This has resulted in an estimated saving of 850,000 tons of carbon. * Wehave moved to digital transmission of CCTV data from our Visa Application Centres to our governmentclients, instead of shipping data on DVDs. This has allowed a 75% decrease in annual physical shipments of CCTV data, from 260 to 52 ateach site. * We host our IT systems in best-in-class, energy-efficient, low carbon data centre infrastructure backed up by renewable energy generation. The aim of our data hosting provider is to achieve 100% renewable energy powering its data centres by 2040. * We provide guidance, visual reminders and promote environmental awareness among our employees to munimise unnecessary water use and reduce wastage. * We take measures aimed at reducing the impact of components and parts of end-oFife electronics and IT equipment by cooperating with certified waste management companies to collect and dispose appropriately ofour e-waste. We also donate unused equipment whenever possible. * Ournew sites include separate waste collection, e-waste management, sustainability certifications, installed water saving devices, and installed energy saving sensors. Our company strategy is to raise environmental awareness amongst our staff and ensure that we continue to reduce the environmental impact of our activities. We will continue to monitor activity whilst leveraging on best practices and industry developments to ensure that we remainat the forefrontofenvironmental sustainability. TLScontact"


## We will continue `CF Industries`

In [16]:
result.show(3)

+--------------------+--------------------+--------+--------------------+--------------------+----------------+---------------+--------------------+---------+-------+
|                path|    modificationTime|  length|                text|           positions|height_dimension|width_dimension|             content|exception|pagenum|
+--------------------+--------------------+--------+--------------------+--------------------+----------------+---------------+--------------------+---------+-------+
|file:/content/CF_...|2023-03-27 13:45:...|12674821|CF Industries
202...|[{[{C, 0, 56.0, 1...|           768.0|         1366.0|[25 50 44 46 2D 3...|     null|      0|
|file:/content/CF_...|2023-03-27 13:45:...|12674821|A Message from Ou...|[{[{A, 1, 55.0, 6...|           768.0|         1366.0|[25 50 44 46 2D 3...|     null|      1|
|file:/content/CF_...|2023-03-27 13:45:...|12674821|OUR BUSINESS AND ...|[{[{O, 2, 56.0, 5...|           768.0|         1366.0|[25 50 44 46 2D 3...|     null|      2

In [17]:
# This will return a new DF with all the columns + page_num
result = result.withColumn("page_num", F.monotonically_increasing_id()+1)

result.show(3)

+--------------------+--------------------+--------+--------------------+--------------------+----------------+---------------+--------------------+---------+-------+--------+
|                path|    modificationTime|  length|                text|           positions|height_dimension|width_dimension|             content|exception|pagenum|page_num|
+--------------------+--------------------+--------+--------------------+--------------------+----------------+---------------+--------------------+---------+-------+--------+
|file:/content/CF_...|2023-03-27 13:45:...|12674821|CF Industries
202...|[{[{C, 0, 56.0, 1...|           768.0|         1366.0|[25 50 44 46 2D 3...|     null|      0|       1|
|file:/content/CF_...|2023-03-27 13:45:...|12674821|A Message from Ou...|[{[{A, 1, 55.0, 6...|           768.0|         1366.0|[25 50 44 46 2D 3...|     null|      1|       2|
|file:/content/CF_...|2023-03-27 13:45:...|12674821|OUR BUSINESS AND ...|[{[{O, 2, 56.0, 5...|           768.0|         

In [18]:
result = result.select("page_num", "text")

result.show(10, truncate = 80)

+--------+--------------------------------------------------------------------------------+
|page_num|                                                                            text|
+--------+--------------------------------------------------------------------------------+
|       1|             CF Industries
2021  ESG Report
1CF INDUSTRIES  |  2021 ESG REPORT 
|
|       2|A Message from Our CEO
To our Stakeholders:
CF Industries made substantial pr...|
|       3|OUR BUSINESS AND STRATEGY
At our core, CF Industries is a producer of ammonia...|
|       4|ACCOUNTABILITY AND TRANSPARENCY
Our strategy of accelerating the world’s tran...|
|       5|This report details CF Industries’ progress and opportunities within key envi...|
|       6|At CF Industries, our mission is to provide clean energy to feed and fuel 
th...|
|       7|AMMONIA’S ROLE IN SOCIETY 
CF Industries is a leading global manufacturer of ...|
|       8|(1) Other segment products include DEF, urea liquor, nitric acid, aqua

In this dataframe, each row corresponds to each page in the PDF file.

In [19]:
result.write.mode("overwrite").parquet("cf_industries_pages.parquet")

## You can proceed to 03.Responsibility_Reports_Analysis!