**Elijah Taber**

**6/1/2024**

<center><h1>Hybrid NLP Text Summarization Model</h1></center>

In [1]:
# Data Manipulation
import pandas as pd
import numpy as np
import re
import warnings
import sys
import spacy

nlp = spacy.load('en_core_web_sm')
warnings.filterwarnings("ignore")
sys.path.insert(0, './CorpusCondenser')

# Custom source code
from CorpusCondenser.pdf_processor import pdf_text_extractor
from CorpusCondenser.ts_corpus_cleaner import preprocess_text
from CorpusCondenser.ts_feature_engineering import feature_engineering_pipeline
from CorpusCondenser.hybrid_ts_model import ExtractiveModel, AbstractiveModel, HybridSummarizationModel

# Extract text from PDF to get a raw and very messy corpus
raw_corpus = pdf_text_extractor(
    "IPCC_AR6_WGIII_TechnicalSummary.pdf", 
    start_page=7, 
    end_page=101
)
raw_corpus[:500] # first 500 characters

[nltk_data] Downloading package stopwords to C:\Users\Elijah
[nltk_data]     Taber\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


'53\nTS\nTechnical Summary\nTS.1\t\nIntroduction\nThe Working Group III (WGIII) contribution to the IPCC’s Sixth \nAssessment Report (AR6) assesses the current state of knowledge on the \nscientific, technological, environmental, economic and social aspects of \nclimate change mitigation. It builds on previous IPCC reports, including \nthe WGIII contribution to the IPCC’s Fifth Assessment Report\xa0(AR5) \nand the three Special Reports of the Sixth Assessment cycle on: Global \nWarming of 1.5°C (SR1.5); Climate'

## Preprocessing

In [2]:
# Preprocess the text by calling the custom preprocess_text function on the raw_corpus
cleaned_corpus = preprocess_text(raw_corpus)
cleaned_corpus[:500]

'ts technical summary ts1 introduction the working group iii wgiii contribution to the ipccs assessment report ar6 assesses the current state of knowledge on the scientific technological environmental economic and social aspects of climate change mitigation it builds on previous ipcc reports including the wgiii contribution to the ipccs assessment report ar5 and the special reports of the assessment cycle on global warming of 15c sr15 climate change and land srccl and the ocean and cryosphere in '

In [3]:
# Total token count in the cleaned corpus
tokens = cleaned_corpus.split()  # split text into tokens based on whitespace
corpus_length = len(tokens)
print(f"Total tokens in cleaned corpus: {corpus_length}")

Total tokens in cleaned corpus: 56703


## Feature Engineering
After casting the feature engineering pipeline, the following cells will look at each feature one at a time to gain better insight into the corpus and determine what will be pushed into the text summarization model.

In [4]:
# Create a Doc object by applying the NLP model to the cleaned corpus
doc = nlp(cleaned_corpus)

# Extract sentences from the Doc object and store them in a list
sentences = [sent.text for sent in doc.sents]

In [5]:
# Perform feature engineering on the extracted sentences
features = feature_engineering_pipeline(sentences)

#### Term Frequency-Inverse Document Frequency

In [6]:
print("TF-IDF Scores:")
for term, score in features['tfidf'].items():
    print(f"{term}: {score}")

TF-IDF Scores:
1000c: 0.0002045827883931368
100c: 0.0002045827883931368
100year: 0.0008183311535725472
12a: 0.0002045827883931368
12sm123: 0.0002045827883931368
152c: 0.0002045827883931368
15c: 0.019026199320561724
15c2c: 0.0002045827883931368
160130tco2: 0.0004091655767862736
17sm1: 0.0004091655767862736
1800: 0.0002045827883931368
1836: 0.0002045827883931368
1850: 0.0012274967303588207
1859: 0.0002045827883931368
1870: 0.0006137483651794104
18cma1: 0.0002045827883931368
1900: 0.0002045827883931368
1924: 0.0002045827883931368
1925: 0.0004091655767862736
1935: 0.0002045827883931368
1950: 0.0002045827883931368
1957: 0.0002045827883931368
1970: 0.0008183311535725472
1973: 0.0002045827883931368
1974: 0.0004091655767862736
1976: 0.0002045827883931368
1978: 0.0002045827883931368
1980: 0.0006137483651794104
1982: 0.0002045827883931368
1984: 0.0002045827883931368
1986: 0.0002045827883931368
1988: 0.0002045827883931368
1989: 0.0002045827883931368
1990: 0.006955814805366651
1991: 0.000204582788

#### Sentence Embeddings

In [7]:
print("Sentence Embeddings:")
print(features['embeddings'])

Sentence Embeddings:
[[-0.08103456 -0.01988939  0.0321498  ... -0.10128707 -0.06748783
   0.00316652]
 [-0.04303985  0.11716386  0.03577056 ... -0.08079035 -0.00558874
  -0.0124907 ]
 [-0.01299914  0.01365421  0.0395098  ... -0.16419211 -0.0674914
  -0.00517126]
 ...
 [-0.0358007   0.02999223  0.076254   ... -0.09917153 -0.00238417
   0.04931219]
 [-0.00344681 -0.02927948  0.01357593 ... -0.00384746 -0.02688603
   0.07399292]
 [-0.01902553  0.04625924  0.05450704 ... -0.09856773 -0.08637608
  -0.01362228]]


In [8]:
print("Topic Modeling:")
print(features['topics'])

Topic Modeling:
{'distribution': array([5.37722992e-06, 9.99978491e-01, 5.37722992e-06, 5.37722992e-06,
       5.37722992e-06]), 'topics': {'Topic 1': ['1000c', 'obsolete', 'crosscutting', 'occupation', 'occupancy', 'occasionally', 'occasion', 'obtained', 'crossreferences', 'crowdin'], 'Topic 2': ['emissions', 'mitigation', 'energy', 'high', 'confidence', 'pathways', 'climate', 'ghg', 'global', 'co2'], 'Topic 3': ['1000c', 'obsolete', 'crosscutting', 'occupation', 'occupancy', 'occasionally', 'occasion', 'obtained', 'crossreferences', 'crowdin'], 'Topic 4': ['1000c', 'obsolete', 'crosscutting', 'occupation', 'occupancy', 'occasionally', 'occasion', 'obtained', 'crossreferences', 'crowdin'], 'Topic 5': ['1000c', 'obsolete', 'crosscutting', 'occupation', 'occupancy', 'occasionally', 'occasion', 'obtained', 'crossreferences', 'crowdin']}}


Taking a look at topic 3:

'Topic 3': ['emissions', 'mitigation', 'energy', 'high', 'confidence', 'pathways', 'climate', 'ghg', 'global', 'co2']

*Note: these are only the top ten words.*

It has a score of **.999978491**, meaning that almost 100% of the topics in climate change aritcle are centered around these words. Taking a look a the list, this is a reasonable conclusion seeing how this article is about a climate change report and all of these terms are highly related to climate change. This is a very good sign that the LDA is noticing the importance of certain topics in this corpus and assigning them higher values. This feature will prove to be invaluable in the text summarization model by focussing on the most important topics. 

#### Named Entity Recognition

In [9]:
print("Named Entities:")
for entity in features['entities']:
    print(f"Entity: {entity[0]}, Type: {entity[1]}")

Named Entities:
Entity: 15c, Type: DATE
Entity: sr15, Type: PRODUCT
Entity: srocc1, Type: PERSON
Entity: 2014, Type: DATE
Entity: un, Type: ORG
Entity: kyoto, Type: GPE
Entity: paris, Type: GPE
Entity: un, Type: ORG
Entity: ts11, Type: PRODUCT
Entity: 2050, Type: CARDINAL
Entity: 15c, Type: DATE
Entity: 15c, Type: DATE
Entity: 2018, Type: DATE
Entity: 2019, Type: DATE
Entity: 2019, Type: DATE
Entity: srm crossworking group, Type: ORG
Entity: inter alia, Type: LOC
Entity: 2050, Type: DATE
Entity: 2100, Type: CARDINAL
Entity: paris, Type: GPE
Entity: 2015, Type: DATE
Entity: paris, Type: GPE
Entity: un, Type: ORG
Entity: the past decade, Type: DATE
Entity: recent years, Type: DATE
Entity: covid19, Type: PERSON
Entity: europe, Type: LOC
Entity: west central, Type: LOC
Entity: latin america, Type: LOC
Entity: caribbean middle east asia, Type: LOC
Entity: pacific africa, Type: LOC
Entity: middle east, Type: LOC
Entity: 1870, Type: DATE
Entity: 2014, Type: DATE
Entity: 1870, Type: DATE
Entit

#### Keywords

In [10]:
print("Keywords:")
for keyword in features['keywords']:
    print(keyword)

Keywords:
resilient development pathways ts technical summary box ts1 continued 1973 oil crisis 1991 soviet union dissolusion 2008 global financial crisis 2020 covid19 pandemic edgar gcb bp iea 60 cm 1970 1980 1990 2000 2010 2020 aviation industry land transport power residential jan apr jul oct jan apr jul oct jan apr jul oct jan apr jul oct jan apr jul oct total jan apr jul oct
industry sectors figure waste agriculture energy industry lulucf ch4 fgases n2o co2 1990 1995 2018 2015 2010 2005 2000 1990 ghg gtco2eq yr1 ghg gtco2eq yr1 1995 2018 2015 2010 2005 2000 ch4 mtch4 yr1 n2o mtn2o yr1 1990 1995 2018 2015 2010 2005 2000 1990 1995 2018 2015 2010 2005 2000 co2 mt co2 yr1 1990 1995 2018 2015 2010 2005 2000 ts technical summary ts562 food systems realising
land use electricity transport eastern asia 2000 multiplication factors lower range upper range x10 x31 x2 x5 x3 x5 x6 x7 x14 x12 x14 x28 x12 x3 x4 x2 x3 x6 x4 x7 x15 x5 x4 x7 x7 x7 x7 x2 x4 x2 x8 x7 multiplication factors indicate
m

## Hybrid Text Summarization Model

In [11]:
# Initialize the hybrid summarization model, going from extractive to abstractive summarization
extractive_model = ExtractiveModel(top_n=5) # set the number of sentences to extract
abstractive_model = AbstractiveModel(
    model_name='facebook/bart-large-cnn', # specify the facebook/bart-large-cnn model, a well-known pre-trained BART model
    framework='pt' # specify PyTorch framework
    )
hybrid_model = HybridSummarizationModel(extractive_model, abstractive_model) # combine both models to run in order

# Generate and display final summary using the hybrid model, along with a human interpretable method clean up the summary
final_summary = hybrid_model.summarize(sentences, features)
print("Final Summary:\n")
print(final_summary)

Final Summary:
 

Modelled pathways that limit warming to 15°C or 2°C involve deep rapid and sustained emissions reductions net CO2 and net greenhouse gas emissions are possible through different mitigation portfolios. The modelled pathway ranges are compared to the emissions from pathways illustrative of high emissions curpol and modact. Ndcs announced prior to The 26th annual Conference of Parties refer to the most recent nationally determined contributions submitted to the unfccc up to the literature cutoff date of this report October 2021. Revised Nationally Determined Contributions were announced by China Japan and the republic of Korea prior to October 2020 but only submitted thereafter.
