<a href="https://colab.research.google.com/github/Kirushikesh/Schlumberger-s-Hackathon/blob/main/Scl_hack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Statement:

To develop a ML-AI based summarization and data aggregation for generating insights from the “Energy Data feed”, which will minimize manual curation and analysis in automated feed. The data was considered from the following News Feeds,


*   “International Energy Agency”, 
*   “National Energy Technology Laboratory”,
*   “U.S. Energy Information Administration”,
*   “International Renewable Energy Agency”.

# Crawler

With the given query and k value it returns the list of top k result url's from each of the data sources.

For now we are focussing on 4 news feeds, we can further extend it based on the need. The future goal is to include different types of data sources like audio, videos, images and texts.

In [1]:
data_sources=[
    'https://www.iea.org/search/news?q=%s',
    'https://netl.doe.gov/search/node?keys=%s',
    'https://search.usa.gov/search?affiliate=eia.doe.gov&sort_by=&query=%s',
    'https://www.irena.org/Search?query=%s&contentType=e833bea4-7572-4310-944f-f57c92ab7ead&orderBy=Date'
]

In [2]:
import requests
from bs4 import BeautifulSoup

#Returns the top k result sites from searching the query in iea website
def iea_crawler(site,k):
    r = requests.get(site)
    soup = BeautifulSoup(r.content, 'html.parser')
    
    out=[]
    for article in soup.find_all('article',class_='m-news-listing',limit=k):
        out.append('https://www.iea.org'+article.find('a').get('href'))
    return out

#Returns the top k result sites from searching the query in netl website
def netl_crawler(site,k):
    r = requests.get(site)
    soup = BeautifulSoup(r.content, 'html.parser')

    out=[]
    for content in soup.find_all('div',class_='netlsearch-results',limit=k):
        out.append(content.find('a').get('href'))
    return out

#Returns the top k result sites from searching the query in eia website
def eia_crawler(site,k):
    r = requests.get(site)
    soup = BeautifulSoup(r.content, 'html.parser')

    out=[]
    for content in soup.find_all('div',class_='content-block-item result',limit=k):
        out.append(content.find('a').get('href'))
    return out

#Returns the top k result sites from searching the query in irena website
def irena_crawler(site,k):
    r = requests.get(site)
    soup = BeautifulSoup(r.content, 'html.parser')

    out=[]
    for content in soup.find_all('div',class_='c-Result__content',limit=k):
        out.append('https://www.irena.org'+content.find('a').get('href'))
    return out

In [3]:
#This function aggregates the top k result url's from each of the data sources for a given query and k value. 
def return_topk(query,k):
    crawler_list=[]
    crawler_list.extend(iea_crawler(data_sources[0] %query,k))

    crawler_list.extend(netl_crawler(data_sources[1] %query,k))

    crawler_list.extend(eia_crawler(data_sources[2] %query,k))

    crawler_list.extend(irena_crawler(data_sources[3] %query,k))

    return crawler_list

# Scraper

Takes a news url, scraps the data then preprocess to remove unwanted texts like html tags to result into clean text.

The Preprocessing step mainly consist of removing java script codes and html tags from the unstructured text.

In [4]:
import re

# remove java script codes from text
def remove_script_code(data):
    pattern = r'<[ ]*script.*?\/[ ]*script[ ]*>'  # mach any char zero or more times
    return re.sub(pattern, '', data, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# remove whitespace from text
def remove_whitespace(text):
    return  " ".join(text.split())
 
# Condenses all repeating newline characters into one single newline character
def condense_newline(text):
    return ' '.join([p for p in re.split('\n|\r', text) if len(p) > 0])

# remove html tags from text
def remove_htmltag(text):
    TAG_RE = re.compile(r'<[^>]+>')
    return TAG_RE.sub(' ', str(text))

# Takes a news feed url, scraps the webpage then process to return the clean version of text
def scrap(site):
    r = requests.get(site)
    soup = BeautifulSoup(r.content,'html.parser')
    return remove_whitespace(condense_newline(remove_htmltag(remove_script_code(str(soup)))))

In [5]:
#Takes a set of new's url then returns the processesed version of each of the site together in a list
def scrapper_agg(sites):
    texts=[]
    for site in sites:
        texts.append(scrap(site))
    
    return texts

# Summarizer

Summaries the text from a website using the power of both Extractive and Abstractive Summarizer.

In [6]:
!pip install -q -U transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m83.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m103.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [7]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Extractive Summarizer

An extractive summarization method is concatenating important sentences or paragraphs without understanding the meaning of those sentences.

The Extractive Summarizer is done using the weighting method, where we assign weights to each of the sentence in the original text(using the importance of the words the sentence contains) then we fix a threshold(usually average of all sentence scores) to result the sentence which has weights higher than the threshold.

In [8]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import bs4 as BeautifulSoup
import urllib.request  

#create a dictionary of words with their occurence frequency
def _create_dictionary_table(text_string) -> dict:
   
    #removing stop words
    stop_words = set(stopwords.words("english"))
    
    words = word_tokenize(text_string)
    
    #reducing words to their root form
    stem = PorterStemmer()
    
    #creating dictionary for the word frequency table
    frequency_table = dict()
    for wd in words:
        wd = stem.stem(wd)
        if wd in stop_words:
            continue
        if wd in frequency_table:
            frequency_table[wd] += 1
        else:
            frequency_table[wd] = 1

    return frequency_table

#calculate the weight scores for each of the sentence in the original text using the importance of word's it contains.
def _calculate_sentence_scores(sentences, frequency_table) -> dict:   

    #algorithm for scoring a sentence by its words
    sentence_weight = dict()

    for sentence in sentences:
        sentence_wordcount = (len(word_tokenize(sentence)))
        sentence_wordcount_without_stop_words = 0
        for word_weight in frequency_table:
            if word_weight in sentence.lower():
                sentence_wordcount_without_stop_words += 1
                if sentence[:7] in sentence_weight:
                    sentence_weight[sentence[:7]] += frequency_table[word_weight]
                else:
                    sentence_weight[sentence[:7]] = frequency_table[word_weight]

        sentence_weight[sentence[:7]] = sentence_weight[sentence[:7]] / sentence_wordcount_without_stop_words

       

    return sentence_weight

#calculate the average of all sentence scores in the text for fixing a threshold
def _calculate_average_score(sentence_weight) -> int:
   
    #calculating the average score for the sentences
    sum_values = 0
    for entry in sentence_weight:
        sum_values += sentence_weight[entry]

    #getting sentence average value from source text
    average_score = (sum_values / len(sentence_weight))

    return average_score

#gets the sentences which has weights more than the threshold
def _get_article_summary(sentences, sentence_weight, threshold):
    sentence_counter = 0
    article_summary = ''

    for sentence in sentences:
        if sentence[:7] in sentence_weight and sentence_weight[sentence[:7]] >= (threshold):
            article_summary += " " + sentence
            sentence_counter += 1

    return article_summary

#returns the extractive summary given a original text
def _run_article_summary(article):
    
    #creating a dictionary for the word frequency table
    frequency_table = _create_dictionary_table(article)

    #tokenizing the sentences
    sentences = sent_tokenize(article)

    #algorithm for scoring a sentence by its words
    sentence_scores = _calculate_sentence_scores(sentences, frequency_table)

    #getting the threshold
    threshold = _calculate_average_score(sentence_scores)

    #producing the summary
    article_summary = _get_article_summary(sentences, sentence_scores, .5 * threshold)

    return article_summary

## Abstractive Summarizer

An abstractive summarization method is generating the meaningful summary using the new words not even found in the original text.

In [9]:
from transformers import pipeline

pipe=pipeline("summarization",model='t5-small')

Downloading:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [10]:
from nltk.tokenize import sent_tokenize

#Uses the t5 pre trained tokenizer to summarize the text
#For each of the text articles it first calls the extractive followed by the abstractive summarizer.
def summarize(texts):
    summary=[]
    
    for text in texts:
        summary_results = _run_article_summary(text)
        pipe_out=pipe(summary_results)
        summary.append("\n".join(sent_tokenize(pipe_out[0]['summary_text'])))
    
    return summary

# Aggregation

Aggregates the summarizer for all the results together to give the final combined results

In [11]:
#concadenate the summaries together to return the final result
def aggregate_results(query,sites,summaries,k):
    print('The Query is :',query,'\n')
    print(f'The Top {k} Results are :')

    print('\n Results from IEA :')
    for site,summary in zip(sites[:k],summaries[:k]):
        print('According to',site,':')
        print('Summary: '+summary)
        print('\n')

    print('Results from National Energy Technology Laboratory :')
    for site,summary in zip(sites[k:2*k],summaries[k:2*k]):
        print('According to',site,':')
        print('Summary: '+summary)
        print('\n')

    print('Results from U.S. Energy Information Administration :')
    for site,summary in zip(sites[2*k:3*k],summaries[2*k:3*k]):
        print('According to',site,':')
        print('Summary: '+summary)
        print('\n')
    
    print('Results from International Renewable Energy Agency :')
    for site,summary in zip(sites[3*k:],summaries[3*k:]):
        print('According to',site,':')
        print('Summary: '+summary)
        print('\n')

    return

# Runs

A demonstraction of the system using a sample query and k value

In [12]:
query='climate change'
k=1

In [13]:
from bs4 import BeautifulSoup

sites=return_topk(query,k)
sites[0]

'https://www.iea.org/news/iea-executive-director-and-china-s-special-envoy-on-climate-change-discuss-global-efforts-to-reach-net-zero-emissions'

In [14]:
sites

['https://www.iea.org/news/iea-executive-director-and-china-s-special-envoy-on-climate-change-discuss-global-efforts-to-reach-net-zero-emissions',
 'https://netl.doe.gov/node/12228',
 'https://www.eia.gov/outlooks/steo/',
 'https://www.irena.org/News/pressreleases/2023/Jan/Renewables-Can-Provide-Nearly-60-Per-Cent-of-Nigerias-Energy-Demand-by-2050']

In [15]:
plain_texts=scrapper_agg(sites)
plain_texts[0][:500]

'IEA Executive Director and China’s Special Envoy on Climate Change discuss global efforts to reach net-zero emissions - News - IEA IEA Close Search Submit IEA Skip navigation Countries Find out about the world, a region, or a country All countries circle-arrow Explore world circle-arrow Member countries Australia Austria Belgium Canada Czech Republic Denmark Estonia Finland France Germany Greece Hungary Ireland Italy Japan Korea Lithuania Luxembourg Mexico New Zealand Norway Poland Portugal Slov'

In [16]:
summaries=summarize(plain_texts)
summaries[0]

Token indices sequence length is longer than the specified maximum sequence length for this model (1651 > 512). Running this sequence through the model will result in indexing errors


'IEA Executive Director and China’s special Envoy on Climate Change Xie Zhenhua discussed global efforts to reach net zero emissions 25 June 2021 . they discussed major energy and climate issues including the findings of the recent global roadmap to net zero by 2050 . he said that the whole of society will need to play a role in helping the world to realise the Paris Agreement goals .'

In [17]:
aggregate_results(query,sites,summaries,k)

The Query is : climate change 

The Top 1 Results are :

 Results from IEA :
According to https://www.iea.org/news/iea-executive-director-and-china-s-special-envoy-on-climate-change-discuss-global-efforts-to-reach-net-zero-emissions :
Summary: IEA Executive Director and China’s special Envoy on Climate Change Xie Zhenhua discussed global efforts to reach net zero emissions 25 June 2021 . they discussed major energy and climate issues including the findings of the recent global roadmap to net zero by 2050 . he said that the whole of society will need to play a role in helping the world to realise the Paris Agreement goals .


Results from National Energy Technology Laboratory :
According to https://netl.doe.gov/node/12228 :
Summary: NETL closer to Achieving Critical Decarbonization Goals Director’s Corner by Brian Anderson, Ph.D. January 02, 2023 . the great american thinker once said, “Act as if what you do makes a difference.
It does.” .
our work has helped bring the nation closer tha