In [1]:
import requests 
from bs4 import BeautifulSoup

In [2]:
url = "https://www.nbr.org/publication/enhancing-clean-energy-cooperation-in-the-indo-pacific/"
response = requests.get(url)

In [3]:
soup=BeautifulSoup(response.content, "html.parser")

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Enhancing Clean Energy Cooperation in the Indo-Pacific | The National Bureau of Asian Research (NBR)
  </title>
  <link href="https://www.nbr.org/xmlrpc.php" rel="pingback"/>
  <link href="https://www.nbr.org/wp-content/themes/nbr-theme/build/css/main.css" rel="stylesheet" type="text/css"/>
  <link href="https://www.nbr.org/wp-content/themes/nbr-theme/style.css" media="screen" rel="stylesheet" type="text/css"/>
  <script crossorigin="anonymous" defer="" integrity="sha384-3yBLeJ4waqGSAf4A8pjZ13UF7GuhgbdKnBQvIp/TkWoXtQbtwjlIPNjkDRJ46UCn" src="https://pro.fontawesome.com/releases/v5.5.0/js/all.js">
  </script>
  <meta content="max-image-preview:large" name="robots"/>
  <link href="//code.jquery.com" rel="dns-prefetch"/>
  <

In [5]:
main_content = soup.find('main')
for element in main_content(['a', 'em']):
    element.decompose()

In [6]:
text_list = []
stop_text = "James Bowen is a Policy Fellow at the Perth USAsia Centre."
for paragraph in main_content.find_all('p'):
    paragraph_text = paragraph.get_text(separator=' ', strip=False)

    if stop_text in paragraph_text:
        text_list.append(paragraph_text)
        break
    text_list.append(paragraph_text)

In [7]:
article_text = '\n'.join(text_list)

In [8]:
print(article_text)

James Bowen argues that clean energy cooperation would be a win for both the climate and stability of the long-standing Indo-Pacific order and urges the United States and other advanced regional economies to revive the spirit of common cause that followed past energy crises.
Ensuring a rapid global transition to clean energy systems is the overriding priority of international climate action. Cross-border cooperation in this space is critical, yet it has proved exceedingly difficult at the all-inclusive UN-led level. Smaller avenues of parallel activity could ultimately deliver more meaningful progress. Collaborative efforts that simultaneously allow space for the advancement of national economic and strategic positions are a prominent feature of current Indo-Pacific relations. They merit sustained commitment from the United States and its regional allies and partners, particularly at a time of upheaval in global energy markets.
The Intergovernmental Panel on Climate Change’s April 2022

In [9]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [11]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer

In [12]:
def preprocess_text(text):
    # Sentence Tokenization
    sentences = sent_tokenize(text)
    
    # Word Tokenization
    words = word_tokenize(text.lower())  # Lowercase all words
    
    # Removing Stop Words
    stop_words = set(stopwords.words('english'))
    words_filtered = [word for word in words if word.isalnum() and word not in stop_words]
    
    # Lemmatization and Stemming
    lemmatizer = WordNetLemmatizer()
    stemmer = PorterStemmer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words_filtered]
    stemmed_words = [stemmer.stem(word) for word in words_filtered]
    
    # Regular Expression for Additional Cleaning (Removing Punctuation)
    words_cleaned = [re.sub(r'\W+', '', word) for word in lemmatized_words]
    
    return words_cleaned

In [13]:
processed_words = preprocess_text(article_text)