## NLP Website Scraping

In [1]:
import nltk
import urllib
import bs4 as bs
import re
from nltk.corpus import stopwords

In [None]:
# Download stopwords
nltk.download('stopwords')

In [2]:
# Get the data source
source = urllib.request.urlopen('https://en.wikipedia.org/wiki/Global_warming').read()

In [3]:
# Parsing the data / creating Beautiful Soup object
soup = bs.BeautifulSoup(source, 'lxml')

# Fetching the data
text = "" 
for paragraph in soup.find_all('p'):
    text += paragraph.text

# Preprocessing the data
text = re.sub(r'\[[0-9]*\]', '', text)
text = re.sub('\s+', ' ', text)
text = text.lower()
text = re.sub(r'\d', '', text)
text = re.sub('\s+', ' ', text)


## **lxml** 
lxml is a powerful library in Python used for parsing, processing, and manipulating XML and HTML documents. It is built on top of the libxml2 and libxslt C libraries, providing robust and efficient tools for working with structured documents. 

In the context of XML and HTML, **parsing** refers to the process of analyzing a document's structure and converting it into a tree-like representation that can be programmatically manipulated. This allows developers to access, extract, or modify specific parts of the document using code.

### **Key Aspects of Parsing**

1. **Input Format**  
   The input to a parser is typically:
   - A string of XML or HTML content.
   - A file containing XML or HTML.
   - A URL pointing to an XML or HTML document.

2. **Output Structure**  
   The output is a **document tree** (or DOM—Document Object Model), where each node represents an element, attribute, or text in the document.

---

### **Why Parsing is Important**
Parsing transforms a document from raw text into a structured format that is easier to work with programmatically. For instance:
- **XML Parsing** ensures compliance with the strict syntax rules of XML.
- **HTML Parsing** is crucial when working with web data, as HTML can often be malformed or inconsistent.

---

### **Types of Parsers**
- **Strict Parsers**: Used for well-formed documents, like XML. These parsers enforce rules.
- **Lenient Parsers**: Used for messy or malformed documents, like some HTML. They attempt to correct errors during parsing.

---

### **Common Python Libraries for Parsing**
1. **`lxml`**: Advanced and efficient for both XML and HTML.
2. **`xml.etree.ElementTree`**: Simpler, built into Python for XML parsing.
3. **`BeautifulSoup`**: Lenient, commonly used for HTML parsing.

Parsing is the foundational step in web scraping and data extraction workflows, enabling developers to efficiently process structured or semi-structured data.


In [4]:
# Preparing the dataset
sentences = nltk.sent_tokenize(text)

sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

In [6]:
type(sentences)

list

In [7]:
sentences

[['present-day',
  'climate',
  'change',
  'includes',
  'both',
  'global',
  'warming—the',
  'ongoing',
  'increase',
  'in',
  'global',
  'average',
  'temperature—and',
  'its',
  'wider',
  'effects',
  'on',
  'earth',
  "'s",
  'climate',
  '.'],
 ['climate',
  'change',
  'in',
  'a',
  'broader',
  'sense',
  'also',
  'includes',
  'previous',
  'long-term',
  'changes',
  'to',
  'earth',
  "'s",
  'climate',
  '.'],
 ['the',
  'current',
  'rise',
  'in',
  'global',
  'temperatures',
  'is',
  'driven',
  'by',
  'human',
  'activities',
  ',',
  'especially',
  'fossil',
  'fuel',
  'burning',
  'since',
  'the',
  'industrial',
  'revolution',
  '.'],
 ['fossil',
  'fuel',
  'use',
  ',',
  'deforestation',
  ',',
  'and',
  'some',
  'agricultural',
  'and',
  'industrial',
  'practices',
  'release',
  'greenhouse',
  'gases',
  '.'],
 ['these',
  'gases',
  'absorb',
  'some',
  'of',
  'the',
  'heat',
  'that',
  'the',
  'earth',
  'radiates',
  'after',
  'it',