A Web Data ETL (Extract, Transform, Load) Pipeline is a systematic process in data engineering. It collects data from online sources, transforms it for analysis, and loads it into a database for reporting and decision-making. The process involves data extraction, transformation (cleaning, filtering, structuring), and loading into a storage format.

In [1]:
# use command for installing beautifulsoup and nltk: pip install beautifulsoup4 nltk

import requests
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from collections import Counter
import pandas as pd
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ferzi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
class WebScraper:
    def __init__(self, url):
        self.url = url

    def extract_article_text(self):
        response = requests.get(self.url)
        html_content = response.content
        soup = BeautifulSoup(html_content, "html.parser")
        article_text = soup.get_text()
        return article_text

In the code above, the WebScraper class extracts the main text content from a web page URL. We can clean and preprocess the extracted text to store word frequencies for analysis.

In [3]:
class TextProcessor:
    def __init__(self, nltk_stopwords):
        self.nltk_stopwords = nltk_stopwords

    def tokenize_and_clean(self, text):
        words = text.split()
        filtered_words = [word.lower() for word in words if word.isalpha() and word.lower() not in self.nltk_stopwords]
        return filtered_words

the TextProcessor class simplifies text processing. It tokenizes and cleans text by removing non-alphabetic words and stopwords, a vital step in text analysis and NLP. Instantiate the TextProcessor class and use its tokenize_and_clean method to get a list of cleaned words from input text.

In [4]:
class ETLPipeline:
    def __init__(self, url):
        self.url = url
        self.nltk_stopwords = set(stopwords.words("english"))

    def run(self):
        scraper = WebScraper(self.url)
        article_text = scraper.extract_article_text()

        processor = TextProcessor(self.nltk_stopwords)
        filtered_words = processor.tokenize_and_clean(article_text)

        word_freq = Counter(filtered_words)
        df = pd.DataFrame(word_freq.items(), columns=["Words", "Frequencies"])
        df = df.sort_values(by="Frequencies", ascending=False)
        return df

In [6]:
if __name__ == "__main__":
    article_url = "https://en.wikipedia.org/wiki/Canada"
    pipeline = ETLPipeline(article_url)
    result_df = pipeline.run()
    print(result_df.head())

           Words  Frequencies
2299        isbn          282
0         canada          268
351     canadian          249
986   university          124
2284   retrieved           94
