## Main Task
Scrapping data from a web page and build a Data ETL Pipeline using data collected from the web!  
> A pipeline to scrape textual data from any article on the web!

1. The process begins with data extraction, where relevant information is collected from websites, APIs, databases, and other online sources.  
2. Then this raw data is transformed through various operations, including cleaning, filtering, structuring, and aggregating.  
3. After transformation, the data is loaded into a CSV file or database, making it accessible for further analysis, reporting, and decision-making.  

In [None]:
import requests
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from collections import Counter
import pandas as pd
import nltk
nltk.download('stopwords')

1. Extracting text from any article on the web!

In [None]:
class WebScrapper:
    def __init__(self, url):
        self.url = url
    
    def extract_article_text(self):
        response = requests.get(self.url)
        html_content = response.content
        soup = BeautifulSoup(html_content, "html.parser")
        article_text = soup.get_text()
        return article_text
"""
The class provides a way conveniently extract the main text content of an article from a given web page URL.  
By creating an instance of the class and calling the 'extract_article_text' method we can retrieval the textual data of the article.
"""

2. As we want to store the frequency of each word in the article, we need to clean and preprocess the data!

In [None]:
class TextProcessor:
    def __init__(self, nltk_stepwords):
        self.nltk_stepwords = nltk_stepwords
    
    def tokenize_and_clean(self, text):
        words = text.split()
        filtered_words = [word.lower() for word in words if word.isalpha() and word.lower() not in self.nltk_stepwords]
        return filtered_words
"""
The class provides a way to process text data by tokenizing it into words and cleaning those words by removing non-alphabetic words and stopwords.
"""