<a href="https://colab.research.google.com/github/AfzalKamboh/Web_Data_ETL_Pipeline/blob/main/Web_Data_ETL_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install beautifulsoup4



In [3]:
!pip install nltk



In [4]:
import requests
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from collections import Counter
import pandas as pd
import nltk

In [5]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

**Process of Extracting text from any article on the web is start from here:**

In [6]:
class WebScraper:
  def __init__(self,url):
    self.url = url

  def extract_article_text(self):
    response = requests.get(self.url)
    html_content = response.content
    su = BeautifulSoup(html_content,"html.parser")
    article_text = su.get_text()
    return article_text


**Clean and preprocess the text extracted from the article. Because i'm storing the frequency of each word in article**

In [7]:
class TextProcessor:
  def __init__(self,nltk_stopwords):
    self.nltk_stopwords = nltk_stopwords

  def tokenize_and_clean(self,text):
    words = text.split()
    filtered_words = [word.lower() for word in words if word.isalpha() and word.lower() not in self.nltk_stopwords]
    return filtered_words

**This class is for ETL (Extract, Transform, Load)**

In [8]:
class ETLPipeline:
    def __init__(self, url):
        self.url = url
        self.nltk_stopwords = set(stopwords.words("english"))

    def run(self):
        scraper = WebScraper(self.url)
        article_text = scraper.extract_article_text()

        processor = TextProcessor(self.nltk_stopwords)
        filtered_words = processor.tokenize_and_clean(article_text)

        word_freq = Counter(filtered_words)
        df = pd.DataFrame(word_freq.items(), columns=["Words", "Frequencies"])
        df = df.sort_values(by="Frequencies", ascending=False)
        return df

**Run this pipeline to scrape textual data from any article from the web**

In [13]:
if __name__ == "__main__":
    article_url = "https://en.wikipedia.org/wiki/Pakistan"
    pipeline = ETLPipeline(article_url)
    result_df = pipeline.run()
    print(result_df.head(50))

              Words  Frequencies
0          pakistan          532
3221      retrieved          446
3217           isbn          173
3218       archived          173
3219       original          172
207        december          126
57            world          124
368         january          113
56           muslim          112
199           march          101
876       pakistani           94
2348       february           93
1067          april           90
140             may           85
162        national           76
201          august           75
152         islamic           72
136           south           70
1309        october           66
772       september           66
369           first           66
284         british           62
1118  international           62
252           india           61
3220           july           60
232      population           59
3226              b           59
159           state           59
135         country           59
732       