<a href="https://colab.research.google.com/github/ShawnLiu119/DataETL-WebScrapper/blob/main/1.NLP_TextScrapeTokenCount_DataETL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Data ETL pipeline - Web Scrap**

In [1]:
import requests #allow send HTTP requess in Python
from bs4 import BeautifulSoup #scrape information from web pages
from nltk.corpus import stopwords
from collections import Counter #Counter is an unordered collection where elements are stored as Dict keys and their count as dict value.
import pandas as pd
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
class WebScraper:
    def __init__(self, url):
        self.url = url

    def extract_article_text(self):
        response = requests.get(self.url)
        html_content = response.content
        soup = BeautifulSoup(html_content, "html.parser")
        article_text = soup.get_text()
        return article_text

In the above code, the WebScraper class provides a way to conveniently extract the main text content of an article from a given web page URL. By creating an instance of the WebScraper class and calling its extract_article_text method, we can retrieve the textual data of the article, which can then be further processed or analyzed as needed.

In [4]:
class TextProcessor:
    def __init__(self, nltk_stopwords):
        self.nltk_stopwords = nltk_stopwords

    def tokenize_and_clean(self, text):
        words = text.split()
        filtered_words = [word.lower() for word in words if word.isalpha() and word.lower() not in self.nltk_stopwords]
        return filtered_words

In the above code, the TextProcessor class provides a convenient way to process text data by tokenizing it into words and cleaning those words by removing non-alphabetic words and stopwords. It is often a crucial step in text analysis and natural language processing tasks. By creating an instance of the TextProcessor class and calling its tokenize_and_clean method, you can obtain a list of cleaned and filtered words from a given input text.

In [6]:
class ETLPipeline:
    def __init__(self, url):
        self.url = url
        self.nltk_stopwords = set(stopwords.words("english"))

    def run(self):
        scraper = WebScraper(self.url)
        article_text = scraper.extract_article_text()

        processor = TextProcessor(self.nltk_stopwords)
        filtered_words = processor.tokenize_and_clean(article_text)

        word_freq = Counter(filtered_words)
        df = pd.DataFrame(word_freq.items(), columns=["Words", "Frequencies"])
        df = df.sort_values(by="Frequencies", ascending=False)
        return df

In the above code, the ETLPipeline class encapsulates the end-to-end process of extracting article text from a web page, cleaning and processing the text, calculating word frequencies, and generating a sorted DataFrame. By creating an instance of the ETLPipeline class and calling its run method, you can perform the complete ETL process and obtain a DataFrame that provides insights into the most frequently used words in the article after removing stopwords.

In [10]:
#test and run

if __name__ == "__main__":
    article_url = 'https://seekingalpha.com/news/4108959-ffie-gme-amc-faraday-future-intelligent-electric-ev-stock-meme-rally-markets-ibkr'
    pipeline = ETLPipeline(article_url)
    result_df = pipeline.run()
    print(result_df.head(20))

         Words  Frequencies
27       stock           11
36   stockstop            8
0      faraday            6
35    dividend            6
75     etfstop            5
22      market            4
43        news            4
193       data            4
2          top            4
11       alpha            3
167    sosnick            3
201         ai            3
55       value            3
1       future            3
9         free            3
349     please            3
3         nvda            3
6      seeking            3
21     menutop            3
71      stocks            3
