## AI Blog Post Summarization with Hugging Face Transformation & Beautiful Soup Web Scraping
- YouTube Video - https://www.youtube.com/watch?v=JctmnczWg0U
- Github Repo - https://github.com/nicknochnack/Longform-Summarization-with-Hugging-Face

### 0. Installing Transformers and Importing Dependencies

In [None]:
%pip install transformers

In [None]:
from transformers import pipeline
from bs4 import BeautifulSoup
import requests

### 1. Load Summarization Pipeline

In [None]:
summarizer = pipeline("summarization")

### 2. Get Blog Post
We now will use `request` to get the BlogPage context then Scrap the page using `BeautifulSoup` 

In [None]:
URL = "https://vercel.com/blog/visual-editing"
page_body = requests.get(URL)

# Scraping the webpage
soup = BeautifulSoup(page_body.text,'html.parser')
results = soup.find_all(['h1','p'])

# Merging the text into one Article
text = [result.text for result in results]
ARTICLE = ' '.join(text)

### 3. Chunk Text
Now we need to split the data into chunks to make it easy to process. we will split the data based on `<eos>` instead of `?` , `.` and `!`, cause we need them in our results

In [None]:
# Splitting based on `<eos>`
ARTICLE = ARTICLE.replace('.',".<eos>")
ARTICLE = ARTICLE.replace('?',"?<eos>")
ARTICLE = ARTICLE.replace('!',"!<eos>")
sentenses = ARTICLE.split('<eos>')
sentenses[:10]

In [None]:
#! TODO Revise this again

MAX_CHUNK = 500
current_chunk = 0
chunks = []

for sentense in sentenses:
    if len(chunks) == current_chunk + 1:
        if len(chunks[current_chunk]) + len(sentense.split(' ')) <= MAX_CHUNK:
            chunks[current_chunk].extend(sentense.split(' '))
        else:
            current_chunk+=1
            chunks.append(sentense.split(' '))
    else:
        print(current_chunk)
        chunks.append(sentense.split(' '))

# Joining all Chunks        
for chunk_id in range(len(chunks)):
    chunks[chunk_id] = ' '.join(chunks[chunk_id])        

### 4. Summarize Text

In [None]:
res = summarizer(chunks , max_length = 120 , min_length = 30 , do_sample = False)

print(f"Type of Summarizer Result {type(res)}")

# The Summarization 
' '.join([summ['summary_text'] for summ in res])

### 5. Output to Text File

In [None]:
text = " ".join([summ['summary_text'] for summ in res])

with open('BlogSummary.txt' , 'w') as f:
    f.write(text)